# Lesson 2 

## Prep Environment

In [2]:
# import module(s) into namespace
import pandas as pd
import numpy as np
import requests
pd.set_option('display.max_colwidth', 150) #important for getting all the text


## Reminder:  last week we talked about what text you can get where and how to organize it. 

### Twitter, HMTL, Documents - where else?

### Now we will look at what we can do with the text. 

#### For example: 
https://empiricalscotus.com/2017/12/17/masterpiece-bake/

https://www.nytimes.com/2017/08/18/upshot/evidence-of-a-toxic-environment-for-women-in-economics.html

http://www.vox.com/2016/5/16/11603854/donald-trump-twitter

## How do you do that?


Create numbers from it!  We'll use Vectorizers from SciKit Learn:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Video introduction to Machine Learning using SciKit Learn:

Part 1 - https://www.youtube.com/watch?v=dUzL0ox3C5o
Part 2 - https://www.youtube.com/watch?v=baWMKkum4mo

Note these are more general overviews and cover much more than we'll talk about.

Let's start with an easy example.  You're in a group text with people talking about what they do for fun.  Some of the texts are relevant to the question. Others?  Not so much. 



In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import math

friend1 = "Machine learning is super fun"
friend2 = "Python is super, super cool"
friend3 = "Statistics is cool, too"
friend4 = "Fun? Data science is more than fun"
friend5 = "Python is great for machine learning"
friend6 = "I like football"
friend7 = "Football is great to watch"
friend8 = "Python is cool super, super cool"
friend9 = "Statistics not cool, too"


textStr = [friend1, friend2, friend3, friend4, friend5, friend6, friend7, friend8, friend9]
print(textStr)

#we've just created our corpus
#the resulting object type is an array

['Machine learning is super fun', 'Python is super, super cool', 'Statistics is cool, too', 'Fun? Data science is more than fun', 'Python is great for machine learning', 'I like football', 'Football is great to watch', 'Python is cool super, super cool', 'Statistics not cool, too']


### Tokenization: creating "tokens" out of strings.  Might be a word, might not.
### Vectorizer: method by which the tokens are counted and represented in a sparse matrix (vector space)
### Feature space: Resulting sparse matrix where rows are entries in the corpus and columns are the "feature" which may or may not appear in that item


In [4]:
# Start by defining our method
  
cv1 = CountVectorizer(binary=True) #defines the method by which we will turn the list of words into a numeric representation

# Any guesses by what we're doing by setting binary = True?
# Does this make sense?



In [5]:
# Apply that transformation to our text
cv1_chat = cv1.fit_transform(textStr)

print(type(cv1_chat))
print(cv1_chat.shape)


<class 'scipy.sparse.csr.csr_matrix'>
(9, 20)


In [6]:
# A sparse matrix is hard to work with, and so it's not very helpful
# It is computationally efficient though
print(cv1_chat)


  (0, 4)	1
  (0, 15)	1
  (0, 6)	1
  (0, 7)	1
  (0, 9)	1
  (1, 0)	1
  (1, 12)	1
  (1, 15)	1
  (1, 6)	1
  (2, 18)	1
  (2, 14)	1
  (2, 0)	1
  (2, 6)	1
  (3, 16)	1
  (3, 10)	1
  (3, 13)	1
  (3, 1)	1
  (3, 4)	1
  (3, 6)	1
  (4, 3)	1
  (4, 5)	1
  (4, 12)	1
  (4, 6)	1
  (4, 7)	1
  (4, 9)	1
  (5, 2)	1
  (5, 8)	1
  (6, 19)	1
  (6, 17)	1
  (6, 2)	1
  (6, 5)	1
  (6, 6)	1
  (7, 0)	1
  (7, 12)	1
  (7, 15)	1
  (7, 6)	1
  (8, 11)	1
  (8, 18)	1
  (8, 14)	1
  (8, 0)	1


In [88]:
#arrays are easier
print(cv1_chat.toarray())

[[0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0]
 [0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0]
 [0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1]
 [1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0]]


In [89]:
#what do these numbers mean?
cv1.get_feature_names()

['cool',
 'data',
 'football',
 'for',
 'fun',
 'great',
 'is',
 'learning',
 'like',
 'machine',
 'more',
 'not',
 'python',
 'science',
 'statistics',
 'super',
 'than',
 'to',
 'too',
 'watch']

In [7]:
#and data frames are even better
pd.DataFrame(cv1_chat.toarray(),columns = cv1.get_feature_names())


Unnamed: 0,cool,data,football,for,fun,great,is,learning,like,machine,more,not,python,science,statistics,super,than,to,too,watch
0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0
1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0
3,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0
4,0,0,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
7,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0


In [8]:
cv2 = CountVectorizer(binary=False) #define the transformation
cv2_chat = cv2.fit_transform(textStr) #apply the transformation

print(type(cv2_chat))
print(cv2_chat.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(9, 20)


In [9]:
# multiset or "bag of words"

print(cv2.get_feature_names())


['cool', 'data', 'football', 'for', 'fun', 'great', 'is', 'learning', 'like', 'machine', 'more', 'not', 'python', 'science', 'statistics', 'super', 'than', 'to', 'too', 'watch']


In [10]:
pd.DataFrame(cv2_chat.toarray(),columns = cv2.get_feature_names())

Unnamed: 0,cool,data,football,for,fun,great,is,learning,like,machine,more,not,python,science,statistics,super,than,to,too,watch
0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0
1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,2,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0
3,0,1,0,0,2,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0
4,0,0,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
7,2,0,0,0,0,0,1,0,0,0,0,0,1,0,0,2,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0


The two vectorizers above differ only by the parameter setting "binary" - default is "false"

There are lots of parameters you can set for the vectorizer which defines the feature space:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

<dt id="sklearn.feature_extraction.text.CountVectorizer">
<em class="property">class </em><code class="descclassname">sklearn.feature_extraction.text.</code><code class="descname">CountVectorizer</code><span class="sig-paren">(</span><em>input='content'</em>, <em>encoding='utf-8'</em>, <em>decode_error='strict'</em>, <em>strip_accents=None</em>, <em>lowercase=True</em>, <em>preprocessor=None</em>, <em>tokenizer=None</em>, <em>stop_words=None</em>, <em>token_pattern='(?u)\b\w\w+\b'</em>, <em>ngram_range=(1</em>, <em>1)</em>, <em>analyzer='word'</em>, <em>max_df=1.0</em>, <em>min_df=1</em>, <em>max_features=None</em>, <em>vocabulary=None</em>, <em>binary=False</em>, <em>dtype=&lt;class 'numpy.int64'&gt;</em></dt>

Some important ones:

* ***lowercase***: convert everything to lowercase.  Default is "true"
* ***stop_words***: remove common words.  Default is "false"
* ***ngram_range***: Count of common n dimensional sets
* ***max_df*** and ***min_df***: parameters on how many documents the term appears in

### It's important to understand the defaults whether you want to leave them that way

# __________

### Let's break into groups and dig into the parameters a little bit

#### Each group take one of the following to look into
stop words

ngram_range

min_df & max_df

max_features

vocabulary

#### Take 5 minutes to review the documentation
Explain your parameter and whether there is a meaningful default setting
Give one example where the parameter might be useful

# __________

## Now let's try changing a few of them

In [11]:
# try changing a few parameters
cv3 = CountVectorizer(binary=False, lowercase = False) #define the transformation
cv3_chat = cv3.fit_transform(textStr) #apply the transformation

print(type(cv3_chat))
print(cv3_chat.shape)
pd.DataFrame(cv3_chat.toarray(),columns = cv3.get_feature_names())

<class 'scipy.sparse.csr.csr_matrix'>
(9, 23)


Unnamed: 0,Data,Football,Fun,Machine,Python,Statistics,cool,football,for,fun,...,like,machine,more,not,science,super,than,to,too,watch
0,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,2,0,0,0,0
2,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,0,1,0,0,0,0,0,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0,0,0,0,1,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
7,0,0,0,0,1,0,2,0,0,0,...,0,0,0,0,0,2,0,0,0,0
8,0,0,0,0,0,1,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [12]:
# try changing a few parameters
cv4 = CountVectorizer(binary=False, stop_words='english') #define the transformation
cv4_chat = cv4.fit_transform(textStr) #apply the transformation

print(type(cv4_chat))
print(cv4_chat.shape)
pd.DataFrame(cv4_chat.toarray(),columns = cv4.get_feature_names())

<class 'scipy.sparse.csr.csr_matrix'>
(9, 13)


Unnamed: 0,cool,data,football,fun,great,learning,like,machine,python,science,statistics,super,watch
0,0,0,0,1,0,1,0,1,0,0,0,1,0
1,1,0,0,0,0,0,0,0,1,0,0,2,0
2,1,0,0,0,0,0,0,0,0,0,1,0,0
3,0,1,0,2,0,0,0,0,0,1,0,0,0
4,0,0,0,0,1,1,0,1,1,0,0,0,0
5,0,0,1,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,1,0,0,0,0,0,0,0,1
7,2,0,0,0,0,0,0,0,1,0,0,2,0
8,1,0,0,0,0,0,0,0,0,0,1,0,0


In [13]:
# try changing a few parameters
pd.set_option('display.max_columns', None)
cv5 = CountVectorizer(binary=False, stop_words = 'english', ngram_range = (1,2)) #define the transformation
cv5_chat = cv5.fit_transform(textStr) #apply the transformation

print(type(cv5_chat))
print(cv5_chat.shape)
pd.DataFrame(cv5_chat.toarray(),columns = cv5.get_feature_names())

<class 'scipy.sparse.csr.csr_matrix'>
(9, 30)


Unnamed: 0,cool,cool super,data,data science,football,football great,fun,fun data,great,great machine,great watch,learning,learning super,like,like football,machine,machine learning,python,python cool,python great,python super,science,science fun,statistics,statistics cool,super,super cool,super fun,super super,watch
0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,2,1,0,1,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
3,0,0,1,1,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2,1,0,1,0
8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0


In [14]:
# try changing a few parameters
cv6 = CountVectorizer(binary=False, max_df = .5, min_df= .2) #define the transformation
# only asking it to make changes based on document frequency
# not using stop words, but it should still might help eliminate "is"

cv6_chat = cv6.fit_transform(textStr) #apply the transformation

print(type(cv6_chat))
print(cv6_chat.shape)
pd.DataFrame(cv6_chat.toarray(),columns = cv6.get_feature_names())

<class 'scipy.sparse.csr.csr_matrix'>
(9, 10)


Unnamed: 0,cool,football,fun,great,learning,machine,python,statistics,super,too
0,0,0,1,0,1,1,0,0,1,0
1,1,0,0,0,0,0,1,0,2,0
2,1,0,0,0,0,0,0,1,0,1
3,0,0,2,0,0,0,0,0,0,0
4,0,0,0,1,1,1,1,0,0,0
5,0,1,0,0,0,0,0,0,0,0
6,0,1,0,1,0,0,0,0,0,0
7,2,0,0,0,0,0,1,0,2,0
8,1,0,0,0,0,0,0,1,0,1


## When would you want to use what settings?

## Who cares?  How would you use this information?



## Weights on terms: binary, raw count, other?

What if relevance doesn't increase with count?  When would you want to find unusual words rather than common words?

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

print(textStr)


['Machine learning is super fun', 'Python is super, super cool', 'Statistics is cool, too', 'Fun? Data science is more than fun', 'Python is great for machine learning', 'I like football', 'Football is great to watch', 'Python is cool super, super cool', 'Statistics not cool, too']


## We can use a different vectorizer to apply weights based on the frequency of the term in a Document

### Important terms:
 * Term frequency - the number of occurrances of term in a particular document
 * Document frequency - the number of documents in the collection that contain the term
 * Collection frequency - the number of occurences of the word in the whole collection
 
 So for our textStr example:
 <table>
 <tr> <td>Word</td><td>tf1</td><td>tf2</td><td>tf3</td><td>tf4</td><td>tf5</td><td>tf6</td><td>tf7</td><td>tf8</td><td>tf9</td><td>df</td><td>cf</td></tr>
 <tr> <td>cool</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>4</td><td>4</td></tr>
 <tr> <td>data</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
 <tr> <td>football</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
<tr> <td>for</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
 <tr> <td>fun</td><td>1</td><td>0</td><td>0</td><td>2</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>3</td></tr>
 <tr> <td>great</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
<tr> <td>is</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>7</td><td>7</td></tr>
 <tr> <td>learning</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
 <tr> <td>machine</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
<tr> <td>more</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
<tr> <td>not</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr>
 <tr> <td>python</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
 <tr> <td>science</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
  <tr> <td>statistics</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>3</td><td>3</td></tr>
 <tr> <td>super</td><td>1</td><td>2</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>3</td></tr>
<tr> <td>than</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
<tr> <td>to</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>
<tr> <td>too</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>3</td><td>3</td></tr>
 <tr> <td>watch</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>1</td></tr>

</table>

More details: http://web.stanford.edu/class/cs276/handouts/lecture6-tfidf-handout-6-per.pdf

This slides at the above link are about information retrieval in general.  tf_idf slides start about slide 18. 

There are different ways of using and combining these counts to weight terms differently.  Here's the documentations for the vectorizer that applies these weights


http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html




In [16]:
# How do the weights work here?
tfidf1 = TfidfVectorizer(use_idf=False, norm=None) #define the transformation

tf1_chat = tfidf1.fit_transform(textStr) #apply the transformation
print(tf1_chat.shape)
pd.DataFrame(tf1_chat.toarray(),columns = tfidf1.get_feature_names())

(9, 20)


Unnamed: 0,cool,data,football,for,fun,great,is,learning,like,machine,more,not,python,science,statistics,super,than,to,too,watch
0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
7,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [17]:
# now apply weights based on the inverse document frequency
    tfidf2 = TfidfVectorizer(use_idf=True, norm=None) #define the transformation
    tf2_chat = tfidf2.fit_transform(textStr) #apply the transformation

    pd.DataFrame(tf2_chat.toarray(),columns = tfidf2.get_feature_names())

Unnamed: 0,cool,data,football,for,fun,great,is,learning,like,machine,more,not,python,science,statistics,super,than,to,too,watch
0,0.0,0.0,0.0,0.0,2.203973,0.0,1.223144,2.203973,0.0,2.203973,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,0.0,0.0
1,1.693147,0.0,0.0,0.0,0.0,0.0,1.223144,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,3.832581,0.0,0.0,0.0,0.0
2,1.693147,0.0,0.0,0.0,0.0,0.0,1.223144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.203973,0.0,0.0,0.0,2.203973,0.0
3,0.0,2.609438,0.0,0.0,4.407946,0.0,1.223144,0.0,0.0,0.0,2.609438,0.0,0.0,2.609438,0.0,0.0,2.609438,0.0,0.0,0.0
4,0.0,0.0,0.0,2.609438,0.0,2.203973,1.223144,2.203973,0.0,2.203973,0.0,0.0,1.916291,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.203973,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,2.203973,0.0,0.0,2.203973,1.223144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,2.609438
7,3.386294,0.0,0.0,0.0,0.0,0.0,1.223144,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,3.832581,0.0,0.0,0.0,0.0
8,1.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.609438,0.0,0.0,2.203973,0.0,0.0,0.0,2.203973,0.0


TFIDF things to remember:

- Weight is highest when $t$ occurs many times within a small number of documents (thus lending high discriminating power to those documents)


- Weight is lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal)


- Weight is lowest when the term occurs in virtually all documents

## Let's see why understanding weights vs. counts might be important when we start to work with the feature space

#### Let's start with counts...

In [18]:
names = cv6.get_feature_names()   #create list of feature names
print(type(names), len(names))

count = np.sum(cv6_chat.toarray(), axis = 0) # convert list to array to add up feature counts 
count2 = count.tolist()  # convert numpy array to list

print("") #this is just to add a break in the output
print("We started with", len(names), "and we ended with",len(count2))
print("")

count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
print(count_df)

<class 'list'> 10

We started with 10 and we ended with 10

            count
cool            5
football        2
fun             3
great           2
learning        2
machine         2
python          3
statistics      2
super           5
too             2


### What about the weighted feature space?

In [19]:
names = tfidf2.get_feature_names()   #create list of feature names
count = np.sum(tf2_chat.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list

print("We started with", len(names), "and we finished with", len(count2), ". Wait.  What? 20?")
print("")
print("Notice anything funny about our numbers and how we got them?")
print("")
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
print(count_df)


We started with 20 and we finished with 20 . Wait.  What? 20?

Notice anything funny about our numbers and how we got them?

               count
cool        8.465736
data        2.609438
football    4.407946
for         2.609438
fun         6.611918
great       4.407946
is          8.562005
learning    4.407946
like        2.609438
machine     4.407946
more        2.609438
not         2.609438
python      5.748872
science     2.609438
statistics  4.407946
super       9.581454
than        2.609438
to          2.609438
too         4.407946
watch       2.609438


In [20]:
# What about other forms of aggregation?  Maybe Average weight?
count = np.mean(tf2_chat.toarray(), axis = 0) # find the average weight for each feature 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
print(count_df)

               count
cool        0.940637
data        0.289938
football    0.489772
for         0.289938
fun         0.734658
great       0.489772
is          0.951334
learning    0.489772
like        0.289938
machine     0.489772
more        0.289938
not         0.289938
python      0.638764
science     0.289938
statistics  0.489772
super       1.064606
than        0.289938
to          0.289938
too         0.489772
watch       0.289938


In [21]:
# What about max?
count = np.max(tf2_chat.toarray(), axis = 0) # find the max weight of each feature 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
print(count_df)

               count
cool        3.386294
data        2.609438
football    2.203973
for         2.609438
fun         4.407946
great       2.203973
is          1.223144
learning    2.203973
like        2.609438
machine     2.203973
more        2.609438
not         2.609438
python      1.916291
science     2.609438
statistics  2.203973
super       3.832581
than        2.609438
to          2.609438
too         2.203973
watch       2.609438


### How about some "real" data?

#### Apply this to some news - corpus of articles from the NYTimes in 2013
From the Socially-Informed Timeline Generation Corpus (first released on March 2015)

URL: https://arxiv.org/pdf/1606.05699.pdf

Corpus construction:
"We collected all articles with comments from NYT in 2013 to form a training set for learning importance scoring functions on articles sentences and comments (see details in Section 3). NYT2013 contains 3,863 articles and 833,032 comments."

We are using only the articles that have text beyond the headline. (Some of the entries in the corpus looked to be "interactive" content which had no text.) I have created and cleaned the corpus for you so we have 3848 articles with both a headline and text. 

In [28]:
#pull in text

filename = "C:\\Users\\Paul\\Desktop\\Rockhurst\\BIA 6304-Text Mining\\Week 2\\nytimes2013.csv"
newsdf = pd.read_csv(filename, index_col = 0) 



print(newsdf.shape)
newsdf.head()

(3848, 5)


Unnamed: 0,date,description,headline,url,text
0,2013-01-01,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increa...",Divided House Passes Tax Deal in End to Latest Fiscal Standoff,http://www.nytimes.com/2013/01/02/us/politics/house-takes-on-fiscal-cliff.html,"WASHINGTON — Ending a climactic fiscal showdown in the final hours of the 112th Congress, the House late Tuesday passed and sent to legislation ..."
1,2013-01-01,A report on nearly three million people found that those whose body mass index ranked them as overweight had less risk of dying than people of nor...,Study Suggests Lower Mortality Risk for People Deemed to Be Overweight,http://www.nytimes.com/2013/01/02/health/study-suggests-lower-death-risk-for-the-overweight.html,"A century ago, Elsie Scheel was the perfect woman. So said a 1912 in The New York Times about how Miss Scheel, 24, was chosen by the “medical ex..."
2,2013-01-01,"As the United States prepares to withdraw from an unpopular war in Afghanistan, it faces challenges similar to what the country’s last occupier, t...","With U.S. Set to Leave Afghanistan, Echoes of 1989",http://www.nytimes.com/2013/01/02/world/asia/us-war-in-afghanistan-has-echoes-of-soviet-experience.html,WASHINGTON — The young president who ascended to office as a change agent decides to end the costly and unpopular war in Afghanistan. He seeks an ...
3,2013-01-01,"The popularity of the drinks reflects success in convincing consumers that they provide an edge, but most of their ingredients have no or little b...","Energy Drinks Promise Edge, but Experts Say Proof Is Scant",http://www.nytimes.com/2013/01/02/health/scant-proof-is-found-to-back-up-claims-by-energy-drinks.html,"Energy drinks are the fastest-growing part of the beverage industry, with sales in the United States reaching more than $10 billion in 2012 — more..."
4,2013-01-01,"New Hampshire, which again chose a woman to be governor, will also become the first state in history to have an all-female delegation in Washington.","From Congress to Halls of State, in New Hampshire, Women Rule",http://www.nytimes.com/2013/01/02/us/politics/from-congress-to-halls-of-state-in-new-hampshire-women-rule.html,"Most states are red or blue. A few are purple. After the November election, New Hampshire turned pink. Women won the state’s two Congressi..."


In [29]:
pd.set_option('display.max_colwidth', 15000) #important for getting all the text

print(newsdf['text'][0:1])
print(type(newsdf['text']))

0    WASHINGTON — Ending a climactic fiscal showdown in the final hours of the 112th Congress, the House late Tuesday passed and sent to   legislation to avert big income tax increases on most Americans and prevent large cuts in spending for the Pentagon and other government programs.         The measure, brought to the House floor less than 24 hours after its passage in the Senate, was approved 257 to 167, with 85 Republicans joining 172 Democrats in voting to allow income taxes to rise for the first time in two decades, in this case for the highest-earning Americans. Voting no were 151 Republicans and 16 Democrats.         The bill was expected to be signed quickly by Mr. Obama, who won re-election on a promise to increase taxes on the wealthy.         Mr. Obama strode into the White House briefing room shortly after the vote, less to hail the end of the fiscal crisis than to lay out a marker for the next one. “The one thing that I think, hopefully, the new year will focus on,” he sa

In [30]:
# let's see what the feature space looks like for these news headlines

cv1_news = cv1.fit_transform(newsdf['headline']) #we already definted cv1 so now we can just apply it
print(cv1_news.shape)


(3848, 7423)


In [31]:
cv1.get_feature_names()

['000',
 '06',
 '09',
 '10',
 '100',
 '100th',
 '101',
 '11',
 '12',
 '13',
 '14',
 '142',
 '150',
 '153',
 '16',
 '17',
 '17th',
 '18',
 '183',
 '19',
 '1953',
 '1964',
 '1965',
 '1988',
 '1989',
 '20',
 '200',
 '2000',
 '2004',
 '2008',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2020',
 '2022',
 '2040',
 '20s',
 '21',
 '21st',
 '22',
 '24',
 '25',
 '250',
 '26',
 '2nd',
 '30',
 '300',
 '30s',
 '32',
 '325',
 '33',
 '35',
 '38',
 '3rd',
 '40',
 '400',
 '401',
 '41',
 '42',
 '44',
 '45',
 '460',
 '47',
 '49',
 '49ers',
 '50',
 '500',
 '51',
 '52',
 '535',
 '546',
 '57th',
 '58',
 '5pointz',
 '60',
 '60s',
 '62',
 '649',
 '65',
 '66',
 '67',
 '68',
 '700',
 '708',
 '70s',
 '75',
 '765',
 '78',
 '7th',
 '80',
 '800',
 '82',
 '89',
 '911',
 '96',
 'aarp',
 'abandon',
 'abandoned',
 'abandoning',
 'abates',
 'abducted',
 'abedin',
 'abets',
 'ability',
 'able',
 'abominably',
 'abortion',
 'abortions',
 'abound',
 'aboushi',
 'about',
 'above',
 'abroad',
 'abruptly',
 'absence',
 'abu

#### Since we didn't use informative names for our vectorizers, here's a summary cheat sheet.
* cv1 - simple binary
* cv2 - Bag of words (counts)
* cv3 - preserves case
* cv4 - gets rid of stop words
* cv5 - Bag of words, stop words, bigrams
* cv6 - uses min_df and max_df


# __

## Let's get some practice with real text

#### Take 5-10 minutes to apply a new vectorizer to the news text with parameter settings of your choosing.

### When we come bac, be prepared to share:
- What vectorizer you selected and why?
- Which parameter settings did you select and why?
- What was the resulting shape of your feature space
- In reviewing your feature space, is there anything else you see you'd like to change?
# __

### Now let's step through some of the vectorizers we already setup and see how they help

In [32]:
# how does excluding stopwords help?
cv4_news = cv4.fit_transform(newsdf['headline'])
print(cv4_news.shape)

(3848, 7194)


In [33]:
# using min and max df?
cv6_news = cv6.fit_transform(newsdf['headline'])
print(cv6_news.shape)
cv6.get_feature_names()

(3848, 2)


['in', 'to']

In [34]:
# let's try more useful df limits and a few other settings as well
cv7 = CountVectorizer(binary=False, min_df = .01, stop_words = "english") #define the transformation
cv7_news = cv7.fit_transform(newsdf['headline']) #apply the transformation
print(cv7_news.shape)


(3848, 14)


In [35]:
cv7.get_feature_names()

['china',
 'city',
 'gay',
 'health',
 'home',
 'house',
 'law',
 'new',
 'obama',
 'plan',
 'says',
 'syria',
 'world',
 'york']

In [36]:
#apply some weights
tfidf3 = TfidfVectorizer(use_idf=True, norm=None, stop_words = "english", min_df = .01) #define the transformation
tf3_news = tfidf3.fit_transform(newsdf['headline']) #apply the transformation

print(tf3_news.shape)

(3848, 14)


In [37]:
# compare unweighted and weighted
pd.DataFrame(cv7_news.toarray(),columns = cv7.get_feature_names()).head(10)

Unnamed: 0,china,city,gay,health,home,house,law,new,obama,plan,says,syria,world,york
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [38]:
pd.DataFrame(tf3_news.toarray(),columns = tfidf3.get_feature_names()).head(10)

Unnamed: 0,china,city,gay,health,home,house,law,new,obama,plan,says,syria,world,york
0,0.0,0.0,0.0,0.0,0.0,5.384368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.977454,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.266585,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.977454,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# My settings and resulting feature space
cv8 = CountVectorizer(binary=False, min_df = .005, stop_words = "english", ngram_range = (1,2)) #define the transformation
cv8_news = cv8.fit_transform(newsdf['headline']) #apply the transformation
print(cv8_news.shape)

names = cv8.get_feature_names()   #create list of feature names
count = np.sum(cv8_news.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list

(3848, 96)


In [40]:
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
print(count_df.head()) # notice entries are alphabetical
print(count_df.tail()) # notice entries are alphabetical

        count
aid        26
attack     25
ban        21
big        27
blasio     29
         count
world       45
yankees     29
year        23
years       32
york        62


In [41]:
sorted_count = count_df.sort_values(['count'], ascending = False)  #arrange by count instead
print(sorted_count.head(10))
print(sorted_count.tail(10))

          count
new         201
obama        94
health       82
york         62
new york     62
syria        61
plan         56
law          54
says         53
china        48
         count
face        21
seen        21
secret      21
say         21
cuts        20
crisis      20
shows       20
ratings     20
jets        20
risk        20


### Now you can choose if "new" and "york" are important on their own... We'll see how to deal with that next week.

So far we've only used the headlines.  What would you expect to be the difference if we used the full text?

We will just compare cv8 on both but I *strongly* encourage you to play with other models as well 

In [42]:
cv8_news = cv8.fit_transform(newsdf['text']) #apply the transformation
print(cv8_news.shape)

names = cv8.get_feature_names()   #create list of feature names
count = np.sum(cv8_news.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list


(3848, 10146)


In [43]:
sorted_count2 = count_df.sort_values(['count'], ascending = False)  #arrange by count instead
print(sorted_count2.head(10))
print(sorted_count2.tail(10))

        count
said    21374
mr      13575
new      7997
like     6614
people   5547
year     5461
years    4853
time     4757
just     4127
city     3514
                 count
boring              20
did play            20
did little          20
slid                20
breeze              20
brendan             20
denies              20
political party     20
shortcomings        20
collectively        20


### You now have the tools to do some basic analysis similar to the articles above.

### Main take aways:
* In order to work with text, we need to create numbers from them. 
* There are lots of choices to be made on how to do that creation. 
* We will spend lots of time discussing the implications of those choices. 


And what else can you do with Trump tweets?
http://www.techrepublic.com/article/a-data-visualization-of-trump-trends-on-social-media/