# stop words, tf-idf

Let's investigate one of the most useful feature weightings, and how stop words derive naturally from that. To start, let's load a set of small documents.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load data
try:
    df = pd.read_csv('../data/nlp_data/rt_critics.csv')
except IOError:
    print 'cannot find file'

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14072 entries, 0 to 14071
Data columns (total 8 columns):
critic         13382 non-null object
fresh          14072 non-null object
imdb           14072 non-null float64
publication    14072 non-null object
quote          14072 non-null object
review_date    14072 non-null object
rtid           14072 non-null float64
title          14072 non-null object
dtypes: float64(2), object(6)
memory usage: 989.4+ KB


In [4]:
df.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
1,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story


In [5]:
# It seems silly to call such short blurbs 'documents', but we'll stick with the NLP nomenclature.

documents = list(df['quote'])
documents[:5]

['So ingenious in concept, design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm.',
 "The year's most inventive comedy.",
 'A winning animated feature that has something for everyone on the age spectrum.',
 "The film sports a provocative and appealing story that's every bit the equal of this technical achievement.",
 "An entertaining computer-generated, hyperrealist animation feature (1995) that's also in effect a toy catalog."]

## Document Frequency

Let's start by calculating the document frequency for words in these documents. For this task, let's also remove all the punctuation marks and make everything lower-case.

In [6]:
from nltk.tokenize import wordpunct_tokenize  # for tokenizing our text
import string  # helps with removing punctuation
from collections import Counter  # great dict-like datastructure for counting things

In [7]:
print string.punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [8]:
# This is a bit of text cleanup
word_bag_list = []
for doc in documents:
    cleaned = doc.lower().replace('-', ' ')  # make lowercase and split hyphenated words in two
    for c in string.punctuation:  # strip punctuation marks.
        cleaned = cleaned.replace(c, '')
    word_bag_list.append(wordpunct_tokenize(cleaned))

# How do things look?
print 'a few tokens:', word_bag_list[:3]

# this flattens the nested lists into one big list for some stats
token_list = []
for tokens in word_bag_list:
    token_list.extend(tokens)
print 'number of tokens:', len(token_list)
print 'number of unique tokens:', len(set(token_list))
print 'number of documents:', len(word_bag_list)

a few tokens: [['so', 'ingenious', 'in', 'concept', 'design', 'and', 'execution', 'that', 'you', 'could', 'watch', 'it', 'on', 'a', 'postage', 'stamp', 'sized', 'screen', 'and', 'still', 'be', 'engulfed', 'by', 'its', 'charm'], ['the', 'years', 'most', 'inventive', 'comedy'], ['a', 'winning', 'animated', 'feature', 'that', 'has', 'something', 'for', 'everyone', 'on', 'the', 'age', 'spectrum']]
number of tokens: 280092
number of unique tokens: 22424
number of documents: 14072


In [15]:
word_dict = {}
count = 0
for item in word_bag_list:
    if item not in word_dict:
        word_dict[item] = count+1
    print word_dict

TypeError: unhashable type: 'list'

In [20]:
a = [7,7,2,2,3,3,3]
count_dict ={}
count = 0
def return_count(input_list):
    for item in a:
        if item not in count_dict:
            count_dict[item]= count+1
        else:
            count_dict[item] = 1
    return input   

In [22]:
return_count(a)

<function ipykernel.ipkernel.<lambda>>

In [27]:
# calculate the document frequency of all the unique tokens in the bags of words.

df = Counter()  # initialize this dict-like thing.

for doc in word_bag_list:
    
    # FILL IN CODE
    # count up the times words appear in INDIVIDUAL documents (not the total across all documents)
    for token in set(doc):  # edit this, obviously
        # add one to the right key in df
        df[token]+=1
for token in df:
    df[token] =df[token]/float(len(documents))
    # normalize the counts by the number of documents (are you getting zeros? Think datatypes.)

# this last line prints the 20 highest-scoring words and their scores
df.most_common(20)

[('the', 0.6140562819783968),
 ('a', 0.5035531552018192),
 ('and', 0.48969584991472426),
 ('of', 0.4640420693575895),
 ('is', 0.3320068220579875),
 ('to', 0.32106310403638433),
 ('in', 0.23848777714610575),
 ('that', 0.20082433200682206),
 ('its', 0.1991898806139852),
 ('it', 0.1960631040363843),
 ('with', 0.15513075611142696),
 ('but', 0.15157760090960773),
 ('this', 0.1467453098351336),
 ('movie', 0.12933484934621944),
 ('film', 0.12926378624218307),
 ('for', 0.1286242183058556),
 ('as', 0.12784252416145536),
 ('an', 0.10993462194428652),
 ('be', 0.08484934621944286),
 ('on', 0.08449403069926094)]

## Stop Words

Which words are likely to be stop words? The ones that show up in the most documents! These terms with the largest document frequency are the stopwords! The threshold above which you call something a stopword is up to you.

## tf-idf

More interesting than stop-words is the tf-idf score. This tells us which words are most discriminative between documents. Words that occur a lot in one document but doesn't occur in many documents will tell you something special about the document:

$$
\text{tf-idf} = tf \cdot \log{idf} = tf \cdot \log{1 \over df} = tf \cdot -\log{df}
$$

recall that:

$$
\log{x} = -\log{1 \over x}
$$

What are the most discriminative words in the first few documents?

In [35]:
# calculate the term frequency of all the unique tokens in all of the bags of words.

for doc in word_bag_list[:100]:
    tf = Counter()  # initialize this dict-like thing. # its a class that makes it easy to do things like getmost common
    tfidf = Counter()
    
    # FILL IN CODE

    # calculate term frequencies
    # this is similar to the document frequencies.
    for token in set(doc):
        tf[token]+=1 
    total = float(sum(tf.values()))

#     for token in tf: ## term frequency
#         tf[token] =tf[token]/float(len(doc))

    # calculate tf-idf scores
    for token in tf:
        tfidf[token] = (tf[token]/total)*(-np.log(df[token])) # fill this in. you can use np.log().

    # this prints most significant words in the document
    print tfidf.most_common(5)

[('engulfed', 0.39799759526739137), ('postage', 0.3691164627440604), ('sized', 0.34023533022072933), ('stamp', 0.31691800572341999), ('ingenious', 0.27994703926504905)]
[('inventive', 1.1776761280575496), ('years', 0.8588893828779226), ('comedy', 0.65543605303509112), ('most', 0.59453821488145864), ('the', 0.097533738115198984)]
[('spectrum', 0.65025615367302192), ('winning', 0.47574203511007074), ('everyone', 0.43231666566869759), ('age', 0.39485397527852278), ('animated', 0.39393272981339084)]
[('equal', 0.41634815803257685), ('sports', 0.39581665384682457), ('provocative', 0.36964725791818803), ('technical', 0.36339204175837664), ('achievement', 0.36339204175837664)]
[('catalog', 0.63679615242782617), ('hyperrealist', 0.63679615242782617), ('1995', 0.49031451393874498), ('toy', 0.45195690427850749), ('generated', 0.41464918508281268)]
[('ushered', 0.26533173017826089), ('revived', 0.26533173017826089), ('lion', 0.21556063381081494), ('repetition', 0.21556063381081494), ('landmark', 

# Sci-Kit Learn

Scikit-Learn comes with utilities to do these calculations for us. 

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [38]:
len(output.toarray()[0])

21254

In [37]:
tfidf_vec = TfidfVectorizer(stop_words='english')
output = tfidf_vec.fit_transform(documents)
print output.toarray()[20:30, :10]

[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.33171187  0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0

In [39]:
print tfidf_vec.get_stop_words()

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neit

In [43]:
#tfidf_vec.get_feature_names()