# TF-IDF (term frequency-inverse document frequency)

In this notebook we'll learn about TF-IDF by trying to figure out if there are some words that are indicative of positive or negative sentiment.

Usually TF-IDF is used to figure out how sets of words differ from each other. It allows you to find words that occur very frequently in one document class compared to another with two simple insights:
* Term frequency matters. Terms that are more frequent in a document are more indicative of the topic of that document. ("Term-Frequency")
* Some words are very frequent but meaningless for distinguishing documents (e.g., 'the').  These terms will tend to be frequent across many documents. ("Inverse Document Frequency")

We multiply a function of the term frequency with a function of the inverse document frequency to get the relative weight of each word as an indicator of each document class.

Mostly this notebook is an excuse to show some code for doing TF-IDF -- it doesn't really make sense to have just 2 document classes and use TF-IDF.

We often use TF-IDF weights instead of raw frequencies or presence/absence data to appropriately scale the importance of each word in describing a document.  It is a very old metric, not based on much math, but it works surprisingly well in practice.

(You could imagine a weighted average of dense vectors for a sentence based on TF-IDF weights from a training corpus... Does that work better or worse than a straight average?)

In [None]:
import pickle
import csv
import string
import numpy as np
from collections import Counter
from sklearn.feature_extraction import DictVectorizer

In [None]:
with open("data/sentiment_splits.p", "rb") as f:
    X_train, X_dev, X_test, y_train, y_dev, y_test = pickle.load(f)

In [None]:
# We're going to precalculate words of interest because this is
# a tiny dataset... but as the dataset grows, extra passes through it are 
# a bad idea
words_of_interest = Counter() # number of documents each word occurs in
for item in X_train:
    for word in set(item.split()):
        words_of_interest[word] += 1

print "Number of unique words:"
print len(words_of_interest)
print "Number of words that occur in more than 1 document:"
print len([val for val, amt in words_of_interest.items() if amt > 1])


In [None]:
# Let's create a mapping from each word in the vocabulary to its 
# index in our one-hot embedding space.
# Any words that occur only once are mapped to the special token
# "UNK" -- others are mapped according to their frequency
embeddings = {}
UNK_IDX = 0
known_words_in_vocab = 1
for word, count in words_of_interest.most_common():
    if count == 1:
        embeddings[word] = UNK_IDX
    else:
        embeddings[word] = known_words_in_vocab
        known_words_in_vocab += 1

In [None]:
# Create a function that will convert each paragraph to a vector.
def convert_to_vector(paragraph):
    representation = np.zeros(known_words_in_vocab)
    for word in paragraph.split():
        if word in embeddings:
            idx = embeddings[word]
            representation[idx] = 1
        else:
            representation[UNK_IDX] = 1
    return representation

We'll create a matrix with 2 document classes: positive reviews and negative reviews.

In [None]:
def create_dataset(dataset, targets):
    # Create a 2 row dataset (neg/pos) with columns
    dataset_vector = np.zeros((2, known_words_in_vocab))
    for i,paragraph in enumerate(dataset):
        row = 0 if targets[i] == -1 else 1
        dataset_vector[row] += convert_to_vector(paragraph)
    return dataset_vector

X_train_vector = create_dataset(X_train, y_train)
print X_train_vector.shape

In [None]:
# Let's see the relative frequencies of the top 20 words:

most_common = words_of_interest.most_common(20)

print "%-10s %5s %5s" % ("Word", "Neg", "Pos")
for i in range(20):
    print "%-10s %5d %5d" % (most_common[i][0], X_train_vector[0,i], X_train_vector[1,i])

We can start to see some potentially interesting things already...
* "and" is positive
* "this" is positive
* "but" is negative

But it's unclear whether we might be reading things in here, so let's use TF-IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_train_vector)

In [None]:
tfidf = tfidf.toarray()   
print tfidf

In [None]:
print tfidf.shape

In [None]:
# Now let's look at the weights of the mosts common words
most_common = words_of_interest.most_common(20)

print "%-10s %5s %5s" % ("Word", "Neg", "Pos")
for i in range(20):
    print "%-10s %5.2f %5.2f" % (most_common[i][0], tfidf[0,i], tfidf[1,i])

In [None]:
print tfidf.shape
print np.argmax(tfidf, axis=1)

In [None]:
indices = tfidf.argpartition(-20)[:,-20:]
most_common = words_of_interest.most_common(100)

def get_words(idx):
    words = [(most_common[i][0], most_common[i][1], tfidf[idx,i]) for i in indices[idx]]
    words = sorted(words, key=lambda x: x[2], reverse=True)
    for unit in words:
        print "%-10s %5d %5.2f" % unit
    return words

print "Negative words:"
negwords = get_words(0)
print
print "Positive words:"
poswords = get_words(1)

So -- unsurprisingly since we only have 2 classes -- TF-IDF is really not that helpful here. If we had more classes we may find more interesting things.  This makes sense, of course, since "inverse document frequency" with only two classes is not very helpful -- a word can occur in either 1 or 2 different documents.