# Weighting words using Tf-Idf

We need to start thinking about how to translate collections of texts into quantifiable phenomena.  The easiest way to start is to think about word frequencies.


If I ask you “Do you remember the article about electrons in NY Times?” there’s a better chance you will remember it than if I asked you “Do you remember the article about electrons in the Physics books?”. Here’s why: an article about electrons in NY Times is far less common than in a collection of physics books. It is less likely to stumble upon the “electron” concept in NY Times than in a physics book.

Let’s consider now the scenario of a single article. Suppose you read an article and you’re asked to rank the concepts found in the article by importance. The chances are you’ll basically order the concepts by frequency. The reason is simply that important stuff would be mentioned repeatedly because the narrative gravitates around them.

Combining the 2 insights, given a term, a document and a collection of documents we can loosely say that:

```
importance ~ appearances(term, document) / count(documents containing term in collection)
```

This technique is called Tf-Idf – Term Frequency – Inverse Document Frequency. Here’s how the measure is defined:

```
tf = count(word, document) / len(document) – term frequency
idf = log( len(collection) / count(document_containing_term, collection) – inverse document frequency )
tf-idf = tf * idf – term frequency – inverse document frequency
```

Let’s test this theory on some data. We’re going to use the Reuters dataset bundles inside NLTK.

In [1]:
from nltk.corpus import reuters
 
print(len(reuters.fileids()))            # Number of files in the corpus = 10788
 
# Print the categories associated with a file
print(reuters.categories('training/999'))      # [u'interest', u'money-fx']
 
# Print the contents of the file
print(reuters.raw('test/14829'))

10788
['interest', 'money-fx']
JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS
  The Ministry of International Trade and
  Industry (MITI) will revise its long-term energy supply/demand
  outlook by August to meet a forecast downtrend in Japanese
  energy demand, ministry officials said.
      MITI is expected to lower the projection for primary energy
  supplies in the year 2000 to 550 mln kilolitres (kl) from 600
  mln, they said.
      The decision follows the emergence of structural changes in
  Japanese industry following the rise in the value of the yen
  and a decline in domestic electric power demand.
      MITI is planning to work out a revised energy supply/demand
  outlook through deliberations of committee meetings of the
  Agency of Natural Resources and Energy, the officials said.
      They said MITI will also review the breakdown of energy
  supply sources, including oil, nuclear, coal and natural gas.
      Nuclear energy provided the bulk of Japan's electric power
 

In [2]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to /home/jon/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

Let’s build a tokenizer that ignores punctuation and stopwords:

In [3]:

from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
 
stop_words = stopwords.words('english') + list(punctuation)
 
def tokenize(text):
    words = word_tokenize(text)
    words = [w.lower() for w in words]
    return [w for w in words if w not in stop_words and not w.isdigit()]
 


We now need to know all the words inside the collection

In [4]:

# build the vocabulary in one pass
vocabulary = set()
for file_id in reuters.fileids():
    words = tokenize(reuters.raw(file_id))
    vocabulary.update(words)
 
vocabulary = list(vocabulary)
word_index = {w: idx for idx, w in enumerate(vocabulary)}
 
VOCABULARY_SIZE = len(vocabulary)
DOCUMENTS_COUNT = len(reuters.fileids())
 
print(VOCABULARY_SIZE, DOCUMENTS_COUNT)    
 

51558 10788


Let’s compute the Idf for every word in the vocabulary:

In [11]:
from collections import defaultdict
import math
word_idf = defaultdict(lambda: 0)
for file_id in reuters.fileids():
    words = set(tokenize(reuters.raw(file_id)))
    for word in words:
        word_idf[word] += 1

for word in vocabulary:
    word_idf[word] = math.log(DOCUMENTS_COUNT / float(1 + word_idf[word]))

print(word_idf['deliberations'])    
print(word_idf['committee'])   

9.28618968425962
9.28618968425962


Let’s write, as an exercise, the numpy parallelized version of the Idf computation:

In [13]:
import numpy as np
 
word_idf = np.zeros(VOCABULARY_SIZE)
for file_id in reuters.fileids():
    words = set(tokenize(reuters.raw(file_id)))
    indexes = [word_index[word] for word in words]
    word_idf[indexes] += 1.0

word_idf = np.log(DOCUMENTS_COUNT / (1 + word_idf).astype(float))
print(word_idf[word_index['deliberations']])   
print(word_idf[word_index['committee']])      

KeyError: 'parti'

Since Idf doesn’t depend on the current document but only on the collection we can preprocess the results as we did above. Here’s the code for the final computation:

In [2]:
def word_tf(word, document):
    if isinstance(document, basestring):
        document = tokenize(document)
 
    return float(document.count(word)) / len(document)
 
def tf_idf(word, document):
    # If not tokenized
    if isinstance(document, str):
        document = tokenize(document)
 
    if word not in word_index:
        return .0
 
    return word_tf(word, document) * word_idf[word_index[word]]
 

In [None]:
print tf_idf('year', reuters.raw('test/14829'))             
print tf_idf('following', reuters.raw('test/14829'))       
print tf_idf('provided', reuters.raw('test/14829'))        
print tf_idf('structural', reuters.raw('test/14829'))        
print tf_idf('japanese', reuters.raw('test/14829'))          
print tf_idf('downtrend', reuters.raw('test/14829'))      

### Putting it together with a classifier!
We're going to go ahead and classify all the documents that are in this dataset (what are all the classes?). This code uses scikit-learns built in functions to classify the data.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png)

In [6]:
from nltk.corpus import stopwords, reuters
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, precision_score, recall_score

cachedStopWords = stopwords.words("english")
def tokenize(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
    tokens = (list(map(lambda token: PorterStemmer().stem(token),words)))
    p = re.compile('[a-zA-Z]+');
    filtered_tokens = list(filter (lambda token: p.match(token) and len(token) >= min_length,tokens))
    return filtered_tokens

def represent(documents):
    train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
    test_docs_id = list(filter(lambda doc: doc.startswith("test"), documents))
    
    train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
    test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]
    
    # Tokenisation
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    
    # Learn and transform train documents
    vectorised_train_documents = vectorizer.fit_transform(train_docs)
    vectorised_test_documents = vectorizer.transform(test_docs)

    # Transform multilabel labels
    mlb = MultiLabelBinarizer()
    train_labels = mlb.fit_transform([reuters.categories(doc_id) for doc_id in train_docs_id]) 
    test_labels = mlb.transform([reuters.categories(doc_id) for doc_id in test_docs_id])
    
    return (vectorised_train_documents, train_labels, vectorised_test_documents, test_labels, vectorizer)
 
def train_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(LinearSVC(random_state=42))
    classifier.fit(train_docs, train_labels)
    return classifier

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='macro')
    recall = recall_score(test_labels, predictions, average='macro')

    print("Precision: {:.4f}, Recall: {:.4f}".format(precision, recall))
    

documents = reuters.fileids()
train_docs, train_labels, test_docs, test_labels, vectorizer = represent(documents)
model = train_classifier(train_docs, train_labels)
predictions = model.predict(test_docs)
evaluate(test_labels, predictions)

Precision: 0.6493, Recall: 0.3948


  'precision', 'predicted', average, warn_for)


In [8]:
list(filter(lambda doc: doc.startswith("test"), documents))

['test/14826',
 'test/14828',
 'test/14829',
 'test/14832',
 'test/14833',
 'test/14839',
 'test/14840',
 'test/14841',
 'test/14842',
 'test/14843',
 'test/14844',
 'test/14849',
 'test/14852',
 'test/14854',
 'test/14858',
 'test/14859',
 'test/14860',
 'test/14861',
 'test/14862',
 'test/14863',
 'test/14865',
 'test/14867',
 'test/14872',
 'test/14873',
 'test/14875',
 'test/14876',
 'test/14877',
 'test/14881',
 'test/14882',
 'test/14885',
 'test/14886',
 'test/14888',
 'test/14890',
 'test/14891',
 'test/14892',
 'test/14899',
 'test/14900',
 'test/14903',
 'test/14904',
 'test/14907',
 'test/14909',
 'test/14911',
 'test/14912',
 'test/14913',
 'test/14918',
 'test/14919',
 'test/14921',
 'test/14922',
 'test/14923',
 'test/14926',
 'test/14928',
 'test/14930',
 'test/14931',
 'test/14932',
 'test/14933',
 'test/14934',
 'test/14941',
 'test/14943',
 'test/14949',
 'test/14951',
 'test/14954',
 'test/14957',
 'test/14958',
 'test/14959',
 'test/14960',
 'test/14962',
 'test/149

### Challenge:
Can we find which features are important to each class? Try to use some of the code below!

In [16]:
def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=25):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

In [None]:

features = vectorizer.get_feature_names()

top_feats_by_class(test_docs, test_labels, )