# tf-idf

This notebook demonstrates how to calculate tf-idf using Python. The data used is extracted from online textbooks written at the high school and early college level. Four documents are used, representing the text of one chapter from each of 4 different topics:

* anatomy
* business law
* economics
* geography

In [8]:
import re

num_docs = 4

with open('data/anat.txt', 'r') as f:
    doc_anat = f.read().lower()
    doc_anat = doc_anat.replace('\n', ' ')
    

with open('data/buslaw.txt', 'r') as f:
    doc_buslaw = f.read().lower()
    doc_buslaw = doc_buslaw.replace('\n', ' ')
    
with open('data/econ.txt', 'r') as f:
    doc_econ = f.read().lower()
    doc_econ = doc_econ.replace('\n', ' ')
    
with open('data/geog.txt', 'r') as f:
    doc_geog = f.read().lower()
    doc_geog = doc_geog.replace('\n', ' ')
    
# look at part of a document
doc_geog[:50]

'chapter 13 the pacific and antarctica the immense '

## tf

The code below writes a function to calculate the frequency of a term in a document. Using a Counter() object would make the code faster, but the goal here is seeing how the sausage is made.

In [9]:
# imports and set up

from nltk import word_tokenize
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

In [10]:
# create tf dictionaries for each document

vocab = set()  # set of words

def create_tf_dict(doc):
    tf_dict = {}
    tokens = word_tokenize(doc)
    tokens = [w for w in tokens if w.isalpha() and w not in stopwords]
     
    # get term frequencies
    for t in tokens:
        if t in tf_dict:
            tf_dict[t] += 1
        else:
            tf_dict[t] = 1
            
    # get term frequencies in a more Pythonic way
    token_set = set(tokens)
    tf_dict = {t:tokens.count(t) for t in token_set}
    
    # normalize tf by number of tokens
    for t in tf_dict.keys():
        tf_dict[t] = tf_dict[t] / len(tokens)
        
    return tf_dict

tf_anat = create_tf_dict(doc_anat)
tf_buslaw = create_tf_dict(doc_buslaw)
tf_econ = create_tf_dict(doc_econ)
tf_geog = create_tf_dict(doc_geog)
    
    
# add to vocab
vocab = set(tf_anat.keys())
vocab = vocab.union(set(tf_buslaw.keys()))
vocab = vocab.union(set(tf_econ.keys()))
vocab = vocab.union(set(tf_geog.keys()))

print("number of unique words:", len(vocab))
    

number of unique words: 4054


In [11]:
# get tf for 'work' in each doc

print('tf for "work" in anat =', tf_anat.get('work'))
print('tf for "work" in buslaw =', tf_buslaw.get('work'))
print('tf for "work" in econ =', tf_econ.get('work'))
print('tf for "work" in geog =', tf_geog.get('work'))


tf for "work" in anat = 0.00046040515653775324
tf for "work" in buslaw = 0.0027739251040221915
tf for "work" in econ = 0.0006854009595613434
tf for "work" in geog = 0.0009285051067780873


## idf

Make an idf frequency dictionary. Adding +1 to denominator to avoid divide by zero. Adding +1 to numerator to avoid negative idf.

In [19]:
import math

idf_dict = {}

vocab_by_topic = [tf_anat.keys(), tf_buslaw.keys(), 
                  tf_econ.keys(), tf_geog.keys()]

for term in vocab:
    temp = ['x' for voc in vocab_by_topic if term in voc]
    idf_dict[term] = math.log((1+num_docs) / (1+len(temp))) 

In [20]:
# look at idf for 'work'
# 0 idf because it occurs in all docs
print('idf for work:', idf_dict['work'])

# look at idf for 'inflation'
# high idf because it occurs in 1 of the 4 docs
print('idf for inflation:', idf_dict['inflation'])

idf for work: 0.0
idf for inflation: 0.9162907318741551


## tf-idf

Create a tf-idf dictionary for each document.

In [14]:
def create_tfidf(tf, idf):
    tf_idf = {}
    for t in tf.keys():
        tf_idf[t] = tf[t] * idf[t] 
        
    return tf_idf

tf_idf_anat = create_tfidf(tf_anat, idf_dict)
tf_idf_buslaw = create_tfidf(tf_buslaw, idf_dict)
tf_idf_econ = create_tfidf(tf_econ, idf_dict)
tf_idf_geog = create_tfidf(tf_geog, idf_dict)

In [15]:
# find the lowest tf-idf terms for the anatomy text
doc_term_weights = sorted(tf_idf_anat.items(), key=lambda x:x[1])
doc_term_weights[:5]

[('may', 0.0),
 ('additional', 0.0),
 ('fact', 0.0),
 ('common', 0.0),
 ('simple', 0.0)]

In [16]:
# find the highest tf-idf terms for each document
doc_term_weights = sorted(tf_idf_anat.items(), key=lambda x:x[1], reverse=True)
print("\nanatomy: ", doc_term_weights[:5])

doc_term_weights = sorted(tf_idf_buslaw.items(), key=lambda x:x[1], reverse=True)
print("\nbusiness law: ", doc_term_weights[:5])

doc_term_weights = sorted(tf_idf_econ.items(), key=lambda x:x[1], reverse=True)
print("\neconomics: ", doc_term_weights[:5])

doc_term_weights = sorted(tf_idf_geog.items(), key=lambda x:x[1], reverse=True)
print("\ngeography: ", doc_term_weights[:5])


anatomy:  [('sympathetic', 0.01581993666909798), ('system', 0.014228798452045322), ('autonomic', 0.012234084357435773), ('parasympathetic', 0.010335691957144016), ('receptors', 0.008648232045773564)]

business law:  [('party', 0.023129668959930128), ('damages', 0.023129668959930128), ('breach', 0.011946092759524352), ('nonbreaching', 0.011691920573151495), ('remedies', 0.008133509963931473)]

economics:  [('inflation', 0.04542725355647514), ('prices', 0.013816584031001654), ('index', 0.009839082567531481), ('price', 0.009103129690141027), ('basket', 0.005861581104061308)]

geography:  [('islands', 0.023609162311520708), ('island', 0.012123623889699824), ('antarctica', 0.008295111082426195), ('ozone', 0.007444330458587611), ('pacific', 0.0068062449907086734)]


For these documents, tf-idf did a good job of identifying important words. 