# tf-idf

*Lauren F. Klein wrote version 1.0 of this notebook in 2019 based of tutorials by [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) and [Kavita Ganesan](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XZVlcOdKhSw). Dan Sinykin supplemented it with material from Melanie Walsh's chapter [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/TF-IDF.html) in 2020. Lauren Klein updated it again in 2021.*

We will learn powerful data science techniques soon. But, in many cases, just counting words can tell you a lot. To wit:

<img src="http://lklein.com/wp-content/uploads/2021/10/Screen-Shot-2021-10-06-at-3.33.34-PM.png">

Today, we're going to explore a method called Term Frequency - Inverse Document Frequency (tf-idf). Tf-idf comes up a lot in text analysis projects because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

The procedure was introduced in a 1972 paper by Karen Spärck Jones under the name “term specificity,” and the basic idea is this:

Instead of representing a term in a document by its raw frequency or its relative frequency (the term count divided by the document length), each term is *weighted* by dividing the term frequency by the number of documents in the corpus containing the word. 

The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in any particular document are often the most frequently used words in all of the documents.

By contrast, terms with the highest tf-idf scores are the terms in a document that are distinctively frequent in a document when that document is compared other documents. When you sort by tf-idf score, these distinctive terms rise to the top. 

## An Analogy ##
    
If this explanation doesn’t quite resonate, a brief analogy might help. 

Say you've decided leave campus to get dinner on Buford Highway. Since leaving campus takes a lot of effort (and also, crucially, access to a car), the food better be worth it! That means you'll need to balance two competing goals:

1) The food has to be really tasty; and also, crucially: 
2) If you're going to go all the way out to Buford Highway, it better be something that you can't also get in Emory Village. Otherwise, why go to all the trouble of getting there?!

Or, to give an example involving actual food: you don't want to go all the way out to Buford Highway to get pizza. Even if the pizza on Buford Highway is pretty tasty, you can get pizza anywhere in town. How can you find out what is distintively tasty on Buford Highway?  

If you looked up the Yelp reviews for the all restaurants on Buford highway and sorted by score, you would get an answer to the question of what's the tastiest. But it still won't help solve the problem of what's *distintively tasty* on Buford Highway--like hot pot, for example, which is something that you can't get in Emory Village.   

So you need a way to tell the difference between what's tasty and what's distinctively tasty. To do so, you need to distinguish between four categories of food. Food that, on Buford Highway, is:

- both tasty and distinctive (e.g. hot pot)
- tasty but not distinctive (e.g. pizza) 
- distinctive but not tasty (e.g. tacos-- tho I'm open to disagreement here)
- neither tasty nor distinctive (e.g. Taco Bell--again, open to disagreement).

These categories are what tf-idf helps you measure. Term frequencies can be assessed according to the same criteria. A term might be:

- Frequently used in the corpus, and used especially frequently in one particular document <-- Interesting! 
- Frequently used in the corpus, but used frequently in equal measure across all documents <-- Less interesting
- Infrequently used in the corpus, but nonetheless used frequently in one particular document <-- Potentially interesting
- Infrequently used in in the corpus and also infrequently used in the corpus consitently across all documents <-- Not interesting

It's the words that are especially frequent in one document that are most interesting to us, and the ones that TF/IDF helps us identify. To see how, let's take a look at our next corpus--a slightly bigger one--articles published in the *Emory Wheel*.

## *The Emory Wheel* 

In this lesson, we're going to use tf-idf to study the articles published by *The Emory Wheel* betweeen 2014 and 2019. This dataset was created by Honggang Min and Kexin Guan for their final project in the 2019 iteration of this course, and was generously transfered back to me for future class use.  

## Pre-processing: prepare the documents

Tf-idf works on sets of documents. Because we'll be using sk-learn's CountVectorizer, which we learned about last class, in order to count the words, we'll need to get the documents into a list, with each document stored as its own string. 

In this particular case, the documents are stored in a single CSV file along with some metadata. Below is some code to get the data from the csv format into a list for processing. 

While this is custom code written for this particular dataset, you'll always need to write some sort of file/text pre-processing code in order to use CountVectorizer and other sk-learn functions. You'll get very familiar with writing code like this by the end of the course! 

In [2]:
import pandas as pd

# read in the file -- using a dataframe since it's already formatted as a csv
df = pd.read_csv('../corpora/emory-wheel/sorteddata.csv')

# because I happen to know that there are some encoding errors in this corpus,
# we'll explicitly convert each string object in this column to unicode
df['Content'] = df['Content'].astype('unicode')

# and then convert to a list for vectorizing
all_docs = df['Content'].tolist()

# take a look at the first one
print(all_docs[0])



The Commission on the Liberal Arts (CoLA) has submitted its final report on developing a vision forEmory University as a residential liberal arts research university, signifying the end of a yearlongprocess that engaged hundreds of faculty, students and staff.The report recommends building intellectual engagement through conversation, such as creatingdiscussions around University-wide events; the creation of a new cross-unit course called a synthesisseminar that would focus on different, specific academic topics; and the expansion of mentoringprograms, progressively structured in a “Karate belt” system. In the long term, CoLA recommendscloser examination of five interrelated strategies ranging from re-evaluation of faculty course load tobetter telling Emory’s story. These recommendations stemmed from four themes identified in theyearlong investigation and subcommittee reports, which were: ongoing and open communication,creating synergies and leveraging existing programs, evaluation and

## Import our sk-learn libraries

Conveniently scikit-learn, which we were introduced to in the previous lesson, allows us to calculate tf-idf with just a few lines of code. We'll go through it slowly first, and then quickly.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):


## Creat document-term matrix

We'll use a doc-term matrix to calculate tf-idf. Remember how to create a doc-term matrix from last lesson?

In [4]:
#instantiate CountVectorizer()
cv=CountVectorizer(stop_words='english') # using stopwords this time

# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(all_docs)

# check shape
dtm.shape

(4027, 135198)

**How many articles do we have in this document-term matrix and how many unique features/terms?**

4026 articles, 135K features

That line, as we learned last clas, tells us that we have 4026 rows, one for each document in the corpus, and 135,198 columns, one for each word (minus single character words, which the tokenizer excludes, as well as the default stopwords, which we've indicated should be excluded with the stop_words='english' parameter above).

We can also look at the whole vocabulary like this:

In [13]:
cv.vocabulary_

{'commission': 28924,
 'liberal': 68569,
 'arts': 14127,
 'cola': 28303,
 'submitted': 110161,
 'final': 45157,
 'report': 96173,
 'developing': 35404,
 'vision': 128172,
 'foremory': 46836,
 'university': 125794,
 'residential': 96580,
 'research': 96476,
 'signifying': 103942,
 'end': 40694,
 'yearlongprocess': 134183,
 'engaged': 40846,
 'hundreds': 57686,
 'faculty': 43599,
 'students': 109696,
 'staff': 107897,
 'recommends': 94722,
 'building': 22786,
 'intellectual': 61262,
 'engagement': 40854,
 'conversation': 30840,
 'creatingdiscussions': 31993,
 'wide': 131806,
 'events': 42182,
 'creation': 32003,
 'new': 78147,
 'cross': 32320,
 'unit': 125723,
 'course': 31635,
 'called': 24006,
 'synthesisseminar': 111645,
 'focus': 46280,
 'different': 35710,
 'specific': 106915,
 'academic': 3858,
 'topics': 122149,
 'expansion': 42884,
 'mentoringprograms': 73479,
 'progressively': 91704,
 'structured': 109513,
 'karate': 65540,
 'belt': 19217,
 'long': 69798,
 'term': 113067,
 'reco

The numbers above are the indices for each feature, not the word counts. But we can use the indicies in order to generate our word counts. 

Note that we did a similar thing last class using `cv.get_feature_names()`. This is arguably more efficient and evidently, as of the most recent version of Python, the feature_names method is being depricated, so perhaps better to use this!

In [24]:
sum_words = dtm.sum(axis=0) # sum_words is a vector that contains the number of times each word appears in all 
                            # the docs in the corpus. In other words, we are summing the elements for each column 
                            # of the doc-term matrix and storing those counts as a vector 

# then sort the list of tuples that contain the word and their occurrence in the corpus.
# tuples are Python's name for single variables that actually store multiple variables, 
# like the word and index in the vocabulary attribute above 

words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()] # rememeber list comprehension! 

words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# display the top 10
words_freq[:10]

[('emory', 15503),
 ('said', 13088),
 ('students', 7297),
 ('university', 5670),
 ('time', 4971),
 ('team', 4777),
 ('like', 4748),
 ('college', 4719),
 ('student', 4695),
 ('people', 4505)]

We can already see some words with the most counts don't seem too distinctive: "emory" and "students," for example. It's not surprising that those are the most frequently occurring words since the Wheel is a newspaper about Emory students.

So now let's calculate the IDF values so that we can balance them out. While we could also calculate these by hand, sk-learn makes it really easy to do it in a few lines of code, so we'll use that instead. 

## Initialize TfidfTransformer

When you initialize TfidfTransformer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf. The recommended way to run `TfidfTransformer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in document length, and, overall, they'll produce more meaningful tf–idf scores. 

In [5]:
# Call tfidf_transformer.fit on the word count vector we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(dtm)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

## Print inverse document frequence (idf) values

In [26]:
# make a dataframe for the idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
emory,1.299087
said,1.508428
time,1.566656
like,1.654901
just,1.753803
year,1.762282
university,1.762814
college,1.786522
new,1.795285
people,1.829414


In the table above, the words at the top are those that appear in the most number of documents, across all of the corpus; and the words at the bottom are those that appear in the least number of documents.

Once again, it makes sense that words like "Emory" and "said" are at the top. It's a newspaper, after all! 

The words at the bottom appear to be either typos or whitespace errors. I'm guessing most of those appear only once across the entire corpus. 

## IDF by the numbers

But what are these numbers that we're looking at?

The most direct formula would be **N/df<sub>i</sub>**, where N represents the total number of documents in the corpus, and df is the number of documents in which the term appears. 

However, many implementations of tf-idf, including scikit-learn, which we are using, normalize the results with additional operations. 

In tf-idf, normalization is generally used in two ways, and for two reasons: first, to prevent bias in term frequency from terms in shorter or longer documents; and second, as above, to calculate each term’s idf value. 

Scikit-learn’s implementation of tf-idf represents N as **N+1**, calculates the natural logarithm of **(N+1)/df<sub>i</sub>**, and then adds **1** to the final result. Here is this same thing formatted slightly more nicely:

<img src="http://lklein.com/wp-content/uploads/2019/10/Screen-Shot-2019-10-02-at-11.52.31-PM.png">

**Important note!** This is only one way to calculate TF-IDF. There are many, many versions. The number itself isn't important. It's the *ranking* that the number enables that's most interesting to us. Because one you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. 

So now let’s compute tf-idf scores for the documents in our corpus.


## Produce & print tf-idf scores

Once you have the idf values, you can compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the documents in our corpus.

In [28]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(dtm)

Now, let’s print the tf-idf values of the first document to see if they make sense. 

We'll place the tf-idf scores from the first document into a pandas dataframe and sort the dataframe in descending order of scores.

In [29]:
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
cola,0.505182
report,0.280217
faculty,0.215381
intellectual,0.177281
recommendations,0.158464
fivush,0.123603
liberal,0.115338
evaluation,0.108227
belt,0.105246
recommends,0.096602


Notice that only certain words have scores. This is because only the words in this document have a tf-idf score and everything else, from other documents, shows up as zeroes.

## tf-idf: the fast way

Since we're now tf-idf pros, we're going to use scikit-learn's all-in-one tf-idf vectorizer to do this entire notebook again in two lines of code. 

In [6]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True) # excludings stopwords again
 
# send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(all_docs)

In [34]:
# place tf-idf values for all docs in a pandas dataframe
# tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names())

tfidf_df

Unnamed: 0,00,000,00000,000academic,000amounts,000and,000annually,000attendees,000award,000bond,...,àpjp,ángel,åkerman,ébrik,épée,írisz,östlund,œuvre,ἱῥή,ﻔﺮ
0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.000000,0.031760,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,00,000,000f,001,006,01,010,021,025,028,...,zrathustra,zuber,zuker,zukor,zukors,zula,zululand,zurich,zvai,zwilich
1852-Ada-Lovelace.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1870-Robert-E-Lee.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1875-Andrew-Johnson.txt,0.0,0.007591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1877-Bedford-Forrest.txt,0.0,0.018902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1880-Lucretia-Mott.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999-Iris-Murdoch.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1999-King-Hussein.txt,0.0,0.007611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2000-Charles-M-Schulz.txt,0.0,0.004885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009675
2000-Elliot-Richardson.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [35]:
# Add row for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

## Let's explore!

We can look at specific words and how they appear in our newspaper corpus. I've entered five words below that we might want to investigate given the recent [symposium on Emory's legacy of slavery](https://libraries.emory.edu/slavery-symposium/index.html).

In [37]:
tfidf_slice = tfidf_df[['slavery', 'white', 'black', 'history', 'change']]
tfidf_slice

Unnamed: 0,slavery,white,black,history,change
0,0.0,0.011193,0.011407,0.000000,0.000000
1,0.0,0.000000,0.000000,0.000000,0.000000
2,0.0,0.000000,0.000000,0.000000,0.017177
3,0.0,0.000000,0.000000,0.000000,0.000000
4,0.0,0.000000,0.000000,0.000000,0.000000
5,0.0,0.000000,0.000000,0.000000,0.000000
6,0.0,0.000000,0.000000,0.014421,0.000000
7,0.0,0.000000,0.015073,0.000000,0.000000
8,0.0,0.000000,0.000000,0.000000,0.000000
9,0.0,0.000000,0.000000,0.000000,0.000000


**Does this output make sense? What does it tell you about which articles you might want to go read? What about some research questions you might ask of this corpus using TF-IDF?**

## Cosine similarity

Just so you know how to do everything in that Pudding Hip-Hop feature, this is how you calculate cosine similiarty between documents on the basis of their tf/idf scores:

In [7]:
# CALCULATE SIMILARITY TO FIRST DOC 

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_vectorizer_vectors[0:1], tfidf_vectorizer_vectors)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = np.float


array([[1.        , 0.01970937, 0.0515625 , ..., 0.03055578, 0.06319356,
        0.0306805 ]])