# tf-idf

*Lauren F. Klein wrote version 1.0 of this notebook, based of tutorials by [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) and [Kavita Ganesan](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XZVlcOdKhSw). I have supplemented it with material from Melanie Walsh's chapter [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/TF-IDF.html) from her online textbook [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html).*

We will learn powerful data science techniques soon. But, in many cases, just counting words can tell you a lot. 

Today, we're going to explore a method called Term Frequency - Inverse Document Frequency (tf-idf). Tf-idf comes up a lot in text analysis projects because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

The procedure was introduced in a 1972 paper by Karen Spärck Jones under the name “term specificity,” and the basic idea is this:

Instead of representing a term in a document by its raw frequency or its relative frequency (the term count divided by the document length), each term is *weighted* by dividing the term frequency by the number of documents in the corpus containing the word. 

The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents.

By contrast, terms with the highest tf-idf scores are the terms in a document that are distinctively frequent in a document when that document is compared other documents. When you sort by tf-idf score, these distinctive terms rise to the top. 

## *New York Times* Obituaries

In this lesson, we're going to use tf-idf to study 378 obituaries published by *The New York Times*. This dataset is based on data originally collected by Matt Lavin for his *Programming Historian* [TF-IDF tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset). Melanie Walsh re-scraped the obituaries so that the subject's name and death year are included in each text file name; she also added 12 more ["Overlooked"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries.

## Pre-processing: prepare the documents

Tf-idf works on a set of documents. Each document needs to be a single string. You'll get very familiar with writing document and text pre-processing code like this by the end of the class.

In [1]:
import os

base_dir = "../docs/NYT-Obituaries/"

all_docs = []
text_titles = []

docs = os.listdir(base_dir)

for doc in docs:
    with open(base_dir + doc, "r") as file:
        text = file.read()
        all_docs.append(text)
        text_titles.append(str(doc))
# just take a look at the first item to be sure
print(docs[0]) 
print("\n")
print(all_docs[0])

1945-Adolf-Hitler.txt


May 2, 1945

 OBITUARY

 Hitler Fought Way to Power Unique in Modern History

 BY THE NEW YORK TIMES

 Adolf Hitler, one-time Austrian vagabond who rose to be the dictator of Germany, "augmenter of the Reich" and the scourge of Europe, was, like Lenin and Mussolini, a product of the First World

 War. The same general circumstances, born of the titanic conflict, that carried Lenin, a bookish professional revolutionist, to the pinnacle of power in the Empire of the Czars and cleared the road to mastery for Mussolini in the Rome of the Caesars also paved the way for Hitler's domination in the former mighty Germany of the Hohenzollerns.

 Like Lenin and Mussolini, Hitler came out of the blood and chaos of 1914-18, but of the three he was the strangest phenomenon. Lenin, while not know to the general public, had for many years before the Russian Revolution occupied a prominent place as leader and theoretician, of the Bolshevist party. Mussolini was a widely known So

## Import libraries

Conveniently scikit-learn, which we were introduced to in the previous lesson, allows us to calculate tf-idf with just a few lines of code.

In [2]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd # this will help us keep track of our data; 
# we'll talk about pandas in more detail later this semester

## Creat document-term matrix

We'll use a doc-term matrix to calculate tf-idf. Remember how to create a doc-term matrix from last lesson?

In [3]:
#instantiate CountVectorizer()
cv=CountVectorizer()

# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(all_docs)

# check shape
dtm.shape

(378, 37160)

## Initialize TfidfTransformer

When you initialize TfidfTransformer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf. The recommended way to run `TfidfTransformer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

In [4]:
# Call tfidf_transformer.fit on the word count vector we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(dtm)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

## Produce inverse document frequence (idf) values

In [5]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
and,1.000000
in,1.000000
the,1.000000
of,1.002642
was,1.002642
...,...
petrus,6.244389
petrovo,6.244389
petronius,6.244389
petticoat,6.244389


## Produce & print tf-idf scores

Once you have the idf values, you can compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the documents in our corpus.

In [6]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(dtm)

Now, let’s print the tf-idf values of the first document to see if they make sense. 

We'll place the tf-idf scores from the first document (The Who's "Baba O'Reilly") into a pandas dataframe and sort the dataframe in descending order of scores.

In [7]:
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
the,0.578120
hitler,0.388941
of,0.345261
and,0.228612
to,0.213694
...,...
feinstein,0.000000
feiner,0.000000
feinberg,0.000000
feigned,0.000000


Notice that only certain words have scores. This is because only the words in this document have a tf-idf score and everything else, from other documents, shows up as zeroes.

Sometimes very common words ("the", "and", "a") are evidently distinctive, but they're not interesting. 

## tf-idf: the fast way

So now we're going to do it again with scikit-learn's stopword list. And since we're tf-idf pros, we're going to use scikit-learn's all-in-one tf-idf vectorizer. 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer 

# to exclude stopwords, add the argument `stop_words='english'`
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True)
 
# send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(all_docs)

In [9]:
# as above, get the first vector out (for the first document)
# first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
 
# place tf-idf values in a pandas dataframe
# reminder: we'll cover pandas in a later lesson
# df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
# df.sort_values(by=["tfidf"],ascending=False)

In [10]:
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()
tfidf_df

Unnamed: 0,00,000,000f,001,006,01,010,021,025,028,...,zrathustra,zuber,zuker,zukor,zukors,zula,zululand,zurich,zvai,zwilich
1852-Ada-Lovelace.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1870-Robert-E-Lee.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1875-Andrew-Johnson.txt,0.0,0.007591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1877-Bedford-Forrest.txt,0.0,0.018902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1880-Lucretia-Mott.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999-Iris-Murdoch.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1999-King-Hussein.txt,0.0,0.007611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2000-Charles-M-Schulz.txt,0.0,0.004885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009675
2000-Elliot-Richardson.txt,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [11]:
# Add row for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

## Let's explore!

We can look at specific words and how they appear in our obituaries corpus. I've entered the eight words below.

In [12]:
tfidf_slice = tfidf_df[['education', 'old', 'slavery', 'hope','government', 'sports', 'love', 'rights']]
tfidf_slice

Unnamed: 0,education,old,slavery,hope,government,sports,love,rights
1852-Ada-Lovelace.txt,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
1870-Robert-E-Lee.txt,0.016767,0.009478,0.000000,0.000000,0.045256,0.00000,0.000000,0.000000
1875-Andrew-Johnson.txt,0.010315,0.000000,0.040739,0.000000,0.046400,0.00000,0.000000,0.035382
1877-Bedford-Forrest.txt,0.000000,0.009679,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
1880-Lucretia-Mott.txt,0.034614,0.058700,0.205067,0.000000,0.000000,0.00000,0.000000,0.039578
...,...,...,...,...,...,...,...,...
1999-King-Hussein.txt,0.000000,0.003897,0.000000,0.000000,0.012406,0.00000,0.003557,0.000000
2000-Charles-M-Schulz.txt,0.003319,0.003752,0.000000,0.000000,0.000000,0.00000,0.034245,0.000000
2000-Elliot-Richardson.txt,0.011454,0.012949,0.000000,0.000000,0.020609,0.00000,0.000000,0.000000
2000-Pierre-Trudeau.txt,0.000000,0.015667,0.000000,0.006679,0.024935,0.00868,0.000000,0.000000


Does this output make sense? What does it tell you?
ANSWER HERE
* *
* *
* *
* *

Try your own words! In the cell below, replace my eight words with eight of your own.

In [None]:
# YOUR CODE HERE

Are your results what you expected? Why or why not?

ANSWER HERE
* *
* *
* *
* *

## Store tf-idf vectors & print top five for each doc

Finally, let's store our tf-idf vectors to files. Don't worry if you can't follow every bit of code below.

In [None]:
base_dir = "../docs/NYT-Obituaries/"

# make a directory to store them in
os.mkdir("./tf_idf_output")

docs = os.listdir(base_dir)

csvs = []

for doc in docs:
    csv = doc.replace(".txt",".csv")
    csvs.append(csv)

# convert sparse matrix to array
tfidf_vectors_as_array = tfidf_vectorizer_vectors.toarray()

# loop each item in tfidf_vectors_as_array, 
titles = []
for counter, doc in enumerate(tfidf_vectors_as_array): # note enumerate. useful! 
    # construct a dataframe
    tf_idf_tuples = list(zip(tfidf_vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    # output to a csv using the enumerated value for the filename
    for csv in csvs:
        one_doc_as_df.to_csv(path_or_buf="./tf_idf_output/" + str(csv))
        title = csv.replace(".csv", "")
        titles.append(title)
    print("\n" + str(titles[counter]) + " top 5 terms: ")
    print(one_doc_as_df.head())


1945-Adolf-Hitler top 5 terms: 
        term     score
0     hitler  0.713154
1     german  0.178339
2    germany  0.170369
3  reichstag  0.153355
4     poland  0.130337

1915-F-W-Taylor top 5 terms: 
         term     score
0      taylor  0.373259
1  management  0.306042
2  scientific  0.198236
3  efficiency  0.181847
4     midvale  0.173128

1975-Chiang-Kai-shek top 5 terms: 
            term     score
0         chiang  0.791663
1          china  0.242680
2     communists  0.152967
3       stilwell  0.134823
4  generalissimo  0.104891

1984-Ethel-Merman top 5 terms: 
       term     score
0    merman  0.663258
1      miss  0.202980
2  broadway  0.202166
3     gypsy  0.156912
4  sondheim  0.147391

1953-Jim-Thorpe top 5 terms: 
       term     score
0    thorpe  0.759121
1  carlisle  0.224952
2       jim  0.165563
3  football  0.159144
4   olympic  0.128771

1964-Nella-Larsen top 5 terms: 
          term     score
0       larsen  0.715102
1       harlem  0.159677
2         imes  0.15


1941-Frank-Conrad top 5 terms: 
           term     score
0        conrad  0.569491
1         radio  0.305098
2    pittsburgh  0.253126
3  westinghouse  0.243591
4       station  0.239495

1966-Alfred-P-Sloan-Jr top 5 terms: 
      term     score
0    sloan  0.811161
1   motors  0.347037
2       mr  0.159016
3  general  0.105019
4  million  0.091264

1960-Beno-Gutenberg top 5 terms: 
            term     score
0      gutenberg  0.521241
1           beno  0.260621
2  seismological  0.260621
3     goettingen  0.243698
4        caltech  0.231691

1976-J-Paul-Getty top 5 terms: 
       term     score
0     getty  0.842211
1       oil  0.268839
2        mr  0.129075
3  business  0.090790
4    sutton  0.083905

1891-P-T-Barnum top 5 terms: 
         term     score
0      barnum  0.783674
1       jumbo  0.223525
2          mr  0.166274
3  bridgeport  0.139558
4      museum  0.090131

1901-Queen-Victoria top 5 terms: 
       term     score
0  victoria  0.383062
1     queen  0.365494
2    prin

## Exercise

**Analyze the printed results, the top five terms for each song in terms of tf-idf scores. The results are likely obvious for a toy corpus like ours. But does anything surprise you? And what is a corpus, an experiment, that you can imagine using tf-idf for?**

ANSWER HERE

## That's it!