## TF-IDF, streamlined

Last class, we went under the hood of TF-IDF in order to understand how it's calculated, and how sk-learn's `CountVectorizer` plays a part. We might have learned a lot, but how to actually calculate TF-IDF was buried at the bottom. Also, I didn't show you how to associate specific words in specific articles with their TF-IDF scores, which is how the measure is most commonly used.

Hence this notebook: TF-IDF, streamlined.


## The Emory Wheel

As in the previous lesson, our corpus will be the articles published by *The Emory Wheel* betweeen 2014 and 2019.

This dataset was created by Honggang Min and Kexin Guan for their final project in the 2019 iteration of this course, and was generously transfered back to me for future class use.  

## Pre-processing #1: Downloading the documents

Tf-idf works on sets of documents. In this particular case, the documents are individual .txt files that are stored in a zip file on my Google Drive. Below is some code to get the data from Google Drive, unzipped, and formatted into a list for processing.

In [None]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# then download the zip file
gdown.download('https://drive.google.com/uc?export=download&id=1SUWUVswaY_RDLhzFruQIDJe-i6I3gznC', quiet=False)

In [None]:
# unzip it
!unzip wheel-clean.zip

# Pre-processing #2: From individual files to a single list

In order to calculate our tf-idf scores, we'll need to get all of the documents into a single Python list, with each document stored as a single (string) item in the list.

Note that this is custom code for processing this particular set of documents. While the specific code will (almost) always be different depending on the particular storage location and format of the files, you will always need *some* pre-processing code like this in order to get the files into the format that any particular method requires.  

In [None]:
# import this library for directory/file manipulation
import os

# set the base directory -- note that this may need to change if you've saved a copy
# of this notebook elsewhere
base_dir = "wheel-clean/"

# read in a list of all the filenames
docs = os.listdir(base_dir)

# a list for storing the text of all the docs
wheel_docs = []

# a list for storing the titles of the docs -- not necessary, just makes recall easier
text_titles = []

# iterate through each of the docs in the directory
for doc in docs:
    with open(base_dir + doc, "r") as file:     # open the doc file
        text = file.read()                      # read the contents of the file
        wheel_docs.append(text)                 # append the contents of the file to our all_docs list for future manipulation
        text_titles.append(str(doc))            # append the title of the doc to another list for future reference

# just take a look at the first item to be sure it worked
print("Filename: " + str(text_titles[0]) + "\n")
print(wheel_docs[0])

## To the TF-IDF calculation!

Here are the few lines of code that make use of scikit-learn's all-in-one tf-idf vectorizer.

In [None]:
# import our required library
from sklearn.feature_extraction.text import TfidfVectorizer

# to exclude stopwords, add the argument `stop_words='english'`
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True)

# send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(wheel_docs)


**And we're done! 🎉 🎉 🎉**



---



---



Now, to explore the results...

In [None]:
# import pandas for Python dataframes
import pandas as pd

# send the output of the vectorizer into a dataframe
# tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# add in a column for the titles of each article for future reference
tfidf_df['Title'] = text_titles

# print out our dataframe of tf-idf scores
# note that this is not super readable yet, but the next few cells will help us make sense of the results
tfidf_df


# Searching/sorting by tf-idf score

1.   List item
2.   List item



One question you often want to ask about tf-idf scores relates to individual words-- more specifically, which documents have the highest tf-idf scores for a specific word.  

You might want to search/sort this way if you were curious, for example, which documents were most uniquely about, say, food:

In [None]:
tfidf_slice_sorted = tfidf_df[['Title', 'food']].sort_values(by=['food'], ascending=False)

# print out the top ten
print (tfidf_slice_sorted[:10])

The above list then suggests the articles that you should prioritize reading if your interest was in food. For example, the first one about sustainability from November 21, 2014:

In [None]:
wheel_docs[3096]

# note that I am using the index number that is listed above to pull up the document
# the files are read in in a different order each time, so your index number for the
# top article may not be the same, even if the same article
# ("2014-11-21-Sustainability-Security-Vital-to-Fu...") will still have the top score


## Searching for multiple terms

Not too much different than above, but you don't need to include just one term as part of your slice.

In [None]:
tfidf_slice_sorted = tfidf_df[['Title', 'food', 'dinner', 'lunch', 'breakfast']].sort_values(by=['food', 'dinner', 'lunch', 'breakfast'], ascending=False)

# print out the top ten
print (tfidf_slice_sorted[:10])



## Displaying the top terms for any particular document

The second major use of TF-IDF is to characterize the most significant words in any particular document. Here's some code that will do that for the first document in the corpus:

In [None]:
# pull out the row at location 0
# replace the index number to pull up another doc
doc_row = tfidf_df.iloc[0]

# drop the title col b/c it messes up sorting
doc_row = doc_row.drop('Title')

# sort by td-idf scores top to bottom
doc_row = doc_row.sort_values(ascending=False)

# print out the top ten
print (doc_row[:10])


# Most unique words in a corpus

Oh and of course! Here are the most unique words in the corpus overall!

Recall from the previous lesson that this is distinct from the most *frequent* words in the corpus.

In [None]:
# drop the title col b/c it messes up sorting
tfidf_df = tfidf_df.drop(columns=['Title'])

# add in a row with the total TF-IDF scores
tfidf_df.loc['Total_TFIDF'] = tfidf_df.sum()

# sort by Total_TFIDF values, high to low
tfidf_df.sort_values(by=['Total_TFIDF'], axis=1, ascending=False, inplace=True)

tfidf_df

Here's the same thing formatted slightly more nicely.

Note use of the Python `range` method to generate an iterator.


In [None]:
sorted_terms = list(tfidf_df.columns.values.tolist())

for i in range(25):
  print(str(i) + ". " + str(sorted_terms[i]))


*Lauren F. Klein wrote version 1.0 of this notebook in 2022*

