# Identifying distinctive words with Term Frequency-Inverse Document Frequency

Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent and significant words in a document compared to a larger set of documents.

With TF-IDF each term is weighted by dividing the frequency of term in a document by the number of documents in the corpus containing the word. It gives weight to terms that appear in a document but are rare or absent in other documents.

TF-IDF is calculated by taking the number of times a term occurs in a document (term frequency). Then taking the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and that fraction is flipped on its head (inverse document frequency =  log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1). Then you multiply the two numbers together (term_frequency * inverse_document_frequency). The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents.


In this notebook we use the implemention of tf-idf in Scikit-learn (sklearn). 

In [None]:
# Import the libraries we're going to use
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from pathlib import Path  
import glob

In [None]:
# Set up path to files 
# and a variable with the file names called text_titles
#(we use these later to create our pandas dataframe)
directory_path = 'soderberg-corpus'
text_files = glob.glob(f'{directory_path}/*.txt')
text_titles = [Path(text).stem for text in text_files]

In [None]:
#Set up tf-idf vectorizing
tfidf_vectorizer = TfidfVectorizer(input='filename' , stop_words='english')

N.B. This uses the default Scikit-Learn stopwords list. Try it first with the default, but you can use your own custom stopwords list if you want to modify your results (using the two cells bellow).

In [None]:
#Read in your txt file as list
#with open('custom-stopwords.txt', 'r') as f:
    #custom_stopwords = [s.rstrip('\n') for s in f.readlines()]

In [None]:
#Set up tf-idf vectorizing
#tfidf_vectorizer = TfidfVectorizer(input='filename' , stop_words=custom_stopwords)

In [None]:
#Actually do the tf-idf vectorizing
#(returns a tf-idf-weighted document-term matrix)
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [None]:
#Make a DataFrame out of the resulting tf–idf vectors, 
#setting the “feature names” (words in vocabulary) as columns 
#and the document titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=text_titles, 
                        columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
#Re-organize DataFrame so that words are in rows rather than columns
tfidf_df = tfidf_df.sort_index()
stacked_tfidf_df = tfidf_df.stack().reset_index()
stacked_tfidf_df = stacked_tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
stacked_tfidf_df.sample(n=20)

In [None]:
#Create a dataframe that sorts the top 10 words with the highest tf–idf for every story
num_top_words = 10
top_tfidf = stacked_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(num_top_words)

In [None]:
#Zoom in on particular words
#What documents have the given word in their top significant words?
top_tfidf[top_tfidf['term'].str.contains('real')]

In [None]:
#Zoom in on particular document
#What are the significant words for a given document?
top_tfidf[top_tfidf['document'].str.contains('Drizzle')]

In [None]:
#What are the top 20 significant words for the given document?
(stacked_tfidf_df[stacked_tfidf_df['document']
                  .str.contains('Drizzle')]
 .sort_values('tfidf', ascending=False)
 .head(20)
)

In [None]:
#Create bar plots of top 10 significant words for each document in corpus
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(5,10))
figure = sns.catplot(data=top_tfidf, row='document', x='tfidf', y='term', kind='bar', sharey=False)

# Save figure
#plt.savefig("sod-distinctive-words.pdf", bbox_inches = 'tight')

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-TF-IDF.html) and Matthew Lavin's ["Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf).