# Term Frequency–Inverse Document Frequency

[Download relevant files here](https://melaniewalsh.org/TF-IDF.zip) or run `git pull` from command line in the "Intro-Cultural-Analytics-Notebooks" directory

In this lesson, we're going to learn about a text analysis method called **term frequency–inverse document frequency** (tf–idf). This method will help us identify the most unique words in a document from a given corpus. 

>term = word <br>
>document = text (or chunk of a text) <br>
>corpus = collection of texts <br>

# Why is tf–idf Useful?

tf–idf is a very useful and commonly used text analysis method. Why?

Let's say we wanted to find out the most interesting or meaningful words in each of the 14 short stories in *Lost in the City*, a short story collection by Edward P. Jones.

We could calculate the most *frequent* words in each story. But stop words like "the" or "and" would be the most frequent words for every single story. We could remove the stop words, but the most frequent words still wouldn't be very interesting. They would be the same or similar for every short story: "said," "man," "woman," "day."

Tf–idf can help remedy these problems. That's because tf–idf calculates a uniqueness score for every word based on how many times the word appears in an individual text **as well as** how many times that word appears *in all the other texts in the corpus*. By taking into consideration the entire short story collection, tf–idf helps to identify words that are unique to a given story (i.e., they don't appear very often in the other stories).

# The Basic Math

> `term_frequency * inverse_document_frequency`

There are more than a few ways to calculate tf–idf scores. Understanding the minute details of tf–idf math is not our primary goal, but it's helpful to understand what's happening in broad strokes. Let's walk through one possible tf–idf formula with one example.
 
## Breaking Down the Formula

For this version of the formula, `term_frequency` equals the number of times a word appears in one of *Lost in the City*'s short stories...

> `term_frequency = number of times a given word appears in story or text`

`inverse_document_frequency` equals the total number of short stories  divided by the number of short stories that contain the given word...

> `total_number_of_documents / number_of_documents_with_term`

...the result of which we're going to take the logarithm of and then add 1

> `inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1`

Do you see how if we flipped the fraction — making it `number_of_documents_with_term /  total_number_of_documents`— that would just be "document frequency"? By inverting this fraction, however, we get "inverse document frequency."

## The Formula in Action

**"said" vs "pigeons"**

Using this formula, we're going to calculate and compare the tf–idf scores for the word "said" and the word "pigeons" in "The Girl Who Raised Pigeons," the first short story in *Lost in the City*.

We need the log() function for our calculation, so we're going to import it from the `math` package.

In [6]:
from math import log

**"said"**

In [7]:
total_number_of_documents = 14 #total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 #number of short stories the contain the word "said"

In [8]:
term_frequency = 47 #number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [None]:
term_frequency * inverse_document_frequency

**"pigeons"**

In [10]:
total_number_of_documents = 14 #total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 #number of short stories the contain the word "pigeons"

In [11]:
term_frequency = 30 #number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [None]:
term_frequency * inverse_document_frequency

**tf–idf Scores**

"said" = 50.48<br>
"pigeons" = 88.38

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

# tf–idf with scikit-learn

## Import Libraries

We could continue calculating tf–idf scores in this manner — by doing all the math with Python — but conveniently there's a Python library that can calculate tf–idf scores in just a few lines of code.

This library is called [scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`. It's a popular Python library for machine learning approaches such as clustering, classification, and regression, among others. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

In [13]:
!pip install sklearn



In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

We're also going to import `pandas` and change two of its default display settings. We're going to increase the maximum number of rows that pandas will display, and we're going to format numbers in a special way. If it's a decimal number, format to three decimal places; if it's a whole number, round to the whole number.

In [15]:
import pandas as pd
pd.set_option("max_rows", 200)
pd.set_option("max_columns", 200)
pd.options.display.float_format = lambda value : '{:.0f}'.format(value) if round(value,0) == value else '{:,.3f}'.format(value)

Finally, we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) and [`glob`](https://docs.python.org/3/library/glob.html). These libraries will help us read in all the short story text files from *Lost in the City*.

In [16]:
from pathlib import Path  
import glob

## Set Directory Path

Below we're setting the directory filepath that contains all the short story text files that we want to analyze.

In [17]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

Then we're going to use `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the short story titles.

In [18]:
text_files = glob.glob(f"{directory_path}/*.txt")
text_titles = [Path(text).stem for text in text_files]

Let's display them to make sure they're correct:

In [19]:
text_files, text_titles

(['../texts/literature/Lost-in-the-City_Stories/11-Gospel.txt',
  '../texts/literature/Lost-in-the-City_Stories/13-A-Dark-Night.txt',
  '../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt',
  '../texts/literature/Lost-in-the-City_Stories/12-A-New-Man.txt',
  '../texts/literature/Lost-in-the-City_Stories/02-The-First-Day.txt',
  '../texts/literature/Lost-in-the-City_Stories/07-The-Sunday-Following-Mother’S-Day.txt',
  '../texts/literature/Lost-in-the-City_Stories/03-The-Night-Rhonda-Ferguson-Was-Killed.txt',
  '../texts/literature/Lost-in-the-City_Stories/05-The-Store.txt',
  '../texts/literature/Lost-in-the-City_Stories/08-Lost-In-The-City.txt',
  '../texts/literature/Lost-in-the-City_Stories/14-Marie.txt',
  '../texts/literature/Lost-in-the-City_Stories/09-His-Mother’S-House.txt',
  '../texts/literature/Lost-in-the-City_Stories/10-A-Butterfly-On-F-Street.txt',
  '../texts/literature/Lost-in-the-City_Stories/06-An-Orange-Line-Train-To-Ballston.txt',
  '../t

## Calculate Word Frequency (Optional Step)

This is an optional step, but for the sake of comparison, we're first going to calculate the raw frequency for every word in every story with scikit-learn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Later, when we calculate our tf–idf scores, we can compare these two methods and see how tf–idf helps us find more unique words.

(Machine learning approaches require that you transform words into a "vector," aka a series of numbers. This is what `CountVectorizer` does. But it's also just a convenient way to tokenize and count words.)

In [20]:
#Initialize CountVectorizer with desired parameters
count_vectorizer= CountVectorizer(input='filename', stop_words='english')

#Plug in "text_files," which contains all our short stories, to the initialized count_vectorizer
word_count_vector = count_vectorizer.fit_transform(text_files)

In [21]:
#Make a DataFrame out of the word count vector and sort by title
word_count_df = pd.DataFrame(word_count_vector.toarray(), index=text_titles, columns=count_vectorizer.get_feature_names())
word_count_df = word_count_df.sort_index()

#Add column for number of times each word appears in all the documents
word_count_df.loc['Document Frequency'] = (word_count_df > 0).sum()

This dataframe `word_count_df` displays all the words that appear in *Lost in the City*, how many times each word appears in each story, and how many times each word appears at least once across all the stories (the very last row of numbers titled "Document Frequency").

Let's look at a sample of 10 words. You can run the cell again to look at a different sample of words.

In [None]:
word_count_df.sample(10, axis='columns')

Let's zoom in on some specific words.

In [None]:
word_count_df[['pigeons', 'school', 'said', 'gospelteers', 'church', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]

To find the top 10 most frequent words in every story, we're going to make and run the following function: `get_top_n_counts()`

In [24]:
def get_top_n_counts(dataframe, top_n=10):
    pretty_df = dataframe.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'count', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['word_freq_rank'] = pretty_df.groupby('story')['count'].rank(method='min', ascending=False)
    return pretty_df

This function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 most frequent words in every story. Finally, it will produce a dataframe with a new column `word_freq_rank`, which contains a 1-10 ranking of the most frequent words.

In [25]:
word_count_df = word_count_df.drop('Document Frequency', errors='ignore')

In [None]:
top_word_freq = get_top_n_counts(word_count_df)
top_word_freq

# Calculate tf–idf

To calculate tf–idf scores for every word, we're going to follow a very similar pattern with scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

### Without Smoothing or Normalization (Not Recommended)

Remember how we calculated the tf–idf score for the word "pigeons" above?

In [27]:
total_number_of_documents = 14 
number_of_documents_with_term = 2
term_frequency = 30
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

term_frequency * inverse_document_frequency

88.3773044716594

We can use this exact formula by running `TfidfVectorizer` and turning off smoothing (`smoth_idf=False`) and normalization (`norm=None`). This is **not** the best or recommended way to calculate tf–idf scores. But it's useful to see the basic math that we discussed earlier in action with scikit-learn.

In [28]:
#Initialize TfidfVectorizer with desired parameters (turn off smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', smooth_idf = False, norm=None)

#Plug in "text_files" which contains all our short stories
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [29]:
#Make a DataFrame out of the tf–idf vector and sort by title
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

#Add column for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

### With Smoothing and Normalization (Defaults/Recommended)

The recommended way to run `TfidfVectorizer`, however, is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`. To turn them on, you don't need to include any extra code at all.

In [31]:
#Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

#Plug in "text_files" which contains all our short stories
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [32]:
#Make a DataFrame out of the tf–idf vector and sort by title
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

#Add column for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [34]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [35]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [None]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

## Write to a CSV File

In [37]:
filename = "tfidf_Lost-in-The-City.csv"
top_tfidf.to_csv(filename, encoding='UTF-8', index=False)

# Compare Word Frequency and tf–idf Scores

Now let's compare the raw word frequencies and tf-idf scores for all the stories in the *Lost in the City*.

First, we're going to merge the top raw word frequency ranks into our top tf–idf dataframe.

In [38]:
tfidf_compare = top_tfidf.merge(top_word_freq[['word_freq_rank', 'word', 'story']] , on=['story', 'word'], how='left')

Then we're going to add a column that calculates the change in rank—that is, how the significance of a word changes when we calculate tf-idf vs raw word frequency.

In [39]:
tfidf_compare['changed_rank'] = tfidf_compare['word_freq_rank'] - tfidf_compare['tfidf_rank']
tfidf_compare = tfidf_compare.fillna("*new top word*")

Finally, we're going to make some functions that will alter the style of our Pandas dataframe—such that the words that move up in tf-idf rank will be emphasized in green with a `+` sign and words that move down in tf-idf rank will be emphasized in red with a `-` sign.

In [40]:
def make_positive(value):
    if value != '*new top word*':
        if float(value) > 0:
            value = f'+{round(value)}'
    return value

def make_bold(value):
    return 'font-weight: bold'

def color_df(value):
    if value == '*new top word*':
        color = 'green'    
    else:
        value = str(value).replace('+', '')
        value = float(value)
        
        if value < 0:
            color = 'red'
        elif value > 0:
            color = 'green'
        else:
             color = 'black'        
    df_style = f'color: {color}; font-weight: bold'
    return df_style

Now let's display the dataframe and explore which words have become more significant and which words have become less so>

In [None]:
tfidf_compare['changed_rank'] = tfidf_compare['changed_rank'].apply(make_positive)
tfidf_compare_styled = tfidf_compare.style.applymap(color_df, subset=['changed_rank']).applymap(make_bold, subset=['tfidf_rank'])
tfidf_compare_styled

The word "said," which is one of the most frequent words throughout the collection, gets knocked down in tf-idf importance precisely because it occurs in almost every story.

*Note: To style your dataframe with color and bolding (as above), add `.style.applymap(color_df, subset=['changed_rank'])` to the end of the code below*

In [None]:
tfidf_compare[tfidf_compare['word'] == 'said']

A word like "pigeons," on the other hand, becomes more significant because it is rarer.

In [None]:
tfidf_compare[tfidf_compare['word'] == 'pigeons']

Words that were not frequent enough to make the top 10 for raw word frequency — such as "dreaming," "gospelteers," or "dreadlocks — now suddenly show up in the top 10 for tf-idf scores.

In [None]:
tfidf_compare[tfidf_compare['word'] == 'dreaming']

In [None]:
tfidf_compare[tfidf_compare['word'] == 'gospelteers']

In [None]:
tfidf_compare[tfidf_compare['word'] == 'dreadlocks']

# Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

In [None]:
tfidf_compare

**1.** What is the difference between a tf-idf score and raw word frequency?

**#** Your answer here

**2.** Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

**#** Your answer here

**3.** What's another collection of texts that you think might be interesting to analyze with tf-idf scores?  Why?

**#** Your answer here