## Term Frequency–Inverse Document Frequency

In this lesson, we're going to learn about a text analysis method called **term frequency–inverse document frequency** (tf–idf). This method will help us identify the most unique words in a document from a given corpus. 

>term = word <br>
>document = text (or chunk of a text) <br>
>corpus = collection of texts <br>

## Introduction 

Calculating the most frequent words in a text can be useful. But often the most *frequent words* in a text aren't the most *interesting* words in a text.

Term frequency-inverse document frequency is a method that tries to help with this problem becuase it identifies the most frequent *unique* words in a text compared to other texts.

### Breaking Down the Formula


**term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in story or text

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1

We're going to calculate and compare the tf–idf scores for the word *said* and the word *pigeons* in "The Girl Who Raised Pigeons," the first short story in *Lost in the City*.

We need the log() function for our calculation, so we're going to import it from the `math` package.

In [38]:
from math import log

**"said"**

In [39]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

In [40]:
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [41]:
term_frequency * inverse_document_frequency

50.48307469122493

**"pigeons"**

In [42]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

In [43]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [44]:
term_frequency * inverse_document_frequency

88.3773044716594

**tf–idf Scores**

"said" = 50.48<br>
"pigeons" = 88.38

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

## tf–idf with scikit-learn

#### Import Libraries

We could continue calculating tf–idf scores in this manner — by doing all the math with Python — but conveniently there's a Python library that can calculate tf–idf scores in just a few lines of code.

This library is called [scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`. It's a popular Python library for machine learning approaches such as clustering, classification, and regression, among others. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

In [45]:
!pip install sklearn



In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

We're also going to import `pandas` and change two of its default display settings. We're going to increase the maximum number of rows that pandas will display, and we're going to format numbers in a special way. If it's a decimal number, format to three decimal places; if it's a whole number, round to the whole number.

In [47]:
import pandas as pd
pd.set_option("max_rows", 200)
pd.set_option("max_columns", 200)
pd.options.display.float_format = lambda value : '{:.0f}'.format(value) if round(value,0) == value else '{:,.3f}'.format(value)

Finally, we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html). These libraries will help us read in all the short story text files from *Lost in the City*.

In [None]:
from pathlib import Path  
import glob

#### Set Directory Path

Below we're setting the directory filepath that contains all the short story text files that we want to analyze.

In [48]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

Then we're going to use `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the short story titles.

In [49]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [50]:
text_files

['../texts/literature/Lost-in-the-City_Stories/11-Gospel.txt',
 '../texts/literature/Lost-in-the-City_Stories/13-A-Dark-Night.txt',
 '../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt',
 '../texts/literature/Lost-in-the-City_Stories/12-A-New-Man.txt',
 '../texts/literature/Lost-in-the-City_Stories/02-The-First-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/07-The-Sunday-Following-Mother’S-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/03-The-Night-Rhonda-Ferguson-Was-Killed.txt',
 '../texts/literature/Lost-in-the-City_Stories/05-The-Store.txt',
 '../texts/literature/Lost-in-the-City_Stories/08-Lost-In-The-City.txt',
 '../texts/literature/Lost-in-the-City_Stories/14-Marie.txt',
 '../texts/literature/Lost-in-the-City_Stories/09-His-Mother’S-House.txt',
 '../texts/literature/Lost-in-the-City_Stories/10-A-Butterfly-On-F-Street.txt',
 '../texts/literature/Lost-in-the-City_Stories/06-An-Orange-Line-Train-To-Ballston.txt',
 '../texts/literatur

In [51]:
text_titles = [Path(text).stem for text in text_files]

In [54]:
Path("../texts/literature/Lost-in-the-City_Stories/04-Young-Lions.txt").stem

'04-Young-Lions'

In [52]:
text_titles

['11-Gospel',
 '13-A-Dark-Night',
 '01-The-Girl-Who-Raised-Pigeons',
 '12-A-New-Man',
 '02-The-First-Day',
 '07-The-Sunday-Following-Mother’S-Day',
 '03-The-Night-Rhonda-Ferguson-Was-Killed',
 '05-The-Store',
 '08-Lost-In-The-City',
 '14-Marie',
 '09-His-Mother’S-House',
 '10-A-Butterfly-On-F-Street',
 '06-An-Orange-Line-Train-To-Ballston',
 '04-Young-Lions']

Let's display them to make sure they're correct:

In [55]:
text_files, text_titles

(['../texts/literature/Lost-in-the-City_Stories/11-Gospel.txt',
  '../texts/literature/Lost-in-the-City_Stories/13-A-Dark-Night.txt',
  '../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt',
  '../texts/literature/Lost-in-the-City_Stories/12-A-New-Man.txt',
  '../texts/literature/Lost-in-the-City_Stories/02-The-First-Day.txt',
  '../texts/literature/Lost-in-the-City_Stories/07-The-Sunday-Following-Mother’S-Day.txt',
  '../texts/literature/Lost-in-the-City_Stories/03-The-Night-Rhonda-Ferguson-Was-Killed.txt',
  '../texts/literature/Lost-in-the-City_Stories/05-The-Store.txt',
  '../texts/literature/Lost-in-the-City_Stories/08-Lost-In-The-City.txt',
  '../texts/literature/Lost-in-the-City_Stories/14-Marie.txt',
  '../texts/literature/Lost-in-the-City_Stories/09-His-Mother’S-House.txt',
  '../texts/literature/Lost-in-the-City_Stories/10-A-Butterfly-On-F-Street.txt',
  '../texts/literature/Lost-in-the-City_Stories/06-An-Orange-Line-Train-To-Ballston.txt',
  '../t

#### Calculate Word Frequency (Optional Step)

This is an optional step, but for the sake of comparison, we're first going to calculate the raw frequency for every word in every story with scikit-learn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Later, when we calculate our tf–idf scores, we can compare these two methods and see how tf–idf helps us find more unique words.

(Machine learning approaches require that you transform words into a "vector," aka a series of numbers. This is what `CountVectorizer` does. But it's also just a convenient way to tokenize and count words.)

In [56]:
##Initialize CountVectorizer with desired parameters
count_vectorizer= CountVectorizer(input='filename', stop_words='english')

##Plug in "text_files," which contains all our short stories, to the initialized count_vectorizer
word_count_vector = count_vectorizer.fit_transform(text_files)

Check the sciki-learn stop words

In [57]:
count_vectorizer.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [60]:
#Make a DataFrame out of the word count vector and sort by title
word_count_df = pd.DataFrame(word_count_vector.toarray(), index=text_titles, columns=count_vectorizer.get_feature_names())
word_count_df = word_count_df.sort_index()

#Add column for number of times each word appears in all the documents
word_count_df.loc['Document Frequency'] = (word_count_df > 0).sum()

This dataframe `word_count_df` displays all the words that appear in *Lost in the City*, how many times each word appears in each story, and how many times each word appears at least once across all the stories (the very last row of numbers titled "Document Frequency").

Let's look at a sample of 10 words. You can run the cell again to look at a different sample of words.

In [61]:
word_count_df.sample(10, axis='columns')

Unnamed: 0,divide,thawing,treats,berated,heat,crowned,stay,resigned,st,sang
01-The-Girl-Who-Raised-Pigeons,0,0,0,0,0,0,6,0,0,0
02-The-First-Day,0,0,0,0,0,0,0,0,0,0
03-The-Night-Rhonda-Ferguson-Was-Killed,0,0,0,0,0,0,1,0,1,8
04-Young-Lions,0,0,0,0,1,0,2,0,0,0
05-The-Store,0,2,1,0,1,1,4,0,0,0
06-An-Orange-Line-Train-To-Ballston,0,0,0,0,0,0,0,0,0,0
07-The-Sunday-Following-Mother’S-Day,0,0,0,1,0,0,0,0,0,0
08-Lost-In-The-City,0,0,0,0,0,0,0,0,1,0
09-His-Mother’S-House,0,0,0,0,0,0,9,0,0,1
10-A-Butterfly-On-F-Street,0,0,0,0,0,0,1,0,0,0


Let's zoom in on some specific words.

In [62]:
word_count_df[['pigeons', 'school', 'said', 'gospelteers', 'church', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]

Unnamed: 0,pigeons,school,said,gospelteers,church,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,30,11,47,0,3,0,22,37,15,0,0
02-The-First-Day,0,10,0,0,2,0,8,6,1,0,0
03-The-Night-Rhonda-Ferguson-Was-Killed,0,9,111,0,1,0,17,43,32,0,42
04-Young-Lions,0,5,71,0,0,0,2,25,28,0,4
05-The-Store,0,5,79,0,3,0,21,30,32,0,9
06-An-Orange-Line-Train-To-Ballston,0,7,64,0,0,0,5,8,5,0,4
07-The-Sunday-Following-Mother’S-Day,0,1,82,0,3,0,11,9,23,0,25
08-Lost-In-The-City,0,1,46,0,3,0,3,8,8,5,0
09-His-Mother’S-House,0,2,96,0,0,0,4,27,3,0,9
10-A-Butterfly-On-F-Street,0,0,16,0,0,0,0,12,4,0,3


To find the top 10 most frequent words in every story, we're going to make and run the following function: `get_top_n_counts()`

In [63]:
def get_top_n_counts(dataframe, top_n=10):
    pretty_df = dataframe.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'count', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['word_freq_rank'] = pretty_df.groupby('story')['count'].rank(method='min', ascending=False)
    return pretty_df

This function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 most frequent words in every story. Finally, it will produce a dataframe with a new column `word_freq_rank`, which contains a 1-10 ranking of the most frequent words.

In [64]:
word_count_df = word_count_df.drop('Document Frequency', errors='ignore')

In [65]:
top_word_freq = get_top_n_counts(word_count_df)
top_word_freq

Unnamed: 0,story,word,count,word_freq_rank
0,01-The-Girl-Who-Raised-Pigeons,miss,47,1
1,01-The-Girl-Who-Raised-Pigeons,said,47,1
2,01-The-Girl-Who-Raised-Pigeons,ann,45,3
3,01-The-Girl-Who-Raised-Pigeons,betsy,45,3
4,01-The-Girl-Who-Raised-Pigeons,jenny,44,5
5,01-The-Girl-Who-Raised-Pigeons,robert,37,6
6,01-The-Girl-Who-Raised-Pigeons,street,37,6
7,01-The-Girl-Who-Raised-Pigeons,pigeons,30,8
8,01-The-Girl-Who-Raised-Pigeons,birds,29,9
9,01-The-Girl-Who-Raised-Pigeons,coop,28,10


## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to follow a very similar pattern with scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

#### Without Smoothing or Normalization (Not Recommended)

Remember how we calculated the tf–idf score for the word "pigeons" above?

In [66]:
total_number_of_documents = 14 
number_of_documents_with_term = 2
term_frequency = 30
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

term_frequency * inverse_document_frequency

88.3773044716594

We can use this exact formula by running `TfidfVectorizer` and turning off smoothing (`smoth_idf=False`) and normalization (`norm=None`). This is **not** the best or recommended way to calculate tf–idf scores. But it's useful to see the basic math that we discussed earlier in action with scikit-learn.

In [67]:
#Initialize TfidfVectorizer with desired parameters (turn off smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', smooth_idf = False, norm=None)

#Plug in "text_files" which contains all our short stories
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [68]:
#Make a DataFrame out of the tf–idf vector and sort by title
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

#Add column for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [69]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

Unnamed: 0,pigeons,school,said,church,gospelteers,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,88.377,13.653,50.483,4.325,0.0,0.0,23.63,39.742,16.112,0.0,0.0
02-The-First-Day,0.0,12.412,0.0,2.884,0.0,0.0,8.593,6.445,1.074,0.0,0.0
03-The-Night-Rhonda-Ferguson-Was-Killed,0.0,11.17,119.226,1.442,0.0,0.0,18.26,46.187,34.371,0.0,52.129
04-Young-Lions,0.0,6.206,76.262,0.0,0.0,0.0,2.148,26.853,30.075,0.0,4.965
05-The-Store,0.0,6.206,84.855,4.325,0.0,0.0,22.556,32.223,34.371,0.0,11.17
06-An-Orange-Line-Train-To-Ballston,0.0,8.688,68.743,0.0,0.0,0.0,5.371,8.593,5.371,0.0,4.965
07-The-Sunday-Following-Mother’S-Day,0.0,1.241,88.077,4.325,0.0,0.0,11.815,9.667,24.704,0.0,31.029
08-Lost-In-The-City,0.0,1.241,49.409,4.325,0.0,0.0,3.222,8.593,8.593,18.195,0.0
09-His-Mother’S-House,0.0,2.482,103.114,0.0,0.0,0.0,4.296,29.001,3.222,0.0,11.17
10-A-Butterfly-On-F-Street,0.0,0.0,17.186,0.0,0.0,0.0,0.0,12.889,4.296,0.0,3.723


#### With Smoothing and Normalization (Defaults/Recommended)

The recommended way to run `TfidfVectorizer`, however, is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`. To turn them on, you don't need to include any extra code at all.

In [70]:
#Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

#Plug in "text_files" which contains all our short stories
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [71]:
#Make a DataFrame out of the tf–idf vector and sort by title
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

#Add column for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [72]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

Unnamed: 0,pigeons,school,said,church,gospelteers,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,0.207,0.036,0.133,0.011,0.0,0.0,0.062,0.105,0.042,0.0,0.0
02-The-First-Day,0.0,0.134,0.0,0.031,0.0,0.0,0.094,0.07,0.012,0.0,0.0
03-The-Night-Rhonda-Ferguson-Was-Killed,0.0,0.02,0.212,0.003,0.0,0.0,0.032,0.082,0.061,0.0,0.092
04-Young-Lions,0.0,0.015,0.186,0.0,0.0,0.0,0.005,0.065,0.073,0.0,0.012
05-The-Store,0.0,0.018,0.246,0.012,0.0,0.0,0.065,0.093,0.1,0.0,0.032
06-An-Orange-Line-Train-To-Ballston,0.0,0.036,0.286,0.0,0.0,0.0,0.022,0.036,0.022,0.0,0.02
07-The-Sunday-Following-Mother’S-Day,0.0,0.003,0.21,0.01,0.0,0.0,0.028,0.023,0.059,0.0,0.073
08-Lost-In-The-City,0.0,0.007,0.292,0.025,0.0,0.0,0.019,0.051,0.051,0.09,0.0
09-His-Mother’S-House,0.0,0.006,0.231,0.0,0.0,0.0,0.01,0.065,0.007,0.0,0.025
10-A-Butterfly-On-F-Street,0.0,0.0,0.171,0.0,0.0,0.0,0.0,0.128,0.043,0.0,0.037


To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [74]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [75]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [76]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358,1
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.35,2
2,01-The-Girl-Who-Raised-Pigeons,ann,0.31,3
3,01-The-Girl-Who-Raised-Pigeons,robert,0.295,4
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223,5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6
6,01-The-Girl-Who-Raised-Pigeons,miss,0.163,7
7,01-The-Girl-Who-Raised-Pigeons,birds,0.147,8
8,01-The-Girl-Who-Raised-Pigeons,clara,0.143,9
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10


#### Write to a CSV File

In [None]:
filename = "tfidf_Lost-in-The-City.csv"
top_tfidf.to_csv(filename, encoding='UTF-8', index=False)

## Compare Word Frequency and tf–idf Scores

Now let's compare the raw word frequencies and tf-idf scores for all the stories in the *Lost in the City*.

First, we're going to merge the top raw word frequency ranks into our top tf–idf dataframe.

In [77]:
tfidf_compare = top_tfidf.merge(top_word_freq[['word_freq_rank', 'word', 'story']] , on=['story', 'word'], how='left')

Then we're going to add a column that calculates the change in rank—that is, how the significance of a word changes when we calculate tf-idf vs raw word frequency.

In [78]:
tfidf_compare['changed_rank'] = tfidf_compare['word_freq_rank'] - tfidf_compare['tfidf_rank']
tfidf_compare = tfidf_compare.fillna("*new top word*")

Finally, we're going to make some functions that will alter the style of our Pandas dataframe—such that the words that move up in tf-idf rank will be emphasized in green with a `+` sign and words that move down in tf-idf rank will be emphasized in red with a `-` sign.

In [79]:
def make_positive(value):
    if value != '*new top word*':
        if float(value) > 0:
            value = f'+{round(value)}'
    return value

def make_bold(value):
    return 'font-weight: bold'

def color_df(value):
    if value == '*new top word*':
        color = 'green'    
    else:
        value = str(value).replace('+', '')
        value = float(value)
        
        if value < 0:
            color = 'red'
        elif value > 0:
            color = 'green'
        else:
             color = 'black'        
    df_style = f'color: {color}; font-weight: bold'
    return df_style

Now let's display the dataframe and explore which words have become more significant and which words have become less so.

In [80]:
tfidf_compare['changed_rank'] = tfidf_compare['changed_rank'].apply(make_positive)
tfidf_compare_styled = tfidf_compare.style.applymap(color_df, subset=['changed_rank']).applymap(make_bold, subset=['tfidf_rank'])
tfidf_compare_styled

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358451,1,3,+2
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.350486,2,5,+3
2,01-The-Girl-Who-Raised-Pigeons,ann,0.310244,3,3,0
3,01-The-Girl-Who-Raised-Pigeons,robert,0.294727,4,6,+2
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223036,5,10,+5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.206829,6,8,+2
6,01-The-Girl-Who-Raised-Pigeons,miss,0.162691,7,1,-6
7,01-The-Girl-Who-Raised-Pigeons,birds,0.146826,8,9,+1
8,01-The-Girl-Who-Raised-Pigeons,clara,0.14338,9,*new top word*,*new top word*
9,01-The-Girl-Who-Raised-Pigeons,said,0.132745,10,1,-9


The word "said," which is one of the most frequent words throughout the collection, gets knocked down in tf-idf importance precisely because it occurs in almost every story.

```{note}
Note: To style your dataframe with color and bolding (as above), add `.style.applymap(color_df, subset=['changed_rank'])` to the end of the code below
```

In [81]:
tfidf_compare[tfidf_compare['word'] == 'said']

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10,1,-9
24,03-The-Night-Rhonda-Ferguson-Was-Killed,said,0.212,5,2,-3
34,04-Young-Lions,said,0.186,5,2,-3
41,05-The-Store,said,0.246,2,1,-1
55,06-An-Orange-Line-Train-To-Ballston,said,0.286,6,1,-5
64,07-The-Sunday-Following-Mother’S-Day,said,0.21,5,1,-4
71,08-Lost-In-The-City,said,0.292,2,1,-1
83,09-His-Mother’S-House,said,0.231,4,1,-3
93,10-A-Butterfly-On-F-Street,said,0.171,4,3,-1
103,11-Gospel,said,0.233,4,1,-3


A word like "pigeons," on the other hand, becomes more significant because it is rarer.

In [82]:
tfidf_compare[tfidf_compare['word'] == 'pigeons']

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6,8,2


Words that were not frequent enough to make the top 10 for raw word frequency — such as "dreaming," "gospelteers," or "dreadlocks — now suddenly show up in the top 10 for tf-idf scores.

In [83]:
tfidf_compare[tfidf_compare['word'] == 'dreaming']

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
77,08-Lost-In-The-City,dreaming,0.09,8,*new top word*,*new top word*


In [84]:
tfidf_compare[tfidf_compare['word'] == 'gospelteers']

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
106,11-Gospel,gospelteers,0.128,7,*new top word*,*new top word*


In [85]:
tfidf_compare[tfidf_compare['word'] == 'dreadlocks']

Unnamed: 0,story,word,tfidf_score,tfidf_rank,word_freq_rank,changed_rank
58,06-An-Orange-Line-Train-To-Ballston,dreadlocks,0.139,9,*new top word*,*new top word*


## Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

In [None]:
tfidf_compare

**1.** What is the difference between a tf-idf score and raw word frequency?

**Your answer here**

**2.** Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

**Your answer here**

**3.** What's another collection of texts that you think might be interesting to analyze with tf-idf scores?  Why?

**Your answer here**