# TF-IDF

In this lesson, we're going to learn about a text analysis method called *term frequency–inverse document frequency* (tf–idf). This method will help us identify the most unique words in a document from a given corpus. 

### Simple Formula

Calculating the most frequent words in a text can be useful. But often the most *frequent words* in a text aren't the most *interesting* words in a text.

Term frequency-inverse document frequency is a method that tries to help with this problem becuase it identifies the most frequent *unique* words in a text by comparing it to other texts.

```{margin}
term = word <br>
document = text (or chunk of a text) <br>
corpus = collection of texts <br>
```

**term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1

We're going to calculate and compare the tf–idf scores for the word *said* and the word *pigeons* in "The Girl Who Raised Pigeons," the first short story in *Lost in the City*.

We need the log() function for our calculation, so we're going to import it from the `math` package.

In [52]:
from math import log

**"said"**

In [58]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

In [59]:
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents +1 / number_of_documents_with_term) + 1

In [60]:
term_frequency * inverse_document_frequency

171.29322938285455

**"pigeons"**

In [69]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

In [70]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log( (1+ total_number_of_documents) / (1+ number_of_documents_with_term)) + 1

In [72]:
tfidf = term_frequency * inverse_document_frequency

**tf–idf Scores**

"said" = 50.48<br>
"pigeons" = 88.38

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

## tf–idf with scikit-learn

We could continue calculating tf–idf scores in this manner — by doing all the math with Python — but conveniently there's a Python library that can calculate tf–idf scores in just a few lines of code.

This library is called [scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`. It's a popular Python library for machine learning approaches such as clustering, classification, and regression, among others. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

In [45]:
!pip install sklearn



#### Import Libraries

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
pd.set_option("max_columns", 200)
#pd.options.display.float_format = lambda value : '{:.0f}'.format(value) if round(value,0) == value else '{:,.3f}'.format(value)
from pathlib import Path  
import glob

We're also going to import `pandas` and change two of its default display settings. We're going to increase the maximum number of rows that pandas will display, and we're going to format numbers in a special way. If it's a decimal number, format to three decimal places; if it's a whole number, round to the whole number.

Finally, we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html). These libraries will help us read in all the short story text files from *Lost in the City*.

#### Set Directory Path

Below we're setting the directory filepath that contains all the short story text files that we want to analyze.

In [2]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

Then we're going to use `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the short story titles.

In [3]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [4]:
text_files

['../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt',
 '../texts/literature/Lost-in-the-City_Stories/02-The-First-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/03-The-Night-Rhonda-Ferguson-Was-Killed.txt',
 '../texts/literature/Lost-in-the-City_Stories/04-Young-Lions.txt',
 '../texts/literature/Lost-in-the-City_Stories/05-The-Store.txt',
 '../texts/literature/Lost-in-the-City_Stories/06-An-Orange-Line-Train-To-Ballston.txt',
 '../texts/literature/Lost-in-the-City_Stories/07-The-Sunday-Following-Mother’S-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/08-Lost-In-The-City.txt',
 '../texts/literature/Lost-in-the-City_Stories/09-His-Mother’S-House.txt',
 '../texts/literature/Lost-in-the-City_Stories/10-A-Butterfly-On-F-Street.txt',
 '../texts/literature/Lost-in-the-City_Stories/11-Gospel.txt',
 '../texts/literature/Lost-in-the-City_Stories/12-A-New-Man.txt',
 '../texts/literature/Lost-in-the-City_Stories/13-A-Dark-Night.txt',
 '../texts/lit

In [5]:
text_titles = [Path(text).stem for text in text_files]

In [6]:
text_titles

['01-The-Girl-Who-Raised-Pigeons',
 '02-The-First-Day',
 '03-The-Night-Rhonda-Ferguson-Was-Killed',
 '04-Young-Lions',
 '05-The-Store',
 '06-An-Orange-Line-Train-To-Ballston',
 '07-The-Sunday-Following-Mother’S-Day',
 '08-Lost-In-The-City',
 '09-His-Mother’S-House',
 '10-A-Butterfly-On-F-Street',
 '11-Gospel',
 '12-A-New-Man',
 '13-A-Dark-Night',
 '14-Marie']

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer`, however, is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`. To turn them on, you don't need to include any extra code at all.

**Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)**

In [7]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

**Plug in "text_files" which contains all our short stories**


In [8]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

#instantiate CountVectorizer()
cv=CountVectorizer(input='filename',stop_words='english') # using stopwords this time
 
# this steps returns word counts for the words in your docs 
word_count_vector=cv.fit_transform(text_files)

# check shape
word_count_vector.shape

(14, 5769)

In [48]:
cv.get_feature_names

<bound method CountVectorizer.get_feature_names of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='filename',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)>

In [44]:
pd.DataFrame(data=word_count_vector, index=index)

NameError: name 'index' is not defined

In [39]:
cv.vocabulary_

{'girl': 2117,
 'raised': 3988,
 'pigeons': 3681,
 'father': 1817,
 'say': 4319,
 'years': 5750,
 'later': 2811,
 'dreamed': 1523,
 'gone': 2150,
 'kitchen': 2743,
 'window': 5644,
 'morning': 3229,
 'visit': 5472,
 'birds': 481,
 'time': 5167,
 'life': 2884,
 'notions': 3395,
 'set': 4416,
 'concrete': 1076,
 'having': 2323,
 'believed': 442,
 'slept': 4595,
 'lightly': 2894,
 'want': 5521,
 'think': 5120,
 'walk': 5504,
 'hour': 2460,
 'night': 3355,
 'waking': 5503,
 'asking': 260,
 'dark': 1294,
 'matter': 3098,
 'visits': 5476,
 'dreams': 1526,
 'remained': 4092,
 'forever': 1984,
 'vivid': 5479,
 'memory': 3137,
 'way': 5562,
 'iridescent': 2625,
 'necklaces': 3319,
 'flirted': 1934,
 'light': 2889,
 'begin': 435,
 'compulsion': 1065,
 'sleeping': 4592,
 'mind': 3175,
 'simple': 4535,
 'need': 3321,
 'pee': 3609,
 'drink': 1536,
 'water': 5551,
 'went': 5586,
 'barefoot': 375,
 'room': 4230,
 'past': 3567,
 'conversing': 1128,
 'sleep': 4590,
 'roof': 4229,
 'steps': 4835,
 'coop

In [33]:
word_count_vector.vo

<14x46 sparse matrix of type '<class 'numpy.int64'>'
	with 114 stored elements in Compressed Sparse Row format>

In [None]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

In [31]:
print(word_count_vector)

  (0, 42)	1
  (0, 28)	1
  (0, 29)	1
  (0, 17)	1
  (0, 0)	1
  (0, 22)	1
  (0, 37)	1
  (0, 36)	1
  (0, 44)	1
  (1, 42)	1
  (1, 28)	1
  (1, 29)	1
  (1, 17)	1
  (1, 44)	1
  (1, 1)	1
  (1, 19)	1
  (2, 42)	1
  (2, 28)	1
  (2, 29)	1
  (2, 17)	1
  (2, 44)	1
  (2, 2)	1
  (2, 34)	1
  (2, 38)	1
  (2, 20)	1
  :	:
  (10, 10)	1
  (10, 23)	1
  (11, 42)	1
  (11, 28)	1
  (11, 29)	1
  (11, 17)	1
  (11, 44)	1
  (11, 11)	1
  (11, 33)	1
  (11, 30)	1
  (12, 42)	1
  (12, 28)	1
  (12, 29)	1
  (12, 17)	1
  (12, 44)	1
  (12, 34)	1
  (12, 12)	1
  (12, 18)	1
  (13, 42)	1
  (13, 28)	1
  (13, 29)	1
  (13, 17)	1
  (13, 44)	1
  (13, 13)	1
  (13, 31)	1


In [34]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [35]:
text_titles = [Path(text).stem for text in text_files]

In [11]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

**Make a DataFrame out of the tf–idf vector and sort by title**

In [12]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

**Add column for number of times word appears in all documents**

In [13]:
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [14]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

Unnamed: 0,pigeons,school,said,church,gospelteers,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,0.206829,0.035548,0.132745,0.01114,0.0,0.0,0.062136,0.104501,0.042365,0.0,0.0
02-The-First-Day,0.0,0.134055,0.0,0.030807,0.0,0.0,0.093728,0.070296,0.011716,0.0,0.0
03-The-Night-Rhonda-Ferguson-Was-Killed,0.0,0.019663,0.211948,0.00251,0.0,0.0,0.03246,0.082106,0.061102,0.0,0.091761
04-Young-Lions,0.0,0.014985,0.185973,0.0,0.0,0.0,0.005239,0.065483,0.073342,0.0,0.011988
05-The-Store,0.0,0.017826,0.246152,0.01229,0.0,0.0,0.065433,0.093475,0.099707,0.0,0.032086
06-An-Orange-Line-Train-To-Ballston,0.0,0.035788,0.28597,0.0,0.0,0.0,0.022341,0.035746,0.022341,0.0,0.02045
07-The-Sunday-Following-Mother’S-Day,0.0,0.002925,0.209612,0.010083,0.0,0.0,0.028119,0.023006,0.058794,0.0,0.073121
08-Lost-In-The-City,0.0,0.007264,0.292022,0.025039,0.0,0.0,0.019045,0.050787,0.050787,0.089521,0.0
09-His-Mother’S-House,0.0,0.005509,0.231101,0.0,0.0,0.0,0.009629,0.064997,0.007222,0.0,0.02479
10-A-Butterfly-On-F-Street,0.0,0.0,0.170616,0.0,0.0,0.0,0.0,0.127962,0.042654,0.0,0.036604


To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [20]:
tfidf_slice.stack()

01-The-Girl-Who-Raised-Pigeons           pigeons         0.206829
                                         school          0.035548
                                         said            0.132745
                                         church          0.011140
                                         gospelteers     0.000000
                                         thunder         0.000000
                                         girl            0.062136
                                         street          0.104501
                                         father          0.042365
                                         dreaming        0.000000
                                         car             0.000000
02-The-First-Day                         pigeons         0.000000
                                         school          0.134055
                                         said            0.000000
                                         church          0.030807
          

In [27]:
tfidf_df.stack().groupby(level=0).nlargest(10)

01-The-Girl-Who-Raised-Pigeons           01-The-Girl-Who-Raised-Pigeons           betsy            0.358451
                                                                                  jenny            0.350486
                                                                                  ann              0.310244
                                                                                  robert           0.294727
                                                                                  coop             0.223036
                                                                                  pigeons          0.206829
                                                                                  miss             0.162691
                                                                                  birds            0.146826
                                                                                  clara            0.143380
                            

In [127]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [128]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [129]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358,1
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.35,2
2,01-The-Girl-Who-Raised-Pigeons,ann,0.31,3
3,01-The-Girl-Who-Raised-Pigeons,robert,0.295,4
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223,5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6
6,01-The-Girl-Who-Raised-Pigeons,miss,0.163,7
7,01-The-Girl-Who-Raised-Pigeons,birds,0.147,8
8,01-The-Girl-Who-Raised-Pigeons,clara,0.143,9
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10


#### Write to a CSV File

In [None]:
filename = "tfidf_Lost-in-The-City.csv"
top_tfidf.to_csv(filename, encoding='UTF-8', index=False)

In [155]:
directory_path = "../texts/history/US_Inaugural_Addresses/"

In [156]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [157]:
text_titles = [Path(text).stem for text in text_files]

In [158]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

**Make a DataFrame out of the tf–idf vector and sort by title**

In [159]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

**Add column for number of times word appears in all documents**

In [160]:
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [162]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [163]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [166]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01_washington_1789,government,0.114,1
1,01_washington_1789,immutable,0.104,2
2,01_washington_1789,impressions,0.104,3
3,01_washington_1789,providential,0.104,4
4,01_washington_1789,ought,0.104,5
5,01_washington_1789,public,0.103,6
6,01_washington_1789,present,0.098,7
7,01_washington_1789,qualifications,0.096,8
8,01_washington_1789,peculiarly,0.091,9
9,01_washington_1789,article,0.086,10


## Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

In [None]:
tfidf_compare

**1.** What is the difference between a tf-idf score and raw word frequency?

**Your answer here**

**2.** Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

**Your answer here**

**3.** What's another collection of texts that you think might be interesting to analyze with tf-idf scores?  Why?

**Your answer here**