# TF-IDF with Scikit-Learn

## Breaking Down the TF-IDF Formula

But first, let's quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1**\***

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the *inverse*, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word "said" vs the word "pigeon." The term "said" appears in 13 (document frequency) of 14 (total documents) *Lost in the City* stories (14 / 13 --> a smaller inverse document frequency) while the term "pigeons" only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 --> a bigger inverse document frequency, a bigger tf-idf boost). 

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we're going to use is the [scikit-learn default](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer), which uses "smoothing" aka it adds a "1" to the numerator and denominator: 

**inverse_document_frequency**  = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

<div class="margin sidebar" style=" padding: 10px">

> If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.  
> -[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

</div>

## TF-IDF with scikit-learn

[scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

Import necessary modules and libraries

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from pathlib import Path  
import glob

<div class="admonition pandasreview" name="html-admonition" style="background: black; color: white; padding: 10px">
<p class="title">Pandas</p>
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
    
</div>

We're also going to import `pandas` and change its default display setting. And we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html).

#### Set Directory Path

Below we're setting the directory filepath that contains all the text files that we want to analyze.

In [3]:
directory_path = "assets/Raw Corpora/F/"

Then we're going to use `glob` and `Path` to make a list of all the filepaths in that directory and a list of all the short story titles.

In [4]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [5]:
text_files

['assets/Raw Corpora/F/1872_de-la-ramee-a-dog-of-flanders.txt',
 'assets/Raw Corpora/F/1857_browne-grannys-wonderful-chair.txt',
 'assets/Raw Corpora/F/1876_ewing-six-to-sixteen-a-story-for-girls.txt',
 'assets/Raw Corpora/F/1887_molesworth-four-winds-farm.txt',
 'assets/Raw Corpora/F/1877_molesworth-the-cuckoo-clock.txt',
 'assets/Raw Corpora/F/1869_ewing-the-land-of-lost-toys.txt',
 'assets/Raw Corpora/F/1877_sewell-black-beauty.txt',
 'assets/Raw Corpora/F/1857_tucker-the-rambles-of-a-rat.txt',
 'assets/Raw Corpora/F/1841_martineau-the-settlers-at-home.txt',
 'assets/Raw Corpora/F/1899_nesbit-the-story-of-the-treasure-seekers.txt',
 'assets/Raw Corpora/F/1869_ewing-mrs-overtheways-remembrances.txt',
 'assets/Raw Corpora/F/1877_ewing-a-great-emergency-and-other-tales.txt',
 'assets/Raw Corpora/F/1882_ewing-brothers-of-pity-and-other-tales-of-beasts-and-men.txt',
 'assets/Raw Corpora/F/1862_ewing-melchiors-dream-and-other-tales.txt',
 'assets/Raw Corpora/F/1876_ewing-jan-of-the-windmi

In [6]:
text_titles = [Path(text).stem for text in text_files]

In [7]:
text_titles

['1872_de-la-ramee-a-dog-of-flanders',
 '1857_browne-grannys-wonderful-chair',
 '1876_ewing-six-to-sixteen-a-story-for-girls',
 '1887_molesworth-four-winds-farm',
 '1877_molesworth-the-cuckoo-clock',
 '1869_ewing-the-land-of-lost-toys',
 '1877_sewell-black-beauty',
 '1857_tucker-the-rambles-of-a-rat',
 '1841_martineau-the-settlers-at-home',
 '1899_nesbit-the-story-of-the-treasure-seekers',
 '1869_ewing-mrs-overtheways-remembrances',
 '1877_ewing-a-great-emergency-and-other-tales',
 '1882_ewing-brothers-of-pity-and-other-tales-of-beasts-and-men',
 '1862_ewing-melchiors-dream-and-other-tales',
 '1876_ewing-jan-of-the-windmill',
 '1875_craik-the-little-lame-prince-and-his-traveling-cloack',
 '1873_ewing-a-flat-iron-for-a-farthing',
 '1870_ewing-the-brownies-and-other-tales',
 '1886_hodgson-burnett-little-lord-fauntleroy',
 '1879_ewing-jackanapes-daddy-darwins-dovecot-and-other-stories',
 '1888_ewing-snap-dragons-old-father-christmas',
 '1839_sinclair-holiday-house-a-series-of-tales',
 '18

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [11]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our `text_files`

In [12]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the "feature names" or words as columns and the titles as rows

In [40]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())



Add column for document frequency aka number of times word appears in all documents

In [14]:
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

In [44]:
tfidf_slice = tfidf_df[['heroine','dress','actress','women','aunt','aunts','bride','daughter','daughters','female','girl','girls']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0,heroine,dress,actress,women,aunt,aunts,bride,daughter,daughters,female,girl,girls
1839_sinclair-holiday-house-a-series-of-tales,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1841_martineau-the-settlers-at-home,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
1857_browne-grannys-wonderful-chair,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.05,0.02,0.0,0.04,0.0
1857_tucker-the-rambles-of-a-rat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1862_ewing-melchiors-dream-and-other-tales,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01
1869_ewing-mrs-overtheways-remembrances,0.0,0.03,0.0,0.01,0.04,0.0,0.0,0.01,0.0,0.0,0.01,0.02
1869_ewing-the-land-of-lost-toys,0.0,0.01,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.01,0.0
1870_ewing-the-brownies-and-other-tales,0.0,0.01,0.0,0.01,0.02,0.0,0.01,0.01,0.0,0.0,0.01,0.0
1872_craik-the-adventure-of-a-brownie,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.02
1872_de-la-ramee-a-dog-of-flanders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's drop "OO_Document Frequency" since we were just using it for illustration purposes.

In [45]:
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let's reorganize the DataFrame so that the words are in rows rather than columns.

In [46]:
tfidf_df.stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,1872_de-la-ramee-a-dog-of-flanders,000,0.0
1,1872_de-la-ramee-a-dog-of-flanders,01,0.0
2,1872_de-la-ramee-a-dog-of-flanders,02,0.0
3,1872_de-la-ramee-a-dog-of-flanders,03,0.0
4,1872_de-la-ramee-a-dog-of-flanders,04,0.0
...,...,...,...
531088,1872_craik-the-adventure-of-a-brownie,zuch,0.0
531089,1872_craik-the-adventure-of-a-brownie,zz,0.0
531090,1872_craik-the-adventure-of-a-brownie,æolian,0.0
531091,1872_craik-the-adventure-of-a-brownie,ærial,0.0


In [47]:
tfidf_df = tfidf_df.stack().reset_index()

In [48]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we're going to sort by document and tfidf score and then groupby document and take the first 10 values.

In [49]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
496826,1839_sinclair-holiday-house-a-series-of-tales,laura,0.604090
494727,1839_sinclair-holiday-house-a-series-of-tales,harry,0.504806
490024,1839_sinclair-holiday-house-a-series-of-tales,crabtree,0.269727
493555,1839_sinclair-holiday-house-a-series-of-tales,frank,0.202770
494245,1839_sinclair-holiday-house-a-series-of-tales,graham,0.153771
...,...,...,...
213875,1899_nesbit-the-story-of-the-treasure-seekers,dicky,0.268199
209016,1899_nesbit-the-story-of-the-treasure-seekers,albert,0.167532
219941,1899_nesbit-the-story-of-the-treasure-seekers,like,0.114178
229175,1899_nesbit-the-story-of-the-treasure-seekers,uncle,0.091717


In [50]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

We can zoom in on particular words and particular documents.

In [51]:
top_tfidf[top_tfidf['term'].str.contains('uncle')]

Unnamed: 0,document,term,tfidf
229175,1899_nesbit-the-story-of-the-treasure-seekers,uncle,0.091717


It turns out that the term "women" is very distinctive in Obama's Inaugural Address.

In [52]:
top_tfidf[top_tfidf['document'].str.contains('obama')]

Unnamed: 0,document,term,tfidf


In [20]:
top_tfidf[top_tfidf['document'].str.contains('trump')]

Unnamed: 0,document,term,tfidf
504405,58_trump_2017,america,0.350162
506586,58_trump_2017,dreams,0.156436
504406,58_trump_2017,american,0.149226
508577,58_trump_2017,jobs,0.142766
510263,58_trump_2017,protected,0.132439
509410,58_trump_2017,obama,0.120288
509767,58_trump_2017,people,0.11237
512002,58_trump_2017,thank,0.109171
504990,58_trump_2017,borders,0.107075
512597,58_trump_2017,ve,0.107075


In [21]:
top_tfidf[top_tfidf['document'].str.contains('kennedy')]

Unnamed: 0,document,term,tfidf
391774,44_kennedy_1961,let,0.267869
394306,44_kennedy_1961,sides,0.262849
392921,44_kennedy_1961,pledge,0.16096
387632,44_kennedy_1961,ask,0.107713
387864,44_kennedy_1961,begin,0.106495
388991,44_kennedy_1961,dare,0.106495
395895,44_kennedy_1961,world,0.10311
390313,44_kennedy_1961,final,0.102311
392370,44_kennedy_1961,new,0.0966
390120,44_kennedy_1961,explore,0.094223


## Visualize TF-IDF

We can also visualize our TF-IDF results with the data visualization library Altair.

In [None]:
!pip install altair

Let's make a heatmap that shows the highest TF-IDF scoring words for each president, and let's put a red dot next to two terms of interest: "war" and "peace":

The code below was contributed by [Eric Monson](https://github.com/emonson). Thanks, Eric!

In [53]:
import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.30, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)