# TF-IDF — Workbook

In this lesson, we're going to learn about a text analysis method called *term frequency–inverse document frequency*, often abbreviated *tf-idf*. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document.

## Import Modules and Libraries

[scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`, is a popular Python library for machine learning approaches such as clustering, classification, and regression.

Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

In [150]:
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd
pd.options.display.max_rows = 100

import glob
from pathlib import Path

We're also going to import `pandas` and change its default display setting. And we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html).

## Get Texts

Below we're setting the directory filepath that contains all the text files that we want to analyze.

In [151]:
directory_path = "../texts/literature/House-on-Mango-Street"

Then we're going to use `glob` and `Path` to make a list of all the filepaths in that directory and a list of all the short story titles.

In [152]:
text_files = glob.glob(f"{directory_path}/*.txt")

Making a list of chapter titles with our traditional method

In [154]:
text_titles = []
for text in text_files:
    title = Path(text).stem
    # what goes here?

Making a list of chapter titles with a [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp)

In [None]:
text_titles = [Path(text).stem for text in text_files]

## Calculate Tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores.

Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

Here we initialize TfidfVectorizer with our desired parameters (default smoothing and normalization)

In [203]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

In [None]:
tfidf_vectorizer.get_stop_words()

Run TfidfVectorizer on our `text_files`

In [205]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the "feature names" or words as columns and the titles as rows

In [206]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

Add column for document frequency aka number of times word appears in all documents

In [207]:
tfidf_df.loc['00_Document-Frequency'] = (tfidf_df > 0).sum()

In [None]:
tfidf_slice = tfidf_df[['esperanza', 'house', 'horse', 'mexico', 'pink', 'laughter', 'clowns']]
tfidf_slice.sort_index().round(decimals=2)

Let's drop "00_Document Frequency" since we were just using it for illustration purposes.

In [209]:
tfidf_df = tfidf_df.drop('00_Document-Frequency', errors='ignore')

Let's reorganize the DataFrame so that the words are in rows rather than columns.

In [None]:
tfidf_df.stack().reset_index().rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

In [214]:
tfidf_df = tfidf_df.stack().reset_index().rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 15 words with the highest tf–idf for every story, we're going to sort by document and tfidf score and then groupby document and take the first 15 values.

In [None]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(15)

In [216]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(15)

We can zoom in on particular words and particular documents.

In [None]:
top_tfidf[top_tfidf['document'].str.contains('My-Name')]

Why doesn't "name" show up here...?

## Your Turn!

Pick another chapter and examine the tf-idf scores. 

In [None]:
top_tfidf...

Pick a term and examine the tf-idf scores in the chapters in which it shows up.

In [None]:
top_tfidf...

- How well do you think tf-idf scores capture what is distinctive about each chapter?
- What does tf-idf seem good at?
- What are some problems that you notice with tf-idf scores as a metric?
- What would be a good text collection for using tf-idf?