## TF-IDF — Code

This notebook is a streamlined version of the previous lesson on **term frequency–inverse document frequency** (tf–idf). It is primarily intended for those who want to reuse the code without the previous lesson's overview and explanations.

## Import Libraries

To calculate tf-idf scores, we're going to use a Python library called [scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`. 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

We're also going to import `pandas` and change two of its default display settings. We're going to increase the maximum number of rows that pandas will display, and we're going to format numbers in a special way. If it's a decimal number, format to three decimal places; if it's a whole number, round to the whole number.

In [25]:
import pandas as pd
pd.set_option("max_rows", 500)
pd.set_option("max_columns", 200)
pd.options.display.float_format = lambda value : '{:.0f}'.format(value) if round(value,0) == value else '{:,.3f}'.format(value)

Finally, we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) and [`glob`](https://docs.python.org/3/library/glob.html).

In [15]:
from pathlib import Path  
import glob

## Set Directory Path

Below we're setting the directory filepath that contains all the short story text files that we want to analyze.

In [16]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

Then we're using `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the short story titles.

In [17]:
text_files = glob.glob(f"{directory_path}/*.txt")
text_titles = [Path(text).stem for text in text_files]

## Calculate tf–idf

We need to initialize [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with our desired parameters. Then we need to plug in the list of text file paths that we want to be calculated with `.fit_transform`.

### With Smoothing and Normalization (Defaults/Recommended)

In [19]:
#Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

#Plug in "text_files" which contains all our short stories
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Then we make a dataframe of every word in the collection and its corresponding tf-idf score.

In [20]:
#Make a DataFrame out of the tf–idf vector and sort by title
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

In [21]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

Unnamed: 0,pigeons,school,said,church,gospelteers,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,0.207,0.036,0.133,0.011,0.0,0.0,0.062,0.105,0.042,0.0,0.0
02-The-First-Day,0.0,0.134,0.0,0.031,0.0,0.0,0.094,0.07,0.012,0.0,0.0
03-The-Night-Rhonda-Ferguson-Was-Killed,0.0,0.02,0.212,0.003,0.0,0.0,0.032,0.082,0.061,0.0,0.092
04-Young-Lions,0.0,0.015,0.186,0.0,0.0,0.0,0.005,0.065,0.073,0.0,0.012
05-The-Store,0.0,0.018,0.246,0.012,0.0,0.0,0.065,0.093,0.1,0.0,0.032
06-An-Orange-Line-Train-To-Ballston,0.0,0.036,0.286,0.0,0.0,0.0,0.022,0.036,0.022,0.0,0.02
07-The-Sunday-Following-Mother’S-Day,0.0,0.003,0.21,0.01,0.0,0.0,0.028,0.023,0.059,0.0,0.073
08-Lost-In-The-City,0.0,0.007,0.292,0.025,0.0,0.0,0.019,0.051,0.051,0.09,0.0
09-His-Mother’S-House,0.0,0.006,0.231,0.0,0.0,0.0,0.01,0.065,0.007,0.0,0.025
10-A-Butterfly-On-F-Street,0.0,0.0,0.171,0.0,0.0,0.0,0.0,0.128,0.043,0.0,0.037


To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [22]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

This function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [23]:
top_tfidf = get_top_tfidf_scores(tfidf_df, top_n=10)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358,1
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.35,2
2,01-The-Girl-Who-Raised-Pigeons,ann,0.31,3
3,01-The-Girl-Who-Raised-Pigeons,robert,0.295,4
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223,5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6
6,01-The-Girl-Who-Raised-Pigeons,miss,0.163,7
7,01-The-Girl-Who-Raised-Pigeons,birds,0.147,8
8,01-The-Girl-Who-Raised-Pigeons,clara,0.143,9
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10


If you want to change how many top tf-idf scores to show for every text, simply change the `top_n` value.

In [26]:
top_tfidf = get_top_tfidf_scores(tfidf_df, top_n=20)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358,1
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.35,2
2,01-The-Girl-Who-Raised-Pigeons,ann,0.31,3
3,01-The-Girl-Who-Raised-Pigeons,robert,0.295,4
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223,5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6
6,01-The-Girl-Who-Raised-Pigeons,miss,0.163,7
7,01-The-Girl-Who-Raised-Pigeons,birds,0.147,8
8,01-The-Girl-Who-Raised-Pigeons,clara,0.143,9
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10


## Write to a CSV File

In [37]:
filename = "tfidf_Lost-in-The-City.csv"

top_tfidf.to_csv(filename, encoding='UTF-8', index=False)