# Analyzing keywords with Term Frequency-Inverse Document Frequency

**Contents**  
[Tf-idf Overview](#section-1)  

[Digging Deeper into Tf-idf](#section-2)
> [Stopwords](#section-3)  

> [A note on tokenizing and preprocessing](#section-4)  

> [Playing with Tf-idf parameters](#section-5)

[Interpreting Tf-idf outputs](#section-6)

<a id='section-1'></a>
# Tf-idf Overview

Keyness or distinctiveness is a catchall term for a constellation of statistical measures that attempt to indicate the numerical significance of a term to a document or set of documents, in direct comparison with a larger set of documents or corpus. 

With TF-IDF each term is weighted by dividing the term frequency by the number of documents in the corpus containing the word. It gives weight to terms that appear in a document but are rare or absent in other documents.

TF-IDF is calculated by taking the number of times a term occurs in a document (term frequency). Then taking the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and that fraction is flipped on its head (inverse document frequency =  log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1). Then you multiply the two numbers together (term_frequency * inverse_document_frequency). The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents.

In this notebook we use the implemention of tf-idf in Scikit-learn (sklearn). Let's first run through using tf-idf in sklearn to get a sense of what it does and then we'll go through it again whilst paying more attention to parameters and stopwords ("Digging Deeper into Tf-idf" section).

In [None]:
# Imports
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
pd.set_option('display.max_rows', 600)
from pathlib import Path  
import glob

In [None]:
# Data: set up path to files and a variable with file names
directory_path = 'kafka-corpus/'
text_files = glob.glob(f'{directory_path}/*.txt')
text_titles = [Path(text).stem for text in text_files]

In [None]:
#Set up tf-idf vectorizing (we'll go over parameters in more detail further below)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

In [None]:
#Actually do the vectorizing
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [None]:
#Make a DataFrame out of the resulting tf–idf vectors, 
#setting the “feature names” (words in vocabulary) as columns and the titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=text_titles, 
                        columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head(11)

In [None]:
#Add row for Document Tf-idf (sum of tf-idf scores for each word across all documents)
tfidf_df.loc['00_Document Tf-idf'] = tfidf_df.sum(axis=0)
tfidf_df.head(12)

In [None]:
#Re-organize so words are in rows rather than columns
tfidf_df = tfidf_df.sort_index()
stacked_tfidf_df = tfidf_df.stack().reset_index()
stacked_tfidf_df = stacked_tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
stacked_tfidf_df.sample(n=20)

In [None]:
#Top 10 words with the highest tf–idf for every story
top_tfidf = stacked_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
top_tfidf

In [None]:
#Zoom in on particular words
#What documents have the given word in their top significant words?
top_tfidf[top_tfidf['term'].str.contains('people')]

In [None]:
#Display signficance scores of the given word across all documents
stacked_tfidf_df[stacked_tfidf_df['term'].str.contains('people')]

In [None]:
#Zoom in on particular document
#What are the top significant words for a given document?
top_tfidf[top_tfidf['document'].str.contains('kafka_a-hunger-artist')]

In [None]:
#What are the top 20 significant words for the given document?
(stacked_tfidf_df[stacked_tfidf_df['document']
                  .str.contains('kafka_a-hunger-artist')]
 .sort_values('tfidf', ascending=False)
 .head(20)
)

In [None]:
#Create bar plots of top 10 significant words for all documents, and each story
import seaborn as sns
sns.catplot(data=top_tfidf, row='document', x='tfidf', y='term', kind='bar', sharey=False)

<a id='section-2'></a>
## Digging Deeper into Tf-idf

<a id='section-3'></a>
### Stopwords

You might already have noticed that there are some words you might want to remove from the analyses in order to create more relevant results. Above we used the sci-kit learn’s built-in stopwords list. We've discussed how these lists might be problematic and might need to be modified in order to be more relevant for our particular research aims.

First we need to build our custom stopwords list. We might start by looking at [Scikit Learn’s built-stopwords list](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/_stop_words.py) (also in cell below) in order to identify some terms we want or don't want on our list. Maybe you’ve already identified some words from the previous analyses, add those too. 

In [None]:
custom_stopwords = (
    [
        "a",
        "about",
        "above",
        "across",
        "after",
        "afterwards",
        "again",
        "against",
        "all",
        "almost",
        "alone",
        "along",
        "already",
        "also",
        "although",
        "always",
        "am",
        "among",
        "amongst",
        "amoungst",
        "amount",
        "an",
        "and",
        "another",
        "any",
        "anyhow",
        "anyone",
        "anything",
        "anyway",
        "anywhere",
        "are",
        "around",
        "as",
        "at",
        "back",
        "be",
        "became",
        "because",
        "become",
        "becomes",
        "becoming",
        "been",
        "before",
        "beforehand",
        "behind",
        "being",
        "below",
        "beside",
        "besides",
        "between",
        "beyond",
        "bill",
        "both",
        "bottom",
        "but",
        "by",
        "call",
        "can",
        "cannot",
        "cant",
        "co",
        "con",
        "could",
        "couldnt",
        "cry",
        "de",
        "describe",
        "detail",
        "do",
        "done",
        "down",
        "due",
        "during",
        "each",
        "eg",
        "eight",
        "either",
        "eleven",
        "else",
        "elsewhere",
        "empty",
        "enough",
        "etc",
        "even",
        "ever",
        "every",
        "everyone",
        "everything",
        "everywhere",
        "except",
        "few",
        "fifteen",
        "fifty",
        "fill",
        "find",
        "fire",
        "first",
        "five",
        "for",
        "former",
        "formerly",
        "forty",
        "found",
        "four",
        "from",
        "front",
        "full",
        "further",
        "get",
        "give",
        "go",
        "had",
        "has",
        "hasnt",
        "have",
        "he",
        "hence",
        "her",
        "here",
        "hereafter",
        "hereby",
        "herein",
        "hereupon",
        "hers",
        "herself",
        "him",
        "himself",
        "his",
        "how",
        "however",
        "hundred",
        "i",
        "ie",
        "if",
        "in",
        "inc",
        "indeed",
        "interest",
        "into",
        "is",
        "it",
        "its",
        "itself",
        "keep",
        "last",
        "latter",
        "latterly",
        "least",
        "less",
        "ltd",
        "made",
        "many",
        "may",
        "me",
        "meanwhile",
        "might",
        "mill",
        "mine",
        "more",
        "moreover",
        "most",
        "mostly",
        "move",
        "much",
        "must",
        "my",
        "myself",
        "name",
        "namely",
        "neither",
        "never",
        "nevertheless",
        "next",
        "nine",
        "no",
        "nobody",
        "none",
        "noone",
        "nor",
        "not",
        "nothing",
        "now",
        "nowhere",
        "of",
        "off",
        "often",
        "on",
        "once",
        "one",
        "only",
        "onto",
        "or",
        "other",
        "others",
        "otherwise",
        "our",
        "ours",
        "ourselves",
        "out",
        "over",
        "own",
        "part",
        "per",
        "perhaps",
        "please",
        "put",
        "rather",
        "re",
        "same",
        "see",
        "seem",
        "seemed",
        "seeming",
        "seems",
        "serious",
        "several",
        "she",
        "should",
        "show",
        "side",
        "since",
        "sincere",
        "six",
        "sixty",
        "so",
        "some",
        "somehow",
        "someone",
        "something",
        "sometime",
        "sometimes",
        "somewhere",
        "still",
        "such",
        "system",
        "take",
        "ten",
        "than",
        "that",
        "the",
        "their",
        "them",
        "themselves",
        "then",
        "thence",
        "there",
        "thereafter",
        "thereby",
        "therefore",
        "therein",
        "thereupon",
        "these",
        "they",
        "thick",
        "thin",
        "third",
        "this",
        "those",
        "though",
        "three",
        "through",
        "throughout",
        "thru",
        "thus",
        "to",
        "together",
        "too",
        "top",
        "toward",
        "towards",
        "twelve",
        "twenty",
        "two",
        "un",
        "under",
        "until",
        "up",
        "upon",
        "us",
        "very",
        "via",
        "was",
        "we",
        "well",
        "were",
        "what",
        "whatever",
        "when",
        "whence",
        "whenever",
        "where",
        "whereafter",
        "whereas",
        "whereby",
        "wherein",
        "whereupon",
        "wherever",
        "whether",
        "which",
        "while",
        "whither",
        "who",
        "whoever",
        "whole",
        "whom",
        "whose",
        "why",
        "will",
        "with",
        "within",
        "without",
        "would",
        "yet",
        "you",
        "your",
        "yours",
        "yourself",
        "yourselves",
    ]
)

### Identifying frequent words to add to custom stopword list

We could also generate a document-term-matrix and inspect the most frequent words in the vocabulary that occur across all texts. This can help us add further to our stopwords list.

In [None]:
# Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option('display.max_rows', 600)
from pathlib import Path  
import glob

In [None]:
# Data
directory_path = 'kafka-corpus/'
text_files = glob.glob(f'{directory_path}/*.txt')
text_titles = [Path(text).stem for text in text_files]

In [None]:
#Set up CountVectorizer()
cv=CountVectorizer(input='filename', stop_words=custom_stopwords)

In [None]:
#Generate document-term matrix for the docs
#DTM reminder: 
#each row is a document
#each column is an element (word) of the total vocabulary for the corpus
dtm=cv.fit_transform(text_files)

In [None]:
#Sorting vocabulary in order of counts
#first we create a dictionary with words in the vocabulary as keys and counts as values
dictVocab = {}
for x in range(dtm.shape[1]):
    dictVocab[cv.get_feature_names_out()[x]]=dtm.toarray().sum(axis=0)[x]

In [None]:
#then we sort the dictionary in order of counts
sortVocab = sorted(dictVocab.items(), key=lambda x: x[1], reverse=True)

In [None]:
#then we print top 30 words with highest frequency counts across all texts
for i in sortVocab[0:30]:
    print(i[0], i[1])

<a id='section-4'></a>
## A note on tokenizing and preprocessing

When using Scikit-Learn for calculating tf-idf, certain pre-processing steps are set as defaults.  

As we’ve discussed, defaults in text analysis tools are often suited to assumptions about how the English language works, and this might not be the best way to do things depending what language you’re working with and what your research goals are. 

We’ll have a look here at Scikit-learn’s preprocessing defaults, and how to modify them. 

#### Lowercasing

The default parameters for lowercasing is `lowercase=True` which means that text will be converted to lower case. Specify parameter `lowercase=False` if you don’t want to lower case.

#### Tokenizing

There are two ways you can modify the built-in tokenizing process. 

You can override the default tokenizing by defining your own tokenizing function and calling it as parameter: 

In [None]:
#Call tokenizing function
import re

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    tokenized = [word for word in split_words if word.isalpha()]
    return tokenized

#Set up vectorizer
tfidf = TfidfVectorizer(input='filename',
                        tokenizer=tokenize,
                        stop_words=custom_stopwords,
                        lowercase=False)

In [None]:
#Actually do the vectorizing
tfidf_vector = tfidf.fit_transform(text_files)

In [None]:
#Make a DataFrame out of the resulting tf–idf vectors, 
#setting the “feature names” (words in vocabulary) as columns and the titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=text_titles, 
                        columns=tfidf.get_feature_names_out())
tfidf_df.head(11)

Or you could change the built-in tokenizing pattern directly.  

The default tokenizing patterns is `token_pattern=r"(?u)\b\w\w+\b"`. This pattern selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

You could replace this pattern with your own regular expression. For example, `token_pattern=r"(?u)\W+"`. This pattern would split characters at anything that is not a word. 

In [None]:
#Replace tokenizing pattern
tfidf = TfidfVectorizer(token_pattern=r'(?u)\W+')

### Lemmatization

Since tf-idf relies on word counts in order to identify words that appear in particular documents compared to other documents, for languages that are highly inflected it might be helpful to lemmatize the text in order to improve the words counts. There are no built-in lemmatizers. We need to use other libraries to lemmatize the text before calculating tf-idf scores. 

In [None]:
#Lemmatizing using spaCy for German
import spacy
!python -m spacy download de_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('de_core_news_sm')

#Open your text and create spaCy document
filepath = 'kafka_dv.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

In [None]:
#Lemmatizing using spaCy for French
import spacy
!python -m spacy download fr_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('fr_core_news_sm')
#And follow process above

In [None]:
#Lemmatizing using spaCy for Spanish
import spacy
!python -m spacy download es_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('es_core_news_sm')
#And follow process above

<a id='section-5'></a>
## Playing with Tf-idf parameters

Now let's go through the process of identifying significant words with tf-idf, this time looking at parameters more closely, and using a custom stopwords list.

In [None]:
"""
Parameters
The parameters you choose affect the output. 
these settings all have pros and cons; 
there’s no singular, correct way to preset them and produce output. 
Instead, it’s best to understand what settings do so that you can describe 
and defend the choices you’ve made. 
The full list of parameters is described in Scikit-Learn’s documentation.

input='filename' so we can pass a list of files that the vectorizer will read and
fetch the raw content to analyze. default: input='content'

min_df, max_df
These settings control the minimum number of documents a term must be found in to be included 
and the maximum number of documents a term can be found in in order to be included. 
Either can be expressed as a decimal between 0 and 1 indicating the percent threshold, 
or as a whole number that represents a raw count. 
Setting max_df below .9 will typically remove most or all stopwords.
e.g. max_df=0.75 will limit to terms appearing in 75 percent of the documents or lower.

norm, smooth_idf, and sublinear_tf
Each of these will affect the range of numerical scores that the tf-idf algorithm outputs.
norm normalizes the scores, default: norm='l2'
Smooth-idf adds one to each document frequency score, default: smooth_idf=True
Sublinear_tf applies another scaling transformation, replacing tf with log(tf). 
default: sublinear_tf=False
It is recommended to keep the default norm and smooth_idf paramters, this will better account 
for differences in text length and overall produce more meaningful tf–idf scores.

strip_accent='unicode' will remove accents and perform other character normalization 
during the preprocessing step. default: strip_accent='None'

We've already mentioned above the stop_words, lowercase, tokenizer, and token_pattern parameters.
"""
#Set up tf-idf vectorizing with custom settings
tfidf_vectorizer = TfidfVectorizer(input='filename')

In [None]:
#Actually do the vectorizing
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [None]:
#Make a DataFrame out of the resulting tf–idf vector, 
#setting the “feature names” or words as columns and the titles as rows
"""
N.B. The fit_transform() method above converts the list of strings to something called a sparse matrix. 
Sparse matrices save on memory by leaving out all zero values, but we want access to those, 
so the next block uses the toarray() method to convert the sparse matrices to a numpy array. 
By converting the space matrix to an array, we ensure that our array is the same length as our
list of documents. 
We want every term represented so that each document has the same number of values, 
one for each word in the corpus even if it doesn't occur in some documents.
"""

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=text_titles, 
                        columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head(11)

In [None]:
#Add row for document Tf-idf (sum of tf-idf score for each word across all documents)
tfidf_df.loc['00_Document Tf-idf'] = tfidf_df.sum(axis=0)
tfidf_df.head(12)

In [None]:
#sort index (and you could also round to only 2 decimals
#but since our score are very low we'll leave them as is
tfidf_df = tfidf_df.sort_index()#.round(decimals=2)
tfidf_df.head(12)

In [None]:
#Re-organize so words are in rows rather than columns
stacked_tfidf_df = tfidf_df.stack().reset_index()
stacked_tfidf_df = stacked_tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
stacked_tfidf_df.sample(n=20)

In [None]:
#Top 10 words with the highest tf–idf for every story
top_tfidf = stacked_tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
top_tfidf

In [None]:
#Zoom in on particular words
#What documents have the given word in their top significant words?
top_tfidf[top_tfidf['term'].str.contains('people')]

In [None]:
#Display signficance scores of the given word across all documents
stacked_tfidf_df[stacked_tfidf_df['term'].str.contains('people')]

In [None]:
#Zoom in on particular document
#What are the top significant words for a given document?
top_tfidf[top_tfidf['document'].str.contains('kafka_a-hunger-artist')]

In [None]:
#What are the top 20 significant words for the given document?
(stacked_tfidf_df[stacked_tfidf_df['document']
                  .str.contains('kafka_a-hunger-artist')]
 .sort_values('tfidf', ascending=False)
 .head(20)
)

In [None]:
#Create bar plots of top 10 significant words for all documents, and each story
import seaborn as sns
sns.catplot(data=top_tfidf, row='document', x='tfidf', y='term', kind='bar', sharey=False)

<a id='section-6'></a>
## Interpreting Tf-idf outputs

Tf-idf outputs can suggest hypotheses and spark questions that you want to look into further. They constitute a first step for identifying patterns that can be investigated further.
For example:  

- What does it mean that a certain term is significant to a certain document? What are the actual contexts the term appears in?  
- What role does the term play in the narrative?  
- How does it compare to other stories: do other stories include instances of the term? Do they include similar concepts but not that term? If so what are those concepts?  

Interpretations can be developed further by combining with other methods (ngrams, vector comparison) and other modes of reading (contextualized reading, close reading). Tf-idf in particular is both a corpus exploration method in itself and a pre-processing step for many other text-mining measures and models. As we heard from [Ari's presentation](https://drive.google.com/file/d/1lLLFzeCMUDKClrm7UqyIgpdX3b_P9zjO/view?usp=sharing), tf-idf scores can be used instead of word counts to generate word vectors. 

In [None]:
#Use tf-idf scores and cosine distance to compare documents
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from scipy.spatial.distance import pdist, squareform

# Data: set up path to files and a variable with file names
directory_path = 'kafka-corpus/'
text_files = glob.glob(f'{directory_path}/*.txt')
text_titles = [Path(text).stem for text in text_files]

#Set up tf-idf vectorizing with custom settings
tfidf_vectorizer = TfidfVectorizer(input='filename')

#Actually do the vectorizing
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [None]:
#Assessing similarties between texts based on tf-idf scores
#Creates a dataframe with cosine distances between the texts
#calculated from vectors of word counts for each text
#cosine distance (as opposed to cosine similar): the closer to 0, the more similar
cosine_distances = pd.DataFrame(squareform(pdist(tfidf_vector.toarray(), metric='cosine')), 
                                columns=text_titles, index=text_titles)
cosine_distances

In [None]:
#Visualize the cosine distance between texts with heatmap based on significance scores
import seaborn as sns
sns.heatmap(data=cosine_distances, annot=False)

In [None]:
#Visualize relations between texts with cluster map
sns.clustermap(data=cosine_distances, annot=False)

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-TF-IDF.html) and Matthew Lavin's ["Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf).