# Counting and Vectorizing Words with sk-learn

We will get back to large language models very soon. But it remains true that in many cases, just counting words can tell you a lot. To wit:

<img src="http://lklein.com/wp-content/uploads/2021/10/Screen-Shot-2021-10-06-at-3.33.34-PM.png">

Today, we're going to explore a measure called Term Frequency - Inverse Document Frequency (tf-idf). Tf-idf comes up a lot in text analysis projects because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

The procedure was introduced in a 1972 paper by Karen Spärck Jones under the name “term specificity,” and the basic idea is this:

Instead of representing a term in a document by its raw frequency or its relative frequency (the term count divided by the document length), each term is *weighted* by dividing the term frequency by the number of documents in the corpus containing the word.

The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in any particular document are often the most frequently used words in all of the documents. Therefore, they are not particularly informative!

By contrast, terms with the highest tf-idf scores are the terms that are *distinctively* frequent in any particular document when that document is compared other documents. When you sort by tf-idf score, these distinctive terms rise to the top.

## An Analogy ##
    
If this explanation doesn’t quite resonate, a brief analogy might help.

Say you've decided leave campus to get dinner on Buford Highway. Since leaving campus takes a lot of effort (and also, crucially, access to a car), the food better be worth it! That means you'll need to balance two competing goals:

1) The food has to be really tasty; and also, crucially:
2) If you're going to go all the way out to Buford Highway, it better be something that you can't also get in Emory Village. Otherwise, why go to all the trouble of getting there?!

Or, to give an example involving actual food: you don't want to go all the way out to Buford Highway to get pizza. Even if the pizza on Buford Highway is pretty tasty, you can get pizza anywhere in town. How can you find out what is distintively tasty on Buford Highway?  

If you looked up the Yelp reviews for the all restaurants on Buford highway and sorted by score, you would get an answer to the question of what's the tastiest. But it still won't help solve the problem of what's *distintively tasty* on Buford Highway--like hot pot, for example, which is something that you can't get in Emory Village.   

So you need a way to tell the difference between what's tasty and what's distinctively tasty. To do so, you need to distinguish between four categories of food. Food that, on Buford Highway, is:

- both tasty and distinctive (e.g. hot pot)
- tasty but not distinctive (e.g. pizza)
- distinctive but not tasty (e.g. tacos-- tho I'm open to disagreement here)
- neither tasty nor distinctive (e.g. Taco Bell--again, open to disagreement).

These categories are what tf-idf helps you measure. Term frequencies can be assessed according to the same criteria. A term might be:

- Frequently used in the corpus, and used especially frequently in one particular document <-- Interesting!
- Frequently used in the corpus, but used frequently in equal measure across all documents <-- Less interesting
- Infrequently used in the corpus, but nonetheless used frequently in one particular document <-- Potentially interesting
- Infrequently used in in the corpus and also infrequently used in the corpus consitently across all documents <-- Not interesting

It's the words that are especially frequent in one document that are most interesting to us, and the ones that tf-idf helps us identify. To see how, let's take a look at our next dataset--a bigger one--Yelp reviews from restaurants in three cities: Atlanta, New York, and San Francisco.

## Oveview of Yelp Review Data ##

In this lesson, we're going to use tf-idf to study Yelp reviews for restaurants in Atlanta. This dataset was created by [Naitian Zhou](https://naitian.org/) (who we'll zoom with later in the semester) for a project that we've been working on, along with Lucy Li, involving questions of taste and authenticity in US restaurant reviews, inspired by work by [Sara Kay](https://ny.eater.com/2019/1/18/18183973/authenticity-yelp-reviews-white-supremacy-trap), [Yiwei Luo, Kristina Gligoric, and Dan Jurafsky](https://arxiv.org/pdf/2307.07645) (which you read for today), and [Sharon Zukin, Scarlett Lindeman, and Laurie Hurson](https://journals-sagepub-com.proxy.library.emory.edu/doi/full/10.1177/1469540515611203). All found instances of racial and ethnic bias in such reviews, and over the next few class sessions, we're going to see what we can find out that might confirm or refute these findings.

## Pre-processing: prepare the reviews

Tf-idf works on sets of documents--individual reviews in our case. We'll be using another (new to us) library, scikit-learn, in order to count the words in the reviews. But before we do, we'll need to get the reviews out of a .jsonl file and into a list, with each review stored as its own string.

The reviews are stored in a .jsonl file that is zipped and stored on my Google Drive. Below is some code to get the zipped jsonl file from Google Drive, unzip it, and format the review text into a list for processing.

First, download the file:

In [None]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# then download the zip files
# atlanta
gdown.download('https://drive.google.com/uc?export=download&id=1gIm9NcoeY1gn9EQjRr2MojGRJ-fpBSqz', quiet=False)

# we'll use these next class, maybe
# san francisco
# gdown.download('https://drive.google.com/uc?export=download&id=1NU19CyDbRVDJfGwV9kpV-JDn5OLPuNlp', quiet=False)

# new york
# gdown.download('https://drive.google.com/uc?export=download&id=1tzL-k2wMqskcDUiA5a_VD7Z3ezu6GNJ3', quiet=False)

Then, unzip it:

In [None]:
# unzip it
!unzip Atlanta-random.jsonl.zip


Then, process the data. So we can take a quick look at everything that's in the json file, we'll pull it into a dataframe first.

Note that while this code is written to process this particular dataset, you'll usually need to write some sort of file/text pre-processing code in order to use any particular library/method/tool. You'll get very familiar with writing code like this by the end of the course!

In [None]:
# import libraries
import os             # for directory/file manipulation
import json           # for json
import pandas as pd   # for dataframes

# read in the file
atlanta_reviews_df = pd.read_json(path_or_buf="./Atlanta-random.jsonl", lines=True)

len(atlanta_reviews_df)


In [None]:
# take a quick look at the top
atlanta_reviews_df.head()

Great! But what we really want is what's in the "comment" column, and in particular we want the value of the "text" key. So let's make a list with only that.

In [None]:
# first extract the 'comment' values from the dataframe
comments = atlanta_reviews_df['comment'].tolist()

# create list to store reviews
reviews = []

# iterate through the comments and append the reviews to the list
for comment in comments:
  reviews.append(comment['text'])

# print out the first one to check
reviews[0]

Oops! There's still some HTML in there. Let's do a quick cleaning pass.

In [None]:
from bs4 import BeautifulSoup

# new array w/ clean text
reviews_clean = []

for review in reviews:
    soup = BeautifulSoup(review, "html.parser")
    text = soup.get_text(separator=' ')

    reviews_clean.append(text)

One last thing. Let's make a set of IDs so we can get a sense of what restaurant each review is about.

In [None]:
# extract the 'business' values from the dataframe
businesses = atlanta_reviews_df['business'].tolist()

# create list to store business aliases
aliases = []

# iterate through the business and append the alias to the list
for business in businesses:
  aliases.append(business['alias'])

# extract ratings
ratings = atlanta_reviews_df['rating'].tolist()

# create list to store IDs
ids = []

# now put them all together into IDs
for i, alias in enumerate(aliases):
  id = alias + "-review" + str(i) + "-" + str(ratings[i]) + "stars"
  ids.append(id)

# print out the first one to check
ids[0]

OK. Looks like we're finally ready to go!

## To the TF-IDF calculation... or not (yet)!

Like many other methods, we can calculate tf-idf with just a few lines of code. [Here is another notebook](https://colab.research.google.com/drive/1dptJ366O9TG9a77g-qX8f4Glf_2A8LVd?usp=sharing) that presents everything below in a much more streamlined fashion.

But because the TF-IDF calculation relies on a process that is, in itself, a pre-processing step for many future methods, and is in itself a little heady, we're going to spend some time today loooking under the hood.

So without further ado, introduing.... the `CountVectorizer`!

## But wait... the count what?

When humans make sense of language, they interpret sequences of words through rules of grammar and syntax. Computers don't need to do that (although we will learn some ways to do that later in the course). To model langauge computationally, it's far easier to turn words with numbers, and then apply statistical measures and modeling approaches to the numbers that represent the words.

TF-IDF, as well as machine learning approaches including language modeling, topic modeling, similarity measures, classification, and clustering--essentially, the set of methods we'll be learning in the next few units of this course--all rely on this basic transformation.

## Introducing scikit-learn!

We've now reached another milestone--our first use of scikit-learn (often abbreviated as sk-learn), Python's major machine learning library, which also happens to be crucial to many of the more advanced methods named above.

We're going to use this to transform our words into numbers.

We'll begin by importing sk-learn's `CountVectorizer`, which [converts a collection of text documents into a matrix of token counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

I think I've used the word "token" in passing before, but here I'll take a minute to formally define it, along with some related terms.

## Tokens, features, document-term matrices

According to the [Stanford NLP group](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html), a *token* is "an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing." The unit of the token is usually the word, but it can also be the sentence, the subword, or anything else that makes sense for that particular task.

Take this famous phrase, for example:

"To be or not to be"

This line has six tokens: "to", "be", "or", "not", "to", "be".

It has four features: "to", "be", "or", "not"

But wait, what is a feature?

In this case, a *feature* is a unique token in the corpus. (Caveat: features, like tokens, can actually be anything that makes sense for the task, but for the purposes of turning words into numbers, features are most often unique words, or "terms," as they're also sometimes called).

**When sk-learn's CountVectorizer does its thing, it first *tokenizes* all of the documents in the corpus--that is, it breaks up each document into its individual tokens--and then then creates a *document-term matrix* that counts up how many times each term, or feature, appears in each document.**

For example, the document-term matrix for the line above might look something like this:

|   | to | be | or | not |
|---|----|----|----|-----|
|   | 2  | 2  | 1  | 1   |


If we add in the second part of that phrase as a new document, we might get something like this:


|         | to | be | or | not | that | is | the | question |
|---------|----|----|----|-----|------|----|-----|----------|
| line 1  | 2  | 2  | 1  | 1   | 0    | 0  | 0   | 0        |
| line 2  | 0  | 0  | 0  | 0   | 1    | 1  | 0   | 0        |


But enough of vectorizing by hand; let's try it out using sk-learn!


## Importing sk-learn's CountVectorizer




In [None]:
# import CountVectorizer from sk-learn
from sklearn.feature_extraction.text import CountVectorizer

## Vectorize a teeny dataset

Now let's vectorize a teeny dataset we can see:

In [None]:
# here's our dataset: the chorus from Taylor Swift's "Tortured Poets Department"
# in which each line is its own document
dataset = [
    'And who\'s gonna hold you like me?',
    'And who\'s gonna know you, if not me?',
    'I laughed in your face and said',
    '\"You\'re not Dylan Thomas, I\'m not Patti Smith',
    'This ain\'t the Chelsea Hotel, we\'rе modern idiots\"',
    'And who\'s gonna hold you like me?',
]

In [None]:
# quickly strip out the single quote marks from each element of the dataset above
for item in dataset:
  item.replace("'", "")

dataset

In [None]:
# now instantiate the CountVectorizer object
# note that this is the same conceptual process we used to instantiate
# the VADER sentiment analysis object, and the spaCy document object
cv=CountVectorizer()

# this steps generates document-term matrix for the doc;
# it's required before you do almost anything else
dtm=cv.fit_transform(dataset)

# this method gives us the feature names that the CountVectorizer vectorized:
features = cv.get_feature_names_out()

# this method turns our doc-term matrix into an array that can be manipulated:
dtm_array = dtm.toarray()

print("All of the features in our dataset:")
print(str(features))

print ("\nAnd their counts in each of the \"documents,\" each of which is really")
print("just a single line of the song:")
print(dtm_array)

In [None]:
# here is some code that uses dataframes to make the above slightly more legible

df = pd.DataFrame(data=dtm_array,columns=features)
print(df)

Let's take a minute to figure out what we're looking at:

* Each column is a feature, or "term," labeled with the name of the term, which in this case is a unique token
* Each row is a line, labeled in order of being ingested
* The "1" in row 0 of the "and" column means that the term "and" appears 1 time in the first line... and so on.  

## Vectorizing a dataset from a set of files

The reality is that you almost always will be vectorizing a dataset from a set of files, and not a list that you type in by hand. This is how you'd do it with the Yelp reviews dataset that we loaded earlier:

In [None]:
# instantiate the vectorizer, as before
cv=CountVectorizer()

# generates document-term matrix for all the docs
dtm=cv.fit_transform(reviews_clean)

# get the feature names aka terms
features = cv.get_feature_names_out()

# take a look at some features in the middle
print(features[3000:3099])

In [None]:
# you can also check the overall shape of the doc-term matrix
dtm.shape

**What does this tell you about how many documents there are? What about the number of features?**

## Helpful CountVectorizer Parameters

Since we're on the subject of the `CountVectorizer`, here are a few more helpful `CountVectorizer` parameters to know about:



In [None]:
# lowercase all words -- this is True by default, but if you want to preserve case,
# you can set lowercase to False
cv_caps = CountVectorizer(lowercase=False)

# generates document-term matrix for all the docs
dtm2=cv_caps.fit_transform(reviews_clean)

# check the shape
dtm2.shape

So, there are more terms since it's not merging the uppercase and the lowercase versions of each word together.

Another parameter to know about has to do with stopwords. These are common words like "and", "not", "or", etc. that are not usually that interesting.

In [None]:
# use the built-in English stopwords list
cv_no_stops = CountVectorizer(stop_words='english')

# generates document-term matrix for all the docs
dtm3=cv_no_stops.fit_transform(reviews_clean)

# check the shape
dtm3.shape

In [None]:
# use a custom stopwrods list
cv_no_stops = CountVectorizer(stop_words=['food','review','atlanta'])

# generates document-term matrix for all the docs
dtm4=cv_no_stops.fit_transform(reviews_clean)

# check the shape
dtm4.shape

One last helpful feature of CountVectorizer is that you can tell it very easily to tokenize by ngrams as well as words. To wit:

In [None]:
bigram_cv = CountVectorizer(analyzer='word', ngram_range=(1, 3))

# generates document-term matrix for all the docs
dtm5=bigram_cv.fit_transform(reviews_clean)

# get the feature names -- bigrams in this case
features = bigram_cv.get_feature_names_out()

# take a look at some features in the middle
print(features[3000:3099])

## Return to TF-IDF

At long last, we're ready to calculate the TF-IDF scores for our corpus!

We're going to do this step by step at first, to make sure that you understand each of the processes, and then at the end, you'll see how to do this in only a few lines of code.

In [None]:
# import the TF-IDF libraries
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

## Create document-term matrix

We'll use a doc-term matrix to calculate tf-idf. We already know how to do this step

In [None]:
#instantiate CountVectorizer()
cv=CountVectorizer(stop_words='english') # using stopwords this time

# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(reviews_clean)

# check shape
dtm.shape

**How many reviews do we have in this document-term matrix and how many unique features/terms?**

That line tells us that we have 34370 rows, one for each review in the dataset, and 35151 columns, one for each word (minus single character words, which the tokenizer excludes, as well as the default stopwords, which we've indicated should be excluded with the stop_words='english' parameter above).

We can also look at the whole vocabulary like this:

In [None]:
cv.vocabulary_

The numbers above are the indices for each feature, not the word counts. But we can use the indicies in order to generate our word counts.

Note that we did a similar thing above using `cv.get_feature_names_out()`. This is arguably more efficient.

In [None]:
sum_words = dtm.sum(axis=0) # sum_words is a vector that contains the number of times each word appears in all
                            # the docs in the corpus. In other words, we are summing the elements for each column
                            # of the doc-term matrix and storing those counts as a vector

# then sort the list of tuples that contain the word and their occurrence in the corpus.
# tuples are Python's name for single variables that actually store multiple variables,
# like the word and index in the vocabulary attribute above

words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()] # rememeber list comprehension!

words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# display the top 20
words_freq[:20]

We can already see some words with the most counts don't seem too distinctive: "food" and "place," for example. It's not surprising that those are the most frequently occurring words since these are reviews about restaurants.

So now let's calculate the IDF values so that we can balance them out. While we could also calculate these by hand, sk-learn makes it really easy to do it in a few lines of code, so we'll use that instead.

## Initialize TfidfTransformer

When you initialize TfidfTransformer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf. The recommended way to run `TfidfTransformer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in document length, and, overall, they'll produce more meaningful tf–idf scores.

In [None]:
# Call tfidf_transformer.fit on the word count vector we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(dtm)

## Print inverse document frequence (idf) values

In [None]:
# make a dataframe for the idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names_out(),columns=["idf_weights"])

# sort ascending
df_idf.sort_values(by=['idf_weights'])

In the table above, the words at the top are those that appear in the most number of reviews, across all of the dataset; and the words at the bottom are those that appear in the least number of reviews.

Once again, it makes sense that words like "food" and "good" are at the top.

The words at the bottom appear only once across the entire dataset.

## IDF by the numbers

But what are these numbers that we're looking at?

The most direct formula would be **N/df<sub>i</sub>**, where N represents the total number of reviews (or documents) in the dataset, and df is the number of documents in which the term appears.

However, many implementations of tf-idf, including scikit-learn, which we are using, normalize the results with additional operations.

In tf-idf, normalization is generally used in two ways, and for two reasons: first, to prevent bias in term frequency from terms in shorter or longer documents; and second, as above, to calculate each term’s idf value.

Scikit-learn’s implementation of tf-idf represents N as **N+1**, calculates the natural logarithm of **(N+1)/df<sub>i</sub>**, and then adds **1** to the final result. Here is this same thing formatted slightly more nicely:

<img src="http://lklein.com/wp-content/uploads/2019/10/Screen-Shot-2019-10-02-at-11.52.31-PM.png">

**Important note!** This is only one way to calculate TF-IDF. There are many, many versions. The number itself isn't important. It's the *ranking* that the number enables that's most interesting to us. Because one you have the IDF values, you can now compute the tf-idf scores for any document or set of documents.

So now let’s compute tf-idf scores for the documents in our corpus.


## Produce & print tf-idf scores

Once you have the idf values, you can compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the documents in our corpus.

In [None]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(dtm)

Now, let’s print the tf-idf values of the first document to see if they make sense.

We'll place the tf-idf scores from the first document into a pandas dataframe and sort the dataframe in descending order of scores.

In [None]:
import textwrap

feature_names = cv.get_feature_names_out()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]

# print the review text for the first doc
print(textwrap.fill(reviews_clean[0],100))

#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Notice that only certain words have scores. This is because only the words in this document have a tf-idf score and everything else, from other documents, shows up as zeroes.

Let's try another:


In [None]:
review_num = 3000

#get tfidf vector for another document
first_document_vector=tf_idf_vector[review_num]

# print the review text for the first doc
print(textwrap.fill(reviews_clean[review_num],100))

#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

## tf-idf: the fast way

Since we're now tf-idf pros, we're going to use scikit-learn's all-in-one tf-idf vectorizer to do this entire notebook again in two lines of code.

In [None]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True) # excludings stopwords again

# send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(reviews_clean)

In [None]:
# place tf-idf values for all docs in a pandas dataframe
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

tfidf_df

In [None]:
# Add row for number of times word appears in all documents
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

**And we're done! 🎉 🎉 🎉**

Now let's explore the results...

## Let's explore!

We can look at specific words and how they appear in our reviews dataset. I've entered five words below that we might want to investigate:

In [None]:
tfidf_slice = tfidf_df[['wings', 'dumplings', 'tacos', 'pancakes', 'bao']]
tfidf_slice

**Does this output make sense? What does it tell you about which articles you might want to go read? What about some research questions you might ask of this corpus using TF-IDF?**

## Searching for multiple terms

Not too much different than above, but you don't need to include just one term as part of your slice.

In [None]:
tfidf_slice = tfidf_df[['chicken', 'waffles']]

# filter out zero values
tfidf_slice = tfidf_slice[
    ~((tfidf_slice['chicken'] == 0) | (tfidf_slice['waffles'] == 0))
]

# now calculate the sum of 'chicken' and 'waffles' columns
tfidf_slice['total'] = tfidf_slice['chicken'] + tfidf_slice['waffles']

# sort by total
slice_sorted = tfidf_slice.sort_values(by=['total'],ascending=False)

print (slice_sorted[:10])

This suggests that we should check out the reviews with these indices if we want to read about chicken and waffles. Let's try it!

In [None]:
indexes = slice_sorted[:10].index.tolist()

for index in indexes:
  print(str(ids[index]))
  print(textwrap.fill(reviews_clean[index],100) + "\n")


## Displaying the top terms for any particular document

The second major use of TF-IDF is to characterize the most significant words in any particular document. Here's some code that will do that for the first document in the corpus:

In [None]:
# pull out the row at location 0
# replace the index number to pull up another doc
idx = 0

print("ID: " + str(ids[idx]))
doc_row = tfidf_df.iloc[idx]
print("Top words: ")

# sort by td-idf scores top to bottom
doc_row = doc_row.sort_values(ascending=False)

# print out the top ten
print (doc_row[:10])

# Most unique words in a corpus

Oh and of course! Here are the most unique words in the corpus overall!

Recall from the previous lesson that this is distinct from the most *frequent* words in the corpus.

In [None]:
# add in a row with the total TF-IDF scores
tfidf_df.loc['Total_TFIDF'] = tfidf_df.sum()

# sort by Total_TFIDF values, high to low
tfidf_df.sort_values(by=['Total_TFIDF'], axis=1, ascending=False, inplace=True)

tfidf_df

Here's the same thing formatted slightly more nicely.

Note use of the Python `range` method to generate an iterator.

In [None]:
sorted_terms = list(tfidf_df.columns.values.tolist())

for i in range(25):
  print(str(i) + ". " + str(sorted_terms[i]))


# Searching/sorting by tf-idf score

One question you often want to ask about tf-idf scores relates to individual words-- more specifically, which documents have the highest tf-idf scores for a specific word.  

You might want to search/sort this way if you were curious, for example, which documents were most uniquely about, say, food:

In [None]:
# new dataframe for id lookup
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# add in a column for the titles of each article for future reference
tfidf_df['IDs'] = ids

In [None]:
tfidf_slice_sorted = tfidf_df[['IDs', 'boba']].sort_values(by=['boba'], ascending=False)

# print out the top ten
print (tfidf_slice_sorted[:10])

The above list then suggests the reviews that you should prioritize reading if your interest was in boba.

## Searching for multiple terms

Not too much different than the above

In [None]:
tfidf_slice_sorted = tfidf_df[['IDs', 'boba', 'coffee']].sort_values(by=['boba', 'coffee'], ascending=True)

# print out the top ten
print (tfidf_slice_sorted[:10])


## Cosine similarity

Just so you know how to do everything in that Pudding Hip-Hop feature, this is how you calculate cosine similiarty between documents on the basis of their tf/idf scores:

In [None]:
# CALCULATE SIMILARITY TO FIRST DOC

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_vectorizer_vectors[0:1], tfidf_vectorizer_vectors)

*Lauren F. Klein wrote version 1.0 of this notebook in 2019 based of tutorials by [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) and [Kavita Ganesan](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XZVlcOdKhSw). Dan Sinykin supplemented it with material from Melanie Walsh's chapter [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/TF-IDF.html) in 2020. Lauren Klein updated it again in 2021, 2022, and 2024.*

