# TF-IDF and similarity scores
>  Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 4 exercises "Feature Engineering for NLP in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Building tf-idf document vectors

### tf-idf weight of commonly occurring words

<p>The word <code>bottle</code> occurs 5 times in a particular document <code>D</code> and also occurs in every document of the corpus. What is the tf-idf weight of <code>bottle</code> in <code>D</code>?</p>

<pre>
Possible Answers

<b>0</b>

1

Not defined

5

</pre>

**In fact, the tf-idf weight for bottle in every document will be 0. This is because the inverse document frequency is constant across documents in a corpus and since bottle occurs in every document, its value is log(1), which is 0.**

### tf-idf vectors for TED talks

<div class=""><p>In this exercise, you have been given a corpus <code>ted</code> which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.</p>
<p>In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.</p></div>

In [None]:
ted = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/13-Feature%20Engineering%20for%20NLP%20in%20Python/datasets/ted_500x1.csv')
ted = ted ['transcript']

Instructions
<ul>
<li>Import <code>TfidfVectorizer</code> from <code>sklearn</code>.</li>
<li>Create a <code>TfidfVectorizer</code> object. Name it <code>vectorizer</code>.</li>
<li>Generate <code>tfidf_matrix</code> for <code>ted</code>  using the <code>fit_transform()</code> method.</li>
</ul>

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(500, 29158)


**You now know how to generate tf-idf vectors for a given corpus of text. You can use these vectors to perform predictive modeling just like we did with CountVectorizer. In the next few lessons, we will see another extremely useful application of the vectorized form of documents: generating recommendations.**

## Cosine similarity

### Range of cosine scores

<p>Which of the following is a possible cosine score for a pair of document vectors?</p>

<pre>
Possible Answers

<b>
0.86</b>

-0.52

2.36

-1.32

</pre>

**Since document vectors use only non-negative weights, the cosine score lies between 0 and 1.**

### Computing dot product

<p>In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the <code>numpy</code> library. More specifically, we will use the <code>np.dot()</code> function to compute the dot product of two numpy arrays.</p>

Instructions
<ul>
<li>Initialize <code>A</code> (1,3) and <code>B</code> (-2,2) as <code>numpy</code> arrays using <code>np.array()</code>.</li>
<li>Compute the dot product using <code>np.dot()</code> and passing <code>A</code> and <code>B</code> as arguments.</li>
</ul>

In [None]:
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2,2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

4


**The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which is indeed the output produced. We will not be using np.dot() too much in this course but it can prove to be a helpful function while computing dot products between two standalone vectors.**

### Cosine similarity matrix of a corpus

<div class=""><p>In this exercise, you have been given a <code>corpus</code>, which is a list containing five sentences. The <code>corpus</code> is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf). </p>
<p>Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.</p></div>

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']

Instructions
<ul>
<li>Initialize an instance of <code>TfidfVectorizer</code>. Name it <code>tfidf_vectorizer</code>.</li>
<li>Using <code>fit_transform()</code>, generate the tf-idf vectors for <code>corpus</code>. Name it <code>tfidf_matrix</code>.</li>
<li>Use <code>cosine_similarity()</code> and pass <code>tfidf_matrix</code> to compute the cosine similarity matrix <code>cosine_sim</code>.</li>
</ul>

In [None]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


**As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.**

## Comparing linear_kernel and cosine_similarity

<div class=""><p>In this exercise, you have been given <code>tfidf_matrix</code> which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using <code>cosine_similarity</code> and then, using <code>linear_kernel</code>. </p>
<p>We will then compare the computation times for both functions.</p></div>

In [7]:
import time
from sklearn.metrics.pairwise import linear_kernel

Instructions 1/2
<p>Compute the cosine similarity matrix for <code>tfidf_matrix</code> using <code>cosine_similarity</code>.</p>

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.007631778717041016 seconds


Instructions 2/2
<p>Compute the cosine similarity matrix for <code>tfidf_matrix</code> using <code>linear_kernel</code>.</p>

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.003402233123779297 seconds


**Notice how both linear_kernel and cosine_similarity produced the same result. However, linear_kernel took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to linear_kernel to improve performance. (NOTE: In case, you see linear_kernel taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately)**

### Plot recommendation engine

<div class=""><p>In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a <code>get_recommendations()</code> function that takes in the title of a movie, a similarity matrix and an <code>indices</code> series as its arguments and outputs a list of most similar movies. <code>indices</code> has already been provided to you.</p>
<p>You have also been given a <code>movie_plots</code> Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.</p>
<p>Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.</p></div>

In [46]:
metadata = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/13-Feature%20Engineering%20for%20NLP%20in%20Python/datasets/movies_metadata.csv').dropna().reset_index(drop=True)
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
movie_plots = metadata['overview']

In [32]:
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

Instructions
<ul>
<li>Initialize a <code>TfidfVectorizer</code> with English <code>stop_words</code>. Name it <code>tfidf</code>.</li>
<li>Construct <code>tfidf_matrix</code> by fitting and transforming the movie plot data using <code>fit_transform()</code>.</li>
<li>Generate the cosine similarity matrix <code>cosine_sim</code> using <code>tfidf_matrix</code>. Don't use <code>cosine_similarity()</code>!</li>
<li>Use <code>get_recommendations()</code> to generate recommendations for <code>'The Dark Knight Rises'</code>.</li>
</ul>

In [48]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

1                              Batman Forever
2                                      Batman
8                  Batman: Under the Red Hood
3                              Batman Returns
9                            Batman: Year One
10    Batman: The Dark Knight Returns, Part 1
11    Batman: The Dark Knight Returns, Part 2
5                Batman: Mask of the Phantasm
7                               Batman Begins
4                              Batman & Robin
Name: title, dtype: object


**You've just built your very first recommendation system. Notice how the recommender correctly identifies 'The Dark Knight Rises' as a Batman movie and recommends other Batman movies as a result. This sytem is, of course, very primitive and there are a host of ways in which it could be improved. One method would be to look at the cast, crew and genre in addition to the plot to generate recommendations. We will not be covering this in this course but you have all the tools necessary to accomplish this. Do give it a try!**

### The recommender function

<div class=""><p>In this exercise, we will build a recommender function <code>get_recommendations()</code>, as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).</p>
<p>You have been given a dataset <code>metadata</code> that consists of the movie titles and overviews. The head of this dataset has been printed to console.</p></div>

Instructions
<ul>
<li>Get index of the movie that matches the title by using the <code>title</code> key of <code>indices</code>.</li>
<li>Extract the ten most similar movies from <code>sim_scores</code> and store it back in <code>sim_scores</code>.</li>
</ul>

In [62]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

**With this recommender function in our toolkit, we are now in a very good place to build the rest of the components of our recommendation engine.**

### TED talk recommender

<div class=""><p>In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a <code>get_recommendations()</code> function that takes in the title of a talk, a similarity matrix and an <code>indices</code> series as its arguments, and outputs a list of most similar talks. <code>indices</code> has already been provided to you.</p>
<p>You have also been given a <code>transcripts</code> series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.</p>
<p>Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.</p></div>

In [91]:
ted = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/13-Feature%20Engineering%20for%20NLP%20in%20Python/datasets/ted_500x2.csv')
indices = pd.Series(ted.index, index=ted['title'])
transcripts = ted['transcript']

Instructions
<ul>
<li>Initialize a <code>TfidfVectorizer</code> with English stopwords. Name it <code>tfidf</code>.</li>
<li>Construct <code>tfidf_matrix</code> by fitting and transforming <code>transcripts</code>.</li>
<li>Generate the cosine similarity matrix <code>cosine_sim</code> using <code>tfidf_matrix</code>.</li>
<li>Use <code>get_recommendations()</code> to generate recommendations for '5 ways to kill your dreams'.</li>
</ul>

In [92]:
# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

453             Success is a continuous journey
157                        Why we do what we do
494                   How to find work you love
149          My journey into movies that matter
447                        One Laptop per Child
230             How to get your ideas to spread
497         Plug into your hard-wired happiness
495    Why you will fail to have a great career
179             Be suspicious of simple stories
53                          To upgrade is human
Name: title, dtype: object


**You have successfully built a TED talk recommender. This recommender works surprisingly well despite being trained only on a small subset of TED talks. In fact, three of the talks recommended by our system is also recommended by the official TED website as talks to watch next after '5 ways to kill your dreams'!**

## Beyond n-grams: word embeddings

### Generating word vectors

<p>In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as <code>sent</code> and has been printed to the console for your convenience.</p>

In [None]:
!python -m spacy download en_core_web_lg

In [1]:
#Restart the colab runtime
import spacy
nlp = spacy.load('en_core_web_lg')
sent = 'I like apples and oranges'

Instructions
<ul>
<li>Create a <code>Doc</code> object <code>doc</code> for <code>sent</code>.</li>
<li>In the nested loop, compute the similarity between <code>token1</code> and <code>token2</code>.</li>
</ul>

In [2]:
# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.55549127
I apples 0.20442723
I and 0.31607857
I oranges 0.18824081
like I 0.55549127
like like 1.0
like apples 0.32987145
like and 0.5267485
like oranges 0.27717474
apples I 0.20442723
apples like 0.32987145
apples apples 1.0
apples and 0.2409773
apples oranges 0.77809423
and I 0.31607857
and like 0.5267485
and apples 0.2409773
and and 1.0
and oranges 0.19245945
oranges I 0.18824081
oranges like 0.27717474
oranges apples 0.77809423
oranges and 0.19245945
oranges oranges 1.0


**Notice how the words 'apples' and 'oranges' have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words.**

### Computing similarity of Pink Floyd songs

<div class=""><p>In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as <code>hopes</code>, <code>hey</code> and <code>mother</code> respectively.</p>
<p>Your task is to compute the pairwise similarity between <code>mother</code> and <code>hopes</code>, and <code>mother</code> and <code>hey</code>.</p></div>

In [4]:
mother = "\nMother do you think they'll drop the bomb?\nMother do you think they'll like this song?\nMother do you think they'll try to break my balls?\nOoh, ah\nMother should I build the wall?\nMother should I run for President?\nMother should I trust the government?\nMother will they put me in the firing mine?\nOoh ah,\nIs it just a waste of time?\nHush now baby, baby, don't you cry.\nMama's gonna make all your nightmares come true.\nMama's gonna put all her fears into you.\nMama's gonna keep you right here under her wing.\nShe won't let you fly, but she might let you sing.\nMama's gonna keep baby cozy and warm.\nOoh baby, ooh baby, ooh baby,\nOf course mama's gonna help build the wall.\nMother do you think she's good enough, for me?\nMother do you think she's dangerous, to me?\nMother will she tear your little boy apart?\nOoh ah,\nMother will she break my heart?\nHush now baby, baby don't you cry.\nMama's gonna check out all your girlfriends for you.\nMama won't let anyone dirty get through.\nMama's gonna wait up until you get in.\nMama will always find out where you've been.\nMama's gonna keep baby healthy and clean.\nOoh baby, ooh baby, ooh baby,\nYou'll always be baby to me.\nMother, did it need to be so high?\n"
hopes = "\nBeyond the horizon of the place we lived when we were young\nIn a world of magnets and miracles\nOur thoughts strayed constantly and without boundary\nThe ringing of the division bell had begun\nAlong the Long Road and on down the Causeway\nDo they still meet there by the Cut\nThere was a ragged band that followed in our footsteps\nRunning before times took our dreams away\nLeaving the myriad small creatures trying to tie us to the ground\nTo a life consumed by slow decay\nThe grass was greener\nThe light was brighter\nWhen friends surrounded\nThe nights of wonder\nLooking beyond the embers of bridges glowing behind us\nTo a glimpse of how green it was on the other side\nSteps taken forwards but sleepwalking back again\nDragged by the force of some in a tide\nAt a higher altitude with flag unfurled\nWe reached the dizzy heights of that dreamed of world\nEncumbered forever by desire and ambition\nThere's a hunger still unsatisfied\nOur weary eyes still stray to the horizon\nThough down this road we've been so many times\nThe grass was greener\nThe light was brighter\nThe taste was sweeter\nThe nights of wonder\nWith friends surrounded\nThe dawn mist glowing\nThe water flowing\nThe endless river\nForever and ever\n"
hey = "\nHey you, out there in the cold\nGetting lonely, getting old\nCan you feel me?\nHey you, standing in the aisles\nWith itchy feet and fading smiles\nCan you feel me?\nHey you, don't help them to bury the light\nDon't give in without a fight\nHey you out there on your own\nSitting naked by the phone\nWould you touch me?\nHey you with you ear against the wall\nWaiting for someone to call out\nWould you touch me?\nHey you, would you help me to carry the stone?\nOpen your heart, I'm coming home\nBut it was only fantasy\nThe wall was too high\nAs you can see\nNo matter how he tried\nHe could not break free\nAnd the worms ate into his brain\nHey you, out there on the road\nAlways doing what you're told\nCan you help me?\nHey you, out there beyond the wall\nBreaking bottles in the hall\nCan you help me?\nHey you, don't tell me there's no hope at all\nTogether we stand, divided we fall\n"

Instructions
<ul>
<li>Create <code>Doc</code> objects for <code>mother</code>, <code>hopes</code> and <code>hey</code>.</li>
<li>Compute the similarity between <code>mother</code> and <code>hopes</code>.</li>
<li>Compute the similarity between <code>mother</code> and <code>hey</code>.</li>
</ul>

In [5]:
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

0.8653562508450858
0.9595267703981097


**Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. Treat yourself by listening to these songs. They're some of the best!**