Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

# Building tf-idf document vectors

## tf-idf weight of commonly occurring words
The word bottle occurs 5 times in a particular document D and also occurs in every document of the corpus. What is the tf-idf weight of bottle in D?
> Correct! In fact, the tf-idf weight for bottle in every document will be 0. This is because the inverse document frequency is constant across documents in a corpus and since bottle occurs in every document, its value is log(1), which is 0.

## tf-idf vectors for TED talks

In this exercise, you have been given a corpus ted which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.

In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.
- Import TfidfVectorizer from sklearn.
- Create a TfidfVectorizer object. Name it vectorizer.
- Generate tfidf_matrix for ted using the fit_transform() method.

In [19]:
import pandas as pd
df1 = pd.read_csv('../datasets/ted.csv')

ted = df1.transcript
df1.head()

Unnamed: 0,transcript,url
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


In [5]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(500, 29158)


Good job! You now know how to generate tf-idf vectors for a given corpus of text. You can use these vectors to perform predictive modeling just like we did with CountVectorizer. In the next few lessons, we will see another extremely useful application of the vectorized form of documents: generating recommendations.

# Cosine similarity
- Cosine Similarity is one of the most popular metrics in NLP.
- It measures the how far away two vectors are from one another: 

![](https://github.com/Shoklan/datacamp/raw/master/NLPFeatureEngineeringPython/images/Cosine-Similarity.png)

## Range of cosine scores
Which of the following is a possible cosine score for a pair of document vectors?
> Great job! Since document vectors use only non-negative weights, the cosine score lies between 0 and 1.

## Computing dot product

In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the numpy library. More specifically, we will use the np.dot() function to compute the dot product of two numpy arrays.

- Initialize A (1,3) and B (-2,2) as numpy arrays using np.array().
- Compute the dot product using np.dot() and passing A and B as arguments.

In [7]:
import numpy as np
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2,2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

4


Good job! The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which is indeed the output produced. We will not be using np.dot() too much in this course but it can prove to be a helpful function while computing dot products between two standalone vectors.

## Cosine similarity matrix of a corpus

In this exercise, you have been given a corpus, which is a list containing five sentences. The corpus is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).

Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.

Instructions
100 XP
- Initialize an instance of TfidfVectorizer. Name it tfidf_vectorizer.
- Using fit_transform(), generate the tf-idf vectors for corpus. Name it tfidf_matrix.
- Use cosine_similarity() and pass tfidf_matrix to compute the cosine similarity matrix cosine_sim.

In [11]:
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

corpus = ['The sun is the largest celestial body in the solar system', 'The solar system consists of the sun and eight revolving planets', 'Ra was the Egyptian Sun God', 'The Pyramids were the pinnacle of Egyptian architecture', 'The quick brown fox jumps over the lazy dog']

In [12]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


Great work! As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.

# Building a plot line based recommender

## Comparing linear_kernel and cosine_similarity

In this exercise, you have been given tfidf_matrix which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using cosine_similarity and then, using linear_kernel.

We will then compare the computation times for both functions.
1. Compute the cosine similarity matrix for tfidf_matrix using cosine_similarity.


In [16]:
import time

#  Import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

In [14]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.01328730583190918 seconds


2. Compute the cosine similarity matrix for tfidf_matrix using linear_kernel.

In [17]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.004376649856567383 seconds


Good job! Notice how both linear_kernel and cosine_similarity produced the same result. However, linear_kernel took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to linear_kernel to improve performance. (NOTE: In case, you see linear_kernel taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately)

## Plot recommendation engine

In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a get_recommendations() function that takes in the title of a movie, a similarity matrix and an indices series as its arguments and outputs a list of most similar movies. indices has already been provided to you.

You have also been given a movie_plots Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.

Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.

- Initialize a TfidfVectorizer with English stop_words. Name it tfidf.
- Construct tfidf_matrix by fitting and transforming the movie plot data using fit_transform().
- Generate the cosine similarity matrix cosine_sim using tfidf_matrix. Don't use cosine_similarity()!
- Use get_recommendations() to generate recommendations for 'The Dark Knight Rises'.

### Prepare Data be as close as Datacamp given

In [105]:
df2 = pd.read_csv('../datasets/movie_overviews.csv')
# df2 = df2.dropna(subset=['overview'])
df2 = df2.dropna(subset=['tagline'])
df2.isnull().sum()

id          0
title       0
overview    0
tagline     0
dtype: int64

In [106]:
df2

Unnamed: 0,id,title,overview,tagline
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga
...,...,...,...,...
9091,390734,Kingsglaive: Final Fantasy XV,The magical kingdom of Lucis is home to the wo...,Kingsglaive: Final Fantasy XV
9093,390989,Sharknado 4: The 4th Awakens,The new installment of the Sharknado franchise...,"What happens in Vegas, stays in Vegas. Unless ..."
9095,392572,Rustom,"Rustom Pavri, an honourable officer of the Ind...",Decorated Officer. Devoted Family Man. Defendi...
9097,315011,Shin Godzilla,From the mind behind Evangelion comes a hit la...,A god incarnate. A city doomed.


In [110]:
# https://stackoverflow.com/questions/19851005/rename-pandas-dataframe-index
# movie_plots = df2.overview
movie_plots = df2.tagline
metadata = df2

df_temp = df2.copy()
# df_temp.index.names = ['new_id']
df_temp = df_temp.reset_index(drop=True)
df_temp = df_temp.reset_index()
df_temp = df_temp.rename(columns={'index': 'new_id'})
df_temp = df_temp[['title' ,'new_id']]
df_temp = df_temp.set_index('title')
indices = df_temp['new_id']
indices
# df_temp

title
Jumanji                                                  0
Grumpier Old Men                                         1
Waiting to Exhale                                        2
Father of the Bride Part II                              3
Heat                                                     4
                                                      ... 
Kingsglaive: Final Fantasy XV                         7028
Sharknado 4: The 4th Awakens                          7029
Rustom                                                7030
Shin Godzilla                                         7031
The Beatles: Eight Days a Week - The Touring Years    7032
Name: new_id, Length: 7033, dtype: int64

In [108]:
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [111]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

7827    Harry Potter and the Deathly Hallows: Part 2
8305                           The Hangover Part III
8749                                         Taken 3
1662                                       Tall Tale
3056                                     Coyote Ugly
8618                                        Hercules
8266                                          Stoker
1374                                       Dark City
1594                                  The Black Hole
7072                                           Saw V
Name: title, dtype: object


Congratulations! You've just built your very first recommendation system. Notice how the recommender correctly identifies 'The Dark Knight Rises' as a Batman movie and recommends other Batman movies as a result. This sytem is, of course, very primitive and there are a host of ways in which it could be improved. One method would be to look at the cast, crew and genre in addition to the plot to generate recommendations. We will not be covering this in this course but you have all the tools necessary to accomplish this. Do give it a try!

## The recommender function

In this exercise, we will build a recommender function get_recommendations(), as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).

You have been given a dataset metadata that consists of the movie titles and overviews. The head of this dataset has been printed to console.

- Get index of the movie that matches the title by using the title key of indices.
- Extract the ten most similar movies from sim_scores and store it back in sim_scores.

In [112]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

Good job! With this recommender function in our toolkit, we are now in a very good place to build the rest of the components of our recommendation engine.

## TED talk recommender

In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a get_recommendations() function that takes in the title of a talk, a similarity matrix and an indices series as its arguments, and outputs a list of most similar talks. indices has already been provided to you.

You have also been given a transcripts series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.

Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.

- Initialize a TfidfVectorizer with English stopwords. Name it tfidf.
- Construct tfidf_matrix by fitting and transforming transcripts.
- Generate the cosine similarity matrix cosine_sim using tfidf_matrix.
- Use get_recommendations() to generate recommendations for '5 ways to kill your dreams'.

Excellent work! You have successfully built a TED talk recommender. This recommender works surprisingly well despite being trained only on a small subset of TED talks. In fact, three of the talks recommended by our system is also recommended by the official TED website as talks to watch next after '5 ways to kill your dreams'!

# Beyond n-grams: word embeddings
## Generating word vectors

In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as sent and has been printed to the console for your convenience.

In [115]:
sent = 'I like apples and oranges'

import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [116]:
# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.19372894
I apples 0.0821518
I and -0.013632407
I oranges 0.049489032


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


like I 0.19372894
like like 1.0
like apples 0.25841203
like and 0.14297311
like oranges 0.12997378
apples I 0.0821518
apples like 0.25841203
apples apples 1.0
apples and -0.04025854
apples oranges 0.61655116
and I -0.013632407
and like 0.14297311
and apples -0.04025854
and and 1.0
and oranges -0.076360986
oranges I 0.049489032
oranges like 0.12997378
oranges apples 0.61655116
oranges and -0.076360986
oranges oranges 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


Good job! Notice how the words 'apples' and 'oranges' have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words.

## Computing similarity of Pink Floyd songs

In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as hopes, hey and mother respectively.

Your task is to compute the pairwise similarity between mother and hopes, and mother and hey.

- Create Doc objects for mother, hopes and hey.
- Compute the similarity between mother and hopes.
- Compute the similarity between mother and hey.

Excellent work! Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. Treat yourself by listening to these songs. They're some of the best!