In [23]:
# Import libraries
import pandas as pd
import numpy as np
import time
import spacy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from pprint import pprint

In [18]:
# Load model
nlp = spacy.load('en_core_web_lg')

In [2]:
# Read data
#ted_talk = pd.read_csv('data/ted.csv')
#print(f'Head of ted_talk: \n{ted_talk.head()}')

movie_overviews = pd.read_csv('data/movie_overviews.csv', index_col=0)
print(f'Head of movie_overviews: \n{movie_overviews.head()}')

ted_main = pd.read_csv('data/ted_main.csv.zip', compression='zip', index_col='url',
                       usecols=['title', 'name', 'languages', 'published_date', 'event', 'url'])
ted_transcripts = pd.read_csv('data/ted_transcripts.csv.zip', compression='zip', index_col='url',
                              usecols=['transcript', 'url'])

ted_talk = ted_main.join(ted_transcripts, how='right').reset_index()
print(f'\n\nHead of ted_talk: \n{ted_talk.head()}')

Head of movie_overviews: 
                             title  \
id                                   
862                      Toy Story   
8844                       Jumanji   
15602             Grumpier Old Men   
31357            Waiting to Exhale   
11862  Father of the Bride Part II   

                                                overview  \
id                                                         
862    Led by Woody, Andy's toys live happily in his ...   
8844   When siblings Judy and Peter discover an encha...   
15602  A family wedding reignites the ancient feud be...   
31357  Cheated on, mistreated and stepped on, the wom...   
11862  Just when George Banks has recovered from his ...   

                                                 tagline  
id                                                        
862                                                  NaN  
8844           Roll the dice and unleash the excitement!  
15602  Still Yelling. Still Fighting. Still Ready 

In [3]:
print(ted_transcripts.tail(100))
print(ted_main.tail(100))

                                                                                           transcript
url                                                                                                  
https://www.ted.com/talks/shah_rukh_khan_though...  Namaskar.I'm a movie star, I'm 51 years of age...
https://www.ted.com/talks/stuart_russell_how_ai...  This is Lee Sedol. Lee Sedol is one of the wor...
https://www.ted.com/talks/lucy_kalanithi_what_m...  A few days after my husband Paul was diagnosed...
https://www.ted.com/talks/ted_halstead_a_climat...  I have a two-year-old daughter named Naya who ...
https://www.ted.com/talks/wendy_troxel_why_scho...  It's six o'clock in the morning, pitch black o...
...                                                                                               ...
https://www.ted.com/talks/duarte_geraldino_what...  So, Ma was trying to explain something to me a...
https://www.ted.com/talks/armando_azua_bustos_t...  This is a picture of a sunset 

## 4. TF-IDF and similarity scores
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

## 4.1 Building tf-idf document vectors

1. Building tf-idf document vectors
>In the last chapter, we learned about n-gram modeling.

2. n-gram modeling
>In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5.

3. Motivation
>However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.

4. Applications
>Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems as we will soon find out. In a lot of cases, this kind of weighting also generates better performance during predictive modeling.

5. Term frequency-inverse document frequency
>The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.

6. Mathematical formula
>Mathematically, the weight of a term i in document j is computed as

7. Mathematical formula
>term frequency of the term i in document j

8. Mathematical formula
>multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.

9. Mathematical formula
>Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.

10. tf-idf using scikit-learn
>Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.

11. Let's practice!
>That's enough theory for now. Let's practice!

In [4]:
# Bag of words model using sklearn
lcorpus = pd.Series([
'The lion is the king of the jungle',
'Lions have lifespans of a decade',
'The lion is an endangered species'
])
print(f'Corpus to analize: \n{lcorpus}')

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(lcorpus)
features = vectorizer.get_feature_names()
matrix = tfidf_matrix.toarray()

print(f'\nFeatures: \n{features}')
print(f'\nVectorized data: \n{matrix}')

Corpus to analize: 
0    The lion is the king of the jungle
1      Lions have lifespans of a decade
2     The lion is an endangered species
dtype: object

Features: 
['an', 'decade', 'endangered', 'have', 'is', 'jungle', 'king', 'lifespans', 'lion', 'lions', 'of', 'species', 'the']

Vectorized data: 
[[0.         0.         0.         0.         0.25434658 0.33443519
  0.33443519 0.         0.25434658 0.         0.25434658 0.
  0.76303975]
 [0.         0.46735098 0.         0.46735098 0.         0.
  0.         0.46735098 0.         0.46735098 0.35543247 0.
  0.        ]
 [0.45954803 0.         0.45954803 0.         0.34949812 0.
  0.         0.         0.34949812 0.         0.         0.45954803
  0.34949812]]


## 4.2 tf-idf weight of commonly occurring words

**Instruction**

The word bottle occurs 5 times in a particular document D and also occurs in every document of the corpus. What is the tf-idf weight of bottle in D?

**Possible Answers**

1. <font color=blue>__0__</font> Correct!
2. 1
3. Not defined
4. 5

**Results**

<font color=darkgreen>Correct! In fact, the tf-idf weight for bottle in every document will be 0. This is because the inverse document frequency is constant across documents in a corpus and since bottle occurs in every document, its value is log(1), which is 0.</font>

## 4.3 tf-idf vectors for TED talks

In this exercise, you have been given a corpus **ted** which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.

In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.

**Instructions**

1. Import TfidfVectorizer from sklearn.
2. Create a TfidfVectorizer object. Name it vectorizer.
3. Generate tfidf_matrix for ted using the fit_transform() method.

**Results**

<font color=darkgreen>Good job! You now know how to generate tf-idf vectors for a given corpus of text. You can use these vectors to perform predictive modeling just like we did with CountVectorizer. In the next few lessons, we will see another extremely useful application of the vectorized form of documents: generating recommendations.</font>

In [5]:
# Read data
ted = ted_talk.transcript.copy(deep = True)

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(2467, 58795)


## 4.4 Cosine similarity

1. Cosine similarity
>We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.

2. Mathematical formula
>Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means.

1 Image courtesy techninpink.com
3. The dot product
>The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two n-dimensional vectors V and W as shown. Then, the dot product here would be v1 times w1 plus v2 times w2 and so on until vn times wn. As an example, consider two vectors A and B. By applying the formula above, we see that the dot product comes to 37.

4. Magnitude of a vector
>The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an n-dimensional vector V, the magnitude,mod V, is computed as the square root of v1 square plus v2 square and so on until vn square. Consider the vector A from before. Using the above formula, we compute its magnitude to be root 66.

5. The cosine score
>We are now in a position to compute the cosine similarity score of A and B. It is the dot product, which is 37, divided by the product of the magnitudes of A and B, which are root 66 and root 38 respectively. The value comes out to be approximately 0.738, which is the value of the cosine of the angle theta between the two vectors.

6. Cosine Score: points to remember
>Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded between -1 and 1. However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. Finally, since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case.

7. Implementation using scikit-learn
>Scikit-learn offers a cosine_similarity function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import cosine_similarity from sklearn dot metrics dot pairwise. However, remember that cosine_similarity takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores of vectors A and B from before. We see that we get the same answer of 0.738 from before.

8. Let's practice!
>That's enough theory for now. Let's practice!

In [6]:
# Define two 3-dimensional vectors A and B
A = (4,7,1)
B = (5,2,3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score
print(score)

[[0.73881883]]


## 4.5 Range of cosine scores

**Instructions**

Which of the following is a possible cosine score for a pair of document vectors?

**Possible Answers**

1. <font color=blue>__0.86__</font> Correct!
2. -0.52
3. 2.36
4. -1.32

**Results**

<font color=darkgreen>Great job! Since document vectors use only non-negative weights, the cosine score lies between 0 and 1.</font>

## 4.6 Computing dot product

In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the numpy library. More specifically, we will use the np.dot() function to compute the dot product of two numpy arrays.

**Instructions**

1. Initialize A (1,3) and B (-2,2) as numpy arrays using np.array().
2. Compute the dot product using np.dot() and passing A and B as arguments.

**Results**

<font color=darkgreen>Good job! The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which is indeed the output produced. We will not be using np.dot() too much in this course but it can prove to be a helpful function while computing dot products between two standalone vectors.</font>

In [7]:
# Initialize numpy vectors
A = np.array([1, 3])
B = np.array([-2, 2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

4


## 4.7 Cosine similarity matrix of a corpus

In this exercise, you have been given a __corpus__, which is a list containing five sentences. The __corpus__ is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).

Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.

**Instructions**

1. Initialize an instance of TfidfVectorizer. Name it tfidf_vectorizer.
2. Using fit_transform(), generate the tf-idf vectors for corpus. Name it tfidf_matrix.
3. Use cosine_similarity() and pass tfidf_matrix to compute the cosine similarity matrix cosine_sim.

**Results**

<font color=darkgreen>Great work! As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.</font>

In [8]:
# Read data
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


## 4.8 Building a plot line based recommender

1. Building a plot line based recommender
>In this lesson, we will use tf-idf vectors and cosine scores to build a recommender system that suggests movies based on overviews.

2. Movie recommender
>We've a dataset containing movie overviews. Here, we can see two movies, Shanghai Triad and Cry, the Beloved Country and their overviews.

3. Movie recommender
>Our task is to build a system that takes in a movie title and outputs a list of movies that has similar plot lines. For instance, if we passed in 'The Godfather', we could expect output like this. Notice how a lot of the movies listed here have to do with crime and gangsters, just like The Godfather.

4. Steps
>Following are the steps involved. The first step, as always, is to preprocess movie overviews. The next step is to generate the tf-idf vectors for our overviews. Finally, we generate a cosine similarity matrix which contains the pairwise similarity scores of every movie with every other movie. Once the cosine similarity matrix is computed, we can proceed to build the recommender function.

5. The recommender function
>We will build a recommender function as part of this course. Let's take a look at how it works. The recommender function takes a movie title, the cosine similarity matrix and an indices series as arguments. The indices series is a reverse mapping of movie titles with their indices in the original dataframe. The function extracts the pairwise cosine similarity scores of the movie passed in with every other movie. Next, it sorts these scores in descending order. Finally, it outputs the titles of movies corresponding to the highest similarity scores. Note that the function ignores the highest similarity score of 1. This is because the movie most similar to a given movie is the movie itself!

6. Generating tf-idf vectors
>Let's say we already have the preprocessed movie overviews as 'movie_plots'. We already know how to generate the tf-idf vectors.

7. Generating cosine similarity matrix
>Generating the cosine similarity matrix is also extremely simple. We simply pass in tfidf_matrix as both the first and second argument of cosine_similarity. This generates a matrix that contains the pairwise similarity score of every movie with every other movie. The value corresponding to the ith row and the jth column is the cosine similarity score of movie i with movie j. Notice that the diagonal elements of this matrix is 1. This is because, as stated earlier, the cosine similarity score of movie k with itself is 1.

8. The linear_kernel function
>The magnitude of a tf-idf vector is always 1. Recall from the previous lesson that the cosine score is computed as the ratio of the dot product and the product of the magnitude of the vectors. Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product! This fact can help us greatly improve the speed of computation of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors. Therefore, while working with tf-idf vectors, we can use the linear_kernel function which computes the pairwise dot product of every vector with every other vector.

9. Generating cosine similarity matrix
>Let us replace the cosine_similarity function with linear_kernel. As you can see, the output remains the same but it takes significantly lesser time to compute.

10. The get_recommendations function
>The recommender function and the indices series described earlier will be built in the exercises. You can use this function to generate recommendations using the cosine similarity matrix.

11. Let's practice!
>In the exercises, you will build recommendation systems of your own and see them in action. Let's practice!

In [9]:
# Compute and print the cosine similarity matrix
start_time = time.time()
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
print("\nWith TfidfVectorizer, the program took %.8f seconds to complete.\n\n" % (time.time() - start_time))

# Compute and print the cosine similarity matrix
start_time = time.time()
linear_ker = linear_kernel(tfidf_matrix, tfidf_matrix)
print(linear_ker)
print("\nWith cosine_similarity, the program took %.8f seconds to complete.\n\n" % (time.time() - start_time))

assert (cosine_sim == linear_ker).all()

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]

With TfidfVectorizer, the program took 0.00498533 seconds to complete.


[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]

With cosine_similarity, the program took 0.00399542 seconds to complete.




## 4.9 Comparing linear_kernel and cosine_similarity

In this exercise, you have been given __tfidf_matrix__ which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using __cosine_similarity__ and then, using __linear_kernel__.

We will then compare the computation times for both functions.

**Instructions**

1. Compute the cosine similarity matrix for tfidf_matrix using cosine_similarity.
2. Compute the cosine similarity matrix for tfidf_matrix using linear_kernel.

**Results**

<font color=darkgreen>Good job! Notice how both linear_kernel and cosine_similarity produced the same result. However, linear_kernel took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to linear_kernel to improve performance. (NOTE: In case, you see linear_kernel taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately)</font>

In [10]:
# Read data
corpus = movie_overviews[movie_overviews.overview.notnull()].overview
print(corpus.head())

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)

# Print the shape of tfidf_matrix
print(f'\nShape of tfidf_matrix: {tfidf_matrix.shape}\n\n')

methods = {cosine_similarity: 'cosine_similarity', linear_kernel: 'linear_kernel'}
for method in methods:
    # Record start time
    start = time.time()
    
    # Compute cosine similarity matrix
    cosine_sim = method(tfidf_matrix, tfidf_matrix)
    
    # Print cosine similarity matrix
    #print(cosine_sim)
    
    # Print time taken
    print("Time taken in method %s: %s seconds" %  (methods[method], (time.time() - start)))

id
862      Led by Woody, Andy's toys live happily in his ...
8844     When siblings Judy and Peter discover an encha...
15602    A family wedding reignites the ancient feud be...
31357    Cheated on, mistreated and stepped on, the wom...
11862    Just when George Banks has recovered from his ...
Name: overview, dtype: object

Shape of tfidf_matrix: (9087, 29727)


Time taken in method cosine_similarity: 1.4352073669433594 seconds
Time taken in method linear_kernel: 2.30485200881958 seconds


## 4.10 Plot recommendation engine

In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a __get_recommendations()__ function that takes in the title of a movie, a similarity matrix and an __indices__ series as its arguments and outputs a list of most similar movies. __indices__ has already been provided to you.

You have also been given a __movie_plots__ Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.

Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.

Instructions

1. Initialize a TfidfVectorizer with English stop_words. Name it tfidf.
2. Construct tfidf_matrix by fitting and transforming the movie plot data using fit_transform().
3. Generate the cosine similarity matrix cosine_sim using tfidf_matrix. Don't use cosine_similarity()!
4. Use get_recommendations() to generate recommendations for 'The Dark Knight Rises'.

**Results**

<font color=darkgreen>Congratulations! You've just built your very first recommendation system. Notice how the recommender correctly identifies 'The Dark Knight Rises' as a Batman movie and recommends other Batman movies as a result. This sytem is, of course, very primitive and there are a host of ways in which it could be improved. One method would be to look at the cast, crew and genre in addition to the plot to generate recommendations. We will not be covering this in this course but you have all the tools necessary to accomplish this. Do give it a try!</font>

In [11]:
# Function to retrieve movie recomendation
def get_recommendations(title, cosine_sim, indices, metadata):
    """Retrieve movies recomendation."""
    # Get the index of the movie that matches the title
    idx = indices[title]
    #print('\n\nidx:', idx)
    
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print('\nsim_scores:', cosine_sim)
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [12]:
# Read data
metadata = movie_overviews[movie_overviews.overview.notnull()].reset_index()
movie_plots = metadata.overview
indices = metadata.reset_index().set_index('title')['index']
print(f'Metadata: \n{metadata.head()}')
print(f'\n\nMovie_plots: \n{movie_plots.head()}')
print(f'\n\nIndices: \n{indices.head()}')

Metadata: 
      id                        title  \
0    862                    Toy Story   
1   8844                      Jumanji   
2  15602             Grumpier Old Men   
3  31357            Waiting to Exhale   
4  11862  Father of the Bride Part II   

                                            overview  \
0  Led by Woody, Andy's toys live happily in his ...   
1  When siblings Judy and Peter discover an encha...   
2  A family wedding reignites the ancient feud be...   
3  Cheated on, mistreated and stepped on, the wom...   
4  Just when George Banks has recovered from his ...   

                                             tagline  
0                                                NaN  
1          Roll the dice and unleash the excitement!  
2  Still Yelling. Still Fighting. Still Ready for...  
3  Friends are the people who let you be yourself...  
4  Just When His World Is Back To Normal... He's ...  


Movie_plots: 
0    Led by Woody, Andy's toys live happily in his ...
1   

In [13]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 2))

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)
#print(list(tfidf_matrix.toarray()))

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
movies = ['Toy Story', 'The Dark Knight Rises']
for movie in movies:
    print(f'\n\nBase on movie: "{movie}", our recomendations are:')
    print(get_recommendations(movie, cosine_sim, indices, metadata))



Base on movie: "Toy Story", our recomendations are:
2499               Toy Story 2
7537               Toy Story 3
6194    The 40 Year Old Virgin
889      Rebel Without a Cause
2544           Man on the Moon
6629              Factory Girl
1597                 Condorman
6556    For Your Consideration
4988          Rivers and Tides
436                     Malice
Name: title, dtype: object


Base on movie: "The Dark Knight Rises", our recomendations are:
132                              Batman Forever
6902                            The Dark Knight
523                                      Batman
1113                             Batman Returns
7567                 Batman: Under the Red Hood
7901                           Batman: Year One
8164    Batman: The Dark Knight Returns, Part 1
7245                  The File on Thelma Jordon
6145                              Batman Begins
4489                                      Q & A
Name: title, dtype: object


## 4.11 The recommender function

In this exercise, we will build a recommender function __get_recommendations()__, as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).

You have been given a dataset __metadata__ that consists of the movie titles and overviews. The head of this dataset has been printed to console.

**Instructions**

1. Get index of the movie that matches the title by using the title key of indices.
2. Extract the ten most similar movies from sim_scores and store it back in sim_scores.


**Results**

<font color=darkgreen>Good job! With this recommender function in our toolkit, we are now in a very good place to build the rest of the components of our recommendation engine.</font>

In [14]:
# Read data
metadata = movie_overviews[movie_overviews.tagline.notnull()].reset_index(drop=True)[['title', 'tagline']]

# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

print(f'Metadata: \n{metadata.head()}')
print(f'\n\nIndices: \n{indices.head()}')


# Function to retrieve movie recomendation
def get_new_recommendations(title, cosine_sim, indices, metadata, num_recomendation=10):
    """Retrieve movies recomendation."""    # Get index of movie that matches title
    idx = indices[title]

    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:num_recomendation+1]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

Metadata: 
                         title  \
0                      Jumanji   
1             Grumpier Old Men   
2            Waiting to Exhale   
3  Father of the Bride Part II   
4                         Heat   

                                             tagline  
0          Roll the dice and unleash the excitement!  
1  Still Yelling. Still Fighting. Still Ready for...  
2  Friends are the people who let you be yourself...  
3  Just When His World Is Back To Normal... He's ...  
4                           A Los Angeles Crime Saga  


Indices: 
title
Jumanji                        0
Grumpier Old Men               1
Waiting to Exhale              2
Father of the Bride Part II    3
Heat                           4
dtype: int64


## 4.12 TED talk recommender

In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a __get_recommendations()__ function that takes in the title of a talk, a similarity matrix and an __indices__ series as its arguments, and outputs a list of most similar talks. __indices__ has already been provided to you.

You have also been given a __transcripts__ series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.

Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.

**Instructions**

1. Initialize a TfidfVectorizer with English stopwords. Name it tfidf.
2. Construct tfidf_matrix by fitting and transforming transcripts.
3. Generate the cosine similarity matrix cosine_sim using tfidf_matrix.
4. Use get_recommendations() to generate recommendations for '5 ways to kill your dreams'.

**Results**

<font color=darkgreen>Excellent work! You have successfully built a TED talk recommender. This recommender works surprisingly well despite being trained only on a small subset of TED talks. In fact, three of the talks recommended by our system is also recommended by the official TED website as talks to watch next after '5 ways to kill your dreams'!</font>

In [15]:
# Read data
metadata = ted_talk[ted_talk.transcript.notnull()].reset_index(drop=True)[['title', 'transcript', 'url']]

# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

print(f'Metadata: \n{metadata.head()}')
print(f'\n\nIndices: \n{indices.head()}')

Metadata: 
                                           title  \
0  The mothers who found forgiveness, friendship   
1                   My year of living biblically   
2                 A robot that flies like a bird   
3                A TED speaker's worst nightmare   
4              A whistleblower you haven't heard   

                                          transcript  \
0  Phyllis Rodriguez: We are here today because o...   
1  I thought I'd tell you a little about what I l...   
2  It is a dream of mankind to fly like a bird. B...   
3  Today I'm going to talk about unexpected disco...   
4  (Whistling)(Whistling ends)(Applause)Thank you...   

                                                 url  
0  https://www.ted.com/talks/9_11_healing_the_mot...  
1  https://www.ted.com/talks/a_j_jacobs_year_of_l...  
2  https://www.ted.com/talks/a_robot_that_flies_l...  
3  https://www.ted.com/talks/a_ted_speaker_s_wors...  
4  https://www.ted.com/talks/a_whistleblower_you_...  


Indices

In [16]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(metadata.transcript)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix)
 
# Generate recommendations 
talks = ['5 ways to kill your dreams', 'The mothers who found forgiveness, friendship']
for talk in talks:
    print(f'\n\nBase on ted talk: "{talk}", our recomendations are:')
    print(get_new_recommendations(talk, cosine_sim, indices, metadata))



Base on ted talk: "5 ways to kill your dreams", our recomendations are:
519                  The dream we haven't dared to dream
2159                   Bring on the learning revolution!
1921                     Success is a continuous journey
1012                 Let's crowdsource the world's goals
2359                                Why we do what we do
2070                           How to find work you love
2156                    How great leaders inspire action
1568    How we can make the world a better place by 2030
1905         How to run a company with (almost) no rules
1927                                   A life of purpose
Name: title, dtype: object


Base on ted talk: "The mothers who found forgiveness, friendship", our recomendations are:
2129                    Why we have too few women leaders
2055                           A one-woman global village
631         What we don't know about Europe's Muslim kids
2023    How I stopped the Taliban from shutting down m...
2235

## 4.13 Beyond n-grams: word embeddings

1. Beyond n-grams: word embeddings
>We have covered a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.

2. The problem with BoW and tf-idf
>Consider the three sentences, I am happy, I am joyous and I am sad. Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

3. Word embeddings
>Word embedding is the process of mapping words into an n-dimensional vector space. These vectors are usually produced using deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. However, once generated, these vectors can be used to discern how similar two words are to each other. Consequently, they can also be used to detect synonyms and antonyms. Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words king and queen relate to each other the same way as man and woman. Or that France and Paris are related in the same way as Russia and Moscow. One last thing to note is that word embeddings are not trained on user data; they are dependent on the pre-trained spacy model you're using and are independent of the size of your dataset.

4. Word embeddings using spaCy
>Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the spacy model and create the doc object for our string. Note that it is advisable to load larger spacy models while working with word vectors. This is because the en_core_web_sm model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word by looping through the tokens and accessing the vector attribute. The truncated output is as shown.

5. Word similarities
>We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other. We define a doc containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected, happy and joyous are more similar to each other than they are to sad.

6. Document similarities
>Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create doc objects for the sentences. Like spacy tokens, docs also have a similarity method. Therefore, we can compute the similarity between two docs as follows. As expected, I am happy is more similar to I am joyous than it is to I am sad. Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, I and am.

7. Let's practice!
>With this, we come to an end of this lesson. Let's now practice our new found skills in the last set of exercises.

In [24]:
# Create Doc object
doc = nlp('I am happy')

# Generate word vectors for each token
pprint([token.vector for token in doc])

[array([ 1.8733e-01,  4.0595e-01, -5.1174e-01, -5.5482e-01,  3.9716e-02,
        1.2887e-01,  4.5137e-01, -5.9149e-01,  1.5591e-01,  1.5137e+00,
       -8.7020e-01,  5.0672e-02,  1.5211e-01, -1.9183e-01,  1.1181e-01,
        1.2131e-01, -2.7212e-01,  1.6203e+00, -2.4884e-01,  1.4060e-01,
        3.3099e-01, -1.8061e-02,  1.5244e-01, -2.6943e-01, -2.7833e-01,
       -5.2123e-02, -4.8149e-01, -5.1839e-01,  8.6262e-02,  3.0818e-02,
       -2.1253e-01, -1.1378e-01, -2.2384e-01,  1.8262e-01, -3.4541e-01,
        8.2611e-02,  1.0024e-01, -7.9550e-02, -8.1721e-01,  6.5621e-03,
        8.0134e-02, -3.9976e-01, -6.3131e-02,  3.2260e-01, -3.1625e-02,
        4.3056e-01, -2.7270e-01, -7.6020e-02,  1.0293e-01, -8.8653e-02,
       -2.9087e-01, -4.7214e-02,  4.6036e-02, -1.7788e-02,  6.4990e-02,
        8.8451e-02, -3.1574e-01, -5.8522e-01,  2.2295e-01, -5.2785e-02,
       -5.5981e-01, -3.9580e-01, -7.9849e-02, -1.0933e-02, -4.1722e-02,
       -5.5576e-01,  8.8707e-02,  1.3710e-01, -2.9873e-03, -2.6

In [29]:
# Word similarities
doc = nlp("happy joyous sad")

for token1 in doc:
    for token2 in doc:
        print('{:>6} - {:<6} : {}'.format(token1.text, token2.text, token1.similarity(token2)))

 happy - happy  : 1.0
 happy - joyous : 0.5333030223846436
 happy - sad    : 0.6438988447189331
joyous - happy  : 0.5333030223846436
joyous - joyous : 1.0
joyous - sad    : 0.4383276700973511
   sad - happy  : 0.6438988447189331
   sad - joyous : 0.4383276700973511
   sad - sad    : 1.0


In [41]:
# Document similarities
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
print('{} - {:<11} : {}'.format(sent1, sent2.text, sent1.similarity(sent2)))

# Compute similarity between sent1 and sent3
print('{} - {:} : {:<11}'.format(sent1, sent3.text, sent1.similarity(sent3)))

I am happy - I am sad    : 0.9492464724721577
I am happy - I am joyous : 0.9239675481730458


## 4.14 Generating word vectors

In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as __sent__ and has been printed to the console for your convenience.

**Instructions**

1. Create a Doc object doc for sent.
2. In the nested loop, compute the similarity between token1 and token2.

**Results**

<font color=darkgreen>Good job! Notice how the words 'apples' and 'oranges' have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words.</font>

In [43]:
# Read data
sent = 'I like apples and oranges'

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
    for token2 in doc:
        print('{:>8} - {:<8} : {}'.format(token1.text, token2.text, token1.similarity(token2)))

       I - I        : 1.0
       I - like     : 0.5554912686347961
       I - apples   : 0.20442721247673035
       I - and      : 0.3160785734653473
       I - oranges  : 0.18824081122875214
    like - I        : 0.5554912686347961
    like - like     : 1.0
    like - apples   : 0.32987144589424133
    like - and      : 0.5267484188079834
    like - oranges  : 0.2771747410297394
  apples - I        : 0.20442721247673035
  apples - like     : 0.32987144589424133
  apples - apples   : 1.0
  apples - and      : 0.24097733199596405
  apples - oranges  : 0.7780942320823669
     and - I        : 0.3160785734653473
     and - like     : 0.5267484188079834
     and - apples   : 0.24097733199596405
     and - and      : 1.0
     and - oranges  : 0.19245947897434235
 oranges - I        : 0.18824081122875214
 oranges - like     : 0.2771747410297394
 oranges - apples   : 0.7780942320823669
 oranges - and      : 0.19245947897434235
 oranges - oranges  : 1.0


## 4.15 Computing similarity of Pink Floyd songs

In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as __hopes__, __hey__ and __mother__ respectively.

Your task is to compute the pairwise similarity between __mother__ and __hopes__, and __mother__ and __hey__.

**Instructions**

1. Create Doc objects for mother, hopes and hey.
2. Compute the similarity between mother and hopes.
3. Compute the similarity between mother and hey.

**Results**

<font color=darkgreen>Excellent work! Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. Treat yourself by listening to these songs. They're some of the best!</font>

In [65]:
with open('data/mother.dat', 'r', encoding='utf-8') as f: mother = f.read()
with open('data/hopes.dat' , 'r', encoding='utf-8') as f: hopes  = f.read()
with open('data/hey.dat'   , 'r', encoding='utf-8') as f: hey    = f.read()
    
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print('mother - hopes :', mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print('mother - hey   :', mother_doc.similarity(hey_doc))

mother - hopes : 0.8653561365788051
mother - hey   : 0.9591095703289078


## 4.16 Congratulations!

1. Congratulations!
>Congratulations on making it to the end of the course!

2. Review
>In this course, we learned about various feature engineering techniques for natural language processing in python. We started off by computing basic features such as character length and word length of documents. We then moved on to readability scores and learned various metrics that could help us deduce the amount of education required to comprehend a piece of text fully. Next, we were introduced to the spacy library and learned to perform tokenization and lemmatization. Building on these techniques, we proceeded to explore text cleaning. We also learned how to perform part of speech tagging and named entity recognition using spacy models and had a sneak peek at their applications. The third chapter was dedicated to n-gram modeling. We also explored an application of it in sentiment analysis of movie reviews. The final chapter saw us covering tf-idf vectors and cosine similarity. Using these concepts, we built a movie and a TED Talk recommender. The final lesson gave you a sneak peek into word embeddings and their use cases.

3. Further resources
>This, by no means, is the end of the road. Once you're done with this course, it is highly recommended that you take the following courses, also offered by DataCamp to muscle up your skills further.

4. Thank you!
>We hope you have enjoyed taking this course as much as we did developing it. Thank you and all the best with your data science journey!

# Aditional material

- Datacamp course: https://learn.datacamp.com/courses/feature-engineering-for-nlp-in-python
- POS annotations in spaCy: https://spacy.io/api/annotation#pos-tagging
- NER annotations in spaCy: https://spacy.io/api/annotation#named-entities
- TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html