# Building tf-idf document vectors
![image-4](image-4.png)


In [1]:
import pandas as pd
ted = pd.read_csv('datasets/ted.csv')
ted.head()

Unnamed: 0,transcript,url
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


In [2]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted['transcript'])

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(500, 29158)


In [3]:
# First 5 rows of tfidf_matrix, where each row represents a transcript and columns the vocabulary with their corresponding weights
tfidf_matrix[:5].toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.01290035, 0.        , ..., 0.        , 0.        ,
        0.        ]])

# Cosine Similarity
![image-5.png](image-5.png)

**Points to remember:**
- Value between -1 to 1
- In NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. 

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(ted['transcript'])

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.47014672 0.43111468 ... 0.46412547 0.45670633 0.49282253]
 [0.47014672 1.         0.44271363 ... 0.46985012 0.44711665 0.50449557]
 [0.43111468 0.44271363 1.         ... 0.46801727 0.39181858 0.46637994]
 ...
 [0.46412547 0.46985012 0.46801727 ... 1.         0.45097242 0.48634418]
 [0.45670633 0.44711665 0.39181858 ... 0.45097242 1.         0.49410619]
 [0.49282253 0.50449557 0.46637994 ... 0.48634418 0.49410619 1.        ]]


In [5]:
# Tfidf_matrix: row --> each transcript of ted video, column --> total vocabulary in whole dataset
tfidf_matrix.shape

(500, 29158)

In [6]:
# Cosine Similarity, the similarity between every ted video
cosine_sim.shape

(500, 500)

# Building a plot line based recommender
**Steps:**
1. Text preprocessing
2. Generate tf-idf vectors
3. Generate cosine similarity matrix

## The linear_kernel function
- The magnitude of a tf-idf vector is always 1
- Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product! 
- This fact can help us greatly improve the speed of computation of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors.
- Therefore, while working with tf-idf vectors, we can use the **linear_kernel function** which computes the pairwise dot product of every vector with every other vector.

In [7]:
movie_overview = pd.read_csv('datasets/movie_overviews.csv')
movie_overview.head()

Unnamed: 0,id,title,overview,tagline
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...


In [8]:
# Dropping null values
movie_overview.dropna(subset=['tagline'],inplace=True)
movie_overview.head()

Unnamed: 0,id,title,overview,tagline
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga


In [9]:
# Reset Index
movie_overview.reset_index(drop=True,inplace=True)

In [10]:
# Pre-processed dataset
movie_overview.head()

Unnamed: 0,id,title,overview,tagline
0,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
1,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
2,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga


In [11]:
# Shape of preprocessed dataset
movie_overview.shape

(7033, 4)

In [12]:
from sklearn.metrics.pairwise import linear_kernel

movie_plots = movie_overview['tagline']

# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [13]:
# Similarity between every movie
cosine_sim.shape

(7033, 7033)

In [14]:
# Generate mapping between titles and index
indices = pd.Series(movie_overview.index, index=movie_overview['title']).drop_duplicates()
indices

title
Jumanji                                                  0
Grumpier Old Men                                         1
Waiting to Exhale                                        2
Father of the Bride Part II                              3
Heat                                                     4
                                                      ... 
Kingsglaive: Final Fantasy XV                         7028
Sharknado 4: The 4th Awakens                          7029
Rustom                                                7030
Shin Godzilla                                         7031
The Beatles: Eight Days a Week - The Touring Years    7032
Length: 7033, dtype: int64

In [15]:
# Recommender function
def get_recommendations(title, cosine_sim, indices):
    
    # Get index of movie that matches title
    idx = indices[title]
    
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return movie_overview['title'].iloc[movie_indices]

In [20]:
# Generate recommendations 
print(get_recommendations('Jumanji', cosine_sim, indices))

3763                                 Hulk
2868    Final Fantasy: The Spirits Within
2049                        Sleepy Hollow
4839                            Last Days
4209            The Plague of the Zombies
1629                       Trick or Treat
6437                           Iron Man 3
4772                 Panic in the Streets
968                        Vegas Vacation
2708                      Head Over Heels
Name: title, dtype: object


# Problem with BoW and Tfidf
![image-6](image-6.png)

Consider the three sentences, I am happy, I am joyous and I am sad. Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

# Word embeddings
- Mapping words into an n-dimensional vector space
- Produced using deep learning and huge amounts of data
- Discern how similar two words are to each other
- Used to detect synonyms and antonyms
- Captures complex relationships
    - `King`-`Queen` -> `Man`-`Woman`
    - `France`-`Paris` -> `Russia`-`Moscow`
- Dependent on spacy model; independent of dataset you use

In [23]:
import spacy

sent = 'I like apples and oranges'

# Load the model
nlp = spacy.load('en_core_web_lg')

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.3184410631656647
I apples 0.1975560337305069
I and -0.0979200005531311
I oranges 0.05048730596899986
like I 0.3184410631656647
like like 1.0
like apples 0.29574331641197205
like and 0.24359610676765442
like oranges 0.2706858515739441
apples I 0.1975560337305069
apples like 0.29574331641197205
apples apples 1.0
apples and 0.24472734332084656
apples oranges 0.7808241248130798
and I -0.0979200005531311
and like 0.24359610676765442
and apples 0.24472734332084656
and and 1.0
and oranges 0.3738573491573334
oranges I 0.05048730596899986
oranges like 0.2706858515739441
oranges apples 0.7808241248130798
oranges and 0.3738573491573334
oranges oranges 1.0


'apples' & 'oranges' has the highest similarity which makes sense.