# Motivation
* Some words occur very commonly across all documents
* Corpus of documents on the universe
    * One document has jupiter and universe occurring 20 times each.
    * jupiter rarely occurs in the other documents. universe is common.
    * Give more weight to jupiter on account of exclusivity.

# Applications
* Automatically detect stopwords
* Search algorithms to determine the ranking of pages
* Recommender systems
* Better performance in predicted modeling for some cases

# Term frequency-inverse document frequency
> The weight of a term in a document is proportional to term frequency and an inverse function of the number of documents in which it occurs

## Mathematical formula

$$w_{i,j} = tf_{i,j} \times \log{\left(\frac{N}{df_i}\right)}$$

## tf-idf using scikit-learn

In [1]:
import pandas as pd
corpus = pd.Series(
    [
        "The lion is the king of the jungle",
        "Lions have lifespans of a decade",
        "The lion is an endangered species",
    ]
)

In [2]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [3]:
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(corpus)
pd.DataFrame(count_matrix.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,an,decade,endangered,have,is,jungle,king,lifespans,lion,lions,of,species,the
0,0,0,0,0,1,1,1,0,1,0,1,0,3
1,0,1,0,1,0,0,0,1,0,1,1,0,0
2,1,0,1,0,1,0,0,0,1,0,0,1,1


In [4]:
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,an,decade,endangered,have,is,jungle,king,lifespans,lion,lions,of,species,the
0,0.0,0.0,0.0,0.0,0.254347,0.334435,0.334435,0.0,0.254347,0.0,0.254347,0.0,0.76304
1,0.0,0.467351,0.0,0.467351,0.0,0.0,0.0,0.467351,0.0,0.467351,0.355432,0.0,0.0
2,0.459548,0.0,0.459548,0.0,0.349498,0.0,0.0,0.0,0.349498,0.0,0.0,0.459548,0.349498


## tf-idf weight of a word ocurring in all documents

The word bottle occurs 5 times in a particular document D and also occurs in every document of the corpus. What is the tf-idf weight of bottle in D?

$$w_{i,j} = tf_{i,j} \times \log{\left(\frac{N}{df_i}\right)}$$

In [5]:
import numpy as np
5 * np.log(1)

0.0

> This is because the inverse document frequency is constant across documents in a corpus and since bottle occurs in every document, its value is log(1), which is 0.

## tf-idf vectors for TED talks

In [6]:
ted = pd.read_csv('ted.csv')

In [7]:
ted

Unnamed: 0,transcript,url
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...
...,...,...
495,Today I'm going to unpack for you three exampl...,https://www.ted.com/talks/john_hodgman_design_...
496,Both myself and my brother belong to the under...,https://www.ted.com/talks/sheikha_al_mayassa_g...
497,John Hockenberry: It's great to be here with y...,https://www.ted.com/talks/tom_shannon_the_pain...
498,"What you're doing, right now, at this very mom...",https://www.ted.com/talks/nilofer_merchant_got...


In [8]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(ted['transcript'])
pd.DataFrame(tfidf_matrix.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,00,000,000001,00001,000042,0001,000th,001,01,024,...,zywiec,ºf,čapek,ʾan,ʾilla,ʾilāha,อย,อยman,อร,送你葱
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.012900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
496,0.0,0.008076,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
497,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
498,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Cosine similarity

We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP. 

## Mathematical formula

Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means. 

$$ similarity(\vec{A},\vec{B}) = \cos{(\theta)} = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \times ||\vec{B}||}$$

## The dot product

Consider two vectors,
$$\vec{V} = (v_1 , v_2 , ⋯ , v_n ), \vec{W} = (w_1 , w_2 , ⋯ , w_n)$$

Then the dot product of V and W is,
$$\vec{V} \cdot \vec{W} = (v_1 \times w_1 ) + (v_2 \times w_2 ) + ⋯ + (v \times w )$$

Example:

$$\vec{A} = (4, 7, 1), \vec{B} = (5, 2, 3)$$
$$\vec{A} \cdot \vec{B} = (4 \times 5) + (7 \times 2) + ⋯ (1 \times 3)$$
$$ = 20 + 14 + 3 = 37$$

In [9]:
np.dot(A:=np.array([4,7,1]), B:=np.array([5,2,3]))

37

## Magnitude of a vector

For any vector,
$$\vec{V} = (v_1 , v_2 , ⋯ , v_n)$$
The magnitude is defined as,
$$||\vec{V}|| = \sqrt{(v_1^2 , v_2^2 , ⋯ , v_n^2)}$$
Example:
$$\vec{A} = (4,7,1), \vec{B} = (5,2,3)$$
$$||\vec{A}|| = \sqrt{(4)^2 + (7)^2 + (1)^2} = \sqrt{16+49+1} = \sqrt{66}$$

In [10]:
np.linalg.norm(A)

8.12403840463596

$$||\vec{B}|| = \sqrt{(5)^2 + (2)^2 + (3)^2} = \sqrt{25+4+9} = \sqrt{38}$$

In [11]:
np.linalg.norm(B)

6.164414002968976

## The cosine score

$$\cos{(\theta)} = cos(\vec{A},\vec{B}) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \times ||\vec{B}||} = \frac{37}{\sqrt{66}\times\sqrt{38}} \sim 0.7388$$

In [12]:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

0.7388188340435563

## Cosine Score: points to remember
* Value between -1 and 1.
* In NLP, value between 0 and 1 because term frequencies are positive.
* Robust to document length: since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case. 

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([A],[B])[0]

array([0.73881883])

## Cosine similarity matrix of a corpus

Compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf). 

In [14]:
corpus = ['The sun is the largest celestial body in the solar system', 'The solar system consists of the sun and eight revolving planets', 'Ra was the Egyptian Sun God', 'The Pyramids were the pinnacle of Egyptian architecture', 'The quick brown fox jumps over the lazy dog']
corpus

['The sun is the largest celestial body in the solar system',
 'The solar system consists of the sun and eight revolving planets',
 'Ra was the Egyptian Sun God',
 'The Pyramids were the pinnacle of Egyptian architecture',
 'The quick brown fox jumps over the lazy dog']

In [15]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
pd.DataFrame(tfidf_matrix.A, columns=tfidf_vectorizer.get_feature_names_out())

Unnamed: 0,and,architecture,body,brown,celestial,consists,dog,egyptian,eight,fox,...,pyramids,quick,ra,revolving,solar,sun,system,the,was,were
0,0.0,0.0,0.337218,0.0,0.337218,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.272065,0.225839,0.272065,0.482058,0.0,0.0
1,0.346907,0.0,0.0,0.0,0.0,0.346907,0.0,0.0,0.346907,0.0,...,0.0,0.0,0.0,0.346907,0.279882,0.232328,0.279882,0.330606,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.387878,0.0,0.0,...,0.0,0.0,0.480764,0.0,0.0,0.321974,0.0,0.229087,0.480764,0.0
3,0.0,0.401284,0.0,0.0,0.0,0.0,0.0,0.323754,0.0,0.0,...,0.401284,0.0,0.0,0.0,0.0,0.0,0.0,0.382428,0.0,0.401284
4,0.0,0.0,0.0,0.355599,0.0,0.0,0.355599,0.0,0.0,0.355599,...,0.0,0.355599,0.0,0.0,0.0,0.0,0.0,0.33889,0.0,0.0


In [16]:
# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.        , 0.36413198, 0.18314713, 0.18435251, 0.16336438],
       [0.36413198, 1.        , 0.15054075, 0.21704584, 0.11203887],
       [0.18314713, 0.15054075, 1.        , 0.21318602, 0.07763512],
       [0.18435251, 0.21704584, 0.21318602, 1.        , 0.12960089],
       [0.16336438, 0.11203887, 0.07763512, 0.12960089, 1.        ]])

> The cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.

# Building a plot line based recommender

In [17]:
import numpy as np
import pandas as pd
movies = pd.read_csv('movie_overviews.csv')
movies

Unnamed: 0,id,title,overview,tagline
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...
...,...,...,...,...
9094,159550,The Last Brickmaker in America,A man must cope with the loss of his wife and ...,
9095,392572,Rustom,"Rustom Pavri, an honourable officer of the Ind...",Decorated Officer. Devoted Family Man. Defendi...
9096,402672,Mohenjo Daro,"Village lad Sarman is drawn to big, bad Mohenj...",
9097,315011,Shin Godzilla,From the mind behind Evangelion comes a hit la...,A god incarnate. A city doomed.


## Steps
1. Text preprocessing
2. Generate tf-idf vectors
3. Generate cosine similarity matrix

## The recommender function
1. Take a movie title, cosine similarity matrix and indices series as arguments.
2. Extract pairwise cosine similarity scores for the movie.
3. Sort the scores in descending order.
4. Output titles corresponding to the highest scores.
5. Ignore the highest similarity score (of 1).

## Generating tf-idf vectors

In [18]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movies['overview'].fillna(''))
# pd.DataFrame(tfidf_matrix.A, columns=vectorizer.get_feature_names_out())
tfidf_matrix.shape

(9099, 30020)

## Generating cosine similarity matrix

In [19]:
# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
# Generate cosine similarity matrix

In [20]:
cosine_sim1 = cosine_similarity(tfidf_matrix, tfidf_matrix, dense_output=False)

In [21]:
print(f'Mean similarity: {cosine_sim1.mean()}')

Mean similarity: 0.031205929594335755


## The `linear_kernel` function

* Magnitude of a tf-idf vector is always 1

In [22]:
# Checking whether the magnitude of a tf-idf vector is 1
magnitude = np.apply_along_axis(func1d=np.linalg.norm, axis=1, arr=tfidf_matrix.A).mean()
magnitude

0.9985712715683042

* Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product!
$$ \cos{(\theta)} = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \times ||\vec{B}||} = \frac{\vec{A} \cdot \vec{B}}{1 \times 1} = \vec{A} \cdot \vec{B}$$
* Cosine similarity, or the cosine kernel, computes similarity as the normalized dot product of X and Y
    * On L2-normalized data, this function is equivalent to `linear_kernel`.
* Can significantly improve computation time.
* Use `linear_kernel` instead of `cosine_similarity`.

In [23]:
# Checking whether cosine_similarity() is like matmul (@)
cosine_sim2 = tfidf_matrix @ tfidf_matrix.T

In [24]:
cosine_sim1.mean() == cosine_sim2.mean() 

True

In [25]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Generate cosine similarity matrix

In [26]:
cosine_sim3 = linear_kernel(tfidf_matrix, tfidf_matrix, dense_output=False)

In [27]:
cosine_sim2.mean() == cosine_sim3.mean() 

True

## Plot recommendation engine

In [28]:
import pandas as pd
movies = pd.read_csv('movie_overviews.csv')

In [29]:
# indices
indices = movies.reset_index(drop=False).set_index('title')['index']

In [30]:
# Initialize the TfidfVectorizer 
tfidf_matrix = TfidfVectorizer(stop_words='english').fit_transform(movies['overview'].fillna(''))

In [31]:
# cosine simililarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [32]:
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [33]:
get_recommendations(title='The Dark Knight Rises',
                    cosine_sim=cosine_sim,
                    indices=indices)

132                              Batman Forever
6907                            The Dark Knight
1116                             Batman Returns
7573                 Batman: Under the Red Hood
524                                      Batman
7907                           Batman: Year One
8171    Batman: The Dark Knight Returns, Part 1
2581               Batman: Mask of the Phantasm
8232    Batman: The Dark Knight Returns, Part 2
6150                              Batman Begins
Name: title, dtype: object

# Beyond n-grams: word embeddings

## Word embeddings
* Mapping words into an n-dimensional vector space
* Produced using deep learning and huge amounts of data
* Discern how similar two words are to each other
* Used to detect synonyms and antonyms
* Captures complex relationships
    * King - Queen → Man - Woman
    * France - Paris → Russia - Moscow
* Dependent on pre-trained spacy model; independent of dataset you use

In [38]:
import spacy
# Load model and create Doc object
spacy.require_gpu()

True

In [39]:
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')
# Generate word vectors for each token
for token in doc:
    print(token.vector)

[ -1.8607     0.15804   -4.1425    -8.6359   -16.955      1.157
  -1.588      5.6609   -12.03      16.417      4.1907     5.5122
  -0.11932   -6.06       3.8957    -7.8212     3.6736   -14.824
  -7.6638     2.5344     7.9893     3.6785     4.3296   -11.338
  -3.5506    -5.899      1.0998     3.4515    -5.4191     1.8356
  -2.902     -7.9294    -1.1269     8.4124     5.1416    -3.1489
  -4.2061    -1.459      7.8313     0.27859   -4.3832     8.0756
  -0.94784   -6.1214     8.2792     5.0529    -8.3611    -6.0743
  -0.53773    2.7538     3.8162    -4.1612     0.7591    -2.8374
  -6.4851    -3.3435     3.2703     2.759      2.6645     4.0013
  13.381     -5.2907    -3.133      4.5374   -11.899     -6.716
  -0.041939  -2.0879     3.0101    10.3        2.6835     2.7265
   8.3018    -4.4563    14.43       3.9642    -4.8287    -5.648
  -7.2597   -11.475     -2.6171     0.3325    14.454     -5.155
   0.93722   -2.6187    -1.783      3.8711     1.4681    -6.705
  -4.0953    -0.22536    9.444  

## Word similarities

In [50]:
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

happy happy 1.0
happy joyous 0.38305556774139404
happy sad 0.5034751296043396
joyous happy 0.38305556774139404
joyous joyous 1.0
joyous sad 0.5143248438835144
sad happy 0.5034751296043396
sad joyous 0.5143248438835144
sad sad 1.0


## Document similarities

In [57]:
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

In [58]:
# Compute similarity between sent1 and sent2
sent1.similarity(sent2)

0.9740256667137146

In [59]:
# Compute similarity between sent1 and sent3
sent1.similarity(sent3)

0.981972336769104

## Generating word vectors

In [60]:
# Create the doc object
sent = 'I like apples and oranges'
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.3184410333633423
I apples 0.1975560337305069
I and -0.0979199931025505
I oranges 0.05048731341958046
like I 0.3184410333633423
like like 1.0
like apples 0.29574334621429443
like and 0.2435961216688156
like oranges 0.2706858813762665
apples I 0.1975560337305069
apples like 0.29574334621429443
apples apples 1.0
apples and 0.24472729861736298
apples oranges 0.7808240652084351
and I -0.0979199931025505
and like 0.2435961216688156
and apples 0.24472729861736298
and and 1.0
and oranges 0.3738572895526886
oranges I 0.05048731341958046
oranges like 0.2706858813762665
oranges apples 0.7808240652084351
oranges and 0.3738572895526886
oranges oranges 1.0


> Notice how the words 'apples' and 'oranges' have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words.

## Computing similarity of Pink Floyd songs

In [62]:
hopes = "\nBeyond the horizon of the place we lived when we were young\nIn a world of magnets and miracles\nOur thoughts strayed constantly and without boundary\nThe ringing of the division bell had begun\nAlong the Long Road and on down the Causeway\nDo they still meet there by the Cut\nThere was a ragged band that followed in our footsteps\nRunning before times took our dreams away\nLeaving the myriad small creatures trying to tie us to the ground\nTo a life consumed by slow decay\nThe grass was greener\nThe light was brighter\nWhen friends surrounded\nThe nights of wonder\nLooking beyond the embers of bridges glowing behind us\nTo a glimpse of how green it was on the other side\nSteps taken forwards but sleepwalking back again\nDragged by the force of some in a tide\nAt a higher altitude with flag unfurled\nWe reached the dizzy heights of that dreamed of world\nEncumbered forever by desire and ambition\nThere's a hunger still unsatisfied\nOur weary eyes still stray to the horizon\nThough down this road we've been so many times\nThe grass was greener\nThe light was brighter\nThe taste was sweeter\nThe nights of wonder\nWith friends surrounded\nThe dawn mist glowing\nThe water flowing\nThe endless river\nForever and ever\n"
mother = "\nMother do you think they'll drop the bomb?\nMother do you think they'll like this song?\nMother do you think they'll try to break my balls?\nOoh, ah\nMother should I build the wall?\nMother should I run for President?\nMother should I trust the government?\nMother will they put me in the firing mine?\nOoh ah,\nIs it just a waste of time?\nHush now baby, baby, don't you cry.\nMama's gonna make all your nightmares come true.\nMama's gonna put all her fears into you.\nMama's gonna keep you right here under her wing.\nShe won't let you fly, but she might let you sing.\nMama's gonna keep baby cozy and warm.\nOoh baby, ooh baby, ooh baby,\nOf course mama's gonna help build the wall.\nMother do you think she's good enough, for me?\nMother do you think she's dangerous, to me?\nMother will she tear your little boy apart?\nOoh ah,\nMother will she break my heart?\nHush now baby, baby don't you cry.\nMama's gonna check out all your girlfriends for you.\nMama won't let anyone dirty get through.\nMama's gonna wait up until you get in.\nMama will always find out where you've been.\nMama's gonna keep baby healthy and clean.\nOoh baby, ooh baby, ooh baby,\nYou'll always be baby to me.\nMother, did it need to be so high?\n"
hey = "\nHey you, out there in the cold\nGetting lonely, getting old\nCan you feel me?\nHey you, standing in the aisles\nWith itchy feet and fading smiles\nCan you feel me?\nHey you, don't help them to bury the light\nDon't give in without a fight\nHey you out there on your own\nSitting naked by the phone\nWould you touch me?\nHey you with you ear against the wall\nWaiting for someone to call out\nWould you touch me?\nHey you, would you help me to carry the stone?\nOpen your heart, I'm coming home\nBut it was only fantasy\nThe wall was too high\nAs you can see\nNo matter how he tried\nHe could not break free\nAnd the worms ate into his brain\nHey you, out there on the road\nAlways doing what you're told\nCan you help me?\nHey you, out there beyond the wall\nBreaking bottles in the hall\nCan you help me?\nHey you, don't tell me there's no hope at all\nTogether we stand, divided we fall\n"

In [63]:
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

0.5779930353164673
0.9465446472167969


> Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. 