1. Introduction: 

In this article we are going to build a movie recommendation system, but not based on score we are going to reccommed movies based on the similarity of  thier plot or summaries.

We are using this [dataset](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) for getting our summary of movie or plot of the movie. 

2. Process:

  1.  We somehow need to convert this plot of the movie into a vector representation, then we can find similarity between these vectors. 
  2. For this we are using sentence-transformer library [sbert](https://www.sbert.net/) 
  3. This library uses seamese networks to find the similarity between two similar sentences. 
  4. We will then convert all the plots of movies in the dataset into a vector embedding. 
  5. Then we will find similar embeddings to a given movie by using knn. [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

3. Why vectors:
  1. Once our plot summaries are converted to vectors, the plots whose summaries are semantically similar, their vector representations will be closer to each other.
  2. We can then use this property to train classifiers like knn or find the cosine similarity.

4. Sentence Bert (sbert)

  1. Sentence Bert or Sbert was introduced in this paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) 
  2. Here they train pairwise sentences and feed them into a [seamese network](https://en.wikipedia.org/wiki/Siamese_neural_network). Seamese Networks take two distinct input but their weights are tied together.

    ![sbert](https://imgur.com/m7nXRwA.png) 
  
  3. The above is the diagram of sbert is as mentioned in their paper  where we are giving two inputs and then their similarity is calculated. [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). When two vectors are close their similarity is one and when they are apart their cosine similarity is zero. 

  4. The loss is calculated by [Triplet Loss](https://en.wikipedia.org/wiki/Triplet_loss) which is used in seamese networks.

  > max(||s <sub>a</sub> − s <sub>p</sub> || − ||s <sub>a</sub>  − s <sub>n</sub> || + ep, 0) 

  5. Given an anchor sentence a, a positive sentence p, and a negative sentence n, triplet loss tunes the network such that the distance between a and p is smaller than the distance between a and n. With s<sub>x</sub> the sentence embedding for a/n/p, || · ||
a distance metric and margin . Margin  ensures
that s<sub>p</sub> is at least ep closer to s<sub>a</sub> than s<sub>n</sub> . As metric the authors of paper used Euclidean distance and we set ep = 1.






Lets install the sentence-transformer package

In [None]:
!pip install -U sentence-transformers

In [1]:
!unzip /content/dataPlot.zip

Archive:  /content/dataPlot.zip
   creating: dataPlot/
  inflating: dataPlot/moviePlot.csv  
  inflating: dataPlot/animePlot.csv  


In [2]:
path = "__path__to_data__"

For movie recommendations

In [4]:
import pandas as pd #import pandas 
dfM = pd.read_csv(path)

Lets take a look at our data


In [5]:
dfM.head()

Unnamed: 0.1,Unnamed: 0,Title,Plot
0,0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,2,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,4,Jack and the Beanstalk,The earliest known adaptation of the classic f...


In [17]:
dfM.describe()

Unnamed: 0.1,Unnamed: 0
count,34886.0
mean,17442.5
std,10070.865082
min,0.0
25%,8721.25
50%,17442.5
75%,26163.75
max,34885.0


There are total 34,886 movie entries

In [6]:
dfM['Plot'].iloc[0]

"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"

Looks like we need some data cleaning, there are reference numbers like [1] [2] from wikipedia, we want to remove that.

In [7]:
import re #for data cleaning

Writing a function that takes a row of dataframe dfM as argument and then cleans the data

In [14]:
def cleanRow(row):
  cleanData = re.sub('\[\d+\]','',row['Plot']) #this finds all the [number] in string and replaces it with empty string ''
  return cleanData

Lets test in on first entry of our dataframe

In [15]:
cleanRow(dfM.iloc[0])

"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave."

Applying the cleanRow() function to all the rows in our dataframe

In [16]:
dfM['cleanPlot'] = dfM.apply(lambda row: cleanRow(row),axis = 1)

As you can see all the reference numbers are gone. Now the rest is very simple, we just need to pass these plots into our sentence transformer model

In [23]:
dfM.head()

Unnamed: 0.1,Unnamed: 0,Title,Plot,cleanPlot
0,0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...","A bartender is working at a saloon, serving dr..."
1,1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...","The moon, painted with a smiling face hangs ov..."
2,2,The Martyred Presidents,"The film, just over a minute long, is composed...","The film, just over a minute long, is composed..."
3,3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,Lasting just 61 seconds and consisting of two ...
4,4,Jack and the Beanstalk,The earliest known adaptation of the classic f...,The earliest known adaptation of the classic f...


Importing our model

In [20]:
from sentence_transformers import SentenceTransformer

We are using a pre-trained model. 
Here is the list of all pre-trained models available in the library
[models](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Using GPU 

In [22]:
model = model.cuda()

This is the step where we get our embeddings in return, The sentence-transformer package made it very easy to compute all these vector embeddings

The model can take some time to process all 34,886 embeddings

In [24]:
embeds = model.encode(dfM['cleanPlot'])

As you can see all 34,886 movie plots are converted to embeddings of 384 shape.

In [26]:
embeds.shape

(34886, 384)

Example

In [25]:
embeds[0]

array([-2.12161932e-02,  3.00669353e-02, -2.20030937e-02, -4.86579239e-02,
        4.65949513e-02,  1.39347359e-03,  1.11619830e-01, -1.05095394e-01,
        1.64841451e-02, -6.58066571e-02,  4.28134426e-02, -8.68948922e-02,
       -6.84939772e-02,  4.01585288e-02, -4.27194089e-02, -2.80906223e-02,
       -5.49537353e-02, -1.82290319e-02,  1.69992838e-02, -1.23173064e-02,
       -7.43851960e-02,  1.77508891e-02, -2.53297910e-02,  4.20343839e-02,
       -1.91297680e-02, -5.33761717e-02,  7.73109645e-02,  3.12473997e-02,
       -1.56767934e-03, -1.35527225e-02,  7.67379850e-02, -8.90569668e-03,
        1.67426597e-02,  5.27440123e-02, -5.66607080e-02, -3.37757282e-02,
        8.18887725e-02,  4.01530862e-02,  7.09116757e-02,  1.03832453e-01,
        1.21762659e-02, -3.67102958e-02, -5.90380980e-03,  2.12801900e-02,
        8.85269195e-02,  6.31055385e-02,  2.16334965e-03,  5.64815775e-02,
        3.59476879e-02, -1.78708490e-02, -1.78310033e-02, -3.49012762e-02,
        3.46690491e-02,  

If you wish you can store the embeddings in our dataframe

In [27]:
dfM['embeddings'] = embeds.tolist()

Now lets predict some movie reccomendations, we are going to train our knn

In [28]:
from sklearn.neighbors import NearestNeighbors

In [36]:
import numpy as np

In [29]:
neighbors = NearestNeighbors(n_neighbors=15)

In [31]:
neighbors.fit(dfM['embeddings'].tolist())#fitting our data

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                 radius=1.0)

KNN return the distance and the index of the next 15 closest embeddings. So, lets make a dict which takes the index and returns the name of movie

In [32]:
idx_dict = dfM['Title'].to_dict()#making an index dictionary  - index-> title of movie

In [42]:
idx_dict[0]

'Kansas Saloon Smashers'

We need the knn to predict all the closest 15 recommendations to a given movie embeddings, we will pass the embeddings one by one and then save the 15 recommended movies index

Here is an example

In [37]:
neighbors.kneighbors(np.array(embeds[0]).reshape(1,-1))

(array([[0.        , 0.8927627 , 0.97209753, 0.98154652, 0.98469113,
         0.99231732, 0.99920433, 1.00403609, 1.01327464, 1.01370713,
         1.0170882 , 1.01793011, 1.01885963, 1.02645933, 1.02752993]]),
 array([[    0,  9231, 22037,    98,   172, 10478,  8426,  1269, 25738,
           431, 21299, 34853, 26449,   231, 28327]]))

Here the second array [ 0,  9231, 22037,    98,   172, 10478,  8426,  1269, 25738, 431, 21299, 34853, 26449,   231, 28327] are the index of similar movies

So the 15 movies similar to movie present at index 0 i.e "Kansas Saloon Smashers" are present at index 0, 9231, 22037, 98, 172, 10478, 8426, 1269, 25738, 431, 21299, 34853, 26449, 231, 28327

Lets print their names

In [44]:
recList = [    0,  9231, 22037,  98,   172, 10478,  8426,  1269, 25738, 431, 21299, 34853, 26449,   231, 28327]
print(f"movies similar to {idx_dict[0]} are following: ")
for i in recList:
  print(idx_dict[i]) 

movies similar to Kansas Saloon Smashers are following: 
Kansas Saloon Smashers
Dixie Dynamite
FUBAR: The Movie
The Rounders
In Again, Out Again
8 Million Ways to Die
Childish Things
Broadway to Cheyenne
Amanaat
The Frozen North
U.F.O.
Black and White
Johnny Gaddar
Out West
Idiots


You can see the most similar to the movie is the movie itself, that's why their embeddings are close to each other and knn return it's index. The rest are in increasing order of their distances

Lets write a function to apply to all embeddings saved in our dataframe

In [65]:
def recFun(row):#return similar movie indexes
    embeds = row['embeddings']
    dist,idx = neighbors.kneighbors(np.array(embeds).reshape(1,-1))
    return idx[0][1:] #since the 0th will always be the same movie 

Apply to the dataframe, this also can take a lot of time (upto 20 mins in my case) since we are doing it for 34,886 movies and the knn returns 15 recommendation for each.

In [81]:
dfM['reccomendations'] = dfM.apply(lambda row: recFun(row),axis=1)

Here are our reccomendations, they are index of that movie name in reccomendation column

In [90]:
dfM.head()

Unnamed: 0.1,Unnamed: 0,Title,Plot,cleanPlot,embeddings,reccomendations
0,0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...","A bartender is working at a saloon, serving dr...","[-0.021216193214058876, 0.03006693534553051, -...","[9231, 22037, 98, 172, 10478, 8426, 1269, 2573..."
1,1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...","The moon, painted with a smiling face hangs ov...","[0.011549428105354309, 0.07491016387939453, 0....","[14371, 28968, 28967, 23560, 27556, 7518, 8604..."
2,2,The Martyred Presidents,"The film, just over a minute long, is composed...","The film, just over a minute long, is composed...","[-0.019372664391994476, 0.04275871440768242, -...","[15550, 32883, 3, 39, 14897, 17522, 9139, 1038..."
3,3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,Lasting just 61 seconds and consisting of two ...,"[0.018185125663876534, 0.022814400494098663, 0...","[2, 19433, 14148, 2702, 9297, 12015, 431, 1261..."
4,4,Jack and the Beanstalk,The earliest known adaptation of the classic f...,The earliest known adaptation of the classic f...,"[-0.03626694902777672, -0.011878693476319313, ...","[16572, 5651, 6246, 16573, 4370, 33672, 598, 1..."


Now that we have recommendation for every movie in our dataset, we can implement a simple search function and return the movie reccomendations

In [113]:
def recs(dfM,movie,idx_dict):
  reccoms = dfM[dfM['Title'].str.contains(movie, na=False, case=False)].iloc[0]['reccomendations'] #finds the movie passes in dataframe
  #print(reccoms)
  if (len(reccoms)>0):
    for i in reccoms:
      print(idx_dict[i])#convert index to movie name
  else:
    print("movie not in database")


Movies whose plot is similar to "The godfather" are 

In [111]:
recs(dfM,"The godfather",idx_dict)

A Bronx Tale
The Godfather Part II
The Sicilian
The Freshman
Donnie Brasco
Jane Austen's Mafia!
Jersey Boys
Avenging Angelo Baby Beethoven Baby Newton 
King of New York
Black Hand
Carlito's Way
Miller's Crossing
The Brothers Rico
Family Business


Movies whose plot is similar to "ast five" are 

In [104]:
recs(dfM,"fast five",idx_dict)

 The Fast and the Furious
The Fate of the Furious
2 Fast 2 Furious
Restraint
Fast & Furious
Collateral
Gunmen
Drive
Beverly Hills Cop III
The Fast and the Furious
Dawn of the Dead
Getaway
Pulp Fiction
The Courtship of Andy Hardy


In [112]:
id_toRecs = dfM['reccomendations'].to_dict() #dictionary from index to reccomendations

### Here is a very very simple website I made for movie reccomendation based on plot [site](https://kcmankar.github.io/website_movie_recommendations_sbert.github.io/).
 
### The code for website is [here](https://github.com/kcmankar/website_movie_recommendations_sbert.github.io)