# **Movie Recommender System based on Wiki-plots using sentence embeddings**
![recommended movie?](https://alvinalexander.com/sites/default/files/2017-09/netflix-christmas-movie-suggestions.jpg)*You may like these?*

#### Dataset contains movie plots scraped from wikipedia from 1902-2017 from approximately 22 regions along with important metadata
<!-- blank line -->
----
## Content (plot) based Recommender System
### We embed all sentences within the plots using Google's [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) and compare the input plot's associated embeddings using [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

### **Imports**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from tqdm import tqdm_notebook
import tensorflow as tf
import tensorflow_hub as hub
from nltk import sent_tokenize
%matplotlib inline

### Plotly imports

In [2]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

### Loading Dataset

In [3]:
movie = pd.read_csv('../input/wiki_movie_plots_deduped.csv')
movie.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [4]:
# Before dropping plot duplicates
len(movie)

34886

In [5]:
## Drop duplicates
movie = movie.drop_duplicates(subset='Plot', keep="first")
len(movie)

33869

In [6]:
movie.reset_index(inplace=True)
movie.head()

Unnamed: 0,index,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [7]:

movie.drop(columns=['index'],inplace=True)
movie.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


### ** bare-bones EDA**

In [8]:
movie['word count'] = movie['Plot'].apply(lambda x : len(x.split()))

In [9]:
movie['word count'].iplot(
    kind='hist',
    bins=100,
    xTitle='word count',
    linecolor='black',
    yTitle='no of plots',
    title='Plot Word Count Distribution')

In [10]:
movie['Origin/Ethnicity'].value_counts().iplot(kind='bar')

In [11]:
movie['Release Year'].value_counts().iplot(kind='bar')

### **Plot text preprocessing**

In [12]:
import re
def clean_plot(text_list):
    clean_list = []
    for sent in text_list:
        sent = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-.:;<=>?@[\]^`{|}~"""), '',sent)
        sent = sent.replace('[]','')
        sent = re.sub('\d+',' ',sent)
        sent = sent.lower()
        clean_list.append(sent)
    return clean_list

### **Embedding using USE** (for more info refer - [USE tutorial](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb))

In [13]:
plot_emb_list = []
with tf.Graph().as_default():
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    messages = tf.placeholder(dtype=tf.string, shape=[None])
    output = embed(messages)
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        for plot in tqdm_notebook(movie['Plot']):
            sent_list = sent_tokenize(plot)
            clean_sent_list = clean_plot(sent_list)
            sent_embed = session.run(output, feed_dict={messages: clean_sent_list})
            plot_emb_list.append(sent_embed.mean(axis=0).reshape(1,512))            
movie['embeddings'] = plot_emb_list
movie.head()

HBox(children=(IntProgress(value=0, max=33869), HTML(value='')))




Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,word count,embeddings
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",83,"[[-0.05115993, -0.027155714, -0.012244814, 0.0..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",86,"[[-0.017801737, 0.019581867, -0.012856972, 0.0..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",76,"[[-0.016033914, -0.00033690967, -0.0016325507,..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,153,"[[0.016082382, 0.020591006, -0.006167304, -0.0..."
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,140,"[[0.0027011565, 0.0040353737, -0.025162894, 0...."


### Pickling the embeddings for future (re)use

In [14]:
import pickle
plot_emb_arr = np.array(plot_emb_list).reshape(len(movie),512)
pickle_out = open("movie_plot.pickle","wb")
pickle.dump(plot_emb_arr, pickle_out)
pickle_out.close()

## Similar Movie function 

In [15]:
from scipy import spatial
from operator import itemgetter
def similar_movie(plot_sent,topn=5):
    with tf.Graph().as_default():
        embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
        messages = tf.placeholder(dtype=tf.string, shape=[None])
        output = embed(messages)
        with tf.Session() as session:
            session.run([tf.global_variables_initializer(), tf.tables_initializer()])
            sent_list = sent_tokenize(plot_sent)
            clean_sent_list = clean_plot(sent_list)
            sent_embed2 = (session.run(output, feed_dict={messages: clean_sent_list})).mean(axis=0).reshape(1,512)
            similarities = []
            for tensor,title in zip(movie['embeddings'],movie['Title']):
                cos_sim = 1 - spatial.distance.cosine(sent_embed2,tensor)
                similarities.append((title,cos_sim))
            return sorted(similarities,key=itemgetter(1),reverse=True)[1:topn+1]

### Testing our model - using Interstellar's plot

In [16]:
for i in range(len(movie)):
    if movie['Title'][i].lower() == 'interstellar':
        print(i)

16779


In [17]:
movie.iloc[16779]

Release Year                                                     2014
Title                                                    Interstellar
Origin/Ethnicity                                             American
Director                                            Christopher Nolan
Cast                Anne Hathaway\r\nMatthew McConaughey\r\nJessic...
Genre                                                 science fiction
Wiki Page           https://en.wikipedia.org/wiki/Interstellar_(film)
Plot                In the mid-21st century, crop blights and dust...
word count                                                        665
embeddings          [[-0.024700206, 0.0074386476, -0.030214606, -0...
Name: 16779, dtype: object

In [18]:
plots = movie['Plot'][16779]
similar_movie(plots)

[('2001: A Space Odyssey', 0.9396531581878662),
 ('Riders to the Stars', 0.9393153190612793),
 ('The Martian', 0.9187579154968262),
 ('2010', 0.9162747263908386),
 ('Life', 0.9144946336746216)]

## **The results seem to make sense**
### You can use other embeddings like BERT/ELMo/Flair to get better/different results
### An upvote will be appreciated :)