# Movie Recommender System using Plot Summary

Importing the necessary libraries

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices.
Pandas is a library written for the Python programming language for data manipulation and analysis.


Data is imported as the data set is called movies1.csv

In [30]:
# Import modules
import numpy as np
import pandas as pd
import nltk
import pickle

# Set seed for reproducibility
np.random.seed(5)

# Read in IMDb and Wikipedia movie data (both in same file)
movies_df = pd.read_csv('movies1.csv')

print("Number of movies loaded: %s " % (len(movies_df)))


Number of movies loaded: 5000 


We see that 5000 rows of movie data are loaded and we can see a sample of 5 data rows below

In [31]:
movies_df.head()

Unnamed: 0.1,Unnamed: 0,Genre,Title,Plot
0,6050,western,The Bounty Hunter,A prologue explains the role of the bounty hun...
1,6051,comedy,The Bowery Boys Meet the Monsters,The front window of Louie's Sweet Shop is a fr...
2,6052,war,The Bridges at Toko-Ri,U.S. Navy Lieutenant Harry Brubaker (William H...
3,6053,musical,Brigadoon,Americans Tommy Albright (Gene Kelly) and Jeff...
4,6054,drama,Bright Road,Jane Richards (Dorothy Dandridge) is a new tea...


We are interested only with the genre and the plot features

In [32]:
movies_df.shape

(5000, 4)

In [33]:
type(movies_df)

pandas.core.frame.DataFrame

No missing data

In [34]:
movies_df = movies_df.dropna() # contain only the genre and plot features 
print(movies_df.isna().sum())

Unnamed: 0    0
Genre         0
Title         0
Plot          0
dtype: int64


We see that the data is clean and all the EDA has been performed and data has been replaced logically and now have the genre, title and the plot

## Tokenization
<p>Tokenization is the process  by which we break down articles into individual sentences or words, as needed. Besides the tokenization method provided by NLTK, we might have to perform additional filtration to remove tokens which are entirely numeric values or punctuation.</p>

An example for a sentence is given below

In [35]:
# Tokenize a paragraph into sentences and store in sent_tokenized
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize
sent_tokenized = [sent for sent in nltk.sent_tokenize("""
                        Today is a good day to be alive. Life is beautiful
                        """)]

# Word Tokenize first sentence from sent_tokenized, save as words_tokenized
words_tokenized = [word for word in nltk.word_tokenize(sent_tokenized[0])]

# Remove tokens that do not contain any letters from words_tokenized
import re

filtered = [word for word in words_tokenized if re.search('[a-zA-Z]', word)]

# Display filtered words to observe words after tokenization
filtered

[nltk_data] Downloading package punkt to C:\Users\Sheetal
[nltk_data]     Sekhar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Today', 'is', 'a', 'good', 'day', 'to', 'be', 'alive']

It has 4 columns and 5000 rows

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

Here we choose the language as English

Stemming is the process of producing morphological variants of a root/base word. 

So variants of a word are reduced to the root word.

## Stemming
<p>Stemming is the process by which we bring down a word from its different forms to the root word. This helps us establish meaning to different forms of the same words without having to deal with each form separately.
<p>There are different algorithms available for stemming such as the Porter Stemmer, Snowball Stemmer, etc. We shall use the Snowball Stemmer.</p>

In [36]:

from nltk.stem.snowball import SnowballStemmer

# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")

# Print filtered to observe words without stemming
print("Without stemming: ", filtered)




Without stemming:  ['Today', 'is', 'a', 'good', 'day', 'to', 'be', 'alive']


In [37]:
stemmed_words = [stemmer.stem(word) for word in filtered]

print("After stemming:   ", stemmed_words)

After stemming:    ['today', 'is', 'a', 'good', 'day', 'to', 'be', 'aliv']


## Club together Tokenize & Stem
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. 

One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

The below function does both tokenisation and stemming

In [38]:
# Define a function to perform both stemming and tokenization
def tokenize_and_stem(text):
    
    # Tokenize by sentence, then by word
    tokens = [word for sentence in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sentence)]
    
    # Filter out raw tokens to remove noise
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    # Stem the filtered_tokens
    stems = [stemmer.stem(token) for token in filtered_tokens]
    
    return stems

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

['today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed']


The sklearn.feature_extraction.text. TfidfVectorizer has the advantage of emphasizing the most important words for a given document

The below function does exactly that to emphasise on that and derive meaning from the plot summary and suggest similar movies

##  Create TfidfVectorizer
<p>Computers do not <em>understand</em> text. These are machines only capable of understanding numbers and performing numerical computation. Hence, we must convert our textual plot summaries to numbers for the computer to be able to extract meaning from them. 
<p> TF-IDF recognizes words which are unique and important to any given document. ## Club together Tokenize & Stem.
    The below function does exactly that to emphasise on that and derive meaning from the plot summary and suggest similar movies
  

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

# Instantiate TfidfVectorizer object with stopwords and tokenizer
# parameters for efficient processing of text
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem,
                                 ngram_range=(1,3))

##  Fit transform TfidfVectorizer
<p>Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data which the computer will be able to understand and derive meaning from. To do this, we use the <code>fit_transform()</code> method of the <code>TfidfVectorizer</code> object. </p>
<p>If we observe the <code>TfidfVectorizer</code> object we created, we come across a parameter stopwords. 'stopwords' are those words in a given text which do not contribute considerably towards the meaning of the sentence and are generally grammatical filler words.

In [40]:
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["Plot"]])

print(tfidf_matrix)

  'stop_words.' % sorted(inconsistent))


  (0, 59)	0.11140479903856593
  (0, 56)	0.3597652642368178
  (0, 2)	0.11695767935214813
  (0, 42)	0.11771636672836318
  (0, 58)	0.1026865764737806
  (0, 51)	0.46435116925307324
  (0, 40)	0.09898251735088832
  (0, 63)	0.10544904385768836
  (0, 16)	0.22288390485344234
  (0, 61)	0.1139223722860266
  (0, 0)	0.1134156754947762
  (0, 7)	0.37472436926936553
  (0, 35)	0.17779173516759503
  (0, 54)	0.38594378054310163
  (0, 11)	0.2169085591111055
  (0, 33)	0.10551450336093159
  (0, 41)	0.10321372258245412
  (0, 25)	0.1146729243026315
  (0, 31)	0.11716719689686722
  (0, 15)	0.10680958460299651
  (0, 24)	0.11360991115073629
  (0, 1)	0.10327613137556808
  (0, 47)	0.11621107777682275
  (0, 32)	0.08787323072338439
  (0, 49)	0.11971294068357613
  :	:
  (4998, 28)	0.08964473742541876
  (4998, 12)	0.10371948332060231
  (4998, 3)	0.6369690720547054
  (4998, 45)	0.08683601271467449
  (4998, 39)	0.0814044719113148
  (4998, 18)	0.08922421450238449
  (4998, 5)	0.09756496568757209
  (4998, 65)	0.101871230170

##  KMeans and create clusters
To determine how closely one movie is related to the other by the help of unsupervised learning, we can use clustering techniques. Clustering is the method of grouping together a number of items such that they exhibit similar properties. According to the measure of similarity desired, a given sample of items can have one or more clusters.

A good basis of clustering in our dataset could be the genre of the movies.

K-means is an algorithm which helps us to implement clustering in Python.

In [41]:
# Import k-means to perform clusters
from sklearn.cluster import KMeans

# Create a KMeans object with 5 clusters and save as km
km = KMeans(n_clusters=5)

# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# Create a column cluster to denote the generated cluster for each movie
movies_df["cluster"] = clusters

# Display number of films per cluster (clusters from 0 to 4)
movies_df['cluster'].value_counts() 

3    1597
4    1416
1    1113
2     460
0     414
Name: cluster, dtype: int64

In [43]:
# Import cosine_similarity to calculate similarity of movie plots
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the similarity distance
similarity_distance = 1 - cosine_similarity(tfidf_matrix)

In [44]:
type(similarity_distance)

numpy.ndarray

Picke file- for serializing

To deploy our ML model on a websiteto increase the processing speed its better to pickle the result and then deploy it.
The below code does it. It helps to load the data in a serialized manner 

In [45]:

with open('simDist.pkl','wb') as f: pickle.dump(similarity_distance, f)

## Find Similar Movie

We can even create a function to search for the movie most similar to another.

Here we are using a numpy methodargsoft to find de second most similar movie in the matrix. This is because the minimum distance of a movie is with itself. This can be seen in similarity_distance in the main diagonal(all the values are zero).

In [46]:
def find_similar(title):
  index = movies_df[movies_df['Title'] == title].index[0]
  vector = similarity_distance[index, :]
  most_similar = movies_df.iloc[np.argsort(vector)[1], 2]
  return most_similar


## Some examples

In [47]:
find_similar("The Bounty Hunter")

'Count Five and Die'

In [48]:
find_similar("The Exorcist")

'On Golden Pond'

## Conclusion

Training a machine learning model for basics tasks in NLP is simple. 
First, data is tokenized and filtered so that it can be represented with units called tokens. We can also represent the words to their root form, so the vocabulary can also be reduced and then we vectorize our dataset using an algorithm that depends on the problem we are trying to solve. 

Second, Here we used K-Means (Unsupervised Learning ) to group similar movies.

Finally we had a function which took input as a movie and based on the above algorithms/score(similarity distance) , it gave us the most similar movie.

This has also been deployed as a Web Application.


## REFERENCES

1. Towards Data Science
2. Medium
3. GitHub Repositories