# Introduction
In this final project of Web Analytics course, we worked on building a movie recommender system based on movie's text scraped from Wikipedia. 

In [216]:
##Import all the required packages
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames
import time
import pickle #To save the objects that were created using webscraping
import pprint
from lxml import html
import requests
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from IPython.display import HTML
import urllib
import os

import os
import re
import nltk
import string
from collections import Counter


# Data collection

We scraped the text related to movie's plot from Wikipedia using a Python based web robot. The design of this web robot is present at: 


The data collection is divided into 2 phases:

1. In the first phase the web robot has successfully scraped the list of target movie URLs, by iteratively visiting theweb site  https://en.wikipedia.org/wiki/List_of_American_films_of_xxxx (where xxxx represent the year. For example, to obtain the list of american movies released in 2000, we have to visit the website https://en.wikipedia.org/wiki/List_of_American_films_of_2000). From each URL the web robot will get the list of movie names, movie's Wiki page, cast, director and genre details. We scraped the list of movies for the years 2000-2016. A total of 4045 movies list was obtained. 

2. In the secod phase the web robot has visited all the 4045 URLs, and has successfully downloaded the movies plots from wikipedia. The web robot has run for approximately 7 hours to download 4045 movies text (we purposefully used a delay of 3 seconds between every hit, so that we do not overwhlem the wikipedia server with constant hits). 

The output of phase-1 is a comma separated file, with the following details: 

**Movie** - Movie Name

**URL** - Wikipedia web page for the movie

**Year** - Year of release

**Director** - Director of the movie

**Cast** - Cast of the movie

**Genre** - Movie's genre

**Movie_ID** - Unique key to distinguish each movie

In [217]:
URL = pd.read_csv("Movies_URL_Latest.csv")
#URL[URL["Movie_ID"] == 899]
display(URL.head())
print(URL.shape)


Unnamed: 0,Movie,URL,Year,Director,Cast,Genre,Movie_ID
0,102 Dalmatians,https://en.wikipedia.org/wiki/102_Dalmatians,2000,Kevin Lima,"Glenn Close, Gérard Depardieu, Alice Evans","Comedy, family",1
1,28 Days,https://en.wikipedia.org/wiki/28_Days_(film),2000,Betty Thomas,"Sandra Bullock, Viggo Mortensen",Drama,2
2,3 Strikes,https://en.wikipedia.org/wiki/3_Strikes_(film),2000,DJ Pooh,"Brian Hooks, N'Bushe Wright",Comedy,3
3,The 6th Day,https://en.wikipedia.org/wiki/The_6th_Day,2000,Roger Spottiswoode,"Arnold Schwarzenegger, Robert Duvall",Science fiction,4
4,Across the Line,https://en.wikipedia.org/wiki/Across_the_Line_...,2000,Martin Spottl,"Brad Johnson, Adrienne Barbeau, Brian Bloom",Thriller,5


(4045, 7)


The output of phase-2 is 4045 text files and image files (Not all the movies and images were downloaded. But will be fixed later). The output of phase-2 (text files) is further processed (cleaned by removing unnecessary characters, stop words etc) to create a CSV file, with the following format.

**Movie_ID** - Unique ID of the movie

**Plot** - Plot of the movie

The initial rows of this file is displayed below:

In [218]:
df = pd.read_csv("processed_data.csv")
df.head()

Unnamed: 0,Movie_ID,Plot
0,1,102 dalmatians 2000 american family comedy fil...
1,10,american psycho 2000 american black comedy hor...
2,100,legacy 2000 american documentary film directed...
3,1000,lemony snicket series unfortunate events 2004 ...
4,1001,life death peter sellers 2004 british-american...


# Building the recommender

## Get the TFIDF scores

Using the data frame, we have to get the TFIDF scores for all the words in each of the document.

In [219]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df["Plot"])
print(tfidf_matrix.shape)

(4037, 54075)


## Get the cosine similarity measure between each pair of movie

In [220]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


#Convert cos_sim (a numpy array) to a data frame with rows and columns as movie IDs
cos_sim_df = pd.DataFrame(cos_sim,columns=df["Movie_ID"].tolist(),index=df["Movie_ID"].tolist())

## Getting recommendations
Assume that the user has liked the following movies:

In [221]:
#Get the mapping between available Movie plots and movie IDs
Movie_Map=pd.merge(URL[["Movie","Movie_ID"]],df,how='inner',on=["Movie_ID"])[["Movie","Movie_ID"]]

def Get_Recommendations(Movie_ID,cos_sim_df):
    recommended_idx=np.argpartition(np.array(cos_sim_df[Movie_ID].tolist()), -6)[-6:]
    Recommended_Movie_IDs = cos_sim_df.columns[recommended_idx].tolist()
    return Recommended_Movie_IDs

def Get_Available_Images():
    
    image_files = os.listdir("./images")
    #Make sure that we are dealing with movie data files only
    image_files = [i for i in image_files if re.search('[1-9]*\.jpg',i)]
    y = list()
    for i in image_files:
        y.append(int(i.split(".")[0]))
    return y

def Display_Recommendations(Recommended_Movies,Movie_Map,Source_Movie_ID):
    Movie_Map[Movie_Map["Movie_ID"].isin(Recommended_Movies)]["Movie_ID"].tolist()
    Available_Images_List = Get_Available_Images()
    Source_Movie_Name = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["Movie"].tolist()[0]
    print("Since the user liked {}:".format(Source_Movie_Name))
    
    Recommended_Movies = list(set(Recommended_Movies) - set([Source_Movie_ID]))
    if Source_Movie_ID in Available_Images_List:
        #print("The user has liked {}".format(Source_Movie_Name))
        display(HTML("<table><tr><td><img src='./images/"+str(Source_Movie_ID)+".jpg'></td></tr></table>" \
            ))        
        
    display_html = ""
    for i in Recommended_Movies:
        if i in Available_Images_List:
            display_html = display_html + "<td><img src='./images/"+str(i)+".jpg'></td>"
    print("The following movies are recommended:")        
    display(HTML("<table><tr><td>"+display_html+"</tr></table>" \
            ))        
    #return display_html            
    #Get available images for movies:

### Demonstration of the recommendations
We will get recommended movies given that the user has liked some movies:

In [222]:
# Get recommendations for the Movie_IDs: [1, 73, 3316, 3883]
Recommended_Movies = Get_Recommendations(1,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,1)

Since the user liked  102 Dalmatians:


The following movies are recommended:


In [211]:

Recommended_Movies = Get_Recommendations(73,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,73)


Since the user liked Gladiator:


The following movies are recommended:


In [215]:
Recommended_Movies = Get_Recommendations(3934,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3934)


Since the user liked London Has Fallen:


The following movies are recommended:


In [213]:
Recommended_Movies = Get_Recommendations(3883,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3883)


Since the user liked Minions:


The following movies are recommended:


Cosine similarity is a numpy array. 

In [50]:
cos_sim[Interested_Movie_ID]

array([[  1.31096471e-02,   1.00000000e+00,   9.44418644e-03, ...,
          1.34289270e-02,   9.61086104e-04,   4.96355875e-03],
       [  2.84413501e-03,   1.30861024e-02,   7.98737751e-03, ...,
          1.48305863e-02,   7.58306966e-03,   1.05293617e-02],
       [  9.75945776e-03,   1.13742963e-02,   9.06565598e-03, ...,
          9.36789954e-03,   2.40102478e-03,   1.14813899e-02],
       [  4.77905113e-03,   1.04962520e-02,   2.80845323e-03, ...,
          1.78730739e-02,   2.41143992e-03,   5.36660507e-03],
       [  2.02944376e-02,   1.54408038e-02,   8.46405079e-03, ...,
          1.26664653e-02,   5.99270202e-04,   6.37538135e-03]])

In [60]:
def Get_Recommendations(cos_sim, id,n,URL):
    '''
    cos_sim is the cosine similarity between each pair of movies
    id is the movie_id
    n is the desired number of recommendations
    '''
    for i in id:
        print("Given that the user liked {}, the following movies are recommended:"
              .format(list(URL["Movie"][URL["Movie_ID"] == i])[0]))
        ind = np.argpartition(cos_sim[i], -n)[-n:]
        print(ind)
        #print(cos_sim[i][ind])
    #y[ind]
Get_Recommendations(cos_sim, Interested_Movie_ID,5,URL)

Given that the user liked  102 Dalmatians, the following movies are recommended:
[3162 2781    1 1438 3002]
Given that the user liked Gladiator, the following movies are recommended:
[2038 2824 1654   73 2283]
Given that the user liked Frozen, the following movies are recommended:
[ 995 2559 1618  610 2772]
Given that the user liked Frozen, the following movies are recommended:
[3512 1604 3232 3554 3316]
Given that the user liked Minions, the following movies are recommended:
[1123 3511 1320 3883  508]


In [25]:
for i in list(URL["Movie"]) if i 

URL[URL["Movie"].isin(["Gladiator","102 Dalmatians"])]

AttributeError: 'Series' object has no attribute 'strip'

# Building the recommender

## Data cleaning

Reading the files data into a data frame

In [121]:
def Read_File(p):
   with open(p, 'r',encoding='utf-8') as f:
    text = f.read()
    #Convert all the text to lower case
    #
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    no_punctuation = lowers.translate(string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

def Remove_Stop_Words(tokens):
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered



def Clean_Text(tokens):
    text = " ".join(tokens)
    #Remove punctuation marks, text in [], (, ), :
    filtered1 = re.sub('\.|\`|\'|\[.*\]|\(|\)|,|:', " ",text)
    
    #Remove any single characters
    filtered1 = re.sub('(^| ).( |$)', " ",filtered1)
    #Remove any contiguous spaces    
    filtered1 = re.sub(' +'," ",filtered1)
    
    #Include only alpha numeric characters
    filtered1=" ".join([i for i in filtered1.split() if re.search('[0-9 a-z]*',i)])
    return filtered1

In [129]:
file_names = os.listdir("./data")
file_names = [i for i in file_names if re.search('[1-9]*\.txt',i)]
y = list()
x = list()
k = 0
for i in file_names:
    y.append(int(i.split(".")[0]))
    #print(y)
    tokens = Read_File("./data/"+i)
    tokens = Remove_Stop_Words(tokens)
    cleaned_text = Clean_Text(tokens)
    x.append(cleaned_text)
    k = k+1
    if(k%100 == 0):
        print("Processed {} files".format(k))
    

Processed 100 files
Processed 200 files
Processed 300 files
Processed 400 files
Processed 500 files
Processed 600 files
Processed 700 files
Processed 800 files
Processed 900 files
Processed 1000 files
Processed 1100 files
Processed 1200 files
Processed 1300 files
Processed 1400 files
Processed 1500 files
Processed 1600 files
Processed 1700 files
Processed 1800 files
Processed 1900 files
Processed 2000 files
Processed 2100 files
Processed 2200 files
Processed 2300 files
Processed 2400 files
Processed 2500 files
Processed 2600 files
Processed 2700 files
Processed 2800 files
Processed 2900 files
Processed 3000 files
Processed 3100 files
Processed 3200 files
Processed 3300 files
Processed 3400 files
Processed 3500 files
Processed 3600 files
Processed 3700 files
Processed 3800 files
Processed 3900 files
Processed 4000 files


In [142]:
df=pd.DataFrame(list(zip(y,x)),columns = ["Movie_ID","Plot"])
df.to_csv("processed_data.csv",encoding='utf-8',index=False)


In [154]:
df = pd.read_csv("processed_data.csv")
#df.head()
#X = df.pop("Plot")
#display(X.head())
#y = df.pop("Movie_ID")
#display(y)
df.head()

Unnamed: 0,Movie_ID,Plot
0,1,102 dalmatians 2000 american family comedy fil...
1,10,american psycho 2000 american black comedy hor...
2,100,legacy 2000 american documentary film directed...
3,1000,lemony snicket series unfortunate events 2004 ...
4,1001,life death peter sellers 2004 british-american...


In [151]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


In [159]:
start=time.time()
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(df["Plot"])
#print(X_counts)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
end=time.time()
print("Run time using sklearn package is {} sec\n".format(end-start))
#print "Some of the initial TFIDF rows:\n\n{}".format(X_tfidf[0:2])
print("The TF-IDF matrix has {} rows and {} columns\n".format(X_tfidf.shape[0],X_tfidf.shape[1]))

Run time using sklearn package is 1.7591631412506104 sec

The TF-IDF matrix has 4037 rows and 54075 columns



In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df["Plot"])
print(tfidf_matrix.shape)
#(4, 11)

(4037, 54075)


In [185]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
#array([[ 1.        ,  0.36651513,  0.52305744,  0.13448867]])
#sorted(cos_sim[0])[-5:]
ind = np.argpartition(cos_sim[0], -4)[-4:]
cos_sim[0][ind]
y[ind]

1201    2080
2846    3564
3660     659
0          1
Name: Movie_ID, dtype: int64

In [106]:
import nltk
import string

from collections import Counter

def get_tokens():
   with open('./data/34.txt', 'r') as f:
    text = f.read()
    print(text)
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    no_punctuation = lowers.translate(string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

tokens = get_tokens()
count = Counter(tokens)
print(count.most_common(10))
print(" ".join(tokens))

Brother is a 2000 American-British-Japanese film starring, written, directed, and edited by Takeshi Kitano.[2]Yamamoto Takeshi Kitano is a brutal and experienced Yakuza enforcer whose boss was killed and whose clan was defeated in a criminal war with a rival family. Surviving clan members have few options: either to join the winners, reconciling with shame and distrust, or to die by committing seppuku. Yamamoto, however, decides to escape to Los Angeles along with his associate Kato (Susumu Terajima). There he finds his estranged half-brother Ken (Claude Maki), who runs a small-time drug business together with his local African-American friends. At the first meeting, Yamamoto badly hurts one of them, Denny (Omar Epps), for an attempt to fraud him. Later, Denny becomes one of the Yamamoto's closest friends and associates.Used to living in a clan and according to its laws, Yamamoto creates a hapless gang out of Ken's buddies. The new gang quickly and brutally attacks Mexican drug bosses 

In [107]:
from nltk.corpus import stopwords

tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print(count.most_common(100))

Brother is a 2000 American-British-Japanese film starring, written, directed, and edited by Takeshi Kitano.[2]Yamamoto Takeshi Kitano is a brutal and experienced Yakuza enforcer whose boss was killed and whose clan was defeated in a criminal war with a rival family. Surviving clan members have few options: either to join the winners, reconciling with shame and distrust, or to die by committing seppuku. Yamamoto, however, decides to escape to Los Angeles along with his associate Kato (Susumu Terajima). There he finds his estranged half-brother Ken (Claude Maki), who runs a small-time drug business together with his local African-American friends. At the first meeting, Yamamoto badly hurts one of them, Denny (Omar Epps), for an attempt to fraud him. Later, Denny becomes one of the Yamamoto's closest friends and associates.Used to living in a clan and according to its laws, Yamamoto creates a hapless gang out of Ken's buddies. The new gang quickly and brutally attacks Mexican drug bosses 

In [108]:
filtered

['brother',
 '2000',
 'american-british-japanese',
 'film',
 'starring',
 ',',
 'written',
 ',',
 'directed',
 ',',
 'edited',
 'takeshi',
 'kitano',
 '.',
 '[',
 '2',
 ']',
 'yamamoto',
 'takeshi',
 'kitano',
 'brutal',
 'experienced',
 'yakuza',
 'enforcer',
 'whose',
 'boss',
 'killed',
 'whose',
 'clan',
 'defeated',
 'criminal',
 'war',
 'rival',
 'family',
 '.',
 'surviving',
 'clan',
 'members',
 'options',
 ':',
 'either',
 'join',
 'winners',
 ',',
 'reconciling',
 'shame',
 'distrust',
 ',',
 'die',
 'committing',
 'seppuku',
 '.',
 'yamamoto',
 ',',
 'however',
 ',',
 'decides',
 'escape',
 'los',
 'angeles',
 'along',
 'associate',
 'kato',
 '(',
 'susumu',
 'terajima',
 ')',
 '.',
 'finds',
 'estranged',
 'half-brother',
 'ken',
 '(',
 'claude',
 'maki',
 ')',
 ',',
 'runs',
 'small-time',
 'drug',
 'business',
 'together',
 'local',
 'african-american',
 'friends',
 '.',
 'first',
 'meeting',
 ',',
 'yamamoto',
 'badly',
 'hurts',
 'one',
 ',',
 'denny',
 '(',
 'omar',
 '

In [109]:
import re
#filtered1 = ['[37]']
#[i for i in filtered1 if re.search('.*\[.*\].*', i)]
["" for i in filtered if re.search(' *\'s.*', i)]

['', '']

In [110]:
filtered=" ".join(filtered)
filtered

"brother 2000 american-british-japanese film starring , written , directed , edited takeshi kitano . [ 2 ] yamamoto takeshi kitano brutal experienced yakuza enforcer whose boss killed whose clan defeated criminal war rival family . surviving clan members options : either join winners , reconciling shame distrust , die committing seppuku . yamamoto , however , decides escape los angeles along associate kato ( susumu terajima ) . finds estranged half-brother ken ( claude maki ) , runs small-time drug business together local african-american friends . first meeting , yamamoto badly hurts one , denny ( omar epps ) , attempt fraud . later , denny becomes one yamamoto 's closest friends associates.used living clan according laws , yamamoto creates hapless gang ken 's buddies . new gang quickly brutally attacks mexican drug bosses takes control territory la . also form alliance shirase ( masaya kato ) , criminal leader little tokyo district , making group even stronger . time passes , yamamot

In [111]:
filtered1 = re.sub('\.|\`|\'|\[.*\]|\(|\)|,|:', " ",filtered)
#print(filtered1)
filtered1 = re.sub('(^| ).( |$)', " ",filtered1)
filtered1 = re.sub(' +'," ",filtered1)


filtered1=" ".join([i for i in filtered1.split() if re.search('[0-9 a-z]*',i)])
print(filtered1)

brother 2000 american-british-japanese film starring written directed edited takeshi kitano yamamoto takeshi kitano brutal experienced yakuza enforcer whose boss killed whose clan defeated criminal war rival family surviving clan members options either join winners reconciling shame distrust die committing seppuku yamamoto however decides escape los angeles along associate kato susumu terajima finds estranged half-brother ken claude maki runs small-time drug business together local african-american friends first meeting yamamoto badly hurts one denny omar epps attempt fraud later denny becomes one yamamoto closest friends associates used living clan according laws yamamoto creates hapless gang ken buddies new gang quickly brutally attacks mexican drug bosses takes control territory la also form alliance shirase masaya kato criminal leader little tokyo district making group even stronger time passes yamamoto new gang emerge formidable force gradually expanding turf extent confront power

In [89]:
count = Counter(filtered1.split())
print(filtered1)
print(count.most_common(100))

102 dalmatians 2000 american family comedy film directed kevin lima live-action directorial debut produced edward  feldman walt disney pictures sequel 1996 film 101 dalmatians stars glenn close reprising role cruella de vil attempts steal puppies  grandest  fur coat yet close tim mcinnerny two actors first film return sequel however film nominated academy award best costume design lost gladiator   three years prison cruella de vil cured desire fur coats dr  pavlov released custody probation office provision forced pay remainder fortune eight million pounds dog shelters borough westminster repeat crime cruella therefore mends working relationship valet alonzo lock away fur coats cruella  probation officer chloe simon nevertheless suspects partly chloe owner now-adult dipstick one original 15 puppies previous film   dipstick  mate dottie recently given birth three puppies domino little dipper oddball lacks spots   mend reputation cruella buys second chance dog shelter owned kevin shepher