# Web Analytics - Final Project
## Movie recommendations based on text from Wikipedia
_July 10, 2017_

### Group 1 Members:
* Mauricio Alarcon
* Sekhar Mekala
* Aadi Kalloo
* Srinivasa Illapani
* Param Singh

#### Background
Movies recommendation is one of the classic application of recommendation systems, and there are several ways to achieve this. In this project, our goal is to apply Natural Language Processing (NLP) techniques to the movie plot obtained from Wikipedia and determine relevant movies for a given movie. We scraped text related to 4037 movies from Wikipedia. These movies are American movies released since the year 2000. The key deliverables of this project are:

* Text corpus of 4037 movies
* Movie posters of 3749 movies
* Movie recommender based on movie's plot

#### Technologies used:
We used the following software/packages to develop the core logic of this project:
* Python 3
* Pandas
* Numpy
* Sklearn
* BeautifulSoup
* urllib

**NOTE:** We scraped movie release posters to render the recommendations in a more aesthetic fashion. However, we could not get all the movies posters, since some of them are not available, and some of them were not easily downloadable (by our crawler) since the webpage's HTML IDs are not consistent.

## Importing the required packages for the project

In [1]:
##Import all the required packages
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames
import time
import pickle #To save the objects that were created using webscraping
import pprint
from IPython.display import HTML
from lxml import html
import requests
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from urllib.request import urlopen
from bs4 import BeautifulSoup

import urllib
import os

import os
import re
import nltk
import string
from collections import Counter


# Data collection

We scraped the text related to movies from Wikipedia using a Python based web robot. The design of this web robot, along with the source code is present in *Appendix-A*.

The data collection is divided into 2 phases:

### Phase 1:
In the first phase, the web robot has successfully scraped the list of target movie URLs, by iteratively visiting the website  https://en.wikipedia.org/wiki/List_of_American_films_of_xxxx (where xxxx represent the year. For example, to obtain the list of American movies released in 2000, we have to visit the website https://en.wikipedia.org/wiki/List_of_American_films_of_2000). From each URL the web robot has obtained the list of movie names, movie's Wiki page, cast, director and genre details. We scraped the list of movies for the years 2000-2016. A total of 4045 movies list was obtained. 

The output of phase-1 is a comma separated file (*Movie_Details.csv*), with the following details: 

**Movie** - Movie Name

**URL** - Wikipedia web page for the movie

**Year** - Year of release

**Director** - Director of the movie

**Cast** - Cast of the movie

**Genre** - Movie's genre

**Movie_ID** - Unique key to distinguish each movie

### Phase 2:
In the second phase, the web robot has visited all the 4045 URLs and has successfully downloaded the movie plots from Wikipedia. The web robot has run for approximately 7 hours to download 4037 movies text (we purposefully used a delay of 3 seconds between every hit, so that Wikipedia is not bombarded with continuous hits). Out of 4045 movies, we obtained text for 4037 movies, since 8 movies do not have any text in Wikipedia. Along with text, the web robot has also downloaded the relevant release posters for the movies. However, we were able to download 3749 posters, since the remaining posters were not found or not found by the web robot. 

The outputs of phase-2 are given below:

1. A set of 3037 text files containing movie plots

2. A set of 3749 movie posters

3. A comma separated file (*processed_data.csv*), with the following details:

      **Movie_ID** - Unique identifier of the movie

      **Plot** - Plot of the movie, with stop words and punctuation removed, and all characters converted to lower case

See **Appendix-A** for the design of web robot logic and source code

See **Appendix-B** for the detailed steps of data cleansing

## Movie details
In phase-1, we produced Movie_Details.csv file. This file is read into a pandas data frame called URL, and the initial contents of this data frame are shown below:

In [3]:
URL = pd.read_csv("Movie_Details.csv")
display(URL.head())
print(URL.shape)


Unnamed: 0,Movie,URL,Year,Director,Cast,Genre,Movie_ID
0,102 Dalmatians,https://en.wikipedia.org/wiki/102_Dalmatians,2000,Kevin Lima,"Glenn Close, Gérard Depardieu, Alice Evans","Comedy, family",1
1,28 Days,https://en.wikipedia.org/wiki/28_Days_(film),2000,Betty Thomas,"Sandra Bullock, Viggo Mortensen",Drama,2
2,3 Strikes,https://en.wikipedia.org/wiki/3_Strikes_(film),2000,DJ Pooh,"Brian Hooks, N'Bushe Wright",Comedy,3
3,The 6th Day,https://en.wikipedia.org/wiki/The_6th_Day,2000,Roger Spottiswoode,"Arnold Schwarzenegger, Robert Duvall",Science fiction,4
4,Across the Line,https://en.wikipedia.org/wiki/Across_the_Line_...,2000,Martin Spottl,"Brad Johnson, Adrienne Barbeau, Brian Bloom",Thriller,5


(4045, 7)


The output of phase-2 is 4037 text files and 3749 image files. The text files were further processed (cleaned by removing unnecessary characters, stop words etc) to create a CSV file (*processed_data.csv*), with the following format.

**Movie_ID** - Unique ID of the movie

**Plot** - Plot of the movie

The initial rows of this file is displayed below:

In [5]:
df = pd.read_csv("processed_data.csv")
df.head()

Unnamed: 0,Movie_ID,Plot
0,1,102 dalmatians 2000 american family comedy fil...
1,10,american psycho 2000 american black comedy hor...
2,100,legacy 2000 american documentary film directed...
3,1000,lemony snicket series unfortunate events 2004 ...
4,1001,life death peter sellers 2004 british-american...


# Building the recommender

Our recommender system is based on text analytics of the movies plot. We will use TF-IDF (Term Frequency - Inverse Document Frequency) score for each unique word in each document. All the unique words in the combined text of all the documents will form the features. 

Once the TFIDF is computed, we will obtain the cosine similarity between each pair of movies.

### TF-IDF Algorithm:

TFIDF (Term Frequency - Inverse Document Frequency) is one of the most popular text processing algorithms that helps us to accurately assign importance scores to each word in a document. 

At a very high level, the algorithm follows the following logic:

Let $D = {d_1, d_2 ... d_n}$ be a set of documents.

For each document $d$ in $D$ perform the following:

a. Get the frequencies of all the words in $d$. Call this as TF (Term Frequency) vector for document $d$

b. Get the list of all unique words in all the documents, and for each unique word, get the number of documents containing the word. Let DF (Document Frequency) be the vector containing these counts.

For each word $w$ in DF, get the following:

$$IDF_w=log(n/(1+\mbox{number of documents containing the word }w))$$ 

The log can have any valid base. 
IDF stands for Inverse Document Frequency. "n" represents the total number of documents

For each document $d$, multiply the elements of $TF_d$ with the corresponding elements of IDF, to obtain TFIDF vector for document $d$.


In sklearn package, we have TfidfVectorizer class, which implements the TF-IDF algorithm. Using this class, we are able to obtain the TF-IDF scores of all the unique words in all the movies plot.

## Get the TFIDF scores

Using the data frame, we have to get the TFIDF scores for all the words in each of the document.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df["Plot"])
print(tfidf_matrix.shape)

(4037, 54075)


The TF-IDF matrix for the movies text has 4037 rows (representing the number of movies) and 54075 columns (representing the unique words in all the movies text). Internally python represents this matrix as a sparse matrix, since most of the elements of this matrix have a value of 0.

## Get the cosine similarity measure between each pair of movie
To obtain the cosine similarity between each pair of movies, we will use cosine_similarity class of sklearn package. The below code block will compute the cosine similarity between each pair of movies.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


#Convert cos_sim (a numpy array) to a data frame with rows and columns as movie IDs
cos_sim_df = pd.DataFrame(cos_sim,columns=df["Movie_ID"].tolist(),index=df["Movie_ID"].tolist())

In [16]:
display(cos_sim_df)

Unnamed: 0,1,10,100,1000,1001,1002,1003,1004,1005,1006,...,990,991,992,993,994,995,996,997,998,999
1,1.000000,0.013110,0.025153,0.016308,0.011577,0.007299,0.010271,0.010509,0.006320,0.009129,...,0.011846,0.008216,0.011399,0.005237,0.007315,0.006427,0.008360,0.006734,0.002983,0.006313
10,0.013110,1.000000,0.009444,0.012178,0.004680,0.005145,0.009484,0.008657,0.011202,0.007491,...,0.013249,0.017382,0.007149,0.007385,0.011362,0.004854,0.019540,0.013429,0.000961,0.004964
100,0.025153,0.009444,1.000000,0.016322,0.059142,0.027863,0.014048,0.036151,0.007757,0.006688,...,0.042228,0.001339,0.009732,0.007993,0.013280,0.009319,0.021211,0.006464,0.023891,0.016957
1000,0.016308,0.012178,0.016322,1.000000,0.004812,0.024725,0.011021,0.018253,0.011957,0.012848,...,0.034581,0.014664,0.008955,0.009943,0.010786,0.008122,0.021269,0.020851,0.005010,0.009184
1001,0.011577,0.004680,0.059142,0.004812,1.000000,0.006460,0.007578,0.032768,0.004027,0.009007,...,0.009151,0.002705,0.014901,0.004565,0.016529,0.044348,0.009097,0.007988,0.014264,0.025655
1002,0.007299,0.005145,0.027863,0.024725,0.006460,1.000000,0.006180,0.015325,0.006960,0.030447,...,0.015545,0.013500,0.031731,0.008173,0.010223,0.007692,0.015265,0.008178,0.010762,0.013937
1003,0.010271,0.009484,0.014048,0.011021,0.007578,0.006180,1.000000,0.009379,0.007544,0.010232,...,0.027237,0.018090,0.007262,0.006958,0.008695,0.009016,0.010088,0.012814,0.011469,0.015597
1004,0.010509,0.008657,0.036151,0.018253,0.032768,0.015325,0.009379,1.000000,0.013861,0.013198,...,0.013786,0.027088,0.012796,0.010316,0.018577,0.009816,0.023062,0.009664,0.007165,0.012340
1005,0.006320,0.011202,0.007757,0.011957,0.004027,0.006960,0.007544,0.013861,1.000000,0.018479,...,0.011095,0.013903,0.015171,0.006398,0.006556,0.006096,0.010419,0.014563,0.004381,0.005944
1006,0.009129,0.007491,0.006688,0.012848,0.009007,0.030447,0.010232,0.013198,0.018479,1.000000,...,0.006813,0.011289,0.014773,0.010392,0.007992,0.007115,0.015426,0.007499,0.003727,0.018923


We can see that the cosine similarity matrix has 4037 rows and 4037 columns, and the elements represent the cosine similarity between each pair of movies. The diagonal elements of this matrix will be 1 since similarity score between the same move is always 1.

## Getting recommendations

Let us build the required functions to make movie recommendations, given that the user has liked a movie.

In [19]:
df.head()

Unnamed: 0,Movie_ID,Plot
0,1,102 dalmatians 2000 american family comedy fil...
1,10,american psycho 2000 american black comedy hor...
2,100,legacy 2000 american documentary film directed...
3,1000,lemony snicket series unfortunate events 2004 ...
4,1001,life death peter sellers 2004 british-american...


In [33]:
#Get the mapping between available Movie plots and movie IDs
Movie_Map=pd.merge(URL[["Movie","Movie_ID"]],df,how='inner',on=["Movie_ID"])[["Movie","Movie_ID","Plot"]]
#display(Movie_Map.head())
def Get_Recommendations(Movie_ID,cos_sim_df):
    recommended_idx=np.argpartition(np.array(cos_sim_df[Movie_ID].tolist()), -6)[-6:]
    #print(np.array(cos_sim_df[Movie_ID].tolist())[recommended_idx])
    Recommended_Movie_IDs = cos_sim_df.columns[recommended_idx].tolist()
    #return Recommended_Movie_IDs
    return dict(zip(Recommended_Movie_IDs,np.array(cos_sim_df[Movie_ID].tolist())[recommended_idx]))

def Get_Available_Images():
    
    image_files = os.listdir("./images")
    #Make sure that we are dealing with movie data files only
    image_files = [i for i in image_files if re.search('[1-9]*\.jpg',i)]
    y = list()
    for i in image_files:
        y.append(int(i.split(".")[0]))
    return y

def Display_Recommendations(Recommended_Movies_Dict,Movie_Map,Source_Movie_ID):
    #The following statement will make sure that we sort the movies in the descending order of similarity
    Recommended_Movies = pd.DataFrame(sorted(Recommended_Movies_Dict.items(), key=lambda x: -x[1]))[0].tolist()
    
    #Delete the liked movie from the list
    Recommended_Movies = Recommended_Movies[1:]
    
    Recommended_Movies_Plot = dict()
    for i in Recommended_Movies:
        Recommended_Movies_Plot[i] = Movie_Map[Movie_Map["Movie_ID"] == i]["Plot"].tolist()[0]
    
    #Recommended_Movies=list(Recommended_Movies_Dict.keys())
    #Movie_Map[Movie_Map["Movie_ID"].isin(Recommended_Movies)]["Movie_ID"].tolist()
    Available_Images_List = Get_Available_Images()
    Source_Movie_Name = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["Movie"].tolist()[0]
    Source_Plot = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["Plot"].tolist()[0]
    print("Assuming that the user liked {}:".format(Source_Movie_Name))
    
    #Recommended_Movies = list(set(Recommended_Movies) - set([Source_Movie_ID]))
    
    if Source_Movie_ID in Available_Images_List:
        #print("The user has liked {}".format(Source_Movie_Name))
        display(HTML("<table><tr><td><img src='./images/"+str(Source_Movie_ID)+".jpg' title='"+str(Source_Plot)+"'></td></tr></table>" \
            ))        
        
    display_html = ""
    display_values = ""
    for i in Recommended_Movies:
        if i in Available_Images_List:
            display_html = display_html + "<td><img src='./images/"+str(i)+".jpg' title='"+str(Recommended_Movies_Plot[i])+"'></td>"
            display_values = display_values + "<td> Similarity:"+str(Recommended_Movies_Dict[i])+"</td>"
    print("The following movies are recommended:")        
    display(HTML("<table><tr>"+display_html+"</tr><tr>"+display_values+"</tr></table>" \
            ))        
    #return display_html            
    #Get available images for movies:

### Demonstration of the system
We will get recommended movies given that the user has liked some movies. The cosine similarity measure is also displayed, along with the movie recommendations. The recommended movies are sorted in the descending order of similarity score. Also the top 5 movies are displayed. At some places you may find less than 5 movies, since we avoided the display of the movie, if an associated image is not available (as the web robot did not download the picture due to unavailability or some other reason):

In [34]:
Recommended_Movies = Get_Recommendations(3974,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,3974)

Assuming that the user liked X-Men: Apocalypse:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.268575850171,Similarity:0.238367317257,Similarity:0.229481952653,Similarity:0.221129217079,Similarity:0.190950502687


In [35]:
Recommended_Movies = Get_Recommendations(1,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,1)

Assuming that the user liked  102 Dalmatians:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.306608854172,Similarity:0.185453340218,Similarity:0.175589654586,Similarity:0.14104740047,Similarity:0.133580264125


In [36]:

Recommended_Movies = Get_Recommendations(73,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,73)


Assuming that the user liked Gladiator:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.0945836626403,Similarity:0.0631114505259,Similarity:0.0631047930739,Similarity:0.0514068891896,Similarity:0.0469541381883


In [37]:
Recommended_Movies = Get_Recommendations(3934,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3934)


Assuming that the user liked London Has Fallen:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.543469898779,Similarity:0.301055015997,Similarity:0.175642267065,Similarity:0.132846901922,Similarity:0.107062246397


In [38]:
Recommended_Movies = Get_Recommendations(3883,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3883)


Assuming that the user liked Minions:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.237926883929,Similarity:0.160253052959,Similarity:0.149015292148,Similarity:0.12885679193,Similarity:0.123991901771


In [39]:
Recommended_Movies = Get_Recommendations(2635,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2635)

Assuming that the user liked Paranormal Activity 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.526099643661,Similarity:0.242463753141,Similarity:0.234107937331,Similarity:0.189501563318,Similarity:0.182762222917


In [40]:
Recommended_Movies = Get_Recommendations(2731,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2731)

Assuming that the user liked Captain America: The First Avenger:


The following movies are recommended:


0,1,2
,,
Similarity:0.298077735765,Similarity:0.212774310509,Similarity:0.178031538704


In [41]:
Recommended_Movies = Get_Recommendations(2800,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2800)

Assuming that the user liked Insidious:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.681227267623,Similarity:0.286212737551,Similarity:0.241297598608,Similarity:0.230389225785,Similarity:0.194715624972


In [42]:
Recommended_Movies = Get_Recommendations(2810,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2810)

Assuming that the user liked Kung Fu Panda 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.477916709463,Similarity:0.472015971221,Similarity:0.234033843151,Similarity:0.221169773384,Similarity:0.0752134805139


In [43]:
Recommended_Movies = Get_Recommendations(2656,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2656)

Assuming that the user liked Saw VII:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.299457229304,Similarity:0.263637194848,Similarity:0.214968641642,Similarity:0.198831678817,Similarity:0.180022259236


In [44]:
Recommended_Movies = Get_Recommendations(2733,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2733)

Assuming that the user liked Cars 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.38027263614,Similarity:0.295401592436,Similarity:0.148393064492,Similarity:0.109960631703,Similarity:0.085337820135


In [45]:
Recommended_Movies = Get_Recommendations(2825,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2825)

Assuming that the user liked Mission: Impossible – Ghost Protocol:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.284669659951,Similarity:0.229786262283,Similarity:0.197036840156,Similarity:0.196086677744,Similarity:0.184644692921


In [46]:
Recommended_Movies = Get_Recommendations(3176,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3176)

Assuming that the user liked Titanic 3D:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.292896298323,Similarity:0.171477544925,Similarity:0.15832825584,Similarity:0.156473330037,Similarity:0.149228285278


In [47]:
Recommended_Movies = Get_Recommendations(3331,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3331)

Assuming that the user liked Gravity:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.276681226339,Similarity:0.204091148072,Similarity:0.182912442585,Similarity:0.179175945546,Similarity:0.143045330245


In [48]:
Recommended_Movies = Get_Recommendations(3893,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3893)

Assuming that the user liked The Martian:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.175169934667,Similarity:0.152099978063,Similarity:0.0842259518701,Similarity:0.0823599248437,Similarity:0.081500883778


In [49]:
Recommended_Movies = Get_Recommendations(4015,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,4015)

Assuming that the user liked Sully:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.104853207218,Similarity:0.0937654970696,Similarity:0.091478167585,Similarity:0.0818109615473,Similarity:0.0676232709732


In [50]:
Recommended_Movies = Get_Recommendations(795,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,795)

Assuming that the user liked The Matrix Reloaded:


The following movies are recommended:


0,1,2,3
,,,
Similarity:0.711975340685,Similarity:0.0977197774246,Similarity:0.0703271071526,Similarity:0.0681079448531


In [51]:
Recommended_Movies = Get_Recommendations(3077,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3077)

Assuming that the user liked The Hunger Games:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.756288369602,Similarity:0.626990711657,Similarity:0.0599779801182,Similarity:0.0502080355009,Similarity:0.0455761376112


In [52]:
Recommended_Movies = Get_Recommendations(761,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,761)

Assuming that the user liked Hulk:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.493282754986,Similarity:0.463848922177,Similarity:0.226625372715,Similarity:0.22531960068,Similarity:0.164103131198


In [53]:
Recommended_Movies = Get_Recommendations(616,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,616)

Assuming that the user liked Spider-Man:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.532694697879,Similarity:0.515483578747,Similarity:0.45961557517,Similarity:0.412063982744,Similarity:0.351215409571


The above test cases clearly show that the text based recommender is performing a pretty decent job. As a part of future work we will develop a simple web based interface to find the recommended movies.

## Challenges
The major challenges for this project are getting the data from wikipedia and cleaning the data to make it apply TFIDF and cosine similarity. 