# Web Analytics - Final Project
### Movie recommendations based on text from Wikipedia
_July 12, 2017_


## Group 1 Members:
* Mauricio Alarcon
* Sekhar Mekala
* Aadi Kalloo
* Srinivasa Illapani
* Param Singh

#### Background
Movies recommendation is one of the classic application of recommendation systems, and there are several ways to achieve this. In this project, our goal is to apply Natural Language Processing (NLP) techniques to the movie plot obtained from Wikipedia and determine relevant movies for a given movie. We scraped text related to 4037 movies from Wikipedia. These movies are American movies released since the year 2000. The key deliverables of this project are:

* Text corpus of 4037 movies
* Movie posters of 3749 movies
* Movie recommender based on movie's plot

#### Technologies used:
We used the following software/packages to develop the core logic of this project:
* Python 3
* Pandas
* Numpy
* Sklearn
* BeautifulSoup
* urllib

**NOTE:** We scraped movie release posters to render the recommendations in a more aesthetic fashion. However, we could not get all the movie posters, since some of them are not available, and some of them were not easily downloadable by our crawler since the webpage's HTML IDs are not consistent.

This project is divided into 4 logical phases:

1. Phase-1: Build a web crawler to download the movies text and release posters
2. Phase-2: Cleanse the text data to build the recommender system
3. Phase-3: Build the recommender system using the text data
4. Phase-4: Get the recommendations

The subsequent sections will have a detailed explanation of these phases.

# I. Phase-1: Build a Web Crawler

The main goal of **Phase-1** is to build a web crawler and scrape the text related to American movies, which were released between 2000 and 2016 years. Along with the text, we will also crawl the Movies posters. The major deliverables of this phase are:

1. Text (or plot) of American movies, which were released between the years 2000 and 2016
2. Movies release posters

NOTE: All the data will be obtained from Wikipedia.


## I.I Design

Wikipedia maintains list of movies, released in each year. The list of American movies released in each year are present at https://en.wikipedia.org/wiki/List_of_American_films_of_XXXX, where XXXX is the year. For each year between 2000 and 2016 (XXXX = 2000 to 2016), we have to recurrsively visit each year's URL to obtain the movies URLs, along with movie details such as cast, director, genre etc. Once the URLs related to all the movies are obtained, the web crawler will visit the movie URLs to scrape the plot of the movie. 

The following flowchart will provide an overview of the web crawling process:

<img src="crawler_logic.png">

 **Figure-1: Web crawling process**

The whole web crawling process is divided into 2 steps. In Step-1, we will visit Wikipedia to obtain a list of all the URLs related to the American movies which were released between the years 2000-2016. In Step-2, we will visit each of the URLs obtained in Step-1, to download movie text and release poster. There must be some delay of 2 to 3 seconds between successive requests to Wikipedia website.

The output of Step-1 will be a comma separated file (Movie_Details.csv), with the following details: 

**Movie** - Movie Name

**URL** - Wikipedia web page for the movie

**Year** - Year of release

**Director** - Director of the movie

**Cast** - Cast of the movie

**Genre** - Movie's genre

**Movie_ID** - Unique key to distinguish each movie

The output of Step-2 will be a set of text files, each file named using the Movie_ID, and a set of image files, each of the image files will also be named using the Movie_ID. Naming the files using the Movie_ID will help us to refer movie's text and image files uniquely.

# I.II Implementation

### I.II.I Import the required packages
Let us import all the required packages:

In [3]:
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames
import time
import pickle #To save the objects that were created using webscraping
import pprint
from lxml import html
import requests
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from IPython.display import HTML
import re
import urllib
import os

### I.II.II Step-1 Implementation

In Step-1 of the web crawling process, we will obtain the list of movies, along with the URLs of the movies, for all the movies which were released between the years 2000 and 2016. 

But the challenge was, the format of the HTML file. Wikipedia used one format for the movies released between 2000 and 2013, and another format for the movies released in 2014-16. So we will sub-divide the Step-1 process further to scrape the list of movies between 2000 and 2013 in phase 1 and 2015-16 in phase2.

In [60]:
##PHASE-1: Get the movies and URLs for the years 2000-2013
#Define the lists to hold the details of the movies
URL = list()
Movie_Name = list()
Director = list()
Cast = list()
Genre = list()
year = list()

#Create a beautiful soup object
bs = BeautifulSoup(html)

#Iterate over the years 2000 to 2014.
for y in list(range(2000,2014)):
    
    #Prepare the URL String and open the URL
    url = "https://en.wikipedia.org/wiki/List_of_American_films_of_"+str(y)
    html = urlopen(url)
    
    #Mandatory wait of 3 seconds
    time.sleep(3)
    
    #Get the web page as HTML document
    bs = BeautifulSoup(html)
    
    #Parse and get the required data
    for table in bs.find_all('table', {"class":"wikitable"}):
        for row in table.find_all('tr'):
            columns = row.find_all('td')
            if len(columns) > 4:
                Movie_Name.append(columns[0].get_text())
                Director.append(columns[1].get_text())
                Cast.append(columns[2].get_text())
                Genre.append(columns[3].get_text())
                year.append(y)
                
                #Handle exceptions, so that the process continues
                try:
                    a = columns[0].find('a',href=True)['href']
                    URL.append("https://en.wikipedia.org"+a)
                except:
                    URL.append("NA")
                    continue


                

In [33]:
##PHASE-2: Get the movies details for the years 2014 to 2016
#Declare the lists
URL1 = list()
Movie_Name1 = list()
Director1 = list()
Cast1 = list()
Genre1 = list()
year1 = list()

#For the years between 2014 and 2016
for y in range(2014,2017):          
    
    #Prepare wiki URL
    url = "https://en.wikipedia.org/wiki/List_of_American_films_of_"+str(y)
    
    #Exception handling to ignore the failures and continue processing
    try: 
        html = urlopen(url)
    except:
        print("problem with the following URL...continuining...:")
        print(url)
        continue
    #Sleep for 3 secs    
    time.sleep(3)
    
    #Declare beautiful soup object
    bs = BeautifulSoup(html)
    for table in bs.find_all('table', {"class":"wikitable"}):
        for row in table.find_all('tr'):
            columns = row.find_all('td')
            if len(columns) > 3: #To make sure that we are accessing the movies tables only
                if len(columns) == 6:
                    #print(columns[0].get_text())
                    Movie_Name1.append(columns[0].get_text())
                    Director1.append(columns[1].get_text())
                    Cast1.append(columns[2].get_text())
                    Genre1.append(columns[3].get_text())
                    year1.append(y)
                    try:
                        a=columns[0].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue
                
                if len(columns) == 7:
                    #print(columns[1].get_text())
                    Movie_Name1.append(columns[1].get_text())
                    Director1.append(columns[2].get_text())
                    Cast1.append(columns[3].get_text())
                    Genre1.append(columns[4].get_text())
                    year1.append(y)
                    
                    try:
                        a=columns[1].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue

                if len(columns) > 7:
                    #print("col len:{}".format(len(columns)))
                    #print(columns[2].get_text())
                    Movie_Name1.append(columns[2].get_text())
                    Director1.append(columns[3].get_text())
                    Cast1.append(columns[4].get_text())
                    Genre1.append(columns[5].get_text())
                    year1.append(y)                    
                    try:
                        a=columns[2].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue
                    

### I.II.III Save the results
We will save the movies details as a CSV file named Movie_Details.csv. This helps us to avoid running Step-1 again.

In [None]:
#Create a data frame:
df = pd.DataFrame(list(zip(Movie_Name+Movie_Name1,URL+URL1,year+year1,Director+Director1,Cast+Cast1,Genre+Genre1)),\
                  columns=["Movie","URL","Year","Director","Cast","Genre"])

#Remove the rows which do not have URL information
df = df[df["URL"] != "NA"]

df["Movie_ID"] = df.index + 1

#Write the file
df.to_csv("Movies_Details.csv",encoding='utf-8',index=False)

### I.II.IV Read the saved file to a data frame.
We will read back the saved file in Step-1, to a data frame. You can download the Movies_Details.csv file from *https://goo.gl/RDhVtf*.

In [4]:
URL = pd.read_csv("Movie_Details.csv")

print("Initial rows of the file Movie_Details.csv")
display(URL.head())

print("The Movie_Details.csv has {} rows and {} columns".format(URL.shape[0],URL.shape[1]))

Initial rows of the file Movie_Details.csv


Unnamed: 0,Movie,URL,Year,Director,Cast,Genre,Movie_ID
0,102 Dalmatians,https://en.wikipedia.org/wiki/102_Dalmatians,2000,Kevin Lima,"Glenn Close, Gérard Depardieu, Alice Evans","Comedy, family",1
1,28 Days,https://en.wikipedia.org/wiki/28_Days_(film),2000,Betty Thomas,"Sandra Bullock, Viggo Mortensen",Drama,2
2,3 Strikes,https://en.wikipedia.org/wiki/3_Strikes_(film),2000,DJ Pooh,"Brian Hooks, N'Bushe Wright",Comedy,3
3,The 6th Day,https://en.wikipedia.org/wiki/The_6th_Day,2000,Roger Spottiswoode,"Arnold Schwarzenegger, Robert Duvall",Science fiction,4
4,Across the Line,https://en.wikipedia.org/wiki/Across_the_Line_...,2000,Martin Spottl,"Brad Johnson, Adrienne Barbeau, Brian Bloom",Thriller,5


The Movie_Details.csv has 4045 rows and 7 columns


There are 4045 movie URLs that have to be scraped from Wikipedia. Our goal is to scrape the image of the movie (if exists), along with the plot and initial introduction texts.

### I.II.V Step-2 Implementation

In Step-2 we will use the file Movie_Details.csv, which has the list of all movie URLs, along with some other details. Each URL of the movie will be crawled to extract the movies text and the images. In Step-2, we will build a set of functions to perform the web crawling. These functions are explained below:

#### I.II.V.I Functions
We will code the following functions to obtain the plot information of the movie, along with the release poster image of the movie.

* **Open_URL(url)** Gets the HTML content, prepares Beautiful Soup object and returns the Beautiful Soup object. The _url_ parameter represents the complete URL of the webpage to be scraped. If an error occurs while opeing the URL, then -1 is returned. If an error occurs while preparing the beautiful soup object, then -2 is returned.

* **Get_Plot(bs)** Takes a beautiful soup object as input. It extracts the introductory text, and the text in the section _plot_
If plot section is NOT present, then it return a negative code. The function returns the extracted text (in the _plot_ section and the initial paragraph). If an error occurs, then a negative code is returned.
    Return code = -1: If an error has occurred while getting the paragraphs from bs object
    Return code = -2: If there is NO _Plot_ section in the document
    Return code = -3: If an error occurred while extracting the first paragrapg in HTML doc

* **Get_All_Text(bs)** This function is called only when **Get_Plot(bs)** returns a -2 (No section with the heading _Plot_ is found). This function will take a beautiful object as input and gives all the text (present in < p > tags) as output. It will return -1, if any error occurs.

* **Save_Text_File(text,text_file_name)** It will save the text string (_text_) as a text file (with the name contained in *text_file_name*). The file is saved into the _data_ directory. Returns 0, if sucessfully saved. Returns -1, when the change directory command fails (while changing to _data_ sub-directory), and -2 when the change to parent directory from _data_ fails.

* **Get_And_Save_Image(bs,image_file_name)** It will get the movies poster (image file) and saves the image in the _images_ directory. It will take beautiful soup object as input and extracts the image URL. The image URL will be downloaded and saved to _images_ directory with the file name as the value present in *image_file_name*. Returns 0, if successfully downloaded and saved. Returns -1 if image is not found, and -2 if the an error occurs when saving the image.

* **Write_Error(url,msg,file)** Will write a error/warning message (contained in the parameter _msg_), while parsing the URL (present in the _url_ parameter). The _file_ parameter contains the name of the error file.  

The source code of these functions is given below:

In [9]:
def Open_URL(url):
    '''
    Gets the HTML content, prepares Beautiful Soup object and 
    returns the  Beautiful Soup object. 
    The url parameter represents the complete URL of the webpage.
    '''
    try:
         html = urlopen(url)
    except:
        return -1
    try:
        #bs = BeautifulSoup(html).encode("ascii")
        bs = BeautifulSoup(html)
        
        return bs
    except:
        return -2       

def Get_Plot(bs):  
    """
    Takes a beautiful soup object as input.
    Extracts the introductory text, and the text in the section plot
    If plot section is NOT present, just get all the available text in the webpage
    Returns the extracted text (if NO error, else returns a negative error code).
    -1: If an error has occurred while getting the paragraphs from bs object
    -2: If an error occurred while extracting the plot text
    -3: If an error occurred while extracting the first paragrapg in HTML doc
    """
    try:
        p = bs.find("p")
        initial_paragraph = p.getText()
    except:
        return -1

    # collect plot in this list
    plot = []
    
    # find the node with id of "Plot"
    try:
        mark = bs.find(id="Plot")
        # walk through the siblings of the parent (H2) node 
        # until we reach the next H2 node
        for elt in mark.parent.nextSiblingGenerator():
            if elt.name == "h2":
                break
            if hasattr(elt, "text"):
                plot.append(elt.text)
    except:
         return -2
    
    try:
        plot="".join(plot)
        text = initial_paragraph + plot
        return text
    except:
        return -3    
    

    
def Get_All_Text(bs):
    try:
        p = bs.find_all("p")
        l = list()
        for i in p:
            l.append(i.getText())
        return " ".join(l)
    except:
        return -1

def Save_Text_File(text,text_file_name):
    try:
        os.chdir("./data")
    except:
        return -1
    
    with open(text_file_name, 'w',encoding='utf-8') as f:
         f.write(text)
    
    try:
        os.chdir("..")
        return 0
    except:
        return -2

def Get_And_Save_Image(bs,image_file_name):    
    try: 
        img=bs.findAll("img",{"class":"thumbborder"})
        img_URL="https:"+img[0]['src']
    except:
        return -1
     
    try:
        os.chdir("./images")
        ignore=urllib.request.urlretrieve(img_URL,image_file_name)
        os.chdir("..")
        return 0
    except:
        return -2

def Write_Error(url,msg,file):
    with open(file,'a') as f:
        f.write("\n"+msg)
        f.write("\n"+url)

## I.III Beginning the major crawling process

The following code block will crawl the movies text from Wikipedia. This code ran for approximately 7 hours. So do NOT execute this code, unless you really want to start the download process. The output of this code is a series of text files and image files. The text files are saved to the _data_ directory and images to the _image_ directory. You can download these files directly from the location: *https://goo.gl/RDhVtf*

In [57]:
tracker = 0
term=1
start = time.time() # Get start time
print("Beginning the files download...")
#k = 1
for movie, url, year,Movie_ID in zip(list(URL["Movie"]),list(URL["URL"]),list(URL["Year"]),list(URL["Movie_ID"])):
    #if k < 30:
    #    k = k+1
    #    continue
    #print("{},{},{}".format(movie,url,year))
    #Open the URL
    
    bs=Open_URL(url)

    if bs == -1:
        Write_Error(url,"Error in opening the URL","error.txt")
        #print("Error in opening the URL: {}".format(url))
        continue
        

    if bs == -2:
        Write_Error(url,"Error in the creation of bs object for the URL","error.txt")
        #print("Error in the creation of bs object for the URL: {}".format(url))
        continue

    time.sleep(3)

    #create a name for the files
    #image_file_name = str(year)+"_"+movie.strip()+".jpg"
    #text_file_name =  str(year)+"_"+movie.strip()+".txt"
    image_file_name = str(Movie_ID)+".jpg"
    text_file_name =  str(Movie_ID)+".txt"

    text=Get_Plot(bs)
    
    if text == -1:
        Write_Error(url,"No paragraphs are found","error.txt")
        #print("No paragraphs are found")
        #print(url)
        continue

    if text == -2:
        Write_Error(url,"Warning: No Plot ID found","error.txt")
        #print("Warning: No Plot ID found")
        #print(url)
        
        text = Get_All_Text(bs)
        if text == -1:
            Write_Error(url,"No paragraphs are found","error.txt")
            #print("No paragraphs are found")
            #print(url)
            continue
        
    if text == -3:
        Write_Error(url,"Error while appending the main plot with the initial paragraph","error.txt")
        #print("Error while appending the main plot with the initial paragraph")
        #print(url)
        continue
    
    status = Save_Text_File(text,text_file_name)
    
    if status == -1:
        Write_Error(url,"Not able to change the directory to ./data","error.txt")
        #print("Not able to change the directory to ./data")
        #print(url)
        continue
    
    if status == -2:
        Write_Error(url,"Not able to change the directory to .. (parent directory) from ./data","error.txt")
        #print("Not able to change the directory to .. (parent directory) from ./data")
        #print(url)
        continue
        
    #Downloading Image files    
    status = Get_And_Save_Image(bs,image_file_name)
    
    if status == -1:
        Write_Error(url,"Not able to find the image","error.txt")
        #print("Not able to find the image")
        #print(url)
        continue

    if status == -2:
        Write_Error(url,"Not able to save the image","error.txt")
        #print("Not able to save the image")
        #print(url)
        continue

    #Check the status of the webbot    
    tracker = tracker + 1
    if (tracker % 100 == 0):
        print("Processed {} URLs".format(tracker))
        end = time.time() # Get end time
        elapsed_time = end - start
        print("Elapsed time to process 100 URLs:{} secs".format(elapsed_time))
        start = time.time() # Get end time
        #break

    #if term == 1:
    #    break

Beginning the files download...
Processed 100 URLs
Elapsed time to process 100 URLs:446.38204622268677 secs
Processed 200 URLs
Elapsed time to process 100 URLs:433.5029966831207 secs
Processed 300 URLs
Elapsed time to process 100 URLs:461.3205850124359 secs
Processed 400 URLs
Elapsed time to process 100 URLs:458.53809428215027 secs
Processed 500 URLs
Elapsed time to process 100 URLs:476.4255542755127 secs
Processed 600 URLs
Elapsed time to process 100 URLs:473.6112816333771 secs
Processed 700 URLs
Elapsed time to process 100 URLs:458.1378722190857 secs
Processed 800 URLs
Elapsed time to process 100 URLs:434.91878509521484 secs
Processed 900 URLs
Elapsed time to process 100 URLs:569.373973608017 secs
Processed 1000 URLs
Elapsed time to process 100 URLs:533.2878279685974 secs
Processed 1100 URLs
Elapsed time to process 100 URLs:493.94636368751526 secs
Processed 1200 URLs
Elapsed time to process 100 URLs:560.7728536128998 secs
Processed 1300 URLs
Elapsed time to process 100 URLs:519.74306

Out of 4045 movies we obtained 4037 movies text successfully. But we were able to obtain only 3749 images, since images were not available to some of the movies. The errors and warnings are logged into a file named _error.txt_. You can find this file at *https://goo.gl/RDhVtf*

## II. Phase-2: Data Cleansing
Now that we downloaded the data, let us clean the data to make the data ready for applying text analytics algorithms. To perform data cleansing, we created the following 3 functions:

* **Read_File(p)** It will open the input file, reads the text in the file, converts all the test to lower case, removes the punctuation (if any), and returns a list of tokens.

* **Remove_Stop_Words(tokens)** It will remove all stop words from the list of input tokens. Returns a refined list of tokens, with no stop words


* **Clean_Text(tokens)** It will clean all the text by removing any square brackets "[...]", braces "(" and ")", commas, colons, apostrophes etc. Returns a list of tokens that are just alphanumeric. 

The source code of these functions is given below:

In [3]:
def Read_File(p):
  with open(p, 'r',encoding='utf-8') as f:
    text = f.read()
    #Convert all the text to lower case
    #
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    no_punctuation = lowers.translate(string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

def Remove_Stop_Words(tokens):
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered



def Clean_Text(tokens):
    text = " ".join(tokens)
    #Remove punctuation marks, text in [], (, ), :
    filtered1 = re.sub('\.|\`|\'|\[.*\]|\(|\)|,|:', " ",text)
    
    #Remove any single characters
    filtered1 = re.sub('(^| ).( |$)', " ",filtered1)
    #Remove any contiguous spaces    
    filtered1 = re.sub(' +'," ",filtered1)
    
    #Include only alpha numeric characters
    filtered1=" ".join([i for i in filtered1.split() if re.search('[0-9 a-z]*',i)])
    return filtered1

The following code block will use the above functions to clean the text. Do NOT run this code, unless you want to test it, since it will run for some time. To save time, we saved the results as *processed_data.csv*. This file can be downloaded from *https://goo.gl/RDhVtf*.

In [6]:
import re
import string
import nltk
from collections import Counter
from nltk.corpus import stopwords

#List the files in the directory./data
file_names = os.listdir("./data")

#Process each file
file_names = [i for i in file_names if re.search('[1-9]*\.txt',i)]
y = list()
x = list()
k = 0
start = time.time()
for i in file_names:
    y.append(int(i.split(".")[0]))
    tokens = Read_File("./data/"+i)
    tokens = Remove_Stop_Words(tokens)
    cleaned_text = Clean_Text(tokens)
    x.append(cleaned_text)
    k = k+1
    if(k%100 == 0):
        temp_time = time.time() - start
        print("Processed {} files. Elapsed time:{} seconds".format(k, temp_time))
        
temp_time = time.time() - start
print("Processed {} files. Elapsed time:{} seconds".format(k, temp_time))


print("Now saving the result as processed_data.csv file...")
df=pd.DataFrame(list(zip(y,x)),columns = ["Movie_ID","Plot"])
df.to_csv("processed_data.csv",encoding='utf-8',index=False)        

Processed 100 files. Elapsed time:34.47499084472656 seconds
Processed 200 files. Elapsed time:68.88607907295227 seconds
Processed 300 files. Elapsed time:102.14953780174255 seconds
Processed 400 files. Elapsed time:134.21351504325867 seconds
Processed 500 files. Elapsed time:160.64091873168945 seconds
Processed 600 files. Elapsed time:190.43764638900757 seconds
Processed 700 files. Elapsed time:219.01563954353333 seconds
Processed 800 files. Elapsed time:251.53410053253174 seconds
Processed 900 files. Elapsed time:282.54603719711304 seconds
Processed 1000 files. Elapsed time:316.63187885284424 seconds
Processed 1100 files. Elapsed time:351.59121203422546 seconds
Processed 1200 files. Elapsed time:387.9956316947937 seconds
Processed 1300 files. Elapsed time:429.9995768070221 seconds
Processed 1400 files. Elapsed time:471.7074546813965 seconds
Processed 1500 files. Elapsed time:509.83835220336914 seconds
Processed 1600 files. Elapsed time:549.4117841720581 seconds
Processed 1700 files. E

We can see that the data cleaninsing process has ran for approximately 23 minutes. However the results of this process are saved as a CSV file processed_data.csv. This file is located at *https://goo.gl/RDhVtf*

Reading the processed_data.csv file into a data frame.

In [5]:
df = pd.read_csv("processed_data.csv")
print("Initial records of processed_data.csv file")
df.head()

Initial records of processed_data.csv file


Unnamed: 0,Movie_ID,Plot
0,1,102 dalmatians 2000 american family comedy fil...
1,10,american psycho 2000 american black comedy hor...
2,100,legacy 2000 american documentary film directed...
3,1000,lemony snicket series unfortunate events 2004 ...
4,1001,life death peter sellers 2004 british-american...


The above display shows that our final data frame, which will be used to build movies recommender is composed of two columns. The first column *Movie_ID* will uniquely identify the movie, and the *Plot* will identify the cleansed text of the movie plot. The movie's release poster will be present in the "./images" directory (you can download it from *https://goo.gl/RDhVtf*. But after download save it in the ./data directory or else this Jupyter notebook will not find the images)

## III. Phase-3: Building the recommender

Our recommender system is based on text analytics of the movies plot. We will use TF-IDF (Term Frequency - Inverse Document Frequency) score for each unique word in each document. All the unique words in the combined text of all the documents will form the *features*. 

Once the TFIDF is computed, we will obtain the cosine similarity between each pair of movies.

### III.I. TF-IDF Algorithm:

TFIDF (Term Frequency - Inverse Document Frequency) is one of the most popular text processing algorithms that helps us to accurately assign importance scores to each word in a document. 

At a very high level, the algorithm follows the below logic:

Let $D = {d_1, d_2 ... d_n}$ be a set of documents.

For each document $d$ in $D$ perform the following:

a. Get the frequencies of all the words in $d$. Call this as TF (Term Frequency) vector for document $d$

b. Get the list of all unique words in all the documents, and for each unique word, get the number of documents containing the word. Let DF (Document Frequency) be the vector containing these counts.

For each word $w$ in DF, get the following:

$$IDF_w=log(n/(1+\mbox{number of documents containing the word }w))$$ 

The log can have any valid base. 
IDF stands for Inverse Document Frequency. "n" represents the total number of documents

For each document $d$, multiply the elements of $TF_d$ with the corresponding elements of IDF, to obtain TFIDF vector for document $d$.


In sklearn package, we have TfidfVectorizer class, which implements the TF-IDF algorithm. Using this class, we are able to obtain the TF-IDF scores of all the unique words in all the movies plot.

### III.II Get the TFIDF scores

Using the data frame (*processed_data*), The below code block will get the TFIDF scores for all the words in each of the document.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df["Plot"])

print("The TF-IDF matrix has {} rows and {} columns".format(tfidf_matrix.shape[0],tfidf_matrix.shape[1]))

The TF-IDF matrix has 4037 rows and 54075 columns


The TF-IDF matrix for the movies text has 4037 rows (representing the number of movies) and 54075 columns (representing the unique words in all the movies text). Internally python represents this matrix as a sparse matrix, since most of the elements of this matrix have a value of 0.

### III.III. Get the cosine similarity measure between each pair of movie
To obtain the cosine similarity between each pair of movies, we will use cosine_similarity class of sklearn package. The below code block will compute the cosine similarity between each pair of movies.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim_df = pd.DataFrame(cos_sim,columns=df["Movie_ID"].tolist(),index=df["Movie_ID"].tolist())

Let us display some rows and columns of the cosine similarity measure.

In [8]:
display(cos_sim_df)

Unnamed: 0,1,10,100,1000,1001,1002,1003,1004,1005,1006,...,990,991,992,993,994,995,996,997,998,999
1,1.000000,0.013110,0.025153,0.016308,0.011577,0.007299,0.010271,0.010509,0.006320,0.009129,...,0.011846,0.008216,0.011399,0.005237,0.007315,0.006427,0.008360,0.006734,0.002983,0.006313
10,0.013110,1.000000,0.009444,0.012178,0.004680,0.005145,0.009484,0.008657,0.011202,0.007491,...,0.013249,0.017382,0.007149,0.007385,0.011362,0.004854,0.019540,0.013429,0.000961,0.004964
100,0.025153,0.009444,1.000000,0.016322,0.059142,0.027863,0.014048,0.036151,0.007757,0.006688,...,0.042228,0.001339,0.009732,0.007993,0.013280,0.009319,0.021211,0.006464,0.023891,0.016957
1000,0.016308,0.012178,0.016322,1.000000,0.004812,0.024725,0.011021,0.018253,0.011957,0.012848,...,0.034581,0.014664,0.008955,0.009943,0.010786,0.008122,0.021269,0.020851,0.005010,0.009184
1001,0.011577,0.004680,0.059142,0.004812,1.000000,0.006460,0.007578,0.032768,0.004027,0.009007,...,0.009151,0.002705,0.014901,0.004565,0.016529,0.044348,0.009097,0.007988,0.014264,0.025655
1002,0.007299,0.005145,0.027863,0.024725,0.006460,1.000000,0.006180,0.015325,0.006960,0.030447,...,0.015545,0.013500,0.031731,0.008173,0.010223,0.007692,0.015265,0.008178,0.010762,0.013937
1003,0.010271,0.009484,0.014048,0.011021,0.007578,0.006180,1.000000,0.009379,0.007544,0.010232,...,0.027237,0.018090,0.007262,0.006958,0.008695,0.009016,0.010088,0.012814,0.011469,0.015597
1004,0.010509,0.008657,0.036151,0.018253,0.032768,0.015325,0.009379,1.000000,0.013861,0.013198,...,0.013786,0.027088,0.012796,0.010316,0.018577,0.009816,0.023062,0.009664,0.007165,0.012340
1005,0.006320,0.011202,0.007757,0.011957,0.004027,0.006960,0.007544,0.013861,1.000000,0.018479,...,0.011095,0.013903,0.015171,0.006398,0.006556,0.006096,0.010419,0.014563,0.004381,0.005944
1006,0.009129,0.007491,0.006688,0.012848,0.009007,0.030447,0.010232,0.013198,0.018479,1.000000,...,0.006813,0.011289,0.014773,0.010392,0.007992,0.007115,0.015426,0.007499,0.003727,0.018923


We can see that the cosine similarity matrix has 4037 rows and 4037 columns, and the elements represent the cosine similarity between each pair of movies. The diagonal elements of this matrix will be 1 since similarity score between the same move is always 1.

## IV. Phase-4: Getting recommendations

Let us build the required functions to make movie recommendations, given that the user has liked a movie.

### IV.I Functions
* **Get_Recommendations(Movie_ID,cos_sim_df)** This function will accept *Movie_ID*, and *cos_sim_df* as inputs. The *Movie_ID* is a number unique to a movie, and *cos_sim_df* is a pandas data frame containing the cosine similarity scores between all pairs of movies. This function will get top 6 Movies (based on the cosine similarity score between the input Movie_ID and other movies. Higher cosine similarity measure, better the match). The result is returned in the form of a dictionary.


* **Get_Available_Images()** This function will not accept any input. It returns the list of all movie IDs, for which we have an available image.


* **Display_Recommendations(Recommended_Movies_Dict,Movie_Map,Source_Movie_ID)** This function will accept 3 inputs. The *Recommended_Movies_Dict* is the dictionary of recommended movies (output of *Get_Recommendations(Movie_ID,cos_sim_df)* function). The *Movie_Map* is a data frame with the columns: *"Movie" (Movie name),"Movie_ID" (Unique ID),"URL" (Movie URL)*. This data frame is obtained by joining the *Movie_Details.csv* and *processed_data.csv* files data (using movie ID). This joining is needed, since it will help to map the movies which are successfully downloaded (4037 movies) and all the available movie names (4045 movies). The *Source_Movie_ID* is the movie ID, which is assumed to be liked by the user. The function does NOT return any value. It just renders the recommended movies along with the cosine similarity scores. The user can click on the movie to read visit the wikipedia site or hover on the image to get the text, which was used for building cosine similarity matrix.

The source code of these functions is given below:

In [9]:
#Get the mapping between available Movie plots and movie IDs
Movie_Map=pd.merge(URL[["Movie","Movie_ID","URL"]],df,how='inner',on=["Movie_ID"])[["Movie","Movie_ID","Plot","URL"]]


def Get_Recommendations(Movie_ID,cos_sim_df):
    #Get the indices (movie IDs) with highest cosine sim scores
    recommended_idx=np.argpartition(np.array(cos_sim_df[Movie_ID].tolist()), -6)[-6:]
    
    #Convert to a list
    Recommended_Movie_IDs = cos_sim_df.columns[recommended_idx].tolist()
    
    #Prepare a dict and return the recommended movies list
    return dict(zip(Recommended_Movie_IDs,np.array(cos_sim_df[Movie_ID].tolist())[recommended_idx]))


def Get_Available_Images():
    #Get all the available image names (movie IDs which have images)    
    image_files = os.listdir("./images")
    
    #Make sure that we are dealing with movie data files only
    image_files = [i for i in image_files if re.search('[1-9]*\.jpg',i)]
    
    #Define a list to collect the movie IDs
    y = list()
    for i in image_files:
        y.append(int(i.split(".")[0]))
    #Return the list    
    return y


def Display_Recommendations(Recommended_Movies_Dict,Movie_Map,Source_Movie_ID):
    #The following statement will make sure that we sort the movies in the descending order of similarity
    Recommended_Movies = pd.DataFrame(sorted(Recommended_Movies_Dict.items(), key=lambda x: -x[1]))[0].tolist()
    
    #Delete the liked movie from the list (since cosine sim with itself is 1)
    Recommended_Movies = Recommended_Movies[1:]
    
    Recommended_Movies_Plot = dict()
    Recommended_Movies_URL = dict()
    
    for i in Recommended_Movies:
        Recommended_Movies_Plot[i] = Movie_Map[Movie_Map["Movie_ID"] == i]["Plot"].tolist()[0]
        Recommended_Movies_URL[i] = Movie_Map[Movie_Map["Movie_ID"] == i]["URL"].tolist()[0]

    #Get the available movies with images    
    Available_Images_List = Get_Available_Images()
    
    Source_Movie_Name = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["Movie"].tolist()[0]
    Source_Plot = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["Plot"].tolist()[0]
    Source_URL = Movie_Map[Movie_Map["Movie_ID"] == Source_Movie_ID]["URL"].tolist()[0]
    print("Assuming that the user liked {}:".format(Source_Movie_Name))
    
    #Prepare HTML for display:    
    if Source_Movie_ID in Available_Images_List:
        display(HTML("<table><tr><td><a href='"+str(Source_URL)+\
                     "' target='_blank'><img src='./images/"+str(Source_Movie_ID)+".jpg' title='"+\
                     str(Source_Plot)+"'></a></td></tr></table>" \
            ))        
        
    display_html = ""
    display_values = ""
    for i in Recommended_Movies:
        if i in Available_Images_List:
            display_html = display_html + "<td><a href='"+str(Recommended_Movies_URL[i])+\
            "' target='_blank'><img src='./images/"+str(i)+".jpg' title='"+\
            str(Recommended_Movies_Plot[i])+"'></a></td>"
            display_values = display_values + "<td> Similarity:"+\
            str(Recommended_Movies_Dict[i])+"</td>"
    print("The following movies are recommended:")        
    display(HTML("<table><tr>"+display_html+"</tr><tr>"+display_values+"</tr></table>" \
            ))        

### IV.II Demonstration of the system
We will get recommended movies given that the user has liked some movies. The cosine similarity measure is also displayed, along with the movie recommendations. The recommended movies are sorted in the descending order of similarity score. Also the top 5 movies are displayed. At some places you may find less than 5 movies, since we avoided the display of the movie, if an associated image is not available (as the web robot did not download the picture due to unavailability or some other reason). Also if you hover over the image, you will see the text (cleansed) used for building the recommender, and if you click the image, you will redirected to the Wikipedia URL:

In [11]:
Recommended_Movies = Get_Recommendations(3974,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,3974)

Assuming that the user liked X-Men: Apocalypse:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.268575850171,Similarity:0.238367317257,Similarity:0.229481952653,Similarity:0.221129217079,Similarity:0.190950502687


In [33]:
Recommended_Movies = Get_Recommendations(1,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,1)

Assuming that the user liked  102 Dalmatians:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.306608854172,Similarity:0.185453340218,Similarity:0.175589654586,Similarity:0.14104740047,Similarity:0.133580264125


In [34]:
Recommended_Movies = Get_Recommendations(3934,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3934)

Assuming that the user liked London Has Fallen:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.543469898779,Similarity:0.301055015997,Similarity:0.175642267065,Similarity:0.132846901922,Similarity:0.107062246397


In [35]:
Recommended_Movies = Get_Recommendations(2635,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2635)

Assuming that the user liked Paranormal Activity 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.526099643661,Similarity:0.242463753141,Similarity:0.234107937331,Similarity:0.189501563318,Similarity:0.182762222917


In [36]:
Recommended_Movies = Get_Recommendations(2810,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2810)

Assuming that the user liked Kung Fu Panda 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.477916709463,Similarity:0.472015971221,Similarity:0.234033843151,Similarity:0.221169773384,Similarity:0.0752134805139


In [37]:
Recommended_Movies = Get_Recommendations(2656,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,2656)

Assuming that the user liked Saw VII:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.299457229304,Similarity:0.263637194848,Similarity:0.214968641642,Similarity:0.198831678817,Similarity:0.180022259236


In [38]:
Recommended_Movies = Get_Recommendations(3176,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3176)

Assuming that the user liked Titanic 3D:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.292896298323,Similarity:0.171477544925,Similarity:0.15832825584,Similarity:0.156473330037,Similarity:0.149228285278


In [39]:
Recommended_Movies = Get_Recommendations(3893,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3893)

Assuming that the user liked The Martian:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.175169934667,Similarity:0.152099978063,Similarity:0.0842259518701,Similarity:0.0823599248437,Similarity:0.081500883778


In [40]:
Recommended_Movies = Get_Recommendations(4015,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,4015)

Assuming that the user liked Sully:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.104853207218,Similarity:0.0937654970696,Similarity:0.091478167585,Similarity:0.0818109615473,Similarity:0.0676232709732


In [41]:
Recommended_Movies = Get_Recommendations(3077,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,3077)

Assuming that the user liked The Hunger Games:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.756288369602,Similarity:0.626990711657,Similarity:0.0599779801182,Similarity:0.0502080355009,Similarity:0.0455761376112


In [42]:
Recommended_Movies = Get_Recommendations(616,cos_sim_df)
Display_Recommendations(Recommended_Movies,Movie_Map,616)

Assuming that the user liked Spider-Man:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.532694697879,Similarity:0.515483578747,Similarity:0.45961557517,Similarity:0.412063982744,Similarity:0.351215409571


In [11]:
Recommended_Movies = Get_Recommendations(2708,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,2708)

Assuming that the user liked The Adventures of Tintin: The Secret of the Unicorn:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.0602222780375,Similarity:0.0600969469384,Similarity:0.0597959791667,Similarity:0.0592945237438,Similarity:0.0539888052148


In [15]:
Recommended_Movies = Get_Recommendations(2354,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,2354)

Assuming that the user liked The Hangover:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.573032072419,Similarity:0.553884363519,Similarity:0.336836724917,Similarity:0.306045314506,Similarity:0.286943231872


In [17]:
Recommended_Movies = Get_Recommendations(3166,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,3166)

Assuming that the user liked Taken 2:


The following movies are recommended:


0,1,2,3,4
,,,,
Similarity:0.563379839379,Similarity:0.518757927248,Similarity:0.151438366191,Similarity:0.132547902327,Similarity:0.127330282131


In [18]:
Recommended_Movies = Get_Recommendations(3604,cos_sim_df)
Recommended_Movies
Display_Recommendations(Recommended_Movies,Movie_Map,3604)

Assuming that the user liked 300: Rise of an Empire:


The following movies are recommended:


0,1,2,3
,,,
Similarity:0.416349579034,Similarity:0.297551118695,Similarity:0.125321250007,Similarity:0.0776579866645


## Conclusion

In this project we were able to implement the following:

* Built a web crawler to scrape American movies text and release posters from Wikipedia. The web crawler ran for approximately 7 hours (with a mandatory delay of 3 seconds between each hit) to download the text and images of the movies released after since 2000.

* Using regular expressions, we cleaned the text, and using sklearn packages, we calculated TF-IDF scores.

* Between each pair of movies, we obtained the cosine similarity measure.

* For a given movie _m_, we identified the top 5 relevant movies based on the cosine similarity measure.

* The approach looks promising, since in most of the test cases, the system has found at least 3 relevant movies, out of 5 movies recommended. If this system is combined with other external resources, such as movie ratings (matrix factorization methods like ALS - Alternating Least Squares can be used to predict the ratings a user could give to the movies), then the recommendations quality can improve further. 
