# Recommender systems - Project 2

## Abstract
In this project we evaluate two types of recommender systems (Content based and Collaborative filtering) methods, and compare their performances using the MovieLens data. For content based recommendation system, we will use the actors acted in the movie, and for collaborative filtering we will use similarity between the users to identify potential movies that a user might be interested. The methods are compared using Mean Absolute Error. The MovieLens data does not have the actors names acted in the movies. We will use the imdb URL provided in the data set to get the actors names acted in a specific movie. To make this project simple, we will consider 10 random users from the MovieLens data set, and only include the list of movies watched by these 10 random users in our algorithms analysis.

## Data pre-processing
The MovieLens data has several files, but we will use the following files for this project:
1. u.user (a tab delimited data set, containing the users details - users uniquely identified by user_id key)
2. u.item (a tab delimited data set, containing the movie details - movies uniquely identified by item_id)
3. u.data (a tab delimited data set, containing the user_id, item_id and the rating given to the item by the user)

We will choose 10 users randomly from u.user data set, and get all the movies rated by these users (using the u.data). Using the u.item data set, we will get the URL of the movie and scrape the IMDB URL to extract the actors list. This list will be used as the properties of items for content based recommender system.

### Importing all the required packages

In [89]:
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames

from lxml import html
import requests

### Reading the files 

We will read the files u.data, u.user and u.item, and extract the required columns (since we do not need all the columns for this project). Follow the comments in the code to know which columns are dropped. 

In [94]:
#Reading the movie information
movie_df=pd.read_table("u.item",sep="|" ,header=None)
#Selecting only the needed columns
movie_df=movie_df[[0,1,4]]
movie_df.columns = ["item_id","item_name","item_url"]
#Displaying sample data
print "Movie data frame sample rows:"
display(movie_df.head())

#Reading the user ID information
user_df=pd.read_table("u.user",sep="|" ,header=None)
user_df.columns=["user_id","age","gender","profession","zip"]

print "User data frame sample rows:"
display(user_df.head())

#Reading the user, movie ratings information
ratings_df=pd.read_csv("u.data",delimiter = "\t",header=None)
ratings_df.columns = ["user_id",  "item_id","rating","timestamp"]
#Dropping the timestamp info, since we do not need that information in this project
del ratings_df["timestamp"]

print "Displaying the ratings details (mapping between user id and item id):"
display(ratings_df.head())

Movie data frame sample rows:


Unnamed: 0,item_id,item_name,item_url
0,1,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)


User data frame sample rows:


Unnamed: 0,user_id,age,gender,profession,zip
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Displaying the ratings details (mapping between user id and item id):


Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### Selecting the users randomly
We will select 10 random users in the following code block. Once the users are selected, we will get the list of all movie names rated by these 10 random users. A data frame is prepared with all the required columns, so that we can use the URL to scrape the IMDB webpage to obtain the actors list (who acted in the movies rated by the 10 selected users).

In [79]:
#Selecting only 10 users randomly
np.random.seed(1234)
uids = np.random.randint(1,user_df["user_id"].max(),10)

user_df=user_df.iloc[uids]
print "Here are the randomly selected users:\n"
display(user_df)

#Combined data frame
df = pd.merge(pd.merge(user_df,ratings_df),movie_df)
print "All columns combined:"
display(df.head())
#df.sort_values(["item_id"])

print "The combined data frame has {} rows and {} columns".format(df.shape[0],df.shape[1])

Here are the randomly selected users:



Unnamed: 0,user_id,age,gender,profession,zip
816,817,19,M,student,60152
724,725,21,M,student,91711
295,296,43,F,administrator,16803
54,55,37,M,programmer,1331
205,206,14,F,student,53115
373,374,36,M,executive,78746
665,666,44,M,administrator,61820
656,657,26,F,none,78704
690,691,34,M,educator,60089
280,281,15,F,student,6059


All columns combined:


Unnamed: 0,user_id,age,gender,profession,zip,item_id,rating,item_name,item_url
0,817,19,M,student,60152,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
1,725,21,M,student,91711,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
2,206,14,F,student,53115,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
3,691,34,M,educator,60089,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
4,281,15,F,student,6059,748,5,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...


The combined data frame has 899 rows and 9 columns


Since we have duplicate movie names in the combined data frame displayed above, we will use the movie_df (which has the unique movies identified by the item_id) to get the actors list from IMDB website. Also note that we will be getting only the leading actors in the movie along with the director and writer names only (to make the list simple).

### Scraping the IMDB web pages to extract actors, writers and directors

In [306]:
#Let us pull the actors and directors for the Toy story (1995) movie
selected_movie_id=df["item_id"].unique()
selected_movies=movie_df.iloc[selected_movie_id]
display(selected_movies.head())

Unnamed: 0,item_id,item_name,item_url
748,749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...
124,125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...
118,119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...
1,2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
358,359,"Assignment, The (1997)",http://us.imdb.com/M/title-exact?Assignment%2C...


Let us get the actors list for a movie using the lxml package. Once we validate that we are successfully getting the actors list, we can implement the same logic in a loop to pull the actors list for all the movies selected (movies rated by the users selected).

In [288]:
page = requests.get('http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)')
tree = html.fromstring(page.content)
actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
print actors

['John Lasseter', 'John Lasseter', 'Pete Docter', 'Tom Hanks', 'Tim Allen', 'Don Rickles']


We can see that the artists for the Toy story were pulled successfully into a list (even though Toy story is an animated movie, these artists gave dubbing to the characters in the movie). 

The number of movies that need to be scraped is given below:

In [290]:
print "We need to scrape {} movies".format(selected_movies.shape[0])

We need to scrape 546 movies


Let us implement the IMDB web scraping (to pull the actors list) in an automated fashion. The following code block will run for more than an hour (since I kept a sleep time of 3 secs between successive requests in order to avoid constant hits to the IMDB server).

In [276]:
import time
start = time.time()
selected_movies.index
actors_list = list()
j = 0
for ind in selected_movies.index:
    
    movie_name= selected_movies.loc[ind]["item_name"]
    movie_url= selected_movies.loc[ind]["item_url"]

    page = requests.get(movie_url)
    tree = html.fromstring(page.content)
    actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
    actors=[i for i in actors if i != '\n']
    actors = filter(lambda name: name.strip(), actors)    
    if len(actors) == 0:
        if len(movie_name.split(",")) == 2:
            #print movie_name
            x=movie_name.split("(")
            year = x[-1]
            year = "("+year.strip()
            name = x[0]

            name = name.split(",")
            if len(name) > 1:
                name = name[-1]+",".join(name[0:-1])
            name=str(name)
            name = name.strip()
            temp_url = name+"%20"+year

            temp_url=temp_url.replace(" ","%20")
            temp_url="http://us.imdb.com/M/title-exact?"+temp_url
            page = requests.get(temp_url)
            tree = html.fromstring(page.content)
            actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
            actors=[i for i in actors if i != '\n']
            actors = filter(lambda name: name.strip(), actors)    


    actors_list.append(actors)
    time.sleep(3)
    j = j+1
    if j % 50 == 0:
        print "processed {} URLs".format(j)

end = time.time()

print "Total processing time {} secs".format(end - start)

len(actors_list)

import pickle
actors_data = {"actors_data":actors_list}
f = open('actors.pkl','wb')
#with open('actors.pkl','wb') as f:
pickle.dump(actors_list,f,-1)
f.close()        


processed 50 URLs
processed 100 URLs
processed 150 URLs
processed 200 URLs
processed 250 URLs
processed 300 URLs
processed 350 URLs
processed 400 URLs
processed 450 URLs
processed 500 URLs
Total processing time 6191.62400007 secs


In [292]:
import pprint
#To read back the pickled data
f= open('actors.pkl', 'rb')
data1 = pickle.load(f)
#pprint.pprint(data1)
#actors_list[1][0]

Adding the actors lists to the data frame.

In [307]:
import warnings
warnings.filterwarnings('ignore')

selected_movies["actors"]=actors_list

#Get the list of movies which are not scraped successfully
remaining_movies=selected_movies[selected_movies.astype(str)['actors'] == '[]']

print "There are {} movies which are NOT scraped successfully.".format(remaining_movies.shape[0])
print "We will delete these movies from our movie list"
selected_movies=selected_movies[selected_movies.astype(str)['actors'] != '[]']
print "After eliminating the movies which were not successfully scraped from the list, we obtained the following data frame finally"
print "The final data frame has {} movies".format(selected_movies.shape[0])
display(selected_movies.head())
#selected_movies.loc[:,len("actors"] == []
#selected_movies.loc[selected_movies['actors'].isin([[]])]
#selected_movies.loc[i for i in selected_movies["actors"] if len(i) == 0]
#selected_movies[selected_movies["actors"] == []]


There are 96 movies which are NOT scraped successfully.
We will delete these movies from our movie list
After eliminating the movies which were not successfully scraped from the list, we obtained the following data frame finally
The final data frame has 450 movies


Unnamed: 0,item_id,item_name,item_url,actors
748,749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...,"[Mark Joffe, Greg Dinner, Karen Janszen, Janea..."
124,125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...,"[Jon Turteltaub, Gerald Di Pego, John Travolta..."
118,119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...,"[Freida Lee Mock, Freida Lee Mock, Maya Lin]"
1,2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...,"[Martin Campbell, Ian Fleming, Michael France,..."
358,359,"Assignment, The (1997)",http://us.imdb.com/M/title-exact?Assignment%2C...,"[Christian Duguay, Dan Gordon, Sabi H. Shabtai..."


## Preparation of item profile

Using the above displayed data frame we will now prepare the item profile, using the actors acted in each of the movie.

In [322]:
def list_to_dict(l):
    l = list(set(l))
    #print l
    return {el:1 for el in l}
d=list(selected_movies["actors"].apply(list_to_dict))

#print d
#pd.DataFrame(selected_movies["actors"])
d=pd.DataFrame(d)
d.index = selected_movies.index
pd.concat([selected_movies,d],axis=1)

Unnamed: 0,item_id,item_name,item_url,actors,A.S. Byatt,Aaron Seltzer,Aaron Sorkin,Abel Ferrara,Adam Sandler,Adi Hasak,...,Winona Ryder,Winston Groom,Wolfgang Petersen,Woody Allen,Woody Gelman,Woody Harrelson,Xiaorui Zhao,Yves Montand,Zack Duhame,Zinedine Soualem
748,749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...,"[Mark Joffe, Greg Dinner, Karen Janszen, Janea...",,,,,,,...,,,,,,,,,,
124,125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...,"[Jon Turteltaub, Gerald Di Pego, John Travolta...",,,,,,,...,,,,,,,,,,
118,119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...,"[Freida Lee Mock, Freida Lee Mock, Maya Lin]",,,,,,,...,,,,,,,,,,
1,2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...,"[Martin Campbell, Ian Fleming, Michael France,...",,,,,,,...,,,,,,,,,,
358,359,"Assignment, The (1997)",http://us.imdb.com/M/title-exact?Assignment%2C...,"[Christian Duguay, Dan Gordon, Sabi H. Shabtai...",,,,,,,...,,,,,,,,,,
147,148,"Ghost and the Darkness, The (1996)",http://us.imdb.com/M/title-exact?Ghost%20and%2...,"[Stephen Hopkins, William Goldman, Michael Dou...",,,,,,,...,,,,,,,,,,
15,16,French Twist (Gazon maudit) (1995),http://us.imdb.com/M/title-exact?Gazon%20maudi...,"[Josiane Balasko, Patrick Aubrée, Josiane Bala...",,,,,,,...,,,,,,,,,,
281,282,"Time to Kill, A (1996)",http://us.imdb.com/M/title-exact?Time%20to%20K...,"[Joel Schumacher, John Grisham, Akiva Goldsman...",,,,,,,...,,,,,,,,,,
129,130,Kansas City (1996),http://us.imdb.com/M/title-exact?Kansas%20City...,"[Robert Altman, Robert Altman, Frank Barhydt, ...",,,,,,,...,,,,,,,,,,
328,329,Desperate Measures (1998),http://us.imdb.com/Title?Desperate+Measures+(1...,"[Barbet Schroeder, David Klass, Michael Keaton...",,,,,,,...,,,,,,,,,,


In [150]:
#<span class="itemprop" itemprop="name">Tom Hanks</span>

#<div title="buyer-name">Carson Busses</div>
#<span class="item-price">$29.95</span>

#This will create a list of buyers:
#buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
#prices = tree.xpath('//span[@class="item-price"]/text()')

#actors = tree.xpath('//span[@class="itemprop"]/text()')
actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
#print actors
actors=[i for i in actors if i != '\n']
actors = filter(lambda name: name.strip(), actors)
print actors
 #<table class="cast_list">    
#subtree = tree.xpath('//table[@class="cast_list"]')

#actors = subtree.xpath('//span[@class="itemprop"]/text()')
#print actors

['Ingmar Bergman', 'Ingmar Bergman', 'Ingmar Bergman', 'Max von Sydow', u'Gunnar Bj\xf6rnstrand', 'Bengt Ekerot']


### Scraping the IMDB database


## Project methodology

At a high level we will perform the following tasks in this project:

1. Choose 10 users randomly
2. Get the list of all movies rated by these 10 users
3. To develop content based recommender, we will scrape the IMDB website to get the list of all actors acted in the movies identified in the second requirement
4. Build items and user profiles
5. For each user 

## Selecting random users

Unnamed: 0,user_id,age,gender,profession,zip,item_id,rating,item_name,item_url
14,817,19,M,student,60152,1,4,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
18,691,34,M,educator,60089,1,5,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
17,657,26,F,none,78704,1,3,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
16,374,36,M,executive,78746,1,4,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
15,296,43,F,administrator,16803,1,5,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
689,374,36,M,executive,78746,2,4,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
723,666,44,M,administrator,61820,4,5,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
722,374,36,M,executive,78746,4,2,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
633,374,36,M,executive,78746,5,4,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)
634,666,44,M,administrator,61820,5,2,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)
