# Recommender systems - Project 2

## Abstract
In this project we evaluate two types of recommender systems (Content based and Collaborative filtering) methods, and compare their performances using the MovieLens data. For content based recommendation system, we will use the actors acted in the movie, and for collaborative filtering we will use similarity between the users to identify potential movies that a user might be interested. The methods are compared using Mean Absolute Error. The MovieLens data does not have the actors names acted in the movies. We will use the imdb URL provided in the data set to get the actors names acted in a specific movie. To make this project simple, we will consider 10 random users from the MovieLens data set, and only include the list of movies watched by these 10 random users in our algorithms analysis.

## Data pre-processing
The MovieLens data has several files, but we will use the following files for this project:
1. u.user (a tab delimited data set, containing the users details - users uniquely identified by user_id key)
2. u.item (a tab delimited data set, containing the movie details - movies uniquely identified by item_id)
3. u.data (a tab delimited data set, containing the user_id, item_id and the rating given to the item by the user)

We will choose 10 users randomly from u.user data set, and get all the movies rated by these users (using the u.data). Using the u.item data set, we will get the URL of the movie and scrape the IMDB URL to extract the actors list. This list will be used as the properties of items for content based recommender system.

### Importing all the required packages

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames
import time
import pickle #To save the objects that were created using webscraping
import pprint
from lxml import html
import requests

### Reading the files 

We will read the files u.data, u.user and u.item, and extract the required columns (since we do not need all the columns for this project). Follow the comments in the code to know which columns are dropped. 

In [2]:
#Reading the movie information
movie_df=pd.read_table("u.item",sep="|" ,header=None)
#Selecting only the needed columns
movie_df=movie_df[[0,1,4]]
movie_df.columns = ["item_id","item_name","item_url"]
#Displaying sample data
print "Movie data frame sample rows:"
display(movie_df.head())

#Reading the user ID information
user_df=pd.read_table("u.user",sep="|" ,header=None)
user_df.columns=["user_id","age","gender","profession","zip"]

print "User data frame sample rows:"
display(user_df.head())

#Reading the user, movie ratings information
ratings_df=pd.read_csv("u.data",delimiter = "\t",header=None)
ratings_df.columns = ["user_id",  "item_id","rating","timestamp"]
#Dropping the timestamp info, since we do not need that information in this project
del ratings_df["timestamp"]

print "Displaying the ratings details (mapping between user id and item id):"
display(ratings_df.head())

Movie data frame sample rows:


Unnamed: 0,item_id,item_name,item_url
0,1,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)


User data frame sample rows:


Unnamed: 0,user_id,age,gender,profession,zip
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Displaying the ratings details (mapping between user id and item id):


Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### Selecting the users randomly
We will select 10 random users in the following code block. Once the users are selected, we will get the list of all movie names rated by these 10 random users. A data frame is prepared with all the required columns, so that we can use the URLs in the data frame to scrape the IMDB webpages to obtain the actors list (who acted in the movies rated by the 10 selected users).

In [3]:
#Selecting only 10 users randomly
#Set the seed, to 1234 to reproduce the same results
np.random.seed(1234)
uids = np.random.randint(1,user_df["user_id"].max(),10)

user_df=user_df.iloc[uids]
#user_df=user_df[user_df["user_id"].isin(uids)]
#user_df.index = user_df["user_id"]
print uids
print "Here are the randomly selected users:\n"
display(user_df)

#Combined data frame
df = pd.merge(pd.merge(user_df,ratings_df),movie_df)
print "All columns combined:"
display(df.head())
#df.sort_values(["item_id"])

print "The combined data frame has {} rows and {} columns".format(df.shape[0],df.shape[1])

[816 724 295  54 205 373 665 656 690 280]
Here are the randomly selected users:



Unnamed: 0,user_id,age,gender,profession,zip
816,817,19,M,student,60152
724,725,21,M,student,91711
295,296,43,F,administrator,16803
54,55,37,M,programmer,1331
205,206,14,F,student,53115
373,374,36,M,executive,78746
665,666,44,M,administrator,61820
656,657,26,F,none,78704
690,691,34,M,educator,60089
280,281,15,F,student,6059


All columns combined:


Unnamed: 0,user_id,age,gender,profession,zip,item_id,rating,item_name,item_url
0,817,19,M,student,60152,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
1,725,21,M,student,91711,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
2,206,14,F,student,53115,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
3,691,34,M,educator,60089,748,4,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...
4,281,15,F,student,6059,748,5,"Saint, The (1997)",http://us.imdb.com/M/title-exact?Saint%2C%20Th...


The combined data frame has 899 rows and 9 columns


Since we have duplicate movie names in the combined data frame displayed above, we will use the movie_df (which has the unique movies identified by the item_id) to get the actors list from IMDB website. Also note that we will be getting only the leading actors in the movie along with the director and writer names only (to make the list simple).

### Scraping the IMDB web pages to extract actors, writers and directors

In [4]:
#Let us pull the actors and directors for the Toy story (1995) movie
selected_movie_id=df["item_id"].unique()
selected_movies=movie_df.iloc[selected_movie_id]
selected_movies.index=selected_movies["item_id"]
selected_movies.drop(["item_id"],axis=1,inplace=True)
display(selected_movies.head())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,item_name,item_url
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...
125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...
119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...
2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
359,"Assignment, The (1997)",http://us.imdb.com/M/title-exact?Assignment%2C...


Let us get the actors list for a movie using the lxml package. Once we validate that we are successfully getting the actors list, we can implement the same logic in a loop to pull the actors list for all the movies selected (movies rated by the users selected).

In [5]:
page = requests.get('http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)')
tree = html.fromstring(page.content)
actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
print actors

['John Lasseter', 'John Lasseter', 'Pete Docter', 'Tom Hanks', 'Tim Allen', 'Don Rickles']


We can see that the artists for the Toy story were pulled successfully into a list (even though Toy story is an animated movie, these artists gave dubbing to the characters in the movie). 

The number of movies that need to be scraped is given below:

In [5]:
print "We need to scrape {} movies".format(selected_movies.shape[0])

We need to scrape 546 movies


Let us implement the IMDB web scraping (to pull the actors list) in an automated fashion. The following code block will run for more than an hour (since I kept a sleep time of 3 secs between successive requests in order to avoid constant hits to the IMDB server).

In [None]:
#Track the time
start = time.time()
selected_movies.index
actors_list = list()
j = 0

#Iterate the data frame by index
for ind in selected_movies.index:
    #Get the movie name and the URL of the IMDB
    movie_name= selected_movies.loc[ind]["item_name"]
    movie_url= selected_movies.loc[ind]["item_url"]
    
    #Pull the page
    page = requests.get(movie_url)
    tree = html.fromstring(page.content)
    actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
    
    #Extract the actors
    actors=[i for i in actors if i != '\n']
    actors = filter(lambda name: name.strip(), actors)    
    
    #Some URL links are not correct. So if we do not get any actors list, then
    #fix the error and pull the data again
    if len(actors) == 0:
        if len(movie_name.split(",")) == 2:
            x=movie_name.split("(")
            year = x[-1]
            year = "("+year.strip()
            name = x[0]

            name = name.split(",")
            if len(name) > 1:
                name = name[-1]+",".join(name[0:-1])
            name=str(name)
            name = name.strip()
            temp_url = name+"%20"+year

            temp_url=temp_url.replace(" ","%20")
            temp_url="http://us.imdb.com/M/title-exact?"+temp_url
            try:
                 page = requests.get(temp_url)
            except:
                print temp_url
            tree = html.fromstring(page.content)
            actors = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "credit_summary_item", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "itemprop", " " ))]/text()')
            actors=[i for i in actors if i != '\n']
            actors = filter(lambda name: name.strip(), actors)    

    #Collect the actors list to actors_list
    actors_list.append(actors)
    
    #Sleep for 3 seconds to avoid bombarding IMDB server with immediate requests
    #time.sleep(3)
    
    #j variable will keep track of how many URLs were processed
    j = j+1
    
    #Display the status of scraping (how many URLs were processed till now?)
    if j % 50 == 0:
        print "processed {} URLs".format(j)

end = time.time()

print "Total processing time {} secs".format(end - start)

len(actors_list)

#Save the actors list, so that we do not have to gather the actors details again
import pickle
actors_data = {"actors_data":actors_list}
f = open('actors.pkl','wb')
#with open('actors.pkl','wb') as f:
pickle.dump(actors_list,f,-1)
f.close()        


In [6]:
import pickle
import pprint
#To read back the pickled data
f= open('actors.pkl', 'rb')
actors_list = pickle.load(f)
#Uncomment the following lines to verify
#pprint.pprint(actors_list)
#actors_list[1][0]

Adding the actors lists to the data frame.

In [7]:
import warnings
warnings.filterwarnings('ignore')
print len(actors_list)
print selected_movies.shape
selected_movies["actors"]=actors_list

#Get the list of movies which are not scraped successfully
remaining_movies=selected_movies[selected_movies.astype(str)['actors'] == '[]']

print "There are {} movies which are NOT scraped successfully.".format(remaining_movies.shape[0])
print "We will delete these movies from our movie list"
selected_movies=selected_movies[selected_movies.astype(str)['actors'] != '[]']
print "After eliminating the movies which were not successfully scraped from the list, we obtained the following data frame finally"
print "The final data frame has {} movies".format(selected_movies.shape[0])
#selected_movies.index=selected_movies["item_id"]
display(selected_movies.head())
#selected_movies.loc[:,len("actors"] == []
#selected_movies.loc[selected_movies['actors'].isin([[]])]
#selected_movies.loc[i for i in selected_movies["actors"] if len(i) == 0]
#selected_movies[selected_movies["actors"] == []]


546
(546, 2)
There are 96 movies which are NOT scraped successfully.
We will delete these movies from our movie list
After eliminating the movies which were not successfully scraped from the list, we obtained the following data frame finally
The final data frame has 450 movies


Unnamed: 0_level_0,item_name,item_url,actors
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...,"[Mark Joffe, Greg Dinner, Karen Janszen, Janea..."
125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...,"[Jon Turteltaub, Gerald Di Pego, John Travolta..."
119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...,"[Freida Lee Mock, Freida Lee Mock, Maya Lin]"
2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...,"[Martin Campbell, Ian Fleming, Michael France,..."
359,"Assignment, The (1997)",http://us.imdb.com/M/title-exact?Assignment%2C...,"[Christian Duguay, Dan Gordon, Sabi H. Shabtai..."


## Preparation of item profile

Using the above displayed data frame we will now prepare the item profile, using the actors acted in each of the movie.

In [8]:
#Define a function that converts a list to a dict
def list_to_dict(l):
    l = list(set(l)) #eliminate duplicate values, if any, in the list
    #Return a dictionary with key as the actor's name and 1 as the key's value
    return {el:1 for el in l} 

#Convert the list of all actors as a list of dictionaries
d=list(selected_movies["actors"].apply(list_to_dict))

#Convert the list of dictionaries to a data frame
d=pd.DataFrame(d)

#Set the row indices of the data frame to the same as the movies data frame, so that we can join the data frames later
d.index = selected_movies.index

#Concatenate the data frames, based on the row indices
items_profile = pd.concat([selected_movies,d],axis=1)

print "The items profile data frame has {} rows and {} columns. The initial 3 rows are displayed below:"\
.format(items_profile.shape[0],items_profile.shape[1])

display(items_profile.head(n=3))

#items_profile[['Mark Joffe', 'Greg Dinner']]



The items profile data frame has 450 rows and 1723 columns. The initial 3 rows are displayed below:


Unnamed: 0_level_0,item_name,item_url,actors,A.S. Byatt,Aaron Seltzer,Aaron Sorkin,Abel Ferrara,Adam Sandler,Adi Hasak,Adolphe Menjou,...,Winona Ryder,Winston Groom,Wolfgang Petersen,Woody Allen,Woody Gelman,Woody Harrelson,Xiaorui Zhao,Yves Montand,Zack Duhame,Zinedine Soualem
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
749,"MatchMaker, The (1997)",http://us.imdb.com/M/title-exact?Matchmaker%2C...,"[Mark Joffe, Greg Dinner, Karen Janszen, Janea...",,,,,,,,...,,,,,,,,,,
125,Phenomenon (1996),http://us.imdb.com/M/title-exact?Phenomenon%20...,"[Jon Turteltaub, Gerald Di Pego, John Travolta...",,,,,,,,...,,,,,,,,,,
119,Maya Lin: A Strong Clear Vision (1994),http://us.imdb.com/M/title-exact?Maya%20Lin:%2...,"[Freida Lee Mock, Freida Lee Mock, Maya Lin]",,,,,,,,...,,,,,,,,,,


Most of the values of the item profile are NaN values, indicating that very less number of actors acted in all the movies. Let us make sure that we have the correct values (1) at the correcponding actor value in the matrix. We will check in hom many movies the actor "Tom Hanks" has acted. We will also display in which movies "Julia Roberts" has acted.

In [10]:
display(items_profile[["item_name","Tom Hanks"]].dropna())
display(items_profile[["item_name","Julia Roberts"]].dropna())

Unnamed: 0_level_0,item_name,Tom Hanks
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
88,Sleepless in Seattle (1993),1.0
845,That Thing You Do! (1996),1.0
69,Forrest Gump (1994),1.0
28,Apollo 13 (1995),1.0


Unnamed: 0_level_0,item_name,Julia Roberts
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
328,Conspiracy Theory (1997),1.0
255,My Best Friend's Wedding (1997),1.0
370,Mary Reilly (1996),1.0
744,Michael Collins (1996),1.0
319,Everyone Says I Love You (1996),1.0


The display confirms that we have encoded the item profile matrix correctly. You may search the IMDB database with the movie names to confirm that these actors have indeed acted in those movies.

### Building utility matrix
Now we will build a utility matrix, and normalize the ratings provided by each of the 10 random users selected. To normalize, we just subtract the mean of the ratings (provided by the user) from the actual ratings provided by the same user. But before we perform the utility matrix construction, we will take out 50 random movies from the 450 movies. These 50 movies will act as test data, and we evaluate our algorithms on these 50 movies. We pretend that these 50 movies were not watched by the users and propose them to the users based on the cosine similarities obtained. Then we compare the proposed movie ratings with the actual ratings provided by the users and if the actual rating is greater than or equal to 3, then we consider that as success. In other words, if we propose a movie M (which is in the 50 held movies), to a user U, and if the user has provided a rating of 3 or greater to M, then we consider our recommendation as success, else it is considered as a failure.

In [9]:
#all_movie_ids = list(set(items_profile["item_id"]))
all_movie_ids = list(set(items_profile.index))
test_movie_ids = all_movie_ids[0:49]
train_movie_ids = all_movie_ids[50:]

The following movies will not be used to build the user profile. These movies will be used for testing.

In [10]:
selected_movies.loc[test_movie_ids]
#list(items_profile["item_id"])
#items_profile["item_id"]

Unnamed: 0_level_0,item_name,item_url,actors
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...,"[Martin Campbell, Ian Fleming, Michael France,..."
3,Four Rooms (1995),http://us.imdb.com/M/title-exact?Four%20Rooms%...,"[Allison Anders, Alexandre Rockwell, Allison A..."
5,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995),"[Jon Amiel, Ann Biderman, David Madsen, Sigour..."
8,Babe (1995),http://us.imdb.com/M/title-exact?Babe%20(1995),"[Chris Noonan, Dick King-Smith, George Miller,..."
9,Dead Man Walking (1995),http://us.imdb.com/M/title-exact?Dead%20Man%20...,"[Tim Robbins, Helen Prejean, Tim Robbins, Susa..."
10,Richard III (1995),http://us.imdb.com/M/title-exact?Richard%20III...,"[Richard Loncraine, William Shakespeare, Ian M..."
11,Seven (Se7en) (1995),http://us.imdb.com/M/title-exact?Se7en%20(1995),"[David Fincher, Andrew Kevin Walker, Morgan Fr..."
12,"Usual Suspects, The (1995)",http://us.imdb.com/M/title-exact?Usual%20Suspe...,"[Bryan Singer, Christopher McQuarrie, Kevin Sp..."
13,Mighty Aphrodite (1995),http://us.imdb.com/M/title-exact?Mighty%20Aphr...,"[Woody Allen, Woody Allen, Woody Allen, Mira S..."
15,Mr. Holland's Opus (1995),http://us.imdb.com/M/title-exact?Mr.%20Holland...,"[Stephen Herek, Patrick Sheane Duncan, Richard..."


In [11]:
def normalize(rec):
    return rec - np.nanmean(rec)

utility_matrix = ratings_df.pivot(index='user_id',columns='item_id',values='rating')
utility_matrix= utility_matrix.iloc[uids]

#Observe that we are using only train data.
utility_matrix = utility_matrix.ix[:,train_movie_ids]


utility_matrix=utility_matrix.apply(normalize,axis=1)
utility_matrix=utility_matrix.fillna(0)


In [12]:
#movie_id = list(set(items_profile["item_id"]))
#movie_id
#utility_matrix[:,[movie_id]]
#utility_matrix = utility_matrix.ix[:,movie_id]
utility_matrix

item_id,101,107,109,110,112,115,117,118,119,121,...,1267,1278,1285,1314,1396,1408,1431,1432,1452,1475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
817,0.0,0.0,0.0,0.0,0.0,0.0,1.470588,-0.529412,0.0,-0.529412,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
296,0.0,0.0,0.0,0.0,0.0,0.0,-1.246377,0.0,0.0,0.753623,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,0.0,0.0,0.0,0.0,0.0,0.0,-0.625,1.375,0.0,-0.625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.384615,-1.384615,0.0,0.0
374,0.0,0.0,0.0,0.0,0.0,0.0,1.486957,1.486957,0.0,0.486957,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.775862,0.0,-0.775862,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
657,0.0,0.0,-2.1875,0.0,0.0,0.0,0.8125,-2.1875,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
items_profile=items_profile.fillna(0)
items_profile.drop(["item_name","item_url","actors"],inplace=True,axis=1)
#del(items_profile[["item_name","item_url","actors"]]
#items_profile=items_profile.sort(columns="item_id").to_sparse()

#items_profile.index = items_profile["item_id"]
#items_profile.drop(["item_id"],inplace=True,axis=1)
items_profile.head()

Unnamed: 0_level_0,A.S. Byatt,Aaron Seltzer,Aaron Sorkin,Abel Ferrara,Adam Sandler,Adi Hasak,Adolphe Menjou,Adriana Caselotti,Agga Olsen,Agnieszka Holland,...,Winona Ryder,Winston Groom,Wolfgang Petersen,Woody Allen,Woody Gelman,Woody Harrelson,Xiaorui Zhao,Yves Montand,Zack Duhame,Zinedine Soualem
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
749,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
#Removing the test movie IDs from the profile or including only the train movie IDs
test_items_profile=items_profile.loc[test_movie_ids]
items_profile=items_profile.loc[train_movie_ids]
items_profile

Unnamed: 0_level_0,A.S. Byatt,Aaron Seltzer,Aaron Sorkin,Abel Ferrara,Adam Sandler,Adi Hasak,Adolphe Menjou,Adriana Caselotti,Agga Olsen,Agnieszka Holland,...,Winona Ryder,Winston Groom,Wolfgang Petersen,Woody Allen,Woody Gelman,Woody Harrelson,Xiaorui Zhao,Yves Montand,Zack Duhame,Zinedine Soualem
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
display(utility_matrix.head())
x=np.dot(utility_matrix,items_profile)
print x.shape
print x

item_id,101,107,109,110,112,115,117,118,119,121,...,1267,1278,1285,1314,1396,1408,1431,1432,1452,1475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
817,0.0,0.0,0.0,0.0,0.0,0.0,1.470588,-0.529412,0.0,-0.529412,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
296,0.0,0.0,0.0,0.0,0.0,0.0,-1.246377,0.0,0.0,0.753623,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,0.0,0.0,0.0,0.0,0.0,0.0,-0.625,1.375,0.0,-0.625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.384615,-1.384615,0.0,0.0


(10L, 1720L)
[[ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.75362319]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.75       ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


In [16]:
temp_utility_matrix = utility_matrix
temp_utility_matrix[temp_utility_matrix != 0] = 1
#items_profile
user_profile=x/np.dot(temp_utility_matrix,items_profile)
user_profile= pd.DataFrame(user_profile).fillna(0)
user_profile.columns = items_profile.columns
user_profile.index=utility_matrix.index
user_profile

Unnamed: 0_level_0,A.S. Byatt,Aaron Seltzer,Aaron Sorkin,Abel Ferrara,Adam Sandler,Adi Hasak,Adolphe Menjou,Adriana Caselotti,Agga Olsen,Agnieszka Holland,...,Winona Ryder,Winston Groom,Wolfgang Petersen,Woody Allen,Woody Gelman,Woody Harrelson,Xiaorui Zhao,Yves Montand,Zack Duhame,Zinedine Soualem
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
817,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.246377,0.0,...,0.0,0.0,0.753623,0.753623,0.0,0.753623,0.0,0.0,0.0,0.753623
55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.384615,...,-1.384615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
374,0.0,0.0,1.486957,0.0,-0.513043,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.486957,0.0,-0.513043,0.0,0.0,0.0,-0.513043,0.0
666,0.0,0.0,-0.775862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.224138,0.0,0.724138,0.224138,0.0,0.0,0.0,0.0,-0.775862,0.0
657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.8125,0.0,0.0,0.0,0.0
691,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
#items_profile[items_profile["A.S. Byatt"] !=0]
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(user_profile,items_profile)


In [18]:
sim = pd.DataFrame(sim)
sim.columns = items_profile.index
sim.index = user_profile.index
#sim.loc[817].sort()
sim

item_id,101,107,109,110,112,115,117,118,119,121,...,1267,1278,1285,1314,1396,1408,1431,1432,1452,1475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
817,0.0,0.0,0.0,-0.002823,0.0,0.0,0.423483,-0.152454,0.0,-0.165459,...,0.024741,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
725,0.0,0.0,0.0,0.0,0.0,0.0,-0.007088,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
296,0.0,0.0,0.0,0.0,0.0,0.0,-0.13962,0.0,0.0,0.079732,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,0.0,0.0,0.0,0.0,0.0,0.0,-0.172585,0.379687,0.0,-0.157548,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
206,0.0,0.015393,0.0,0.0,0.0,0.0,0.0,0.014052,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.154889,-0.1897,0.0,0.0
374,0.0,-0.016233,0.0,-0.014818,0.0,0.0,0.107183,0.099962,0.0,0.033245,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021474,0.0,0.0
666,0.0,0.011345,0.0,0.0,0.0,0.0,-0.01527,-0.084448,0.0,-0.098181,...,0.011345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
657,0.0,0.0,-0.334881,-0.078932,0.0,0.0,0.175906,-0.473593,0.0,0.0,...,-0.086466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
281,0.0,0.026434,0.0,0.0,0.0,0.0,0.0,0.024131,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
sim = cosine_similarity(user_profile,test_items_profile)
sim = pd.DataFrame(sim)
sim.columns = test_items_profile.index
sim.index = user_profile.index
#sim.loc[817].sort()
sim

item_id,2,3,5,8,9,10,11,12,13,15,...,82,83,88,89,90,92,93,96,97,98
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
817,0.0,0.0,0.0,0.0,0.0,0.0,-0.08041,0.0,0.0,0.0,...,-0.055668,0.0,0.0,-0.098813,0.0,-0.08041,0.0,0.0,0.0,0.0
725,0.0,0.0,0.0,0.0,0.07765,0.0,0.069885,0.0,0.0,0.0,...,0.0,0.0,0.0,-0.007088,0.0,0.0,0.0,0.0,0.0,0.0
296,0.0,0.0,-0.110857,0.0,0.015946,0.015946,0.0,0.015946,0.020587,0.0,...,0.00552,-0.005829,0.015946,-0.014417,0.0,0.0,0.0,0.0,0.0,0.0
55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.056717,0.0,0.0,0.017258,0.0,0.0,0.0,0.0,0.0,0.0
206,-0.063233,0.0,-0.031617,0.0,0.040407,0.06542,0.086586,0.130841,0.0,0.0,...,0.015393,0.0,0.0,-0.031617,0.0,0.0,0.0,-0.034634,0.0,0.0
374,-0.007409,0.0,-0.000377,0.0,0.0,0.023524,-0.008116,0.007704,-0.051326,-0.000413,...,0.017838,0.0,-0.023936,0.018691,0.0,-0.023936,0.0,-0.013802,0.0,0.021474
666,0.0,0.0,0.027087,0.0,-0.019636,-0.019636,-0.002618,0.042327,0.007323,-0.019636,...,-0.055999,0.0,0.0,-0.026822,0.0,0.0,0.0,-0.046254,0.0,0.0
657,0.0,0.0,0.0,0.0,0.0,0.0,-0.046939,-0.046939,0.0,0.0,...,-0.086466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
691,0.0,0.0,0.0,0.0,0.0,0.054956,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
281,-0.032174,0.0,0.0,0.0,-0.035245,0.0,-0.008811,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.035245,0.0,0.0


In [28]:
sim_t=sim.T
x=sim_t[[666]].sort_values([666])
x
#x.loc[294]

user_id,666
item_id,Unnamed: 1_level_1
29,-0.083253
67,-0.071563
82,-0.055999
96,-0.046254
62,-0.039272
70,-0.039272
28,-0.035851
56,-0.0336
69,-0.030672
89,-0.026822


In [23]:
uids = [i+1 for i in uids]
actual_ratings=ratings_df[(ratings_df["item_id"].isin(test_movie_ids))&(ratings_df["user_id"].isin(uids))]
actual_ratings.sort_values(["user_id"])

Unnamed: 0,user_id,item_id,rating
6633,55,56,4
7615,55,89,5
402,296,20,5
77042,296,13,3
76754,296,9,4
65321,296,11,5
22967,296,55,5
18961,296,89,5
11846,296,56,5
97753,296,23,5


In [389]:
ratings_df[(ratings_df["user_id"] == 296)&(ratings_df["item_id"].in(test_movie_ids))]
#ratings_df[(ratings_df["user_id"]==296) &(ratings_df["item_id"] ==88)].sort_values(["rating"])

SyntaxError: invalid syntax (<ipython-input-389-dce079f1d499>, line 1)

In [234]:
top_n = 10
pd.DataFrame({n: sim.T[col].nlargest(top_n).index.tolist() 
                  for n, col in enumerate(sim.T)}).T


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,117,273,294,187,222,281,288,328,329,124
1,881,333,286,157,9,11,825,39,80,148
2,20,98,199,257,277,293,357,429,485,508
3,118,144,89,273,174,56,187,226,425,467
4,302,315,288,310,333,269,896,12,332,873
5,144,185,197,226,742,1010,71,125,762,98
6,493,506,302,269,286,529,647,483,179,188
7,294,269,117,475,508,9,286,57,67,340
8,185,205,604,178,692,524,180,209,653,10
9,271,304,500,332,322,288,310,877,1011,771


### Scraping the IMDB database


## Project methodology

At a high level we will perform the following tasks in this project:

1. Choose 10 users randomly
2. Get the list of all movies rated by these 10 users
3. To develop content based recommender, we will scrape the IMDB website to get the list of all actors acted in the movies identified in the second requirement
4. Build items and user profiles
5. For each user 

## Selecting random users

Unnamed: 0,user_id,age,gender,profession,zip,item_id,rating,item_name,item_url
14,817,19,M,student,60152,1,4,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
18,691,34,M,educator,60089,1,5,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
17,657,26,F,none,78704,1,3,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
16,374,36,M,executive,78746,1,4,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
15,296,43,F,administrator,16803,1,5,Toy Story (1995),http://us.imdb.com/M/title-exact?Toy%20Story%2...
689,374,36,M,executive,78746,2,4,GoldenEye (1995),http://us.imdb.com/M/title-exact?GoldenEye%20(...
723,666,44,M,administrator,61820,4,5,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
722,374,36,M,executive,78746,4,2,Get Shorty (1995),http://us.imdb.com/M/title-exact?Get%20Shorty%...
633,374,36,M,executive,78746,5,4,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)
634,666,44,M,administrator,61820,5,2,Copycat (1995),http://us.imdb.com/M/title-exact?Copycat%20(1995)
