# Recommender system

We start by first importing the MovieLens dataset which has 100,000 ratings.

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)



In [2]:
ratings.head(4)

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4


A pivot table is then constructed to get a dataframe object where each row represents a user and the fields are all the ratings given by them for the corresponding movie.

In [3]:
mr = ratings.pivot_table(index=['user_id'],columns='title',values='rating')
mr.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


A correlation matrix is now generated which makes it very convenient to find out how similar two movies are based on the user ratings given to them.

In [4]:
mr.corr().head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


Being a star wars fan, I wanted to check for movies similar or related to it.

In [6]:
sw_rating = mr['Star Wars (1977)']
sw_rating.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

In [7]:
#Initially movies that are most highly correlated with star wars are found out
similar = mr.corrwith(sw_rating)
similar = similar.dropna()
df = pd.DataFrame(similar, columns = ['Scores'])

  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


In [8]:
#Movies are sorted based on their similarity scores
df.sort_values(ascending=False,by='Scores').head(10)

Unnamed: 0_level_0,Scores
title,Unnamed: 1_level_1
Hollow Reed (1996),1.0
Commandments (1997),1.0
Cosi (1996),1.0
No Escape (1994),1.0
Stripes (1981),1.0
Star Wars (1977),1.0
Man of the Year (1995),1.0
"Beans of Egypt, Maine, The (1994)",1.0
"Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)",1.0
"Outlaw, The (1943)",1.0



But the results obtain don't actually make a lot of sense. Looks like something important has been overlooked.

This could probably be because the data may be skewed by movies watched only by a handful of people who also just happened to like Star Wars. So these movies must be gotten rid of.

A new dataframe is created by grouping the inital table to get the total number of ratings and average rating for each movie.


In [10]:
import numpy as np
stats = ratings.groupby('title').agg({'rating': [np.mean,np.sum]})

In [11]:
stats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,mean,sum
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),2.333333,21
1-900 (1994),2.6,13
101 Dalmatians (1996),2.908257,317
12 Angry Men (1957),4.344,543
187 (1997),3.02439,124


In [12]:
stats = stats.sort_values([('rating','sum')], ascending = False)

In [13]:
stats['rating']['sum'].describe()

count    1664.000000
mean      212.137620
std       310.776354
min         1.000000
25%        18.000000
50%        81.000000
75%       273.000000
max      2546.000000
Name: sum, dtype: float64


From further analysis it can be seen that on an average each movie has been rated by around 200 people. We filter out movies that have less than 100 ratings. This number can be tuned as per requirements.


In [14]:
stats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,mean,sum
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Star Wars (1977),4.359589,2546
Fargo (1996),4.155512,2111
Return of the Jedi (1983),4.00789,2032
Contact (1997),3.803536,1936
Raiders of the Lost Ark (1981),4.252381,1786


In [15]:
popular = stats['rating']['sum'] >= 100
pop_movies = stats[popular]
pop_movies.head(3)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,mean,sum
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Star Wars (1977),4.359589,2546
Fargo (1996),4.155512,2111
Return of the Jedi (1983),4.00789,2032


In [16]:
#This dataframe is merged with the initial similarity scores to get the values only for this list
rel = pop_movies.join(df)



In [18]:
rel.sort_values(by='Scores',ascending=False).head()

Unnamed: 0_level_0,"(rating, mean)","(rating, sum)",Scores
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),4.359589,2546,1.0
"Empire Strikes Back, The (1980)",4.206522,1548,0.748353
Return of the Jedi (1983),4.00789,2032,0.672556
Raiders of the Lost Ark (1981),4.252381,1786,0.536117
Austin Powers: International Man of Mystery (1997),3.246154,422,0.377433


Now, these results are more like it. It includes the Star Wars sequels as well!

# Making recommendations for users

This user was created as a custom data point and recommendations will be made for him based on the movies he has seen.

In [11]:
profile = mr.loc[0].dropna()
profile

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

In [12]:
type(profile)

pandas.core.series.Series

In [14]:
#Correlation matrix is generated with values only for movies with greater than or equal to 200 ratings.
corr_mat = mr.corr(min_periods=100)
corr_mat.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


# Recommendations based on each movie:

In [32]:
j=0
for i in profile.index:
    sims = corr_mat[i].dropna().sort_values(ascending=False)
    
    #The similarity scores for each movie are scaled by the rating that the selected user has given for the related movie.
    #This is done so that apart from recommending similar movies in general, the movies that are similar to those that are 
    #highly rated by the user can be given higher scores.
    
    sims=sims.map(lambda x:x*profile[j]) 
    print('Because you watched ' + i + '...')
    print(sims.head(4))
    print('\n')
    j+=1

Because you watched Empire Strikes Back, The (1980)...
title
Empire Strikes Back, The (1980)    5.000000
Star Wars (1977)                   3.741763
Return of the Jedi (1983)          3.606146
Raiders of the Lost Ark (1981)     2.693297
Name: Empire Strikes Back, The (1980), dtype: float64


Because you watched Gone with the Wind (1939)...
title
Gone with the Wind (1939)            1.000000
Wizard of Oz, The (1939)             0.430219
E.T. the Extra-Terrestrial (1982)    0.361463
Schindler's List (1993)              0.344765
Name: Gone with the Wind (1939), dtype: float64


Because you watched Star Wars (1977)...
title
Star Wars (1977)                   5.000000
Empire Strikes Back, The (1980)    3.741763
Return of the Jedi (1983)          3.362779
Raiders of the Lost Ark (1981)     2.680586
Name: Star Wars (1977), dtype: float64




From these results, it can be seen that the recommendations for each movie contain the name of that same movie as well because they are perfectly correlated. This can be overcome by removing the first element of the series object. 

In [30]:
j=0
for i in profile.index:
    sims = corr_mat[i].dropna().sort_values(ascending=False)
    sims=sims.map(lambda x:x*profile[j])
    print('Because you watched ' + i + '...')
    print(sims.drop(sims.index[0]).head(4))
    print('\n')
    j+=1

Because you watched Empire Strikes Back, The (1980)...
title
Star Wars (1977)                        3.741763
Return of the Jedi (1983)               3.606146
Raiders of the Lost Ark (1981)          2.693297
Bridge on the River Kwai, The (1957)    1.783717
Name: Empire Strikes Back, The (1980), dtype: float64


Because you watched Gone with the Wind (1939)...
title
Wizard of Oz, The (1939)             0.430219
E.T. the Extra-Terrestrial (1982)    0.361463
Schindler's List (1993)              0.344765
Graduate, The (1967)                 0.326215
Name: Gone with the Wind (1939), dtype: float64


Because you watched Star Wars (1977)...
title
Empire Strikes Back, The (1980)                       3.741763
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Name: Star Wars (1977), dtype: float64




# Overall recommendations

Here we generate an overall list of recommendations based on all the movies watched.

In [31]:
j=0
recommendations = pd.Series()
for i in profile.index:
    sims = corr_mat[i].dropna().sort_values(ascending=False)
    sims=sims*profile[j]
    recommendations = recommendations.append(sims) 
    j+=1
    
print('Working on your recommendations....\n')
print('Almost done...\n')
print(recommendations.sort_values(ascending=False))
    

Working on your recommendations....

Almost done...

Empire Strikes Back, The (1980)                       5.000000
Star Wars (1977)                                      5.000000
Star Wars (1977)                                      3.741763
Empire Strikes Back, The (1980)                       3.741763
Return of the Jedi (1983)                             3.606146
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.693297
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Sting, The (1973)                                     1.837692
Bridge on the River Kwai, The (1957)                  1.783717
Indiana Jones and the Last Crusade (1989)             1.750535
Cinderella (1950)                                     1.749598
Back to the Future (1985)                             1.726427
Terminator 2: Judgment Day (1991)                     1.667662
Fr


Again we see that the same movie repeats in the recommendations. Also, there are movies repeating more than once because some of them are similar to more than one movie watched.

This is taken care of by grouping the repeating movies together so that their scores can be added up to result in a higher score because these are obviously more preferred by the user.

 # Final 

In [63]:
print('Working on your recommendations....\n')
print('Almost done...\n')
print('There you go!..\n')
f = recommendations.drop(profile.index)
f = f.groupby(f.index).sum()
f.sort_values(ascending=False).head(10)

Working on your recommendations....

Almost done...

There you go!..



Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64