# Movie Recommendation System with Python
In this project, we'll develop a basic recommender system with Python and pandas.

Movies will be suggested by similarity to other movies; this is not a robust recommendation system, but something to start out on.

In [1]:
import numpy as np
import pandas as pd

2 datasets:
- user rating on movie
- list all movie titles and ids

In [2]:
#Reading the ratings dataset & movie titles and ids, and merge them.
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('user.data', sep='\t', names=column_names)
movie_titles = pd.read_csv("Movie_Id_Titles.txt")

df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


we are trying to get the user ratings on movies in a single frame

In [3]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

Unnamed: 0_level_0,rating,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),2.333333,9
1-900 (1994),2.6,5
101 Dalmatians (1996),2.908257,109
12 Angry Men (1957),4.344,125
187 (1997),3.02439,41


now that we have the ratings table we will make a movie matrix with ratings of user per title

NaN would means that user has no rating for it

In [4]:
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


now that we got this matrix, creating recommendation would be easy, we just had to find the correlation with .corrwith

example here we are looking for similar movies to starwars

In [5]:
starwars_user_ratings = moviemat['Star Wars (1977)']

similar_to_starwars = moviemat.corrwith(starwars_user_ratings)

corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
# drop the NaN value
corr_starwars.dropna(inplace=True)

corr_starwars.sort_values('Correlation',ascending=False).head(10)

  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
Hollow Reed (1996),1.0
Stripes (1981),1.0
Star Wars (1977),1.0
Man of the Year (1995),1.0
"Beans of Egypt, Maine, The (1994)",1.0
Safe Passage (1994),1.0
"Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)",1.0
"Outlaw, The (1943)",1.0
"Line King: Al Hirschfeld, The (1996)",1.0
Hurricane Streets (1998),1.0


now we got this, but this is not reliable, since number of ratings is not accounted, there might be some movie where there's only 1 raters and thus it gets high value.

we would need to account the number of ratings as well

In [6]:
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.sort_values('Correlation', ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Hollow Reed (1996),1.0,6
Stripes (1981),1.0,5
Star Wars (1977),1.0,584
Man of the Year (1995),1.0,9
"Beans of Egypt, Maine, The (1994)",1.0,2


now it looks more reliable, we just have to discard any movie with number of ratings lower than certain value in ourcase we will try 200

In [48]:
result = corr_starwars[corr_starwars['num of ratings']>200]
result.sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),1.0,584
"Empire Strikes Back, The (1980)",0.748353,368
Return of the Jedi (1983),0.672556,507
Raiders of the Lost Ark (1981),0.536117,420
"Sting, The (1973)",0.367538,241


obviously finally we will need to remove itself.

In [49]:
result[result['Correlation']!=1.0].sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Empire Strikes Back, The (1980)",0.748353,368
Return of the Jedi (1983),0.672556,507
Raiders of the Lost Ark (1981),0.536117,420
"Sting, The (1973)",0.367538,241
Indiana Jones and the Last Crusade (1989),0.350107,331
L.A. Confidential (1997),0.319065,297
E.T. the Extra-Terrestrial (1982),0.303619,300
Batman (1989),0.289344,201
Field of Dreams (1989),0.285286,212
Star Trek: The Wrath of Khan (1982),0.282206,244


looks pretty good now! we can probably remove rows that has correlation < 0.5

In [50]:
result = result[result['Correlation']!=1.0]
result[result['Correlation']>0.5].sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Empire Strikes Back, The (1980)",0.748353,368
Return of the Jedi (1983),0.672556,507
Raiders of the Lost Ark (1981),0.536117,420
