### Movie Recommendation System (Item Similarity)

We pick a movie first and find what other movies are correlated with it using the user ratings and filtering the data using number of ratings

Lets import the necessary packages

In [1]:
import numpy as np
import pandas as pd

In [2]:
movie_ratings = pd.read_csv("ratings.csv",sep=',')
movie_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movie_id = pd.read_csv('movies.csv')
movie_id.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies_data  = pd.merge(movie_ratings,movie_id,on='movieId')
movies_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [6]:
movies_data.drop(['genres'],axis=1,inplace=True)

In [7]:
movies_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,1,4.0,964982703,Toy Story (1995)
1,5,1,4.0,847434962,Toy Story (1995)
2,7,1,4.5,1106635946,Toy Story (1995)
3,15,1,2.5,1510577970,Toy Story (1995)
4,17,1,4.5,1305696483,Toy Story (1995)


In [8]:
movie_user_grid = movies_data.pivot_table(index='userId',columns='title',values='rating')
movie_user_grid.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [9]:
temp = movies_data.groupby('title')['rating']
recommender_data = pd.DataFrame(temp.mean())
recommender_data['count'] = pd.DataFrame(temp.count())
recommender_data.rename(columns = {'rating':'mean ratings'},inplace=True)
recommender_data.head()

Unnamed: 0_level_0,mean ratings,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.0,1
'Hellboy': The Seeds of Creation (2004),4.0,1
'Round Midnight (1986),3.5,2
'Salem's Lot (2004),5.0,1
'Til There Was You (1997),4.0,2


Most rated movies by count:(Impressive as expected)

In [10]:
recommender_data.sort_values('count',ascending=False).head()

Unnamed: 0_level_0,mean ratings,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Forrest Gump (1994),4.164134,329
"Shawshank Redemption, The (1994)",4.429022,317
Pulp Fiction (1994),4.197068,307
"Silence of the Lambs, The (1991)",4.16129,279
"Matrix, The (1999)",4.192446,278


Most highly rated movies:(Quite a disappointing list) {Not a good parameter for recommendation I guess}

In [11]:
recommender_data.sort_values('mean ratings',ascending=False).head()

Unnamed: 0_level_0,mean ratings,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Gena the Crocodile (1969),5.0,1
True Stories (1986),5.0,1
Cosmic Scrat-tastrophe (2015),5.0,1
Love and Pigeons (1985),5.0,1
Red Sorghum (Hong gao liang) (1987),5.0,1


Lets recommend some movies to Forrest Gump watchers

In [12]:
forrest_gump_ratings = movie_user_grid['Forrest Gump (1994)']
forrest_gump_ratings.head()

userId
1    4.0
2    NaN
3    NaN
4    NaN
5    NaN
Name: Forrest Gump (1994), dtype: float64

Lets find some similar movies and clean the output 

In [13]:
similar_to_forrest = movie_user_grid.corrwith(forrest_gump_ratings)
similar_to_forrest.dropna(inplace=True)
similar_to_forrest.head()

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


title
'burbs, The (1989)                0.197712
(500) Days of Summer (2009)       0.234095
*batteries not included (1987)    0.892710
...And Justice for All (1979)     0.928571
10 Cent Pistol (2015)            -1.000000
dtype: float64

In [14]:
rel_forrest = pd.DataFrame(similar_to_forrest,columns=['Similarity'])
rel_forrest.head()

Unnamed: 0_level_0,Similarity
title,Unnamed: 1_level_1
"'burbs, The (1989)",0.197712
(500) Days of Summer (2009),0.234095
*batteries not included (1987),0.89271
...And Justice for All (1979),0.928571
10 Cent Pistol (2015),-1.0


Let's sort the bad boys by similarity (descending ofc)

In [15]:
rel_forrest.sort_values('Similarity',ascending=False).head()

Unnamed: 0_level_0,Similarity
title,Unnamed: 1_level_1
Lost & Found (1999),1.0
"Century of the Self, The (2002)",1.0
The 5th Wave (2016),1.0
Play Time (a.k.a. Playtime) (1967),1.0
Memories (Memorîzu) (1995),1.0


Never heard about any of the films tbh lets filter out using the count 

In [16]:
rel_forrest = rel_forrest.join(recommender_data['count'])
rel_forrest.head()

Unnamed: 0_level_0,Similarity,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.197712,17
(500) Days of Summer (2009),0.234095,42
*batteries not included (1987),0.89271,7
...And Justice for All (1979),0.928571,3
10 Cent Pistol (2015),-1.0,2


In [17]:
rel_forrest[rel_forrest['count']>100].sort_values('Similarity',ascending=False).head()

Unnamed: 0_level_0,Similarity,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Forrest Gump (1994),1.0,329
Good Will Hunting (1997),0.484042,141
Aladdin (1992),0.464268,183
American History X (1998),0.457287,129
"Truman Show, The (1998)",0.432556,125


I guess its good :)