#       Movie Recommendation System

## Objective

To utilize movieLens Dataset(https://grouplens.org/datasets/movielens/latest/) and Perform Analytics and Machine Learning for Following Tasks:

-    **Task 1: To Recommend Movies based on Ratings Provided by the users**
-    **Task 2: To Recommend Movies based on other similar user's viewing experience**

we will break these tasks into further sub-tasks

### Importing Necessary Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re


%matplotlib inline

### Reading csv files into project

In [2]:
ratingData = pd.read_csv('Dataset/ratings.csv')

moviesData = pd.read_csv('Dataset/movies.csv')

linksData = pd.read_csv('Dataset/links.csv')

In [3]:
ratingData.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
moviesData.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
linksData.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


## Task 1: To Recommend Movies based on Ratings Provided by the users

To achieve This Task we will Perform some sub-tasks:
-    **Merging Both Ratings and Movies Dataset**
-    **Grouping Merged Dataset Based on movieId**
-    **Getting Rating count and mean Ratings**
-    **Calculating Recommendation Score**
-    **Top 10 Recommended Movies**
-    **Top 10 Movies based on Ratings** 

### Merging Both Ratings and Movies Datasets
to achieve this we will use **Merge** Function of Pandas Library

In [6]:
wholeData = pd.merge(ratingData,moviesData,on="movieId")


wholeData.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [7]:
wholeData.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
title         object
genres        object
dtype: object

### Grouping Merged Dataset Based on movieId

Generating new Dataset Based on grouping.

In [8]:
newData = pd.DataFrame()

newData['title'] = wholeData.groupby('movieId')['title'].unique().astype(str)
newData['movieId'] =[movieId for movieId, df in newData.groupby(['movieId'])]

newData.head(10)

Unnamed: 0_level_0,title,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,['Toy Story (1995)'],1
2,['Jumanji (1995)'],2
3,['Grumpier Old Men (1995)'],3
4,['Waiting to Exhale (1995)'],4
5,['Father of the Bride Part II (1995)'],5
6,['Heat (1995)'],6
7,['Sabrina (1995)'],7
8,['Tom and Huck (1995)'],8
9,['Sudden Death (1995)'],9
10,['GoldenEye (1995)'],10


### Getting Rating count and mean Ratings
to achieve this we will use **count** and **mean** functions

In [9]:
newData['count'] = wholeData.groupby('movieId')['rating'].count() 
newData['avg.rating'] = wholeData.groupby('movieId')['rating'].mean()

newData.head()

Unnamed: 0_level_0,title,movieId,count,avg.rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,['Toy Story (1995)'],1,215,3.92093
2,['Jumanji (1995)'],2,110,3.431818
3,['Grumpier Old Men (1995)'],3,52,3.259615
4,['Waiting to Exhale (1995)'],4,7,2.357143
5,['Father of the Bride Part II (1995)'],5,49,3.071429


### Calculating Recommendation Score
to achieve this we will multiply **count** with **avg. rating**

In [10]:
newData['recommendScore'] = newData['count'] * newData['avg.rating']

newData.head()

Unnamed: 0_level_0,title,movieId,count,avg.rating,recommendScore
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,['Toy Story (1995)'],1,215,3.92093,843.0
2,['Jumanji (1995)'],2,110,3.431818,377.5
3,['Grumpier Old Men (1995)'],3,52,3.259615,169.5
4,['Waiting to Exhale (1995)'],4,7,2.357143,16.5
5,['Father of the Bride Part II (1995)'],5,49,3.071429,150.5


### Top 10 Recommended Movies
to achieve this we sort Movies based on recommend Score by using **sort_values** Function

In [11]:
top = newData.sort_values('recommendScore',ascending=False)
print("Below are the Top 10 Recommended Movies:")
top.head(10)['title']


Below are the Top 10 Recommended Movies:


movieId
318              ['Shawshank Redemption, The (1994)']
356                           ['Forrest Gump (1994)']
296                           ['Pulp Fiction (1994)']
2571                           ['Matrix, The (1999)']
593              ['Silence of the Lambs, The (1991)']
260     ['Star Wars: Episode IV - A New Hope (1977)']
110                             ['Braveheart (1995)']
2959                            ['Fight Club (1999)']
527                       ["Schindler's List (1993)"]
480                          ['Jurassic Park (1993)']
Name: title, dtype: object

### Top 10 Movies based on Ratings
to achieve this we will first drop all data with rating count less then 100 then sorting on the basis of avg. ratings

In [12]:
newData = newData[newData['count']>100]
top2 = newData.sort_values('avg.rating',ascending=False)

top2.head(10)['title']

movieId
318               ['Shawshank Redemption, The (1994)']
858                          ['Godfather, The (1972)']
2959                             ['Fight Club (1999)']
1221                ['Godfather: Part II, The (1974)']
48516                         ['Departed, The (2006)']
1213                             ['Goodfellas (1990)']
58559                      ['Dark Knight, The (2008)']
50                      ['Usual Suspects, The (1995)']
1197                    ['Princess Bride, The (1987)']
260      ['Star Wars: Episode IV - A New Hope (1977)']
Name: title, dtype: object

### Task 2: To Recommend Movies based on other similar user's viewing experience
we will use **pivot_table** function to get titles as column names and user Id as Row 

In [14]:
movieData = wholeData.pivot_table(index = 'userId',columns='title',values = 'rating')

movieData.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


### Getting Most Rated Movies

In [16]:
newData.sort_values('count',ascending=False).head(10)

Unnamed: 0_level_0,title,movieId,count,avg.rating,recommendScore
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
356,['Forrest Gump (1994)'],356,329,4.164134,1370.0
318,"['Shawshank Redemption, The (1994)']",318,317,4.429022,1404.0
296,['Pulp Fiction (1994)'],296,307,4.197068,1288.5
593,"['Silence of the Lambs, The (1991)']",593,279,4.16129,1161.0
2571,"['Matrix, The (1999)']",2571,278,4.192446,1165.5
260,['Star Wars: Episode IV - A New Hope (1977)'],260,251,4.231076,1062.0
480,['Jurassic Park (1993)'],480,238,3.75,892.5
110,['Braveheart (1995)'],110,237,4.031646,955.5
589,['Terminator 2: Judgment Day (1991)'],589,224,3.970982,889.5
527,"[""Schindler's List (1993)""]",527,220,4.225,929.5


#### Now Lets Select Two Movies : 'Forrest Gump (1994)' and 'Matrix, The (1999)' and grabing user ratings for these movies

In [17]:
forrestgumpRatings = movieData['Forrest Gump (1994)']
matrixRatings = movieData['Matrix, The (1999)']

matrixRatings.head()

userId
1    5.0
2    NaN
3    NaN
4    1.0
5    NaN
Name: Matrix, The (1999), dtype: float64

### Getting Correlations

Now Using **corrwith()** Function to get correlations between two panda series. 

In [18]:
similar_to_forrest_gump = movieData.corrwith(forrestgumpRatings)
similar_to_matrix = movieData.corrwith(matrixRatings)

similar_to_forrest_gump.head()

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


title
'71 (2014)                                NaN
'Hellboy': The Seeds of Creation (2004)   NaN
'Round Midnight (1986)                    NaN
'Salem's Lot (2004)                       NaN
'Til There Was You (1997)                 NaN
dtype: float64

### converting forrest gump correlated data into DataFrame and droping Nan Values

In [19]:
corr_forrest = pd.DataFrame(similar_to_forrest_gump,columns = ['correlation'])
corr_forrest.dropna(inplace=True)

corr_forrest.head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
"'burbs, The (1989)",0.197712
(500) Days of Summer (2009),0.234095
*batteries not included (1987),0.89271
...And Justice for All (1979),0.928571
10 Cent Pistol (2015),-1.0


### Sorting Top 10 Movies which Correlates with Forrest Gump 

In [20]:
corr_forrest.sort_values('correlation',ascending=False).head(10)

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Lost & Found (1999),1.0
"Century of the Self, The (2002)",1.0
The 5th Wave (2016),1.0
Play Time (a.k.a. Playtime) (1967),1.0
Memories (Memorîzu) (1995),1.0
Playing God (1997),1.0
Killers (2010),1.0
"Girl Walks Home Alone at Night, A (2014)",1.0
Tampopo (1985),1.0
"Cercle Rouge, Le (Red Circle, The) (1970)",1.0


### Repeating Same Process for The Matrix Movie

In [21]:
corr_matrix = pd.DataFrame(similar_to_matrix,columns = ['correlation'])
corr_matrix.dropna(inplace=True)

corr_matrix.head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
"'burbs, The (1989)",-0.160843
(500) Days of Summer (2009),0.302316
*batteries not included (1987),0.392232
...And Justice for All (1979),0.654654
10 Cent Pistol (2015),-1.0


In [23]:
corr_matrix.sort_values('correlation',ascending=False).head(10)

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Haywire (2011),1.0
Highway 61 (1991),1.0
World on a Wire (Welt am Draht) (1973),1.0
"War Zone, The (1999)",1.0
"Hitcher, The (1986)",1.0
Gross Anatomy (a.k.a. A Cut Above) (1989),1.0
Paper Towns (2015),1.0
Juwanna Mann (2002),1.0
Topsy-Turvy (1999),1.0
All the King's Men (2006),1.0
