### Here is the simple recommender system that can detect similar movies based on user ratings

#### The data here we are using is MovieLens Data
####  we have Ratings data & Titles data , combining together we will do the preocess


## Import Libraries

In [2]:
import numpy as np
import pandas as pd

## Get the Data

In [3]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

In [4]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [5]:
df.shape

(100003, 4)

Now let's get the movie titles:

In [6]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.shape

(1682, 2)

We can merge them together:

In [7]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


# EDA

Let's explore the data a bit and get a look at some of the best rated movies.

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

Let's create a ratings dataframe with average rating and number of ratings:

In [9]:
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()

title
Marlene Dietrich: Shadow and Light (1996)     5.0
Prefontaine (1997)                            5.0
Santa with Muscles (1996)                     5.0
Star Kid (1997)                               5.0
Someone Else's America (1995)                 5.0
Name: rating, dtype: float64

In [10]:
df.groupby('title')['rating'].count().sort_values(ascending=False).head()

title
Star Wars (1977)             584
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64

In [11]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head(10)

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
'Til There Was You (1997),2.333333
1-900 (1994),2.6
101 Dalmatians (1996),2.908257
12 Angry Men (1957),4.344
187 (1997),3.02439
2 Days in the Valley (1996),3.225806
"20,000 Leagues Under the Sea (1954)",3.5
2001: A Space Odyssey (1968),3.969112
3 Ninjas: High Noon At Mega Mountain (1998),1.0
"39 Steps, The (1935)",4.050847


Now set the number of ratings column:

In [12]:
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

Unnamed: 0_level_0,rating,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),2.333333,9
1-900 (1994),2.6,5
101 Dalmatians (1996),2.908257,109
12 Angry Men (1957),4.344,125
187 (1997),3.02439,41


Okay! Now that we have a general idea of what the data looks like, let's move on to creating a simple recommendation system:

## Recommending Similar Movies

Now let's create a matrix that has the user ids on one access and the movie title on another axis. Each cell will then consist of the rating the user gave to that movie. Note there will be a lot of NaN values, because most people have not seen most of the movies.

In [13]:
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Most rated movie:

In [14]:
ratings.sort_values('num of ratings',ascending=False).head(10)

Unnamed: 0_level_0,rating,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),4.359589,584
Contact (1997),3.803536,509
Fargo (1996),4.155512,508
Return of the Jedi (1983),4.00789,507
Liar Liar (1997),3.156701,485
"English Patient, The (1996)",3.656965,481
Scream (1996),3.441423,478
Toy Story (1995),3.878319,452
Air Force One (1997),3.63109,431
Independence Day (ID4) (1996),3.438228,429


### Let's choose two movies:

1. Fargo (Black Comedy Thriller)

2. Air Force One an (Action film)

In [15]:
ratings.head()

Unnamed: 0_level_0,rating,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),2.333333,9
1-900 (1994),2.6,5
101 Dalmatians (1996),2.908257,109
12 Angry Men (1957),4.344,125
187 (1997),3.02439,41


Now let's grab the user ratings for those two movies:

In [17]:
 moviemat['Fargo (1996)']

user_id
0      NaN
1      5.0
2      5.0
3      NaN
4      NaN
5      5.0
6      5.0
7      5.0
8      NaN
9      NaN
10     5.0
11     4.0
12     NaN
13     5.0
14     5.0
15     NaN
16     5.0
17     4.0
18     5.0
19     NaN
20     NaN
21     5.0
22     NaN
23     5.0
24     5.0
25     NaN
26     5.0
27     5.0
28     5.0
29     NaN
      ... 
914    NaN
915    NaN
916    5.0
917    4.0
918    NaN
919    5.0
920    NaN
921    NaN
922    NaN
923    5.0
924    4.0
925    NaN
926    NaN
927    NaN
928    NaN
929    4.0
930    3.0
931    4.0
932    5.0
933    5.0
934    4.0
935    3.0
936    4.0
937    3.0
938    5.0
939    NaN
940    3.0
941    NaN
942    NaN
943    5.0
Name: Fargo (1996), Length: 944, dtype: float64

In [18]:
fargo_user_ratings = moviemat['Fargo (1996)']
airforceone_user_ratings = moviemat['Air Force One (1997)']
fargo_user_ratings.head()

user_id
0    NaN
1    5.0
2    5.0
3    NaN
4    NaN
Name: Fargo (1996), dtype: float64

We can then use corrwith() method to get correlations between two pandas series:

In [19]:
similar_to_fargo = moviemat.corrwith(fargo_user_ratings)
similar_to_airforceone = moviemat.corrwith(airforceone_user_ratings)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Let's clean this by removing NaN values and using a DataFrame instead of a series:

In [20]:
corr_fargo = pd.DataFrame(similar_to_fargo,columns=['Correlation'])
corr_fargo.dropna(inplace=True)
corr_fargo.head()

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
'Til There Was You (1997),0.1
1-900 (1994),0.866025
101 Dalmatians (1996),-0.245368
12 Angry Men (1957),0.098676
187 (1997),0.142509


Now if we sort the dataframe by correlation, we should get the most similar movies, however note that we get some results that don't really make sense. This is because there are a lot of movies only watched once by users who also watched Fargo. 

In [21]:
corr_fargo.sort_values('Correlation',ascending=False).head(10)

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
"Smile Like Yours, A (1997)",1.0
Open Season (1996),1.0
"Journey of August King, The (1995)",1.0
"Wooden Man's Bride, The (Wu Kui) (1994)",1.0
"Wedding Gift, The (1994)",1.0
Nowhere (1997),1.0
Captives (1994),1.0
City of Industry (1997),1.0
"Convent, The (Convento, O) (1995)",1.0
King of the Hill (1993),1.0


Let's fix this by filtering out movies that have less than 150 reviews (this value was chosen based off the histogram from earlier).

In [22]:
corr_fargo = corr_fargo.join(ratings['num of ratings'])
corr_fargo.head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),0.1,9
1-900 (1994),0.866025,5
101 Dalmatians (1996),-0.245368,109
12 Angry Men (1957),0.098676,125
187 (1997),0.142509,41


Now sort the values and notice how the titles make a lot more sense:

In [25]:
corr_fargo[corr_fargo['num of ratings']>150].sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Fargo (1996),1.0,508
Lone Star (1996),0.370915,187
Quiz Show (1994),0.355031,175
Lawrence of Arabia (1962),0.353408,173
"People vs. Larry Flynt, The (1996)",0.341784,215


These are the top movies that can be recommended along with Fargo

Now the same for the Air Force One:

In [26]:
corr_airforceone = pd.DataFrame(similar_to_airforceone,columns=['Correlation'])
corr_airforceone.dropna(inplace=True)
corr_airforceone = corr_airforceone.join(ratings['num of ratings'])
corr_airforceone[corr_airforceone['num of ratings']>150].sort_values('Correlation',ascending=False).head()

Unnamed: 0_level_0,Correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Air Force One (1997),1.0,431
"Hunt for Red October, The (1990)",0.554383,227
"Firm, The (1993)",0.526743,151
Murder at 1600 (1997),0.514906,218
Eraser (1996),0.500606,206


These are the top movies that can be recommended along with Air Force One