# Lab | Recommendation Systems
Movie Recommendation -

Dataset - https://grouplens.org/datasets/movielens/100k/

Steps --

1. Load the data.
2. Find total number of ratings.
3. Do Exploratory data analysis. ( Visualize the data, find min, max ratings etc.)
4. Restrict users who have rated less number of movies to improve the quality of recommendations.
5. Remove outliers if any.
6. Calculate the density of the rating matrix.
7. Divide the data in training and testing part in the ratio 75:25.
8. Apply  --
          a. Item Based Collaborative Filtering  
          b. User Based Collaborative Filtering
9. Train the model.
10. Evaluate the model.
11. Recommend top - 10 movies.
12. State your insights.

#### Load Required Libraries

In [1]:
%matplotlib inline

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import time
from sklearn.externals import joblib

# Load The Data

In [2]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols,encoding='latin-1')

i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure','Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy','Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('u.item', sep='|', names=i_cols,encoding='latin-1')

u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('u.user', sep='|', names=u_cols,encoding='latin-1')

In [3]:
print(users.shape)
users.head()

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [4]:
print(ratings.shape)
ratings.head()

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
print(items.shape)
items.head()

(1682, 24)


Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


# Exploratory Data Analysis

#### Total Number of Ratings

In [6]:
print('Count of Ratings in dataset: ', ratings.rating.count())

Count of Ratings in dataset:  100000


In [7]:
print('Unique Values of Ratings: ', ratings.rating.unique())
print('Count of Unique Values of Ratings: ',len(ratings.rating.unique()))

Unique Values of Ratings:  [3 1 2 4 5]
Count of Unique Values of Ratings:  5


In [8]:
# Shape of Dataset
print('Number of Rows and Columns in dataset:', ratings.shape)

Number of Rows and Columns in dataset: (100000, 4)


#### There are no missing values in the dataset provided.

#### Merge dataframes

In [9]:
df_1 = pd.merge(ratings,items.drop_duplicates(['movie_id']), on = 'movie_id', how='left')
df_1.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie title,release date,video release date,IMDb URL,unknown,Action,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...,0,0,...,0,1,0,0,1,0,0,1,0,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,...,0,0,0,0,0,1,0,0,1,1
4,166,346,1,886397596,Jackie Brown (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
df = pd.merge(df_1,users.drop_duplicates(['user_id']), on = 'user_id', how='left')
df.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie title,release date,video release date,IMDb URL,unknown,Action,...,Mystery,Romance,Sci-Fi,Thriller,War,Western,age,sex,occupation,zip_code
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,49,M,writer,55105
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...,0,0,...,1,0,0,1,0,0,39,F,executive,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...,0,0,...,0,0,0,0,0,0,25,M,writer,40206
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,...,0,1,0,0,1,1,28,M,technician,80525
4,166,346,1,886397596,Jackie Brown (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,...,0,0,0,0,0,0,47,M,educator,55113


In [11]:
#Any Missing Value?
print('Summary of Missing Values:')
print(df.isnull().sum(axis=0))

Summary of Missing Values:
user_id                    0
movie_id                   0
rating                     0
unix_timestamp             0
movie title                0
release date               9
video release date    100000
IMDb URL                  13
unknown                    0
Action                     0
Adventure                  0
Animation                  0
Children's                 0
Comedy                     0
Crime                      0
Documentary                0
Drama                      0
Fantasy                    0
Film-Noir                  0
Horror                     0
Musical                    0
Mystery                    0
Romance                    0
Sci-Fi                     0
Thriller                   0
War                        0
Western                    0
age                        0
sex                        0
occupation                 0
zip_code                   0
dtype: int64


In [12]:
#Video Release Date is almost entirely Null and IMDB URL also has some NUll values.
#We can drop these columns from trh dataset
df = df.drop(['video release date', 'IMDb URL'], axis = 1)

#### video release data and IMDb URL columns have been dropped as they have missing values and not essential for recommendations.

## Most Rated Movies

In [13]:
movies_rated = df.groupby(['movie title']).agg({'rating': 'count'}).reset_index()
movies_rated.sort_values(['rating', 'movie title'], ascending = [0,1])

Unnamed: 0,movie title,rating
1398,Star Wars (1977),583
333,Contact (1997),509
498,Fargo (1996),508
1234,Return of the Jedi (1983),507
860,Liar Liar (1997),485
460,"English Patient, The (1996)",481
1284,Scream (1996),478
1523,Toy Story (1995),452
32,Air Force One (1997),431
744,Independence Day (ID4) (1996),429


# Popular Movies

In [14]:
movies_grouped = df.groupby(['movie title']).agg({'rating':['mean']}).reset_index()
movies_grouped.sort_values([('rating', 'mean')], ascending = False)

Unnamed: 0_level_0,movie title,rating
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
1472,They Made Me a Criminal (1939),5.000000
944,Marlene Dietrich: Shadow and Light (1996),5.000000
1273,"Saint of Fort Washington, The (1993)",5.000000
1359,Someone Else's America (1995),5.000000
1387,Star Kid (1997),5.000000
633,"Great Day in Harlem, A (1994)",5.000000
30,Aiqing wansui (1994),5.000000
1277,Santa with Muscles (1996),5.000000
1172,Prefontaine (1997),5.000000
462,Entertaining Angels: The Dorothy Day Story (1996),5.000000


# Active Users

In [15]:
users_grouped = df.groupby(['user_id']).agg({'rating': 'count'}).reset_index()
users_grouped.sort_values(['rating', 'user_id'], ascending = [0,1])

Unnamed: 0,user_id,rating
404,405,737
654,655,685
12,13,636
449,450,540
275,276,518
415,416,493
536,537,490
302,303,484
233,234,480
392,393,448


#### From above analysis, we conclude that there are no outliers in the dataset.

# Restrict users who have not reviewed more movies.

In [16]:
passive_users = users_grouped[users_grouped.rating < 25]
print(passive_users.count())
print(passive_users.head())

user_id    121
rating     121
dtype: int64
    user_id  rating
3         4      24
8         9      22
18       19      20
32       33      24
33       34      20


In [17]:
keys = ['user_id']
i1 = df.set_index(keys).index
i2 = passive_users.set_index(keys).index
df_treated = df[~i1.isin(i2)]
df_treated.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie title,release date,unknown,Action,Adventure,Animation,...,Mystery,Romance,Sci-Fi,Thriller,War,Western,age,sex,occupation,zip_code
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,0,0,0,0,...,0,0,0,0,0,0,49,M,writer,55105
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,0,0,0,0,...,1,0,0,1,0,0,39,F,executive,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,0,0,0,0,...,0,0,0,0,0,0,25,M,writer,40206
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,0,0,0,0,...,0,1,0,0,1,1,28,M,technician,80525
5,298,474,4,884182806,Dr. Strangelove or: How I Learned to Stop Worr...,01-Jan-1963,0,0,0,0,...,0,0,1,0,1,0,44,M,executive,1581


In [18]:
df_treated.count()

user_id           97363
movie_id          97363
rating            97363
unix_timestamp    97363
movie title       97363
release date      97355
unknown           97363
Action            97363
Adventure         97363
Animation         97363
Children's        97363
Comedy            97363
Crime             97363
Documentary       97363
Drama             97363
Fantasy           97363
Film-Noir         97363
Horror            97363
Musical           97363
Mystery           97363
Romance           97363
Sci-Fi            97363
Thriller          97363
War               97363
Western           97363
age               97363
sex               97363
occupation        97363
zip_code          97363
dtype: int64

# Density of Data

In [19]:
# Calculate the sparsity of data
zero_count = 0
for i in range(0,97363):
    for j in range(0,28):
        if df_treated.iloc[i,j] == 0:
            zero_count+=1

In [20]:
sparsity = zero_count/(i*j)
print('Sparsity of Dataframe = ', sparsity)
print('Density of Dataframe = ', 1- sparsity)

Sparsity of Dataframe =  0.6250586775432198
Density of Dataframe =  0.37494132245678025


# Divide data into train and test

In [21]:
# Divide data into train and test
train_df, test_df = train_test_split(df_treated, test_size = 0.25, random_state=0)
print(train_df.head(5))

       user_id  movie_id  rating  unix_timestamp         movie title  \
50587      622       433       4       882669886     Heathers (1989)   
2713        11       746       4       891905032  Real Genius (1985)   
87485      521       183       3       884477630        Alien (1979)   
97915      639       673       4       891239406    Cape Fear (1962)   
19223      303       268       5       879466166  Chasing Amy (1997)   

      release date  unknown  Action  Adventure  Animation    ...     Mystery  \
50587  01-Jan-1989        0       0          0          0    ...           0   
2713   01-Jan-1985        0       0          0          0    ...           0   
87485  01-Jan-1979        0       1          0          0    ...           0   
97915  01-Jan-1962        0       0          0          0    ...           0   
19223  01-Jan-1997        0       0          0          0    ...           0   

       Romance  Sci-Fi  Thriller  War  Western  age  sex  occupation  zip_code  
50587

# Matrix Factorization and Singular Value Decomposition

In [22]:
# We want the format of ratings matrix to be one row per user and one column per movie. 
#we can pivot ratings_df to get that and call the new variable R_df.
R_train = train_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_train.tail()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1670,1672,1674,1675,1676,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
938,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
943,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
R_test = test_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_test.tail()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1648,1649,1659,1661,1663,1664,1665,1671,1673,1677
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
938,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Apply User Based and Item Based Collaberative Mapping

In [24]:
from sklearn.metrics.pairwise import pairwise_distances

In [25]:
R_train_matrix = R_train.values
R_train_matrix

array([[ 5.,  3.,  0., ...,  0.,  0.,  0.],
       [ 4.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  5.,  0., ...,  0.,  0.,  0.]])

In [26]:
R_test_matrix = R_test.values
R_test_matrix

array([[ 0.,  0.,  4., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [27]:
item_distance = pairwise_distances(R_train_matrix.T, metric = 'cosine')
item_distance

array([[ 0.        ,  0.6717021 ,  0.74702869, ...,  1.        ,
         0.94386207,  0.94386207],
       [ 0.6717021 ,  0.        ,  0.80058697, ...,  1.        ,
         0.90991438,  0.90991438],
       [ 0.74702869,  0.80058697,  0.        , ...,  1.        ,
         1.        ,  1.        ],
       ..., 
       [ 1.        ,  1.        ,  1.        , ...,  0.        ,
         1.        ,  1.        ],
       [ 0.94386207,  0.90991438,  1.        , ...,  1.        ,
         0.        ,  1.        ],
       [ 0.94386207,  0.90991438,  1.        , ...,  1.        ,
         1.        ,  0.        ]])

In [30]:
user_distance = pairwise_distances(R_train_matrix, metric = 'cosine')
user_distance

array([[ 0.        ,  0.86610644,  0.95680343, ...,  0.71316318,
         0.84621759,  0.67803134],
       [ 0.86610644,  0.        ,  0.92814401, ...,  0.79878981,
         0.87815924,  0.90144842],
       [ 0.95680343,  0.92814401,  0.        , ...,  0.83818632,
         0.86505954,  0.98040334],
       ..., 
       [ 0.71316318,  0.79878981,  0.83818632, ...,  0.        ,
         0.78445416,  0.79858813],
       [ 0.84621759,  0.87815924,  0.86505954, ...,  0.78445416,
         0.        ,  0.82152505],
       [ 0.67803134,  0.90144842,  0.98040334, ...,  0.79858813,
         0.82152505,  0.        ]])

In [28]:
def predict(ratings, distance, type = 'user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:,np.newaxis])
        pred = mean_user_rating[:,np.newaxis] + distance.dot(ratings_diff)/np.array([np.abs(distance).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(distance)/np.array([np.abs(distance).sum(axis=1)])
    return pred

In [29]:
item_prediction = predict(R_train_matrix,item_distance, type='item')
item_prediction

array([[ 0.3747279 ,  0.39407858,  0.4195023 , ...,  0.45770436,
         0.44485955,  0.44232533],
       [ 0.08705461,  0.10244006,  0.09978951, ...,  0.10370439,
         0.10497377,  0.10615034],
       [ 0.07284153,  0.07695396,  0.07342543, ...,  0.07082701,
         0.07526532,  0.07668024],
       ..., 
       [ 0.13909567,  0.15081265,  0.15825125, ...,  0.16885379,
         0.16746836,  0.16756075],
       [ 0.14112745,  0.15037401,  0.16023543, ...,  0.16569418,
         0.16248426,  0.16450132],
       [ 0.20172302,  0.1962471 ,  0.22130822, ...,  0.25278391,
         0.24345212,  0.24547449]])

In [31]:
user_prediction = predict(R_train_matrix,user_distance, type='user')
user_prediction

array([[ 1.71314976,  0.61039331,  0.48470993, ...,  0.28073717,
         0.28033684,  0.2802746 ],
       [ 1.41791484,  0.31351697,  0.13793753, ..., -0.08631806,
        -0.084839  , -0.08452403],
       [ 1.44446403,  0.28033775,  0.10885485, ..., -0.11698873,
        -0.11486191, -0.11454503],
       ..., 
       [ 1.49867054,  0.35323125,  0.20800015, ..., -0.0121028 ,
        -0.01137414, -0.01117617],
       [ 1.49214967,  0.35102265,  0.20742835, ..., -0.01729973,
        -0.01673384, -0.01639317],
       [ 1.52968177,  0.39239947,  0.27853985, ...,  0.0752654 ,
         0.0750997 ,  0.07535622]])

# Eavaluation

In [32]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def RMSE(predict, nh):
    predict = predict[nh.nonzero()].flatten()
    nh = nh[nh.nonzero()].flatten()
    return sqrt(mean_squared_error(predict,nh))

In [33]:
print('RMSE User Based Train Data',RMSE(user_prediction,R_train_matrix))
print('RMSE Item Based Train Data',RMSE(user_prediction,R_train_matrix))
print('RMSE User Based Test Data',RMSE(user_prediction,R_test_matrix))
print('RMSE Item Based Test Data',RMSE(user_prediction,R_test_matrix))

RMSE User Based Train Data 3.085479208132152
RMSE Item Based Train Data 3.085479208132152
RMSE User Based Test Data 3.200577782598785
RMSE Item Based Test Data 3.200577782598785


There is not much difference in RMSE between User Based and Item Based Collaberative Filtering between test and train data set

# Functinon For recommendation

In [91]:
# return the movies with the highest predicted rating that the specified user hasn’t already rated
#Take specific user row from matrix from predictions
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.user_id == (userID)]
    #Added title and genres
    user_full = (pd.merge(user_data, movies_df[['movie_id']], left_index=True, right_index=True, how='left')
                     .sort_values(['rating'], ascending=False))

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    keys = ['user_id']
    i1 = movies_df.set_index(keys).index
    i2 = user_full.set_index(keys).index
    df_temp = movies_df[~i1.isin(i2)]

    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (df_temp.
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [92]:
item_df_pred = pd.DataFrame(data = item_prediction, columns = R_train.columns)
user_df_pred = pd.DataFrame(data = user_prediction, columns = R_train.columns)

# Item Based Recommendation for User 5

In [98]:
past_rated, recommend, sorted_user_predict, user_data, user_full = recommend_movies(item_df_pred,5, df_treated, train_df, 5)

User 5 has already rated 132 movies.
Recommending the highest 5 predicted ratings movies not already rated.


In [99]:
past_rated.head(5)

Unnamed: 0,user_id,movie_id_x,rating,unix_timestamp,movie title,release date,unknown,Action,Adventure,Animation,...,Romance,Sci-Fi,Thriller,War,Western,age,sex,occupation,zip_code,movie_id_y
76171,5,181,5,875635757,Return of the Jedi (1983),14-Mar-1997,0,1,1,0,...,1,1,0,1,0,33,F,other,15213,181
2026,5,382,5,875636587,"Adventures of Priscilla, Queen of the Desert, ...",01-Jan-1994,0,0,0,0,...,0,0,0,0,0,33,F,other,15213,382
23878,5,257,5,875635239,Men in Black (1997),04-Jul-1997,0,1,1,0,...,0,1,0,0,0,33,F,other,15213,257
15667,5,433,5,875636655,Heathers (1989),01-Jan-1989,0,0,0,0,...,0,0,0,0,0,33,F,other,15213,433
26821,5,101,5,878844510,Heavy Metal (1981),08-Mar-1981,0,1,1,1,...,0,1,0,0,0,33,F,other,15213,101


# User Based Recommendation For User 5

In [100]:
past_rated, recommend, sorted_user_predict, user_data, user_full = recommend_movies(user_df_pred,5, df_treated, train_df, 5)

User 5 has already rated 132 movies.
Recommending the highest 5 predicted ratings movies not already rated.


In [101]:
past_rated.head(5)

Unnamed: 0,user_id,movie_id_x,rating,unix_timestamp,movie title,release date,unknown,Action,Adventure,Animation,...,Romance,Sci-Fi,Thriller,War,Western,age,sex,occupation,zip_code,movie_id_y
76171,5,181,5,875635757,Return of the Jedi (1983),14-Mar-1997,0,1,1,0,...,1,1,0,1,0,33,F,other,15213,181
2026,5,382,5,875636587,"Adventures of Priscilla, Queen of the Desert, ...",01-Jan-1994,0,0,0,0,...,0,0,0,0,0,33,F,other,15213,382
23878,5,257,5,875635239,Men in Black (1997),04-Jul-1997,0,1,1,0,...,0,1,0,0,0,33,F,other,15213,257
15667,5,433,5,875636655,Heathers (1989),01-Jan-1989,0,0,0,0,...,0,0,0,0,0,33,F,other,15213,433
26821,5,101,5,878844510,Heavy Metal (1981),08-Mar-1981,0,1,1,1,...,0,1,0,0,0,33,F,other,15213,101


# Insights
1. There is not much of difference between accuracy of user based and item based collaberative filtering for this dataset.
2. However, user based collaberative filtering has higher latency and throughput time.