# Netflix
Netflix is the world's leading streaming entertainment service with 100s of million paid customers worldwide.

# Problem Statement
Netflix kind of content platforms will definitely have huge content library of waste variety, so some users will often face difficulties to find out content of their likelihood.
That's where the role of recommender systems comes into play.
Recommender system will automatically sort a high quality content with best relevance to user's likelihood preferences, that leads to better user experience success.


# Dataset
Netflix Prize data 
https://www.kaggle.com/netflix-inc/netflix-prize-data


# Import Libraries (Modules)

In [None]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 336kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1617570 sha256=bb4da2703bcf1618ce4d6c32956c69dc55298484828cbe6d6150c1489e2139be
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

# Read Data

Store the data in zip file, so specificly this dataset will occupy significantly less storege space.

Unzip the Data with single one line easy statement and access a files directly from directory.

In [None]:
!unzip -q "/content/drive/MyDrive/Data Science/Projects/Netflix Recommander System/Netflix Prize.zip"

For now read the user ratings data from the directory. First we are using user rating data for only the first 4999 movies to make a recommendation system so low ram instances can do the work for 24M raws.

In [None]:
df = pd.read_csv('/content/combined_data_1.txt', names=['UserId', 'Rating'])
df

Unnamed: 0,UserId,Rating
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0
...,...,...
24058258,2591364,2.0
24058259,1791000,2.0
24058260,512536,5.0
24058261,988963,3.0


Here 1: is for movie id which will creat a prob for us to analysis of user review to a perticular movies so new seprate movie id column should be in this dataframe

It seems that many preprocessing will be needed before actually using this user ratings data for recommendation system. 

# Preprocessing data for Recommendation

Creat a new movieid columns with zeros 

In [None]:
df['movieid'] = 0
df

Unnamed: 0,UserId,Rating,movieid
0,1:,,0
1,1488844,3.0,0
2,822109,5.0,0
3,885013,4.0,0
4,30878,4.0,0
...,...,...,...
24058258,2591364,2.0,0
24058259,1791000,2.0,0
24058260,512536,5.0,0
24058261,988963,3.0,0


In [None]:
# where the rating is null userid will be movieid 
df_movie = df.loc[df.Rating.isnull()].copy()
df_movie['movieid'] = df_movie['UserId']
df_movie

Unnamed: 0,UserId,Rating,movieid
0,1:,,1:
548,2:,,2:
694,3:,,3:
2707,4:,,4:
2850,5:,,5:
...,...,...,...
24046714,4495:,,4495:
24047329,4496:,,4496:
24056849,4497:,,4497:
24057564,4498:,,4498:


In this dataframe movieid is in object so creat a function to convert that into string.

In [None]:
# Function to process this strings to integer
def movie_id(string):
  string = string[:-1]  # Remove ':' from movie id and return as integer
  return int(string)

This is the most important part of preprocessing data for recommendation system.

In [None]:
# apply above function to movieid column
df_movie['movieid'] = df_movie['movieid'].apply(movie_id)

# merge seprate dataframe to main dataframe 
# so where ever the userid has movieid at there 
# rating will be none and movieid will be in movieid column 
df['movieid'] = df_movie['movieid']

# Apply ffill from pandas to fill null values in movieid to other movieid
df['movieid'] = df['movieid'].fillna(method='ffill')
df.movieid = df.movieid.astype(int)

# drop the movieid raws in userid columns
# map that by where movieid is in userid column, 
# at there rating will have null value
# so drop the raws with null values from entire dataframe
df.dropna(axis=0, inplace=True)

# Recommender system module will only accept data in 
# ['userID', 'itemID', 'rating'] this order, so reindex it
df = df.reindex(columns=['UserId', 'movieid', 'Rating'])

# That's what will needed for recommendation system
df

Unnamed: 0,UserId,movieid,Rating
1,1488844,1,3.0
2,822109,1,5.0
3,885013,1,4.0
4,30878,1,4.0
5,823519,1,3.0
...,...,...,...
24058258,2591364,4499,2.0
24058259,1791000,4499,2.0
24058260,512536,4499,5.0
24058261,988963,4499,3.0


Read the movie titles data

In [None]:
df_titles = pd.read_csv('/content/movie_titles.csv', encoding="ISO-8859-1", names=['movieid', 'Year', 'Name'])

Set movieid as index for further easyness as this dataframe is only for movie titles

In [None]:
df_titles.set_index('movieid', inplace = True)
df_titles.head()

Unnamed: 0_level_0,Year,Name
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2003.0,Dinosaur Planet
2,2004.0,Isle of Man TT 2004 Review
3,1997.0,Character
4,1994.0,Paula Abdul's Get Up & Dance
5,2004.0,The Rise and Fall of ECW


To give high quality recommendation to users system should give them recommendation of only popular movies so remove movies which are rated by less number of users.

In [None]:
movie_ratings = df.groupby('movieid')['Rating'].agg(['count', 'mean'])  # rating is mean of ratings by all users.
movie_ratings.index = movie_ratings.index.map(int)
non_popular_threshold = round(movie_ratings['count'].quantile(0.95),0) 
non_popular_movies = movie_ratings[movie_ratings['count'] < non_popular_threshold].index
non_popular_movies 

Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            4490, 4491, 4492, 4493, 4494, 4495, 4496, 4497, 4498, 4499],
           dtype='int64', name='movieid', length=4274)

To get high quality reference users for collabrative filtering, system should only take reference of users which are more active on their hobby of watching perticular content and often gives rating for many movies.

In [None]:
df_users = df.groupby('UserId')['Rating'].agg(['count', 'mean'])
df_users.index = df_users.index.map(int)
less_users_threshold = round(df_users['count'].quantile(0.95),0)
df_less_active_users = df_users[df_users['count'] < less_users_threshold].index
df_less_active_users

Int64Index([     10, 1000004, 1000027, 1000033, 1000035, 1000038, 1000051,
            1000057,  100006, 1000062,
            ...
             999935,   99994,  999944,  999945,  999949,  999964,  999972,
             999977,  999984,  999988],
           dtype='int64', name='UserId', length=447138)

In [None]:
non_popular_threshold

27366.0

In [None]:
less_users_threshold

190.0

Drop the data under the threshold

In [None]:
print('Befer droping less active raws', df.shape)

df = df[~df['movieid'].isin(non_popular_movies)]
df = df[~df['UserId'].isin(df_less_active_users)]

print('After droping less active raws', df.shape)

df

Befer droping less active raws (24053764, 3)
After droping less active raws (14372438, 3)


Unnamed: 0,UserId,movieid,Rating
52551,1392773,28,4.0
52552,1697479,28,4.0
52553,1990901,28,5.0
52554,2626356,28,5.0
52555,1402412,28,2.0
...,...,...,...
24018724,480064,4488,1.0
24018725,1021220,4488,4.0
24018726,2186555,4488,4.0
24018727,833254,4488,3.0


# Recommender system

Use the famouse SVD algoridhm to buid our recommandation system and refer to its documentation

In [None]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from surprise import accuracy

It will take around 60-90 minute.

In [None]:
# Takes 1 hour 20 minute

reader = Reader()

# getting full dataset
data = Dataset.load_from_df(df[['UserId', 'movieid', 'Rating']], reader)

algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'])

{'fit_time': (746.9048941135406,
  787.6050469875336,
  789.5430340766907,
  792.3407363891602,
  788.5767765045166),
 'test_mae': array([0.70620804, 0.70530168, 0.70516181, 0.70443728, 0.70546429]),
 'test_rmse': array([0.91228935, 0.91092952, 0.9110382 , 0.91060588, 0.91143263]),
 'test_time': (71.643545627594,
  80.41635513305664,
  75.1009590625763,
  80.00788640975952,
  75.77366232872009)}

To improve rmse colab instance will not be sufficent, this is max out of colab instance by removeing maximum inactive user data and non popular movies.

In [None]:
# Takes 15-20 minute
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f32b2de2390>

Recommend movies to users

In [None]:
def recommendation(Userid, rating):
  '''
  recommendation(1488844, 5) 

  return recommendaded movie list based on movies that user had rated 5 star
  '''
  
  df_user = df[(df['UserId']==str(Userid)) & 
               (df['Rating']==rating)].set_index('movieid')
  df_user = df_user.join(df_titles)['Name']
  print('---Movies Seen by User---')
  print(df_user)  # list of movies seen by users

  df_user = df_titles.copy().reset_index()
  df_user['Predicted_Rating'] = df_user['movieid'].apply(lambda x: algo.predict(str(Userid), x).est)
  df_user_recommendations = df_user.sort_values('Predicted_Rating', ascending=False)
  print('\n---Recommended Movies to User---')

  return df_user_recommendations

In [None]:
recommendation(1488844, 5) 

---Movies Seen by User---
movieid
143                                              The Game
191                                      X2: X-Men United
468                               The Matrix: Revolutions
571                                       American Beauty
607                                                 Speed
658                         Robin Hood: Prince of Thieves
798                                                  Jaws
1180                                     A Beautiful Mind
1590                                      Life as a House
1625                          Aliens: Collector's Edition
1798                                        Lethal Weapon
1905    Pirates of the Caribbean: The Curse of the Bla...
2192                                        The Hurricane
2252                                Bram Stoker's Dracula
2372                                 The Bourne Supremacy
2430                           Alien: Collector's Edition
2452        Lord of the Rings: The Fel

Unnamed: 0,movieid,Year,Name,Predicted_Rating
1797,1798,1987.0,Lethal Weapon,4.582067
3816,3817,1994.0,Stargate,4.571523
1809,1810,1998.0,U.S. Marshals,4.530583
3609,3610,1992.0,Lethal Weapon 3,4.520544
3961,3962,2003.0,Finding Nemo (Widescreen),4.476690
...,...,...,...,...
3755,3756,2002.0,About Schmidt,2.656447
1304,1305,2003.0,Thirteen,2.655991
412,413,2002.0,Igby Goes Down,2.573105
2959,2960,2004.0,The Ladykillers,2.446850


In above recommendation list shows leathal weapon 3 which is not seen yet and recommended due to user had already seen movie leathel weapon.

Based on users 5 ratings these are the obvious recommendation on decending order. This recommendation list contains movies they already seen and not seen yet both.