<a href="https://colab.research.google.com/github/hari-chintaparthi/Netflix-Recommendation-Project/blob/main/Netflix_Recommendation_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üé¨ Netflix Recommendation System ‚Äì Data Summary

Welcome to the **Netflix Recommendation System Project**!  
This project uses two datasets ‚Äî **Customers** and **Movies** ‚Äî to build a personalized movie recommendation model.  
The goal is to analyze how users rate different movies and predict what they might enjoy next.  

---

## üìÇ Datasets Overview

### 1Ô∏è‚É£ Customers Dataset  
Contains information about how each customer rated different movies.

**Columns:**
- `Customer_ID` ‚Üí Unique ID assigned to each user with movie id
- `Rating` ‚Üí User‚Äôs rating score (typically between 1 and 5)

üß© **Purpose:**  
Helps understand user behavior and preferences for recommendation modeling.

---

### 2Ô∏è‚É£ Movies Dataset  
Provides details about each movie available on the platform.

**Columns:**
- `Movie_ID` ‚Üí Unique identifier for each movie  
- `Movie_Title` ‚Üí Title of the movie  
- `Year_of_Release` ‚Üí Release year of the movie  

üé• **Purpose:**  
Adds movie-level information, useful for merging and displaying meaningful recommendations.

---

## üîó Data Relationship  
Both datasets are connected through the **`Movie_ID`** column.  
Merging them allows us to link customer ratings with actual movie names and release years.

---

## üß† Next Steps
1. Load both datasets into Pandas DataFrames.  
2. Clean and preprocess the data (handle missing values, duplicates, etc.).  
3. Merge datasets using the `Movie_ID`.  
4. Perform exploratory data analysis (EDA).  
5. Build and evaluate recommendation models (Collaborative Filtering / Content-Based).  

---

üí° *This section serves as the basic dataset summary and introduction for your Colab and GitHub project.*


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# This is the dataset drive link.
# https://drive.google.com/drive/folders/1KNfqcr2GdD76GjlfqISWt51v1oXu_twH?usp=sharing

In [None]:
# !pip install numpy==1.26.4
# !pip install scikit-surprise

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Kaggle/Netflix_data/Copy of combined_data_1.txt.zip", header = None, names = ['Cust_ID', 'Ratings'],usecols=[0,1])

In [None]:
df.head()

Unnamed: 0,Cust_ID,Ratings
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0


In [None]:
df.shape  # the dataset is huge 2+Cr rows

(24058263, 2)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24058263 entries, 0 to 24058262
Data columns (total 2 columns):
 #   Column   Dtype  
---  ------   -----  
 0   Cust_ID  object 
 1   Ratings  float64
dtypes: float64(1), object(1)
memory usage: 367.1+ MB


In [None]:
# Movies Count - how many movies are present in the dataset.
movie_count = df['Ratings'].isna().sum()
movie_count

4499

In [None]:
#customer count - how many customers were rated the movies
customer_count = df['Cust_ID'].nunique()-movie_count
customer_count

470758

In [None]:
#ratings count - Total number of ratings rated by the customers
ratings_count = df['Ratings'].count()-movie_count
ratings_count

24049265

In [None]:
#star count - total count of each star(1star,...5star)
stars_count = df['Ratings'].value_counts()
stars_count

Unnamed: 0_level_0,count
Ratings,Unnamed: 1_level_1
4.0,8085741
3.0,6904181
5.0,5506583
2.0,2439073
1.0,1118186


In [None]:
# In the customers data we dont have seperate column for the movie title...

movie_list=[]

for customer in df['Cust_ID']:
  if ':' in customer:
    movie_id=int(customer.replace(':',''))

  movie_list.append(movie_id)


In [None]:
# Adding movie title id feature to customer data

df['Movie_ID'] = movie_list
df.head()

Unnamed: 0,Cust_ID,Ratings,Movie_ID
0,1:,,1
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1


In [None]:
# Drop Nulls
df.dropna(inplace=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24053764 entries, 1 to 24058262
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Cust_ID   object 
 1   Ratings   float64
 2   Movie_ID  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 734.1+ MB


In [None]:
# convert cust id col to int datatype

df['Cust_ID'] = df['Cust_ID'].astype(int)

In [None]:
# Cleaned customers data

df.head()

Unnamed: 0,Cust_ID,Ratings,Movie_ID
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
5,823519,3.0,1


In [None]:
# as the data set is huge, we need to remove some noise in the data
# for that we will use top 40% of the data for model training
# if the movie is ratings above 60% then we will keep that movie or else will remove
# if the customer ratings is above 60% then we will keep or else remove that entry

In [None]:
movie_ratings_counts = df['Movie_ID'].value_counts()
movie_ratings_counts

Unnamed: 0_level_0,count
Movie_ID,Unnamed: 1_level_1
1905,193941
2152,162597
3860,160454
4432,156183
571,154832
...,...
4294,44
915,43
3656,42
4338,39


In [None]:
# bench_mark1 is a value--> is 60th percentile value--->if less than that remove or else keep
bench_mark1 = round(movie_ratings_counts.quantile(0.6),0)
bench_mark1

908.0

In [None]:
# These movie index got the less number of movie ratings
drop_movie_index = movie_ratings_counts[movie_ratings_counts<bench_mark1].index
drop_movie_index

Index([1598, 1733, 1647, 4099, 1616, 1446,  263, 4259,  160, 1988,
       ...
       1858, 4035, 3693, 2805,  820, 4294,  915, 3656, 4338, 4362],
      dtype='int64', name='Movie_ID', length=2699)

In [None]:
cust_ratings_counts = df['Cust_ID'].value_counts()
cust_ratings_counts

Unnamed: 0_level_0,count
Cust_ID,Unnamed: 1_level_1
305344,4467
387418,4422
2439493,4195
1664010,4019
2118461,3769
...,...
1300341,1
2550360,1
11848,1
930788,1


In [None]:
# bench_mark2 is a value--> is 60th percentile value--->if less than that remove or else keep

bench_mark2 = round(cust_ratings_counts.quantile(0.6),0)
bench_mark2

36.0

In [None]:
# customers who have rated above 36 ratings then we will keep or else remove
# below customers we need to drop

drop_cust_id = cust_ratings_counts[cust_ratings_counts<bench_mark2].index
drop_cust_id

Index([2194851,  600295, 1739398, 1157368,  532108, 2157249,  256134,  640441,
       1272324, 1346990,
       ...
       1969065,  899932,  611596, 2147176,  811650, 1300341, 2550360,   11848,
        930788,  594210],
      dtype='int64', name='Cust_ID', length=282042)

In [None]:
# cleaned df after removing the less rated movies and less rated customers

df = df[~df['Movie_ID'].isin(drop_movie_index)]
df = df[~df['Cust_ID'].isin(drop_cust_id)]
df

Unnamed: 0,Cust_ID,Ratings,Movie_ID
696,712664,5.0,3
697,1331154,4.0,3
698,2632461,3.0,3
699,44937,5.0,3
700,656399,4.0,3
...,...,...,...
24056842,1055714,5.0,4496
24056843,2643029,4.0,4496
24056844,267802,4.0,4496
24056845,1559566,3.0,4496


In [None]:
# this dataset shows us the details about the movies

movie_title=pd.read_csv('/content/drive/MyDrive/Kaggle/Netflix_data/Copy of movie_titles.csv',encoding='ISO-8859-1',header=None,names=['Movie_ID','Year','Name'],usecols=[0,1,2])

In [None]:
movie_title.head()

Unnamed: 0,Movie_ID,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW


In [None]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate,train_test_split

üß© 1. Surprise:

1. Surprise (also called Scikit-Surprise) is a Python library used to build recommendation systems.
It helps predict what a user will like based on their past ratings or preferences.

2. Surprise is a tool that helps computers recommend items (like movies, books, or products) to people by learning from ratings given by other users.

3. **Example:**
If you rated 5 movies, Surprise can predict how much you might like a new movie ‚Äî and recommend it to you.


üîÅ 2. Cross-Validation (in Machine Learning)

1. Cross-validation is a technique to test how well a machine learning model works.
It divides the data into parts ‚Äî some for training the model and others for testing it ‚Äî and repeats this process several times to get an accurate performance score.

2. Cross-validation checks if your model is good and reliable, not just lucky on one dataset.

3. **Example:** If you split your data into 5 parts (called 5-fold cross-validation):
    1. Train on 4 parts
    2. Test on 1 part
    3. Repeat 5 times with different parts
    4. Average the results to get a fair accuracy score.

In [None]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df[['Cust_ID','Movie_ID','Ratings']][:169583],reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7f615aa5b680>

Reader in Surprise is a helper that specifies how to interpret your data file or DataFrame, especially:
* Which values represent the ratings (like 1‚Äì5 or 0‚Äì10)
* What format your data columns follow (like user ID, item ID, rating)

In [None]:
model_svd=SVD()
model_svd

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f615a9f0d40>

### SVD Parameters
* **algo:** The algorithm you want to test (e.g., SVD())
* **data:**	The dataset you loaded (using Dataset.load_from_df() or built-in datasets)
* **measures:**	What metrics to calculate (like RMSE, MAE)
* **cv:**	Number of folds (how many times to split data)
* **verbose:**	If True, prints the results

In [None]:
svd_results = cross_validate(model_svd,data,measures=['RMSE','MAE'],cv=3,verbose=True)
print(svd_results)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9885  0.9926  0.9902  0.9905  0.0017  
MAE (testset)     0.7807  0.7924  0.7918  0.7883  0.0054  
Fit time          2.31    3.48    2.37    2.72    0.54    
Test time         0.54    0.41    0.29    0.41    0.10    
{'test_rmse': array([0.9885364 , 0.99261745, 0.99024549]), 'test_mae': array([0.78073048, 0.79242104, 0.79175528]), 'fit_time': (2.3118934631347656, 3.475088119506836, 2.366422176361084), 'test_time': (0.5360629558563232, 0.4056367874145508, 0.2857017517089844)}


In [None]:
svd_results

{'test_rmse': array([0.9885364 , 0.99261745, 0.99024549]),
 'test_mae': array([0.78073048, 0.79242104, 0.79175528]),
 'fit_time': (2.3118934631347656, 3.475088119506836, 2.366422176361084),
 'test_time': (0.5360629558563232, 0.4056367874145508, 0.2857017517089844)}

In [None]:
# the above results was good
# there is no much diff on test_rmse (98%)

In [None]:
# merge two df and movie_title based on movie ID
df_merged=pd.merge(movie_title,df,on='Movie_ID')
df_merged

Unnamed: 0,Movie_ID,Year,Name,Cust_ID,Ratings
0,3,1997.0,Character,712664,5.0
1,3,1997.0,Character,1331154,4.0
2,3,1997.0,Character,2632461,3.0
3,3,1997.0,Character,44937,5.0
4,3,1997.0,Character,656399,4.0
...,...,...,...,...,...
19695831,4496,1993.0,Farewell My Concubine,1055714,5.0
19695832,4496,1993.0,Farewell My Concubine,2643029,4.0
19695833,4496,1993.0,Farewell My Concubine,267802,4.0
19695834,4496,1993.0,Farewell My Concubine,1559566,3.0


In [None]:
# Now check the model recommendations by taking one customer_ID

user_387418=df_merged[df_merged['Cust_ID']==387418]
user_387418

Unnamed: 0,Movie_ID,Year,Name,Cust_ID,Ratings
1389,3,1997.0,Character,387418,2.0
2405,5,2004.0,The Rise and Fall of ECW,387418,1.0
3251,6,1997.0,Sick,387418,2.0
12599,8,2004.0,What the #$*! Do We Know!?,387418,1.0
16605,16,1996.0,Screamers,387418,2.0
...,...,...,...,...,...
19665188,4489,1961.0,Mysterious Island,387418,1.0
19672457,4490,2004.0,Ned Kelly,387418,3.0
19680684,4492,2004.0,Club Dread,387418,2.0
19686451,4493,2003.0,Ju-on: The Grudge,387418,1.0


In [None]:
# drop the movies which got the less ratings

# dummy_title=movie_title[~movie_title['Movie_ID'].isin(drop_movie_index)]
cleaned_movie_title=movie_title[~movie_title['Movie_ID'].isin(drop_movie_index)]
cleaned_movie_title

Unnamed: 0,Movie_ID,Year,Name
2,3,1997.0,Character
4,5,2004.0,The Rise and Fall of ECW
5,6,1997.0,Sick
7,8,2004.0,What the #$*! Do We Know!?
15,16,1996.0,Screamers
...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004.0,Fidel Castro: American Experience
17767,17768,2000.0,Epoch
17768,17769,2003.0,The Company


In [None]:
estimated_rating=[]
for movie_id in cleaned_movie_title['Movie_ID']:
  rating=model_svd.predict(387418,movie_id).est
  estimated_rating.append(rating)

cleaned_movie_title['predicted_ratings']=estimated_rating
cleaned_movie_title

Unnamed: 0,Movie_ID,Year,Name,predicted_ratings
2,3,1997.0,Character,2.657704
4,5,2004.0,The Rise and Fall of ECW,2.125466
5,6,1997.0,Sick,1.655243
7,8,2004.0,What the #$*! Do We Know!?,1.000000
15,16,1996.0,Screamers,1.915646
...,...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...,2.516202
17766,17767,2004.0,Fidel Castro: American Experience,2.516202
17767,17768,2000.0,Epoch,2.516202
17768,17769,2003.0,The Company,2.516202


In [None]:

# dummy_title=dummy_title.sort_values('predicted_ratings',ascending=False)
# dummy_title

recommended_movies=cleaned_movie_title.sort_values('predicted_ratings',ascending=False)

In [None]:
# top 5 recomended for this user_387418 was..
# Based on the previous watch history we are recommending to the user

top_5 = recommended_movies.head(5)['Name'].to_list()
print('These are the top five recommended movies for this user_387418 :- ',top_5)


These are the top five recommended movies for this user_387418 :-  ['Aqua Teen Hunger Force: Vol. 1', 'Inspector Morse 31: Death Is Now My Neighbour', "Something's Gotta Give", 'Immortal Beloved', "ABC Primetime: Mel Gibson's The Passion of the Christ"]


In [None]:
# a function to get the top 5 recommended movies with the parameter customer ID


def top_5_recommendations(user_id):
  estimated_rating=[]

  for movie_id in cleaned_movie_title['Movie_ID']:
    rating=model_svd.predict(user_id,movie_id).est
    estimated_rating.append(rating)

  cleaned_movie_title['predicted_ratings']=estimated_rating

  # all movies predicted ratings as per the movies watched before...
  recommended_movies=cleaned_movie_title.sort_values('predicted_ratings',ascending=False)

  top_5 = recommended_movies.head(5)['Name'].to_list()
  print(f'These are the top five recommended movies for this user_{user_id} :- \n\n{top_5}')


In [None]:
# top 5 movies for the user id 387418

top_5_recommendations(387418)

These are the top five recommended movies for this user_387418 :- 

['Aqua Teen Hunger Force: Vol. 1', 'Inspector Morse 31: Death Is Now My Neighbour', "Something's Gotta Give", 'Immortal Beloved', "ABC Primetime: Mel Gibson's The Passion of the Christ"]


In [None]:
# top 5 movies for the user id 1488844
top_5_recommendations(1488844)

These are the top five recommended movies for this user_1488844 :- 

['Inspector Morse 31: Death Is Now My Neighbour', "ABC Primetime: Mel Gibson's The Passion of the Christ", 'The Rise and Fall of ECW', 'Aqua Teen Hunger Force: Vol. 1', 'Character']


In [None]:
# top 5 movies for the user id 2143489
top_5_recommendations(2143489)

These are the top five recommended movies for this user_2143489 :- 

["ABC Primetime: Mel Gibson's The Passion of the Christ", 'The Rise and Fall of ECW', 'Aqua Teen Hunger Force: Vol. 1', 'Lilo and Stitch', 'Inspector Morse 31: Death Is Now My Neighbour']


In [None]:
# top 5 movies for the user id 1854303
top_5_recommendations(1854303)

These are the top five recommended movies for this user_1854303 :- 

["ABC Primetime: Mel Gibson's The Passion of the Christ", 'The Rise and Fall of ECW', 'Inspector Morse 31: Death Is Now My Neighbour', 'Immortal Beloved', 'Aqua Teen Hunger Force: Vol. 1']


In [None]:
# top 5 movies for the user id 1562675
top_5_recommendations(1562675)

These are the top five recommended movies for this user_1562675 :- 

["ABC Primetime: Mel Gibson's The Passion of the Christ", 'Inspector Morse 31: Death Is Now My Neighbour', 'Aqua Teen Hunger Force: Vol. 1', 'Immortal Beloved', "Something's Gotta Give"]
