# **Introduction**
  **Hey there, movie buffs! Tired of scrolling endlessly to find something good to watch?   Ever feel like the OTT platform just gets you?**

  **Well, buckle up because we're about to crack the code on recommendations!   Join us at 10:10 AM to dive into the world of how these platforms know your taste in movies.  We'll be spilling the tea on those eerily perfect suggestions.**

**Link to the datasource: https://drive.google.com/drive/folders/1lgXm5Lyqjdkisn1G9c-MqaeLT4boNIzg?usp=sharing**

<hr>

# **Building the Recommendation Engine üéûÔ∏è**

**Ever scrolled through Amazon and felt like they can read your mind? That's the magic of recommendation systems!  Basically, it's a super-powered filter that picks up on what you search for or buy, then uses that info to suggest similar things you might like. It's like having a personal shopping buddy who remembers your interests and whispers "Hey, check this out!" whenever you're browsing. Pretty cool, right?**

**Ever notice how YouTube seems to know exactly what you want to watch next? That's because it's got a recommendation system working behind the scenes, like a super-smart friend suggesting videos based on what you've watched before. It's the same deal with Netflix ‚Äì it learns your taste in movies and genres, then whispers in your ear (well, the recommendation bar) with suggestions you might love.**

**This recommendation system magic isn't just for entertainment. It's everywhere!  Scrolling through Facebook or Instagram? Boom, recommendations for new friends and accounts to follow pop up.  Shopping online?  Amazon, BigBasket, and other sites use your past searches and purchases to show you ads and products that might catch your eye. It's like having a personal shopping buddy who remembers what you like and says "Hey, check this out!"**

**So, the next time you see those eerily perfect recommendations, remember ‚Äì it's not magic, it's just a clever system that helps you discover new things and maybe even find that perfect product (or next binge-worthy video).**

<hr>

**Types of Recommendation System**
  * **`Demographic Filtering`: Imagine you walk into a movie store blindfolded. The salesperson, armed with only your age and maybe your favorite color, recommends the "blockbuster hits" everyone's raving about. That's kind of how demographic filtering works. It uses broad categories like age, gender, or location to suggest movies that are generally popular with similar groups. While it can be a good starting point, it doesn't account for your unique tastes. It's like getting a generic recommendation instead of a friend suggesting a hidden gem they know you'll love.**


  * **`Content Based Filtering` : Ever notice how after watching a funny cat video on YouTube, you get bombarded with suggestions for more feline frolics? That's content-based filtering at work! This system is like a detective, looking at the clues ‚Äì things like genre, director, or actors for movies ‚Äì to find items that are similar to what you liked before. The idea is that if you enjoyed something, you'd probably enjoy something else with similar characteristics. It's a good way to discover hidden gems within a category you already love, but it might not introduce you to entirely new things outside your comfort zone.**

  * **`Collabrative Filtering` - Imagine you're at a party and hit it off with someone who has amazing taste in movies. They rave about this hidden gem you've never heard of, and you know you gotta check it out because you trusted their other picks. That's the magic of collaborative filtering! Unlike content-based systems that focus on the movie itself, this one is all about finding users with similar tastes to you. It's like having a secret network of movie buddies who recommend things they know you'll love, based on what they've enjoyed themselves. Pretty cool, right?**

**Assumption**
  * **We are considering that the rating that are present for a given movie are all after the `x:`**

  * **Customer can be multiple, there could be mulitple reviews given a customer, for different movies**

<hr>

# **Loading the libraries**

In [1]:
import numpy as np
import pandas as pd

<hr>

# **Loading the dataset**

In [4]:
df = pd.read_csv("/content/drive/MyDrive/Datasets/Netflix/Data for Files/NetflixData.txt", names = ["CustID", "Ratings"], usecols = [0, 1], header = None)

**Data Inspection**

In [5]:
df.head()

Unnamed: 0,CustID,Ratings
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0


**These are the rating provided by the users, for a given movie, that is actually combined with the customer ID**

<hr>

**Shape Inspection**

In [6]:
a = df.shape
print(f"Rows: {a[0]} and Columns: {a[1]}")

Rows: 24058263 and Columns: 2


**We are nearly working with the datasize of 24 million records**

<hr>

**Finding the relevant amount of information for the data**

**How many movies we are having in total?**

In [10]:
total_movie_count = df.isnull().sum()["Ratings"]

In [11]:
print(f"Total number of movies that are present are: {total_movie_count}")

Total number of movies that are present are: 4499


**Customer Details**

In [13]:
# How many customers we are having
customer_count = df["CustID"].nunique() - total_movie_count

In [14]:
print(f"Total Customer count in the dataset without removing movie out of them: {customer_count}")

Total Customer count in the dataset without removing movie out of them: 470758


In [15]:
df_nan = pd.DataFrame(pd.isnull(df["Ratings"]))

In [16]:
df_nan

Unnamed: 0,Ratings
0,True
1,False
2,False
3,False
4,False
...,...
24058258,False
24058259,False
24058260,False
24058261,False


In [17]:
df_nan = df_nan[df_nan["Ratings"] == True]

df_nan

Unnamed: 0,Ratings
0,True
548,True
694,True
2707,True
2850,True
...,...
24046714,True
24047329,True
24056849,True
24057564,True


In [21]:
df.iloc[24057834]

CustID     4499:
Ratings      NaN
Name: 24057834, dtype: object

**How many rating we are dealing**

In [27]:
total_ratings = df["CustID"].count() - total_movie_count

In [28]:
total_ratings

24053764

<hr>

**Individual Ratings**

In [29]:
df["Ratings"].value_counts()

Ratings
4.0    8085741
3.0    6904181
5.0    5506583
2.0    2439073
1.0    1118186
Name: count, dtype: int64

<hr>

# **Segregation of the data**

In [30]:
# Copy of my original data
temp = df.copy()

In [37]:
# Tracker for keeping track of all the movies
current_movie_id = None
# List for storing all the movies
movie_ids = []

# Loop for iterating through all the movies and rows
for cust_id in temp['CustID']:
  # This condition will help us find the movies in the format of (X:)
    if ':' in cust_id:
      # This line helps us remove the : with nothing and converts it into integer
      current_movie_id = int(cust_id.replace(':', ''))
    # This line appends the current movie into the movie list
    movie_ids.append(current_movie_id) #

# This line creates the column for the movie ID
temp['MovieID'] = movie_ids

# This line helps you to add all the rating for the customer for that particular movie
temp = temp[temp['Ratings'].notna()]

In [38]:
temp.head()

Unnamed: 0,CustID,Ratings,MovieID
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
5,823519,3.0,1


In [39]:
df = temp.copy()

<hr>

# **Preparation for Collabrative Filtering**

* **We cannot remove duplicates for any column since the requirement is completely based on the ratings of the users given, hence we will go ahead with the benchmarks**

* **Here there is a need of two benchmarks in the dataset**
  * **Customers who are not frequently giving rating non active users or maybe fake or dummy users. These are users that watch movies but don't give ratings, so we can remove them**
  * **Those movies that has less ratings are possibly not much popular so will not recommend and remove them from the list**

<hr>

**Benchmark 1 - Removing the least rated movies**

In [44]:
movie_list = df.groupby("MovieID")["Ratings"].agg(["count"])

In [50]:
movie_list

Unnamed: 0_level_0,count
MovieID,Unnamed: 1_level_1
1,547
2,145
3,2012
4,142
5,1140
...,...
4495,614
4496,9519
4497,714
4498,269


In [46]:
movie_list["count"].quantile(0.7)

1798.6

In [47]:
benchmark_movies = round(movie_list["count"].quantile(0.7), 0)

benchmark_movies

1799.0

In [51]:
# This line will given you the index of the movies that are having a lesser value then our decided benchmark, and we will uuse this list to drop those movies
drop_list_movies = movie_list[movie_list["count"] < benchmark_movies].index

In [53]:
drop_list_movies

Index([   1,    2,    4,    5,    6,    7,    9,   10,   11,   12,
       ...
       4484, 4486, 4487, 4489, 4491, 4494, 4495, 4497, 4498, 4499],
      dtype='int64', name='MovieID', length=3149)

<hr>

**Benchmark 2 - Least active customers**

In [54]:
cust_list = df.groupby("CustID")["Ratings"].agg(["count"])

In [55]:
cust_list

Unnamed: 0_level_0,count
CustID,Unnamed: 1_level_1
10,49
1000004,1
1000027,30
1000033,101
1000035,20
...,...
999964,48
999972,35
999977,14
999984,38


In [56]:
# Taking in consideration those customer, those who have rated the least number of times
benchmark_customer = round(cust_list["count"].quantile(0.7), 0)

In [57]:
benchmark_customer

52.0

In [58]:
# drop_list for customer to remove the least active customer
drop_list_cust = cust_list[cust_list["count"] < benchmark_customer].index

In [59]:
drop_list_cust

Index(['10', '1000004', '1000027', '1000035', '1000038', '1000051', '1000057',
       '100006', '100007', '1000072',
       ...
       '999932', '999935', '99994', '999945', '999949', '999964', '999972',
       '999977', '999984', '999988'],
      dtype='object', name='CustID', length=327300)

<hr>

**Let's remove both the things from the original data**

In [60]:
df.columns

Index(['CustID', 'Ratings', 'MovieID'], dtype='object')

In [79]:
(df["CustID"].unique())

array(['712664', '1331154', '2632461', ..., '605364', '2076092',
       '2507614'], dtype=object)

In [61]:
# This line will remove the movies from the actual using the drop list of movies
# not include movie id that is in the drop_list_movies
df = df[~df["MovieID"].isin(drop_list_movies)]

In [63]:
df = df[~df["CustID"].isin(drop_list_cust)]

In [64]:
print(f"Rows: {df.shape[0]} and columns: {df.shape[1]}")

Rows: 17337458 and columns: 3


In [85]:
df.head()

Unnamed: 0,CustID,Ratings,MovieID
696,712664,5.0,3
697,1331154,4.0,3
698,2632461,3.0,3
699,44937,5.0,3
700,656399,4.0,3


In [89]:
df["CustID"].dtype

dtype('O')

<hr>

# **Import the secondary data (Holding the movie names)**

In [70]:
movies_df = pd.read_csv("/content/drive/MyDrive/Datasets/Netflix/Data for Files/NetflixMovieData.csv", names = ["MovieID", "Year", "Name"], usecols = [0, 1, 2], header = None)

movies_df = movies_df.set_index("MovieID")

In [71]:
movies_df.head()

Unnamed: 0_level_0,Year,Name
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2003.0,Dinosaur Planet
2,2004.0,Isle of Man TT 2004 Review
3,1997.0,Character
4,1994.0,Paula Abdul's Get Up & Dance
5,2004.0,The Rise and Fall of ECW


<hr>

# **Recommendation System using SVD**

**Package: scikit-surprise**

In [72]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m102.4/154.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m154.4/154.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hd

In [73]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [74]:
# We need to read our data, so we can use Reader, so we are initializing a object to the same
reader = Reader()
# It return me a pivot table

In [75]:
# We need data to work on, we are considering the first 1 lakh in order to keep the model simple
data = Dataset.load_from_df(df[["CustID", "MovieID", "Ratings"]][:100000], reader)

In [76]:
data

<surprise.dataset.DatasetAutoFolds at 0x7fac0c2b8220>

<hr>

**Model Building**

In [77]:
model = SVD()

In [78]:
# We will be using cross-validate for better recommendation and to ensure least errors
cross_validate(model, data, measures = ["RMSE"], cv = 4)

{'test_rmse': array([0.99456951, 1.00268851, 1.00014923, 0.99026191]),
 'fit_time': (1.6247625350952148,
  1.6647472381591797,
  3.769357681274414,
  2.0972983837127686),
 'test_time': (0.410754919052124,
  0.9401366710662842,
  0.32293033599853516,
  0.14821910858154297)}

In [106]:
# We have tried to get the movie for the mentioned who has rated the movie as 5 stars (the best movie of the particular)
data_712664 = df[(df["CustID"] == "1331154") & (df["Ratings"] == 5.0)]

In [107]:
data_712664

Unnamed: 0,CustID,Ratings,MovieID
458308,1331154,5.0,143
1184450,1331154,5.0,270
1991774,1331154,5.0,361
2369367,1331154,5.0,457
2600328,1331154,5.0,482
3417458,1331154,5.0,658
4029215,1331154,5.0,763
5646194,1331154,5.0,1144
7075510,1331154,5.0,1425
7423467,1331154,5.0,1476


In [84]:
movies_df

Unnamed: 0_level_0,Year,Name
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2003.0,Dinosaur Planet
2,2004.0,Isle of Man TT 2004 Review
3,1997.0,Character
4,1994.0,Paula Abdul's Get Up & Dance
5,2004.0,The Rise and Fall of ECW
...,...,...
17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17767,2004.0,Fidel Castro: American Experience
17768,2000.0,Epoch
17769,2003.0,The Company


In [99]:
# We are making a list of movies that a user can possible see, and using this we will recommend our user the movie based on his behaviour
list_of_infinite_possibility = movies_df.copy()

In [100]:
list_of_infinite_possibility.reset_index(inplace = True)

In [101]:
# These are all the movies that a user can possibly see. Now we will use this list and using the data for the user
# we will try to give him the upcoming recommendation from this given list
list_of_infinite_possibility = list_of_infinite_possibility[~list_of_infinite_possibility["MovieID"].isin(drop_list_movies)]

list_of_infinite_possibility

Unnamed: 0,MovieID,Year,Name
2,3,1997.0,Character
7,8,2004.0,What the #$*! Do We Know!?
15,16,1996.0,Screamers
16,17,2005.0,7 Seconds
17,18,1994.0,Immortal Beloved
...,...,...,...
17764,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17765,17767,2004.0,Fidel Castro: American Experience
17766,17768,2000.0,Epoch
17767,17769,2003.0,The Company


In [111]:
# We are going to find the scores
list_of_infinite_possibility["Estimate Score"] = list_of_infinite_possibility["MovieID"].apply(lambda x: model.predict("1331154", x).est)
# This code, is taking the list of all the movies and is trying given me prediction in the form of estimate (est) for a given user (712664)

In [112]:
list_of_infinite_possibility = list_of_infinite_possibility.sort_values("Estimate Score", ascending = False)

In [113]:
list_of_infinite_possibility

Unnamed: 0,MovieID,Year,Name,Estimate Score
2,3,1997.0,Character,4.328682
17,18,1994.0,Immortal Beloved,4.018907
29,30,2003.0,Something's Gotta Give,3.840619
27,28,2002.0,Lilo and Stitch,3.772996
10462,10464,1995.0,Tenchi Muyo! Ryo Ohki,3.717107
...,...,...,...,...
17762,17764,1998.0,Shakespeare in Love,3.717107
7,8,2004.0,What the #$*! Do We Know!?,3.668707
16,17,2005.0,7 Seconds,3.259428
15,16,1996.0,Screamers,3.124936
