# Netflix Recommendation System

## Project Purpose:

The purpose of the project is to develop a movie recommendation system for Netflix based on the given dataset. The project aims to provide users with personalized recommendations for movies in the Comedy genre that have a high rating and a significant number of votes. By analyzing the dataset and applying filters based on genre and votes, the project identifies the top-rated comedy movies that meet the specified criteria. The purpose is to help users discover and explore popular comedy movies on Netflix, enhancing their viewing experience and assisting them in finding movies that align with their preferences.

The project utilizes various techniques and libraries to implement the movie recommendation system. 

The techniques used as follows:

- Data Cleaning: The script performs data cleaning operations to preprocess the dataset and ensure its quality for analysis. Techniques such as dropping unnecessary columns, removing duplicates, and converting data types are employed.


- Data Filtering: The script filters the dataset based on the specified genre ("Comedy") to focus on movies within that category. It uses the str.contains() method to identify rows that contain the desired genre.


- Data Manipulation: The script calculates a weighted average score for each movie, considering the number of votes and the overall mean rating. This manipulation allows for a more personalized ranking of movies. Additionally, the script uses numpy (np) library to calculate the quantile for setting the votes threshold.


- Data Sorting and Display: The script sorts the movies based on their weighted average scores in descending order and displays the top recommendations. The pandas library is used for sorting the DataFrame and the IPython.display library is used to present the recommendations in a visually appealing manner.


The project relies on the following libraries:

- pandas (pd): Used for data manipulation and analysis, including reading the dataset from a CSV file, dropping columns, handling missing values, and performing sorting operations.
- numpy (np): Utilized for numerical computations, specifically calculating the quantile for determining the votes threshold.
- IPython.display: Employed for displaying the top recommendations in a tabular format.

These techniques and libraries enable the project to process and analyze the movie dataset, filter movies based on genre and votes, calculate a weighted average score, and present the top recommendations to the users.

In [None]:
import pandas as pd  # Importing the pandas library for data manipulation and analysis
from IPython.display import display  # Importing display capabilities from IPython
import numpy as np  # Importing the NumPy library for numerical computations

In [9]:
pd.set_option("display.max_rows", None)  # Setting the option to display all rows of a DataFrame

In [65]:
df = None  # Initializing a variable df with a value of None
# Note: The DataFrame df is not yet defined or assigned at this point.
# Further code is needed to process or analyze data using the df DataFrame.

In [125]:
minVotes = None  # Initializing minVotes variable

def readFile(path):
    global df
    df = pd.read_csv(path)  # Reading a CSV file and assigning the result to the global variable df

def cleanDataset():
    global df
    # Dropping unnecessary columns from the DataFrame
    df = df.drop(['year'], axis=1)
    df = df.drop(['certificate'], axis=1)
    df = df.drop(['duration'], axis=1)
    df = df.drop(['description'], axis=1)
    df = df.drop(['stars'], axis=1)
    # Removing duplicate rows based on the 'title' column
    df = df.drop_duplicates(subset=['title'])
    # Cleaning the 'votes' column by removing commas and converting the values to float
    df['votes'] = df['votes'].str.replace(',', '').astype(float)
    # Filling any missing values in the DataFrame with 0
    df = df.fillna(0)

def filterGenre(genre):
    global df
    # Filtering the DataFrame to include only rows that have the specified genre in the 'genre' column
    df = df[df['genre'].str.contains(genre, case=False).fillna(False)]

def threshVotes(thrs):
    global df
    global minVotes
    # Calculating the minimum number of votes based on the specified threshold using numpy's quantile function
    minVotes = np.quantile(df['votes'], thrs)
    # Removing rows from the DataFrame where the number of votes is less than the minimum votes threshold
    df = df.drop(df[df.votes < minVotes].index)

def weightedAvgScore():
    global df
    weightedAvg = []
    mean = df['rating'].mean()
    df = df.reset_index(drop=True)
    # Calculating the weighted average score for each row in the DataFrame
    for i in range(0, len(df['rating'])):
        res = (df['votes'][i] / (df['votes'][i] + minVotes) * df['rating'][i]) + (minVotes / (df['votes'][i] + minVotes)) * mean
        weightedAvg.append(res)
    df["weightedAvg"] = weightedAvg  # Adding the calculated weighted average scores to a new column 'weightedAvg'

def sortNscores(n):
    # Sorting the DataFrame based on the 'weightedAvg' column in descending order and selecting the top n rows
    sort = df.sort_values('weightedAvg', ascending=False).head(n)
    display(sort)  # Displaying the sorted DataFrame

def runRecommenderSystem(genre, votesThrs, numberOfReturnedData):
    # Running the recommender system by executing a series of functions in a specific order
    readFile("NetflixDatasetMovies.csv")
    cleanDataset()
    filterGenre(genre)
    threshVotes(votesThrs)
    weightedAvgScore()
    sortNscores(numberOfReturnedData)

In [126]:
runRecommenderSystem("Drama", 0.8, 20)

Unnamed: 0,title,genre,rating,votes,weightedAvg
4,Breaking Bad,"Crime, Drama, Thriller",9.5,1831340.0,9.475954
53,Sherlock,"Crime, Drama, Mystery",9.1,913816.0,9.060775
33,The Lord of the Rings: The Return of the King,"Action, Adventure, Drama",9.0,1819157.0,8.98116
167,Death Note,"Animation, Crime, Drama",9.0,316300.0,8.896898
2,Better Call Saul,"Crime, Drama",8.9,501384.0,8.837303
36,Fargo,"Crime, Drama, Thriller",8.9,369918.0,8.816149
9,The Lord of the Rings: The Fellowship of the Ring,"Action, Adventure, Drama",8.8,1844055.0,8.783529
41,The Lord of the Rings: The Two Towers,"Action, Adventure, Drama",8.8,1642708.0,8.781534
498,Leyla and Mecnun,"Adventure, Comedy, Drama",9.1,93632.0,8.77698
23,Black Mirror,"Drama, Mystery, Sci-Fi",8.8,535782.0,8.744737


In [213]:
class netflixTopRecommenderSystem:
    def __init__(self, datasetPath, genre, votesThrs, topN):
        # Constructor for the netflixTopRecommenderSystem class
        self._datasetPath = datasetPath  # Path to the dataset file
        self._genre = genre  # Genre for filtering the recommendations
        self._votesThrs = votesThrs  # Votes threshold for filtering the recommendations
        self._topN = topN  # Number of top recommendations to display
        self.df = None  # DataFrame to store the dataset
        self._minVotes = None  # Minimum number of votes based on the threshold

        self.colDrop = ['year', 'certificate', 'duration', 'description', 'stars']  # Columns to be dropped from the dataset
        self.colTitle = {"votes": "votes", "genre": "genre", "title": "title", "rating": "rating"}  # Column names dictionary

    def readFile(self):
        # Read the dataset file into the DataFrame
        self.df = pd.read_csv(self._datasetPath)

    def cleanDataset(self):
        # Clean the dataset by dropping unnecessary columns and removing duplicates
        for item in self.colDrop:
            self.df = self.df.drop([item], axis=1)

        self.df = self.df.drop_duplicates(subset=[self.colTitle["title"]])

        if type(self.df[self.colTitle["votes"]][0]) == str:
            # Clean the 'votes' column by removing commas and converting values to float if necessary
            self.df[self.colTitle["votes"]] = self.df[self.colTitle["votes"]].str.replace(',', '').astype(float)

        self.df = self.df.fillna(0)

    def filterGenre(self):
        # Filter the dataset to include only rows with the specified genre
        self.df = self.df[self.df[self.colTitle["genre"]].str.contains(self._genre, case=False).fillna(False)]

    def threshVotes(self):
        # Set the minimum number of votes based on the threshold and remove rows with fewer votes
        self._minVotes = np.quantile(self.df[self.colTitle["votes"]], self._votesThrs)
        self.df = self.df.drop(self.df[self.df.votes < self._minVotes].index)

    def weightedAvgScore(self):
        # Calculate the weighted average score for each row in the dataset
        weightedAvg = []
        mean = self.df[self.colTitle["rating"]].mean()
        self.df = self.df.reset_index(drop=True)

        for i in range(0, len(self.df[self.colTitle["rating"]])):
            res = (self.df[self.colTitle["votes"]][i] / (self.df[self.colTitle["votes"]][i] + self._minVotes) * self.df[self.colTitle["rating"]][i]) + (self._minVotes / (self.df[self.colTitle["rating"]][i] + self._minVotes)) * mean
            weightedAvg.append(res)
        self.df["weightedAvg"] = weightedAvg

    def sortNscores(self):
        # Sort the dataset based on the weighted average score and display the top recommendations
        sort = self.df.sort_values('weightedAvg', ascending=False).head(self._topN)
        display(sort)

    def run(self):
        # Run the recommender system by executing a series of methods in a specific order
        self.readFile()
        self.cleanDataset()
        self.filterGenre()
        self.threshVotes()
        self.weightedAvgScore()
        self.sortNscores()
        

In [214]:
path = "NetflixDatasetMovies.csv"
best5Comedy = netflixTopRecommenderSystem(path, "Comedy", 0.8, 20)

# Modifying class attributes for column drop and column titles
best5Comedy.colDrop = ['year', 'certificate', 'duration', 'description', 'stars']
best5Comedy.colTitle= {"votes": "votes", "genre": "genre", "title":"title", "rating": "rating"}

# Running the recommender system
best5Comedy.run()


Unnamed: 0,title,genre,rating,votes,weightedAvg
1,Rick and Morty,"Animation, Adventure, Comedy",9.2,502160.0,15.936105
3,Friends,"Comedy, Romance",8.9,979424.0,15.71961
8,Seinfeld,Comedy,8.9,314089.0,15.54797
27,South Park,"Animation, Comedy",8.7,366394.0,15.388565
29,Arrested Development,Comedy,8.7,302834.0,15.344859
5,Modern Family,"Comedy, Drama, Romance",8.5,423963.0,15.221511
7,Suits,"Comedy, Drama",8.5,405863.0,15.213583
25,BoJack Horseman,"Animation, Comedy, Drama",8.8,152649.0,15.199459
275,Leyla and Mecnun,"Adventure, Comedy, Drama",9.1,93632.0,15.183357
4,Shameless,"Comedy, Drama",8.6,239541.0,15.182946


## **Conclusion:**

The Python script implements a Netflix movie recommender system for the Comedy genre. By analyzing the dataset, filtering by genre, applying a votes threshold, and calculating a weighted average score, the script provides insights into the top-rated comedy movies on Netflix. These recommendations can assist users in discovering highly-rated movies in the Comedy genre that have received a significant number of votes. The script demonstrates the application of data cleaning, filtering, and ranking techniques to derive meaningful insights from a given dataset.

**Analytics Insights:**

- Dataset Source: The script utilizes a dataset sourced from a CSV file containing information about movies on Netflix.

- Data Cleaning: The script performs data cleaning operations to prepare the dataset for analysis. It drops unnecessary columns such as year, certificate, duration, description, and stars. Duplicate movie titles are removed to ensure uniqueness in the dataset. The 'votes' column is cleaned by removing commas and converting the values to float data type.

- Filtering by Genre: The script filters the dataset based on a specified genre. In this case, the genre chosen is "Comedy". Only movies that have the genre "Comedy" (case-insensitive) in their genre column are retained for further analysis.

- Votes Threshold: A votes threshold is applied to filter out movies with a lower number of votes. The threshold is set at 0.8, which means that movies with votes below the 80th percentile will be excluded from the recommendations.

- Weighted Average Score: The script calculates a weighted average score for each movie in the dataset. The weighted average takes into account the number of votes and the overall mean rating of the movies.

- Top Recommendations: The script sorts the movies based on their weighted average scores in descending order and displays the top recommendations. The number of top recommendations shown is set to 20.