# Project: Movie Recommender System

## Overview
Create a recommender system based on the ReelGood Data, utilizing only Python in Google Colab Jupyter notebooks. This system should not require any external data or tools.

## Objectives
- Build a tool that asks the user for specific inputs about their movie preferences.
- Use the ReelGood dataset to generate movie or show recommendations based on these inputs.


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

In [6]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')  # Suppress all warnings
df = pd.read_csv('Reel Good Data (Title+Service+Genre+Tag List).csv')

# Recommendar System

The simple recommender offers general recommendations to every user based on the popularity and critical acclaim of TV shows and movies. This model does not provide personalized recommendations but instead sorts titles based on metrics such as ratings and a proxy measure for popularity, assumed from the "Service" data. The principle behind this system is that titles which are more popular and critically acclaimed are likely to be appreciated by the average viewer.


## Implementation

The implementation of this simple recommender involves calculating a weighted rating for each title based on its IMDB rating and its availability across various streaming platforms (as a measure of popularity). The weighted rating formula takes into account both the average rating of the title and the global average rating across all titles, adjusted by the title’s relative popularity.

## Weighted Rating Formula

The weighted rating (WR) combines a title's individual quality with its popularity to derive a score that ranks titles more effectively. The formula used is as follows:

$$
\text{Weighted Rating (WR)} = \left(\frac{v}{v + m} \cdot R\right) + \left(\frac{m}{m + v} \cdot C\right)
$$

where:
- \( v \) is the number of platforms on which the title is available, serving as a proxy for popularity.
- \( m \) is the minimum number of platforms required for a title to be included in the rankings, determined as the 95th percentile of \( v \).
- \( R \) is the average rating of the title (from IMDB).
- \( C \) is the mean rating across all titles (the average IMDB rating).

This formula ensures that both popular and highly rated titles are recommended, balancing widespread appeal with quality.


In [8]:
df['IMDB'] = pd.to_numeric(df['IMDB'], errors='coerce')
df['Service Count'] = df['Service'].apply(lambda x: len(x.split(', ')) if pd.notnull(x) else 0)

C = df['IMDB'].mean()
m = df['Service Count'].quantile(0.95)

def weighted_rating(x, m, C):
    v = x['Service Count']
    R = x['IMDB']
    return (v/(v+m) * R) + (m/(m+v) * C)

df['wr'] = df.apply(lambda x: weighted_rating(x, m, C), axis=1)

agg_data = df.groupby(['Title', 'Released Year']).agg({
    'wr': 'max',  
    'Service Count': 'sum',  
    'IMDB': 'mean', 
}).reset_index()

qualified_IMDB= agg_data[agg_data['Service Count'] >= m]

top_10_IMDB = qualified_IMDB.sort_values('wr', ascending=False).head(10)

In [9]:
top_10_IMDB.head()

Unnamed: 0,Title,Released Year,wr,Service Count,IMDB
16897,Eco-Terrorist: Battle for Our Planet,2019,8.059343,1,10.0
8546,Bluey,2018,7.909343,15,9.7
66143,"You May Be Pretty, But I Am Beautiful: The Adr...",2019,7.909343,2,9.7
60558,This Happened: Claudia Brücken Live at the Scala,2012,7.859343,2,9.6
51786,The Curators of Dixon School,2012,7.859343,5,9.6


## Weighted Rating Formula

The weighted rating (WR) combines a title's individual quality with its popularity to derive a score that ranks titles more effectively. The formula used is as follows:

$$
\text{Weighted Rating (WR)} = \left(\frac{v}{v + m} \cdot R\right) + \left(\frac{m}{m + v} \cdot C\right)
$$

where:
- \( v \) is the number of platforms on which the title is available, serving as a proxy for popularity.
- \( m \) is the minimum number of platforms required for a title to be included in the rankings, determined as the 95th percentile of \( v \).
- \( R \) is the average rating of the title (from ReelGood).
- \( C \) is the mean rating across all titles (the average ReelGood rating).

This formula ensures that both popular and highly rated titles are recommended, balancing widespread appeal with quality.


In [10]:
df['Service Count'] = df['Where to Watch'].apply(lambda x: len(x.split(', ')) if pd.notnull(x) else 0) 

C = df['ReelGood'].mean()
m = df['Service Count'].quantile(0.95)

def weighted_rating(x, m, C):
    v = x['Service Count']
    R = x['ReelGood']
    return (v/(v+m) * R) + (m/(m+v) * C)

df['wr'] = df.apply(lambda x: weighted_rating(x, m, C), axis=1)
agg_data = df.groupby(['Title', 'Released Year']).agg({
    'wr': 'max',  
    'Service Count': 'sum', 'ReelGood': 'mean', 
}).reset_index()

qualified_ReelGood= agg_data[agg_data['Service Count'] >= m]

top_10_ReelGood = qualified_ReelGood.sort_values('wr', ascending=False).head(10)

In [11]:
top_10_ReelGood.head()

Unnamed: 0,Title,Released Year,wr,Service Count,ReelGood
58386,The Silence of the Lambs,1991,75.581241,75,97.0
36668,Night of the Living Dead,1968,74.556244,432,86.0
21825,Good Will Hunting,1997,73.914575,5,94.0
61703,Train to Busan,2016,73.523117,162,90.0
5371,Attack on Titan,2013,72.803463,135,92.0


## Conclusion

This simple recommender system effectively highlights top titles based on their popularity and critical ratings. By leveraging the availability data from "Where to Watch" as a proxy for popularity, this model can recommend titles that are widely regarded and potentially of higher interest to the average viewer. The next steps could involve refining the metrics for popularity or integrating additional data sources for a more comprehensive view.


# Title-Based Recommendations

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
import warnings

warnings.filterwarnings('ignore') 
df = pd.read_csv('Reel Good Data (Title+Service+Genre+Tag List).csv')

df['Service Count'] = df['Where to Watch'].apply(lambda x: len(x.split(', ')) if pd.notnull(x) else 0)

C = df['ReelGood'].mean()
m = df['Service Count'].quantile(0.95)

def weighted_rating(x, m, C):
    v = x['Service Count']
    R = x['ReelGood']
    return (v/(v+m) * R) + (m/(m+v) * C)

df['wr'] = df.apply(lambda x: weighted_rating(x, m, C), axis=1)

agg_data = df.groupby(['Title', 'Released Year']).agg({
    'wr': 'max',  
    'Service Count': 'sum', 'ReelGood': 'mean', 
}).reset_index()

qualified_ReelGood = agg_data[agg_data['Service Count'] >= m]

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(qualified_ReelGood['Title'])

user_title = input("Enter a title you like: ")

title_idx = tfidf.transform([user_title]).toarray()
cos_sim = cosine_similarity(title_idx, tfidf_matrix)

qualified_ReelGood['similarity'] = cos_sim[0]

top_3_ReelGood = qualified_ReelGood.sort_values(['similarity', 'wr'], ascending=[False, False]).head(3)

print(top_3_ReelGood[['Title', 'Released Year', 'wr', 'Service Count', 'ReelGood']])

Enter a title you like: Breaking bad
                 Title  Released Year         wr  Service Count  ReelGood
9193      Breaking Bad           2008  65.871862              4     100.0
9204       Breaking In           2011  53.871862             24      64.0
9213  Breaking Through           2013  42.871862             24      31.0


# Genre-based recommendations

In [13]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')  
df = pd.read_csv('Reel Good Data (Title+Service+Genre+Tag List).csv')

In [14]:
warnings.filterwarnings('ignore') 
df['Service Count'] = df['Where to Watch'].apply(lambda x: len(set(x.split(', '))) if pd.notnull(x) else 0)

df = df.drop_duplicates()

aggregated = df.groupby(['Title', 'Released Year', 'Type']).agg({
    'Service Count': 'sum',  
    'ReelGood': 'mean',  
    'IMDB': 'mean',  
    'Genre': lambda x: ', '.join(set(x)), 
}).reset_index()

aggregated['Genre'] = aggregated['Genre'].str.split(', ')
gen_df = aggregated.explode('Genre')

def build_chart(genre, percentile=0.85):
    df_genre = gen_df[gen_df['Genre'].str.contains(genre, case=False, na=False)]
    C = df_genre['ReelGood'].mean()
    m = df_genre['Service Count'].quantile(percentile)

    qualified = df_genre[(df_genre['Service Count'] >= m) & (df_genre['ReelGood'].notnull())]
    qualified['wr'] = (qualified['Service Count'] / (qualified['Service Count'] + m) * qualified['ReelGood']) + (m / (m + qualified['Service Count']) * C)
    
    return qualified.sort_values('wr', ascending=False).head(250)

In [15]:
top_romance_movies = build_chart('Romance').head(15)
build_chart('Romance').head()

Unnamed: 0,Title,Released Year,Type,Service Count,ReelGood,IMDB,Genre,wr
30431,Let the Right One In,2008,movies,60,89.0,7.9,Romance,85.44169
11005,Charade,1963,movies,231,84.0,7.9,Romance,83.13566
20224,Friends,1994,tv,18,94.0,8.9,Romance,82.964647
24067,His Girl Friday,1940,movies,288,83.0,7.9,Romance,82.323645
54392,The Illusionist,2006,movies,84,84.0,7.6,Romance,81.723906


In [16]:
top_crime_movies = build_chart('Crime').head(15)
build_chart('Crime').head()

Unnamed: 0,Title,Released Year,Type,Service Count,ReelGood,IMDB,Genre,wr
9514,Brooklyn Nine-Nine,2013,tv,63,91.0,8.4,Crime,85.324531
52804,The Fall,2013,tv,96,89.0,8.2,Crime,85.280227
53461,The Girl with the Dragon Tattoo,2009,movies,120,87.0,7.8,Crime,84.12347
55923,The Man from Nowhere,2010,movies,120,87.0,7.8,Crime,84.12347
15220,Dexter,2006,tv,45,91.0,8.6,Crime,83.477181


In [17]:
import pandas as pd

warnings.filterwarnings('ignore') 
df['Service Count'] = df['Where to Watch'].apply(lambda x: len(set(x.split(', '))) if pd.notnull(x) else 0)

df['Tag'] = df['Tag'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

df = df.drop_duplicates()

aggregated = df.groupby(['Title', 'Released Year', 'Type']).agg({
    'Service Count': 'sum',  
    'ReelGood': 'mean',  
    'IMDB': 'mean',  
    'Genre': lambda x: ', '.join(set(x)), 
}).reset_index()

aggregated['Genre'] = aggregated['Genre'].str.split(', ')
gen_df = aggregated.explode('Genre')

def build_chart():
    genre_input = input("Enter the genre you are interested in: ")
    df_genre = gen_df[gen_df['Genre'].str.contains(genre_input, case=False, na=False)]
    if df_genre.empty:
        print("No movies found for this genre.")
        return None
    C = df_genre['ReelGood'].mean()
    m = df_genre['Service Count'].quantile(0.85)

    qualified = df_genre[(df_genre['Service Count'] >= m) & (df_genre['ReelGood'].notnull())]
    qualified['wr'] = (qualified['Service Count'] / (qualified['Service Count'] + m) * qualified['ReelGood']) + (m / (m + qualified['Service Count']) * C)
    
    return qualified.sort_values('wr', ascending=False).head(3)  
recommendations = build_chart()
if recommendations is not None:
    print(recommendations[['Title', 'IMDB', 'ReelGood', 'Genre']])


Enter the genre you are interested in: Crime
                                 Title  IMDB  ReelGood  Genre
9514                Brooklyn Nine-Nine   8.4      91.0  Crime
52804                         The Fall   8.2      89.0  Crime
53461  The Girl with the Dragon Tattoo   7.8      87.0  Crime
