# Instructions to run the Jupyter notebooks

    1.Go to the link https://grouplens.org/datasets/movielens/20m/ and download the file ml-20m.zip from the page.
    2.Unzip ALL the files and extract the files to the same location as notebook is in.
    3.Go to the link https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset and download the file from the page
    4.Unzip the file and copy the file tmdb_movies_data.csv to the same location as notebook is in.
    5.Make sure you have below libraries already installed.
        •	numpy
        •	pandas
        •	matplotlib.pyplot
        •	seaborn
        •	pickle
        •	Counter
        •	Collections
        •	scipy.sparse
        •	csr_matrix
        •	sklearn
        •	fuzzywuzzy
        •	dash
        •	requests

    6.Run the notebook Movie Recommendation.ipynb file first.
    7.Make sure that below files are created after the run on the folder you have the notebooks.

        •	user_rating.pkl
        •	movie_list.pkl
        •	cba_similarity.pkl
        •	movie_recommendations_knn.csv
        •	grouped_user_filtered_df.csv
        •	unique_genres.pkl
        •	unique_cast.pkl
        •	user_semantics.csv
        
    
    8.Now run the Next Python Notebook, Dashboard.ipynb. This will create the dashboard. This will create the dashboard


# Movie Recommendation System


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:black;">Hello, <br>
    

In this notebook we will be creating 2 recommendation models, 

1. Item based recommendation using KNN after doing SVD ( Matrix Factorization)(Implicit Feedback).
2. Cold Start problem resolution using Content based Recommendation.
<br>
</p>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:black;">       

We need to download the zip file from below links

- Go to the link https://grouplens.org/datasets/movielens/20m/ and download the file **ml-20m.zip** from the page
- Unzip **ALL** the files and extract the files to the same location as notebook is in.
- Go to the link https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset and download the file from the page
- Unzip the file and copy the file **tmdb_movies_data.csv** to the same location as notebook is in.

</p>
</div>

In [1]:
# importing the libraries needed for the work.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle as pickle

In [2]:
# loading the datset to the code for the files movies.csv and ratings.csv
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

In [3]:
#removing the year and the special characters from the title
movies['title'] = movies['title'].str.replace(r'\s\(\d{4}\)', '', case=False, regex=True)
movies['title'] = movies['title'].str.replace(',', ' ').str.replace('&', ' ').str.replace('$', ' ').str.replace("'", '')

In [4]:
# Here we will be doing the analysis of the data and find characteritics of the data
n_ratings = len(ratings)
n_movies = ratings['movieId'].nunique()
n_users = ratings['userId'].nunique()

print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average number of ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average number of ratings per movie: {round(n_ratings/n_movies, 2)}")

Number of ratings: 20001539
Number of unique movieId's: 26744
Number of unique users: 138497
Average number of ratings per user: 144.42
Average number of ratings per movie: 747.89


In [5]:
# So here we will define the Bayesian Average function.

movie_stats = ratings.groupby('movieId')['rating'].agg(['count', 'mean'])
C = movie_stats['count'].mean()
m = movie_stats['mean'].mean()

print(f"Average number of ratings for a given movie: {C:.2f}")
print(f"Average rating for a given movie: {m:.2f}")

def bayesian_avg(ratings):
    bayesian_avg = (C*m+ratings.sum())/(C+ratings.count()) ### formula for bayesian average
    return round(bayesian_avg, 3)

Average number of ratings for a given movie: 747.89
Average rating for a given movie: 3.13


In [6]:
#Now we will create a column with Bayesian Average for each movies
bayesian_avg_ratings = ratings.groupby('movieId')['rating'].agg(bayesian_avg).reset_index()
bayesian_avg_ratings.columns = ['movieId', 'bayesian_avg']
movie_stats = movie_stats.merge(bayesian_avg_ratings, on='movieId')
movie_stats

Unnamed: 0,movieId,count,mean,bayesian_avg
0,1,49701,3.921269,3.910
1,2,22245,3.211958,3.209
2,3,12737,3.151095,3.150
3,4,2756,2.861393,2.919
4,5,12163,3.064417,3.068
...,...,...,...,...
26739,131254,1,4.000000,3.134
26740,131256,1,4.000000,3.134
26741,131258,1,2.500000,3.132
26742,131260,1,3.000000,3.133


In [7]:
# Will add the movies name column also to the dataset.
movie_stats = movie_stats.merge(movies[['movieId', 'title']])
movie_stats
movie_stats.sort_values('bayesian_avg',ascending=False)

Unnamed: 0,movieId,count,mean,bayesian_avg,title
315,318,63370,4.446978,4.432,Shawshank Redemption The
843,858,41357,4.364690,4.343,Godfather The
49,50,47008,4.334358,4.316,Usual Suspects The
523,527,50056,4.310183,4.293,Schindlers List
1195,1221,27398,4.275641,4.245,Godfather: Part II The
...,...,...,...,...,...
2298,2383,2155,1.794896,2.140,Police Academy 6: City Under Siege
1648,1707,2697,1.819985,2.105,Home Alone 3
1694,1760,2658,1.770316,2.070,Spice World
1506,1556,5326,1.912317,2.063,Speed 2: Cruise Control


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:black; font-weight:bold;">       
##### Using the Bayesian average, we see that `Shawshank Redemption`, `The Godfather`, and `The Usual Suspects` are the most highly rated movies. This result makes much more sense since these movies are critically acclaimed films.
</p>
</div>


**A look at Movie Genres**

The movies dataset needs to be cleaned in two ways:

- `genres` is expressed as a string with a pipe `|` separating each genre. We will manipulate this string into a list, which will make it much easier to analyze.

In [8]:
# We will try to do the analysis on Genres of films
movies['genres'] = movies['genres'].apply(lambda x: x.split("|"))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men,"[Comedy, Romance]"
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II,[Comedy]




<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:black; font-weight:bold;">
We will be creating a sparse matrix from the ratings dataframe. This is for using the data for our models in for recommendation system

So below function creates a sparse matrix from ratings dataframe with an argument containing pandas dataframe with 3 columns (userId, movieId, rating)
    
Returning below items

- X: sparse matrix
- user_mapper: dict that maps user id's to user indices
- user_inv_mapper: dict that maps user indices to user id's
- movie_mapper: dict that maps movie id's to movie indices
- movie_inv_mapper: dict that maps movie indices to movie id's
</p>
</div>



In [9]:
    from scipy.sparse import csr_matrix

    def create_X(df):

        M = df['userId'].nunique()
        N = df['movieId'].nunique()

        user_mapper = dict(zip(np.unique(df["userId"]), list(range(M))))
        movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(N))))

        user_inv_mapper = dict(zip(list(range(M)), np.unique(df["userId"])))
        movie_inv_mapper = dict(zip(list(range(N)), np.unique(df["movieId"])))

        user_index = [user_mapper[i] for i in df['userId']]
        item_index = [movie_mapper[i] for i in df['movieId']]

        X = csr_matrix((df["rating"], (user_index,item_index)), shape=(M,N))

        return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

    X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_X(ratings)

In [10]:
X.shape

(138497, 26744)

In [11]:
#Here we will be finding out the sparsity of the matrix 

n_total = X.shape[0]*X.shape[1]
n_ratings = X.nnz
sparsity = n_ratings/n_total
print(f"Matrix sparsity: {round(sparsity*100,2)}%")

Matrix sparsity: 0.54%


As we can see sparsity is 0.54, this matrix can be used for model

In [12]:
movie_titles = dict(zip(movies['movieId'], movies['title']))

## First Recommendation System - with KNN using Matrix Factorization (SVD)


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;color:black; font-weight:bold;">
Sometimes the recommendation can be affected by having sparsity in the utility matrix which we created before. we need to do Matrix factorization.

Matrix factorization (MF)  is a linear algebra technique that can help us discover latent features underlying the interactions between users and movies. These latent features give a more compact representation of user tastes and item descriptions. MF is particularly useful for very sparse data and can enhance the quality of recommendations. The algorithm works by factorizing the original user-item matrix into two factor matrices:

- user-factor matrix (n_users, k)
- item-factor matrix (k, n_items)

We are reducing the dimensions of our original matrix into "taste" dimensions. We cannot interpret what each latent feature $k$ represents. However, we could imagine that one latent feature may represent users who like romantic comedies from the 1990s, while another latent feature may represent movies which are independent foreign language films.
</p>
</div>

In [13]:
## Using Nearest KNN Algorithm to find recommendations

from sklearn.neighbors import NearestNeighbors

def find_similar_movies(movie_id, X, movie_mapper, movie_inv_mapper, k, metric='euclidean'):
    """
    Finds k-nearest neighbours for a given movie id.
    
    Args:
        movie_id: id of the movie of interest
        X: user-item utility matrix
        k: number of similar movies to retrieve
        metric: distance metric for kNN calculations
    
    Output: returns list of k similar movie ID's
    """
    X = X.T
    neighbour_ids = []
    
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]
    if isinstance(movie_vec, (np.ndarray)):
        movie_vec = movie_vec.reshape(1,-1)
    # use k+1 since kNN output includes the movieId of interest
    kNN = NearestNeighbors(n_neighbors=k+1, algorithm="brute", metric=metric)
    kNN.fit(X)
    neighbour = kNN.kneighbors(movie_vec, return_distance=False)
    for i in range(0,k):
        n = neighbour.item(i)
        neighbour_ids.append(movie_inv_mapper[n])
    neighbour_ids.pop(0)
    return neighbour_ids

In [14]:
##Using SVD to do the recommendation

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=20, n_iter=10)
Q = svd.fit_transform(X.T)
Q.shape

(26744, 20)

## Second Recommendation System -  Content Based Recommendation System


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px; color:black; font-weight:bold;">
Handling the Cold Start Problem with Content-Based Filtering

Collaborative filtering relies solely on user-item interactions within the utility matrix. The issue with this approach is that brand new users or items with no interactions get excluded from the recommendation system. This is called the "cold start" problem. Content-based filtering is a way to handle this problem by generating recommendations based on user and item features.

</p>
</div>


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px; color:black; font-weight:bold;">
**We will be using TMDB dataset also in below case to get more details on the movies as there are less details on movielens dataset.There is a file called **links.csv** which will help to link Movielens dataset to TMDB Dataset.**

</p>
</div>

In [15]:
movielens = pd.read_csv('movies.csv') #this file comes from movielens dataset
tmdb = pd.read_csv('tmdb_movies_data.csv')
link = pd.read_csv('links.csv') #this file comes from movielens dataset
tags = pd.read_csv('tags.csv') #this file comes from movielens dataset


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px; color:black; font-weight:bold;">
Below we will be doing the data prepration steps for the dataset for modelling.

</p>
</div>

In [16]:
movielens_link = pd.merge(movielens, link, on='movieId', how='left')
tmdb = tmdb.rename(columns={'id': 'tmdbId'})
movielens_tmdb = pd.merge(movielens_link,tmdb,on ='tmdbId')
movielens_tmdb.drop(columns=['imdb_id','imdbId','popularity','budget','revenue','original_title','homepage','runtime','genres_y','production_companies','release_date','vote_count','vote_average','release_year','budget_adj','revenue_adj'], inplace=True)
movielens_tmdb.rename(columns={"genres_x": "genre"}, inplace=True)
agg_tags = tags.groupby('movieId')['tag'].apply(list).reset_index()

In [17]:
movie_cba = pd.merge(movielens_tmdb,agg_tags,on='movieId')
movie_cba = movie_cba.rename(columns={'title_y': 'title'})
movie_cba['combined_data'] = movie_cba.apply(lambda row: f"{row['genre']} {row['overview']} {row['tag']} {row['cast']} {row['director']} {row['tagline']} {row['keywords']}", axis=1)
movie_cba = movie_cba.drop(columns = ['overview','tag'])
movie_cba['tmdbId'] = movie_cba['tmdbId'].astype(int)
def format_genres(genre):
    return ', '.join(genre.split('|'))
movie_cba['genre'] = movie_cba['genre'].apply(format_genres)
movie_cba = movie_cba.rename(columns={'genre': 'genres'})
movie_cba['cast'] = movie_cba['cast'].str.split('|').str[:4].str.join(', ')
movie_cba['title'] = movie_cba['title'].str.replace(r'\s\(\d{4}\)', '', case=False, regex=True)
movie_cba = movie_cba.drop(columns = ['keywords','tagline'])
movie_cba['combined_data'] = movie_cba['combined_data'].str.replace(r'[\[\]\|,]', ' ', regex=True)
movie_cba['title'] = movie_cba['title'].str.replace(r'\s\(\d{4}\)', '', case=False, regex=True)
movie_cba['title'] = movie_cba['title'].str.replace(',', ' ').str.replace('&', ' ').str.replace('$', ' ').str.replace("'", '').str.replace('/', ' ')
movieids_cba = movie_cba['movieId'].tolist()

In [18]:
#Creating the vector matrix usng the tags for each movies
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')
vector = cv.fit_transform(movie_cba['combined_data']).toarray()
#finding the cosine similarity upon the call. 
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)

In [19]:
def recommend(movie):
    index = movie_cba[movie_cba['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:11]:
        print(movie_cba.iloc[i[0]].title)

<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px; color:black; font-weight:bold;">
Below we will find do the export of files needed for building the dashboard for recommendation system.
</p>
</div>

In [20]:
# Create empty lists to store the recommendation data
recommendation_data = []

# Iterate over movieIds in movieids_cba
for i in movieids_cba:
    # Check if the movie ID exists in the dataset
    if i in movie_mapper:
        similar_movies = find_similar_movies(i, Q.T, movie_mapper, movie_inv_mapper, metric='cosine', k=10)
        movie_title = movie_titles[i]

    # Store the recommendations as a comma-separated string within quotes
    recommendations = ', '.join([f'"{movie_titles[movie_id]}"' for movie_id in similar_movies])

    # Create a dictionary with the recommendation data
    recommendation_dict = {
        'MovieId': i,
        'MovieName': movie_title,
        'Recommendations': recommendations
    }

    # Append the dictionary to the recommendation_data list
    recommendation_data.append(recommendation_dict)

# Create a DataFrame from the recommendation_data list
recommendation_df = pd.DataFrame(recommendation_data)
recommendation_df.to_csv('movie_recommendations_knn.csv', index=False)

In [21]:
pickle.dump(movie_cba,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('cba_similarity.pkl','wb'))
pickle.dump(ratings,open('user_rating.pkl','wb'))

In [22]:
movie_ratings = ratings.merge(movies, on='movieId')
filtered_movie_ratings = movie_ratings[movie_ratings['movieId'].isin(movieids_cba)]
# Group by 'userId' and aggregate 'movieId' into a list and count the number of movie IDs
grouped_user_id = filtered_movie_ratings.groupby('userId')['movieId'].agg(list).reset_index()

# Rename columns if needed
grouped_user_id.columns = ['userId', 'movieId']
grouped_user_id['movie_num'] = grouped_user_id['movieId'].apply(len)
#grouped_user_filtered_df = grouped_user_id[grouped_user_id['movie_num'] > 400]
grouped_user_filtered_df = grouped_user_id[grouped_user_id['movie_num'] > 0]
grouped_user_filtered_df.to_csv('grouped_user_filtered_df.csv', index=False)

In [23]:
df = pd.DataFrame({'genres': movie_cba['genres']})
genres_split = df['genres'].str.split(', ')
unique_genres = []
for genres_list in genres_split.dropna():  # dropna() removes rows where 'genre' is NaN
    unique_genres.extend(genres_list)
unique_genres = [genre.strip() for genre in unique_genres]
unique_genres = set(unique_genres) - {'(no genres listed)'}
with open('unique_genres.pkl', 'wb') as f:
    pickle.dump(unique_genres, f)


In [24]:
movie_cba['cast'] = movie_cba['cast'].apply(lambda x: x.split(', ')[:3] if isinstance(x, str) else [])
all_cast = [actor for sublist in movie_cba['cast'].tolist() for actor in sublist]
unique_cast = set(all_cast)
with open('unique_cast.pkl', 'wb') as f:
    pickle.dump(unique_cast, f)


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;color:black; font-weight:bold;">

<strong>Below we will be creating a user semantics file which contains the details of movies rated, liked and reviwes provided by users</strong>
</p>
</div>

In [25]:
user_sem = ratings
user_semantics = pd.merge(grouped_user_filtered_df, user_sem, on='userId')
user_semantics.drop('movieId_x',axis=1,inplace=True)
user_semantics.drop('movie_num',axis=1,inplace=True)
user_semantics.drop('timestamp',axis=1,inplace=True)
user_semantics = user_semantics.rename(columns={'movieId_y': 'movieId'})

In [26]:
user_semantics['liked(Yes_No)'] = user_semantics['rating'].apply(lambda x: 'Yes' if x >= 4 else 'No')
user_semantics['reviews'] = user_semantics['rating'].apply(lambda x: 'Excellent movie with well played cast and well written script' if x == 5 else ('Good movie' if x == 4.5 else ('Good movie' if x == 4 else ('Average movie' if x == 3.5 or x == 3 else 'Bad or worst movie I have ever seen'))))
user_semantics.to_csv('user_semantics.csv', index=False)


<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:Beige;
           font-size:110%;
           letter-spacing:0.5px">
<p style="padding: 10px;color:black; font-weight:bold;">

<strong>Now we will run the Dashboard.ipynb</strong>
</p>
</div>