# Hybrid Recommendation System using Collaborative Filtering and Content-Based Filtering

**Stevens Institute of Technology**
<br>**CS 541-A**
<br>**Group no. 9**
<br>**Team members**
- Thanapoom Phatthanaphan
- Chandini Rayadurgam
- Shreya Daniel Jacob

**Abstract**

The research focuses on movie recommendation systems as its application domain, aiming to create a hybrid recommendation system that blends collaborative filtering and content-based filtering techniques. By doing so, the system can provide personalized movie recommendations to users, which is a common practice among movie streaming platforms and other entertainment providers. This research aims to enhance users' movie-watching experience by suggesting movies that match their preferences, leading to greater satisfaction and engagement

**Introduction**

Movie recommendation systems have become increasingly popular in recent years, with the rise of online streaming platforms such as Netflix and Amazon Prime. These sys-tems aim to suggest movies to users that they are likely to enjoy based on their past preferences. 

However, the traditional movie recommendation systems that use either collaborative filtering (CF) or content-based filtering (CBF) techniques have several limitations that can negatively affect the quality and diversity of recommenda-tions. CF relies on user behavior data to recommend movies, which can lead to the problem of the cold start, where new users or movies have no historical data to rely on. Additionally, this technique often leads to the problem of the popularity bias, where popular movies receive more recommendations regardless of their quality. CBF, on the other hand, recommends movies based on their features, which can lead to limited diversity in recommendations and a lack of exposure to new and different types of movies.

To address these limitations, this research proposes the development of a hybrid movie recommendation system that combines CF and CBF techniques that can address the limitations of traditional recommendation systems, evaluate the performance of the proposed hybrid system, and compare it to traditional CF and CBF techniques. The MovieLens 1M dataset from GroupLens Research will be used to train and evaluate the system. This dataset contains ratings from over 6,040 users on approximately 3,900 movies, making it a rich source of data for building a movie recommendation system. 

The proposed system will utilize the strengths of both CF and CBF techniques to provide more accurate and diverse recommendations. CF will be used to identify similar users and movies based on past behavior, while CBF will be used to analyze movie attributes such as genre, director, and cast to make recommendations. By combining these two techniques, we hope to overcome some of the limitations of each method and provide a more comprehensive and personalized recommendation system.


**Problem Statement**

The problem statement for building a movie recommenda-tion system is that the system may encounter challenges such as data sparsity, cold start problem, and lack of diver-sity. These issues can make it difficult to accurately capture user preferences and provide recommendations that are rele-vant and diverse. 

- **Data sparsity**: Data sparsity occurs when there is a limited amount of data available for some movies, which can make it challenging to provide accurate recommendations.


- **Cold start problem**: The cold start problem arises when there is a lack of data available for new users, which can make it difficult to pro-vide personalized recommendations. 


- **Lack of diversity**: The lack of diversity issue can arise when the system tends to recommend similar movies, which can limit the user's exposure to new and different content.


**Artificial Intelligence Techniques** 

- **Collaborative Filtering**
    
    Collaborative filtering utilizes past user behavior and pref-erences to recommend movies to users with similar tastes. This technique can help address the cold start problem by identifying similar users who have rated the same movies and providing recommendations based on their preferences. It can also address data sparsity by filling in missing data through predictions based on the preferences of similar users. Collaborative filtering can also help increase diversity by identifying less popular movies that are similar to those the user has rated positively. 


- **Content-based Filtering**

    Content-based filtering, on the other hand, utilizes information about the movie's characteristics to recommend movies to users. This technique can help address the cold start problem by recommending movies based on the user's preferences for certain genres or actors. It can also address data sparsity by providing recommendations based on a movie's specific attributes, such as plot, genre, or director. Content-based filtering can also increase diversity by recommending movies that have similar attributes but may not be as popular as those the user has already seen. 


- **Hybrid system**

    To address the challenges of data sparsity, cold start problem, and lack of diversity effectively, a hybrid system that combines collaborative filtering and content-based filtering can be used. This approach leverages the strengths of both techniques and can help mitigate the limitations of each. A hybrid system can provide more accurate recommendations, increase diversity, and handle data sparsity and the cold start problem more effectively by incorporating information about both user preferences and movie attributes.

### 1) Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
from sklearn.decomposition import TruncatedSVD

### 2) Import the dataset

In [2]:
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')
column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1')
column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

In [3]:
ratings_data.head()

Unnamed: 0,UserID,MovieID,Ratings,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [4]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   UserID     1000209 non-null  int64
 1   MovieID    1000209 non-null  int64
 2   Ratings    1000209 non-null  int64
 3   Timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


In [5]:
movies_data.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   MovieID  3883 non-null   int64 
 1   Title    3883 non-null   object
 2   Genres   3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [7]:
user_data.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zixp-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [8]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   UserID      6040 non-null   int64 
 1   Gender      6040 non-null   object
 2   Age         6040 non-null   int64 
 3   Occupation  6040 non-null   int64 
 4   Zixp-code   6040 non-null   object
dtypes: int64(3), object(2)
memory usage: 236.1+ KB


### 3) Combine all 3 dataset into 1 dataset and clean to contain only necessary variables

In [9]:
data = pd.merge(pd.merge(ratings_data, user_data), movies_data)
data.head()

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [10]:
# extract genres into separated columns
genres = data['Genres'].str.split('|', expand=True)
genre_df = pd.get_dummies(genres.apply(pd.Series).stack()).sum(level=0)
data = pd.concat([data, genre_df], axis=1)
data.head()

  genre_df = pd.get_dummies(genres.apply(pd.Series).stack()).sum(level=0)


Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres,...,Film-Noir,Horror,Musical,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Categorize the age range into 6 groups: 0-10 years, 11-20 years, 21-30 years, 31-40 years, 41-50 years, >50 years
def get_age_group(age):
    if age <= 10:
        return 1
    elif age <= 20:
        return 2
    elif age <= 30:
        return 3
    elif age <= 40:
        return 4
    elif age <= 50:
        return 5
    else:
        return 6

def get_gender_group(gender):
    if gender == 'M':
        return 1
    else:
        return 2

data['Age_range'] = pd.cut(data['Age'], bins=[-1, 10, 20, 30, 40, 50, 60], labels=['0-10 years', '11-20 years', '21-30 years', '31-40 years', '41-50 years', '>50 years'])
data['Age_group'] = data['Age'].apply(lambda x: get_age_group(x))
data['Gender_group'] = data['Gender'].apply(lambda x: get_gender_group(x))
data.head()

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres,...,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western,Age_range,Age_group,Gender_group
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0-10 years,1,2
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,>50 years,6,1
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,21-30 years,3,1
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,21-30 years,3,1
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,41-50 years,5,1


In [12]:
dataset = data.drop(labels=['Timestamp', 'Age', 'Occupation', 'Zixp-code', 'Title', 'Genres'], axis=1)
dataset.head()

Unnamed: 0,UserID,MovieID,Ratings,Gender,Action,Adventure,Animation,Children's,Comedy,Crime,...,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western,Age_range,Age_group,Gender_group
0,1,1193,5,F,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0-10 years,1,2
1,2,1193,5,M,0,0,0,0,0,0,...,0,0,0,0,0,0,0,>50 years,6,1
2,12,1193,4,M,0,0,0,0,0,0,...,0,0,0,0,0,0,0,21-30 years,3,1
3,15,1193,4,M,0,0,0,0,0,0,...,0,0,0,0,0,0,0,21-30 years,3,1
4,17,1193,5,M,0,0,0,0,0,0,...,0,0,0,0,0,0,0,41-50 years,5,1


In [13]:
data.describe()

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Age,Occupation,Action,Adventure,Animation,Children's,...,Musical,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western,Age_group,Gender_group
count,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,...,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0,1000209.0
mean,3024.512,1865.54,3.581564,972243700.0,29.73831,8.036138,0.2574032,0.133925,0.04328395,0.07217092,...,0.04152432,0.0401696,0.1470343,0.0004579043,0.1572611,0.1896404,0.06851268,0.02067868,3.38955,1.246389
std,1728.413,1096.041,1.117102,12152560.0,11.75198,6.531336,0.4372036,0.3405719,0.2034957,0.2587708,...,0.1994996,0.1963569,0.3541403,0.02139381,0.364047,0.3920166,0.2526237,0.1423063,1.145793,0.4309076
min,1.0,1.0,1.0,956703900.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,1506.0,1030.0,3.0,965302600.0,25.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
50%,3070.0,1835.0,4.0,973018000.0,25.0,7.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
75%,4476.0,2770.0,4.0,975220900.0,35.0,14.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0
max,6040.0,3952.0,5.0,1046455000.0,56.0,20.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,2.0


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 32 columns):
 #   Column                       Non-Null Count    Dtype   
---  ------                       --------------    -----   
 0   UserID                       1000209 non-null  int64   
 1   MovieID                      1000209 non-null  int64   
 2   Ratings                      1000209 non-null  int64   
 3   Timestamp                    1000209 non-null  int64   
 4   Gender                       1000209 non-null  object  
 5   Age                          1000209 non-null  int64   
 6   Occupation                   1000209 non-null  int64   
 7   Zixp-code                    1000209 non-null  object  
 8   Title                        1000209 non-null  object  
 9   Genres                       1000209 non-null  object  
 10  Action                       1000209 non-null  uint8   
 11  Adventure                    1000209 non-null  uint8   
 12  Animation                   

### 4) Create the matrix for building Collaborative Filtering and Content-Based Filtering

**Matrix for Collaborative Filtering**

In [15]:
# Create pivot table between User ID and Movie ID, having Ratings as values
cf_ratings = pd.pivot_table(data=dataset, values='Ratings', index='UserID', columns='MovieID', fill_value=0)
cf_ratings

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0,0,0,2,0,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6037,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6038,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Create pivot table between User ID and Gender
added_cbf_user_info = pd.pivot_table(data=dataset, values='Gender_group', index='UserID', columns='Gender', fill_value=0)

# Create pivot table between User ID and Age
added_cbf_age = pd.pivot_table(data=dataset, values='Age_group', index='UserID', columns='Age_range', fill_value=0)

# Merge all the dataframe together
added_cbf_user_info[['0-10 years', '11-20 years', '21-30 years', '31-40 years', '41-50 years', '>50 years']] = added_cbf_age
added_cbf_user_info

Gender,F,M,0-10 years,11-20 years,21-30 years,31-40 years,41-50 years,>50 years
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0,6
3,0,1,0,0,3,0,0,0
4,0,1,0,0,0,0,5,0
5,0,1,0,0,3,0,0,0
...,...,...,...,...,...,...,...,...
6036,2,0,0,0,3,0,0,0
6037,2,0,0,0,0,0,5,0
6038,2,0,0,0,0,0,0,6
6039,2,0,0,0,0,0,5,0


**Matrix for Content-Based Filtering**

In [17]:
# Create pivot table between User ID and Movie ID, having Ratings as values
added_cf_ratings = pd.pivot_table(data=dataset, values='Ratings', index='MovieID', columns='UserID', fill_value=0)
added_cf_ratings

UserID,1,2,3,4,5,6,7,8,9,10,...,6031,6032,6033,6034,6035,6036,6037,6038,6039,6040
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,0,0,0,0,4,0,4,5,5,...,0,4,0,0,4,0,0,0,0,3
2,0,0,0,0,0,0,0,0,0,5,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,2,2,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0,0,0,0,0,0,0,0,3,4,...,0,0,0,0,0,0,0,0,0,0
3949,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3950,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3951,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Delete the unnecessary columns for CBF
dataset_cbf = dataset.drop(labels=['Ratings', 'Gender', 'Age_range', 'Age_group', 'Gender_group'], axis=1)

# Create dataframe contains only Movie ID and Genres
cbf_dataframe_original = dataset_cbf.drop(labels=['UserID'], axis=1)

# Delete duplicate row
cbf_dataframe_unique = cbf_dataframe_original.drop_duplicates()

# Sort the dataframe by Movie ID
cbf_dataframe_unique_sorted = cbf_dataframe_unique.sort_values(by=['MovieID']).reset_index(drop=True).set_index('MovieID')

cbf_genres = cbf_dataframe_unique_sorted
cbf_genres

Unnamed: 0_level_0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3949,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3950,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3951,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


### 5) Normalization using MinMaxScaler, and Matrix Factorization (SVD)

In [19]:
# Function to normalize matrix
def min_max_normalized(dataset):
    
    scaler = MinMaxScaler()
    dataset = scaler.fit_transform(dataset)
    return dataset

# Matrix Factorization - SVD
def matrix_factorization_svd(dataset):
    
    U, s, Vt = np.linalg.svd(dataset)
    S = np.zeros((dataset.shape[0], dataset.shape[1]))
    if dataset.shape[0] <= dataset.shape[1]:
        S[:dataset.shape[0], :dataset.shape[0]] = np.diag(s)
    else:
        S[:dataset.shape[1], :dataset.shape[1]] = np.diag(s)
        
    # Reconstruct the matrix using 1,000 rank features
    svd = U[:, :1000]@S[:1000, :1000]@Vt[:1000, :]
    return svd
        
# Normalize the dataset
cf_ratings_norm = min_max_normalized(cf_ratings)
cf_ratings_norm_df = pd.DataFrame(cf_ratings_norm, index=cf_ratings.index, columns=cf_ratings.columns)
added_cbf_user_info_norm = min_max_normalized(added_cbf_user_info)
added_cbf_user_info_norm_df = pd.DataFrame(added_cbf_user_info_norm, index=added_cbf_user_info.index, columns=added_cbf_user_info.columns)
added_cf_ratings_norm = min_max_normalized(added_cf_ratings)
added_cf_ratings_norm_df = pd.DataFrame(added_cf_ratings_norm, index=added_cf_ratings.index, columns=added_cf_ratings.columns)
cbf_genres_norm = min_max_normalized(cbf_genres)
cbf_genres_norm_df = pd.DataFrame(cbf_genres_norm, index=cbf_genres.index, columns=cbf_genres.columns)

cf_with_added_cbf = cf_ratings_norm_df.copy()
cf_with_added_cbf[['F', 'M', '0-10 years', '11-20 years', '21-30 years', '31-40 years', '41-50 years', '>50 years']] = added_cbf_user_info_norm_df
svd_cf_with_added_cbf = matrix_factorization_svd(cf_with_added_cbf)
cbf_with_added_cf = added_cf_ratings_norm_df.copy()
cbf_with_added_cf[['Action', 
                      'Adventure', 
                      'Animation', 
                      "Children's",  
                      'Comedy', 
                      'Crime', 
                      'Documentary', 
                      'Drame', 
                      'Fantasy', 
                      'Film-Noir', 
                      'Horror', 
                      'Musical', 
                      'Mystery', 
                      'Romance', 
                      'RomanceCarriersAre Waiting', 
                      'Sci-Fi', 
                      'Thriller', 
                      'War', 
                      'Western'
                     ]] = cbf_genres_norm_df
svd_cbf_with_added_cf = matrix_factorization_svd(cbf_with_added_cf)


# Factorize the matrices of ratings for Collaborative Filtering and Content-Based Filtering
# svd_cf_ratings = matrix_factorization_svd(cf_ratings_norm)
# svd_added_cf_ratings = matrix_factorization_svd(added_cf_ratings_norm)

In [20]:
cbf_with_added_cf

UserID,1,2,3,4,5,6,7,8,9,10,...,Film-Noir,Horror,Musical,Mystery,Romance,RomanceCarriersAre Waiting,Sci-Fi,Thriller,War,Western
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.0,0.0,0.8,0.0,0.8,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3950,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
svd_cf_with_added_cbf

array([[ 1.01131537,  0.03104115, -0.04930724, ...,  0.00451675,
         0.00650818,  0.02244134],
       [ 0.01747884,  0.02495976, -0.02457337, ..., -0.00871467,
         0.00603662,  1.02423351],
       [-0.00979233,  0.00560454, -0.02275317, ...,  0.00158187,
        -0.00247754, -0.02862639],
       ...,
       [ 0.02013241, -0.03096764,  0.0331155 , ...,  0.01113451,
        -0.00469227,  0.97300975],
       [ 0.0149028 , -0.04761904, -0.01416837, ...,  0.01624688,
         0.99591503,  0.02860947],
       [ 0.5906503 , -0.01135106,  0.00645543, ..., -0.02618756,
         0.00891023,  0.01099049]])

In [22]:
svd_cf_with_added_cbf.shape

(6040, 3714)

In [23]:
svd_cbf_with_added_cf

array([[ 1.00156561,  0.01765711, -0.00801841, ..., -0.01979324,
        -0.02767514,  0.00890088],
       [ 0.03925   ,  0.01962865,  0.00962283, ..., -0.02685445,
        -0.0019201 ,  0.02244818],
       [-0.03598736, -0.03201922, -0.02334132, ..., -0.0426134 ,
        -0.02695314,  0.06986472],
       ...,
       [-0.01854802,  0.00560091,  0.01508797, ...,  0.05898926,
         0.03264483, -0.00962334],
       [-0.00503418, -0.01673467,  0.00901846, ...,  0.00390473,
         0.00408   , -0.02564217],
       [-0.05308743, -0.03629341, -0.01658168, ...,  1.06719969,
         0.05290692,  0.00257441]])

In [24]:
svd_cbf_with_added_cf.shape

(3706, 6059)

### 6) Build Hybrid Movie Recommendation System using Collaborative Filtering and Content-Based Filtering

In [25]:
def get_cosine_similarity(data, data_get_index, var_id):
    
    row_num = data_get_index.index.get_loc(var_id)
    rows = data[row_num, :]
    cols = data
    cos_sim = cosine_similarity([rows], cols)
    return cos_sim

# Build Hybrid system in Collaborative Filtering part (Find the most similar users)
def top_similar_user(svd_hybrid_system, user_id, cf_ratings):
    
    # Find cosine similarity based on user's ratings, gender, and age
    svd_hybrid_system_cos_sim = get_cosine_similarity(svd_hybrid_system, cf_ratings, user_id)
    
    # Descending sort to find the most similar users
    top_sim_user_id = []
    new_cf_cos_sim_sorted = np.argsort(-svd_hybrid_system_cos_sim)[0][1:2]
    sim_user_id = cf_ratings.index[new_cf_cos_sim_sorted[0]]
    top_sim_user_id.append(sim_user_id)
    return top_sim_user_id[0]

# Get the top rated movies that the most similar users liked
def top_rated_movies_of_similar_users(data, top_sim_user_id, top_n=5):
    
    rated_movies = data[data['UserID'] == top_sim_user_id][['MovieID', 'Title', 'Ratings']].values
        
    # Descending sort to find the top rated movies that the most similar users liked
    rated_movies_sorted = rated_movies[np.argsort(-rated_movies[:, 2])]
        
    # Add the index of the top rated movies into the matrix
    top_rated_movies_of_similar_users = rated_movies_sorted[0:5]
    return top_rated_movies_of_similar_users

# Build Hybrid system in Content-Based Filtering part
# Find the most similar movies of the movies that the most similar users liked
# Then recommend it to a particular user
def top_similar_movies(svd_hybrid_system, user_id, movie_id, added_cf_ratings, data, movies_data, top_n=5):
    
    # Create array of movies that user has never seen
    movies_user_seen = movies_seen(data, user_id)
    movies_user_not_seen = movies_not_seen(movies_data, movies_user_seen)
    
    # Find cosine similarity based on movie's ratings, and genres
    svd_hybrid_system_cos_sim = get_cosine_similarity(svd_hybrid_system, added_cf_ratings, movie_id)  
    new_cbf_cos_sim_sorted = np.argsort(-svd_hybrid_system_cos_sim)[0][1:6]
    
    top_5_sim_movie_id = []
    for index in new_cbf_cos_sim_sorted:
        sim_movie_id = added_cf_ratings.index[index]
        top_5_sim_movie_id.append(sim_movie_id)

    return top_5_sim_movie_id

def movies_seen(dataset, user_id):
    
    movies_user_seen = dataset[dataset['UserID'] == user_id]['MovieID'].values
    return movies_user_seen
    
def movies_not_seen(movies_data, movies_user_seen):
    
    movies_user_not_seen = movies_data[~movies_data['MovieID'].isin(movies_user_seen)]['MovieID'].values
    return movies_user_not_seen

# Define User ID of a particular user that we would like to recommend movies
user_id = 5
top_sim_user_id = top_similar_user(svd_cf_with_added_cbf, user_id, cf_ratings)
top_rated_movies_of_similar_users = top_rated_movies_of_similar_users(data, top_sim_user_id)

movie_id = top_rated_movies_of_similar_users[0][0]
top_5_sim_movie_id = top_similar_movies(svd_cbf_with_added_cf, 
                                        user_id, 
                                        movie_id, 
                                        added_cf_ratings, 
                                        data, 
                                        movies_data
                                       )

# Print the list of recommended movies
print(f"Recommended movies from the most similar user of user ID {user_id}: ")
for i, movie in enumerate(top_rated_movies_of_similar_users[:, 1]):
    print(f"{i + 1}. {movie}")

print(f"\nRecommended movies that are the most similar movies to movie ID {movie_id}: ")
for i, movie in enumerate(top_5_sim_movie_id):
    print(f"{i + 1}. {movies_data[movies_data['MovieID'] == movie][['Title', 'Genres']].values[0][0]}")

Recommended movies from the most similar user of user ID 5: 
1. Three Kings (1999)
2. Bridge on the River Kwai, The (1957)
3. Four Days in September (1997)
4. Malcolm X (1992)
5. Raging Bull (1980)

Recommended movies that are the most similar movies to movie ID 2890: 
1. Pulp Fiction (1994)
2. Fargo (1996)
3. Good Will Hunting (1997)
4. Truman Show, The (1998)
5. Usual Suspects, The (1995)


### 7) Evaluation

To evaluate the performance of our movie recommendation system, we used two metrics: Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). These metrics were used to measure the accuracy of our system in predicting the average ratings of each user and movie.

The purpose of this evaluation was to determine the potential benefits of combining content-based filtering and collaborative filtering. By doing so, we aimed to create a hybrid system that could recommend movies to both old and new users, with a variety of preferences. Specifically, we wanted to determine whether the inclusion of content-based filtering into traditional collaborative filtering, and vice versa, would result in an improvement, decrease or no change in the performance of our system in predicting the average ratings of each user and movie.

Overall, our evaluation aimed to provide insights into the potential benefits and limitations of combining these two recommendation techniques. By understanding the impact of these combinations on the performance of our system, we hoped to improve the accuracy and effectiveness of our movie recommendation system.

**1) Dataframe preparation for evaluation**

In [26]:
# Get the mean of ratings of each user and movie
mean_user_ratings = data.pivot_table('Ratings','UserID',aggfunc='mean')
mean_movie_ratings = data.pivot_table('Ratings','MovieID',aggfunc='mean')

Dataframe for the traditional system, using Collaborative Filtering Filtering

In [27]:
cf_ratings_df = cf_ratings.copy()
cf_ratings_df['Average_rating'] = mean_user_ratings
cf_ratings_df

MovieID,1,2,3,4,5,6,7,8,9,10,...,3944,3945,3946,3947,3948,3949,3950,3951,3952,Average_rating
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.188679
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.713178
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.901961
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.190476
5,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.146465
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0,0,0,2,0,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.302928
6037,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.717822
6038,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.800000
6039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.878049


Dataframe for the traditional system, using Content-Based Filtering

In [28]:
cbf_genres_df = cbf_genres.copy()
cbf_genres_df['Average_rating'] = mean_movie_ratings
cbf_genres_df

Unnamed: 0_level_0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,RomanceCarriers Are Waiting,Sci-Fi,Thriller,War,Western,Average_rating
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.146846
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3.201141
3,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.016736
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2.729412
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.006757
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.635731
3949,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,4.115132
3950,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3.666667
3951,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3.900000


Dataframe for the hybrid system, using Collaborative Filtering with added Content-Based Filtering

In [29]:
cf_with_added_cbf['Average_rating'] = mean_user_ratings
cf_with_added_cbf

MovieID,1,2,3,4,5,6,7,8,9,10,...,3952,F,M,0-10 years,11-20 years,21-30 years,31-40 years,41-50 years,>50 years,Average_rating
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.188679
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.713178
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.901961
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.190476
5,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.146465
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,0.4,0.0,0.6,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3.302928
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.717822
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.800000
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.878049


Dataframe for the hybrid system, using Content-Based Filtering with added Collaborative Filtering

In [30]:
cbf_with_added_cf['Average_rating'] = mean_movie_ratings
cbf_with_added_cf

UserID,1,2,3,4,5,6,7,8,9,10,...,Horror,Musical,Mystery,Romance,RomanceCarriersAre Waiting,Sci-Fi,Thriller,War,Western,Average_rating
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.0,0.0,0.8,0.0,0.8,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.146846
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.201141
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.016736
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.729412
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.006757
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.635731
3949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.115132
3950,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.666667
3951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.900000


**2) Evaluate the system**

**2.1 Collaborative Filtering with added Content-Based Filtering**
- Split the dataset into 70% as train data, and 30% as test data
- Decompose the data by using SVD
- Evaluate by implementing linear regression to predict the average rating of each user

**Traditional Collaborative Filtering**

In [31]:
X1 = cf_ratings_df.drop('Average_rating', axis=1)
y1 = cf_ratings_df['Average_rating']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=42)

# Perform Singular Value Decomposition (SVD)
svd_1 = TruncatedSVD(n_components=1000)
X1_train_svd = svd_1.fit_transform(X1_train)
X1_test_svd = svd_1.transform(X1_test)

# Train a logistic regression model on the transformed data
lin_reg = LinearRegression()
lin_reg.fit(X1_train_svd, y1_train)

# Make predictions on the test set
y1_preds = lin_reg.predict(X1_test_svd)
y1_preds

array([3.76493508, 3.90442352, 3.84448978, ..., 3.45828435, 3.68249325,
       3.35715415])

**Hybrid system of Collaborative Filtering with added Content-Based Filtering**

In [32]:
X2 = cf_with_added_cbf.drop('Average_rating', axis=1)
y2 = cf_with_added_cbf['Average_rating']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)

# Perform Singular Value Decomposition (SVD)
svd_2 = TruncatedSVD(n_components=1000)
X2_train_svd = svd_2.fit_transform(X2_train)
X2_test_svd = svd_2.transform(X2_test)

# Train a logistic regression model on the transformed data
lin_reg = LinearRegression()
lin_reg.fit(X2_train_svd, y2_train)

# Make predictions on the test set
y2_preds = lin_reg.predict(X2_test_svd)
y2_preds



array([3.66376841, 3.84843758, 3.87539902, ..., 3.59262222, 3.62573634,
       3.08512435])

**Results comparison between traditional system (CF) and hybrid system (CF + CBF)**

In [33]:
# Compute Mean Absolute Percentage Error (MAPE)
mape_cf = mean_absolute_percentage_error(y1_test, y1_preds)
mape_cf_added_cbf = mean_absolute_percentage_error(y2_test, y2_preds)

# Compute Root Mean Squared Error (RMSE)
rmse_cf = mean_squared_error(y1_test, y1_preds)
rmse_cf_added_cbf = mean_squared_error(y2_test, y2_preds)

# Build the comparison table
df_results_1_2 = pd.DataFrame({'MAPE': [mape_cf, mape_cf_added_cbf], 'RMSE': [rmse_cf, rmse_cf_added_cbf]}, 
                          index=['CF', 'CF with added CBF']
                         )
df_results_1_2

Unnamed: 0,MAPE,RMSE
CF,0.094686,0.184759
CF with added CBF,0.095688,0.185261


**2.2 Content-Based Filtering with added Collaborative Filtering**
- Split the dataset into 70% as train data, and 30% as test data
- Decompose the data by using SVD
- Evaluate by implementing linear regression to predict the average rating of each user

**Traditional Content-Based Filtering**

In [34]:
X3 = cbf_genres_df.drop('Average_rating', axis=1)
y3 = cbf_genres_df['Average_rating']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.3, random_state=42)

# Train a logistic regression model on the transformed data
lin_reg = LinearRegression()
lin_reg.fit(X3_train, y3_train)

# Make predictions on the test set
y3_preds = lin_reg.predict(X3_test)
y3_preds

array([2.74896116, 3.34183192, 3.34183192, ..., 3.35065952, 3.34183192,
       3.34183192])

**Hybrid system of Content-Based Filtering with added Collaborative Filtering**

In [35]:
X4 = cbf_with_added_cf.drop('Average_rating', axis=1)
y4 = cbf_with_added_cf['Average_rating']
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.3, random_state=42)

# Perform Singular Value Decomposition (SVD)
svd_4 = TruncatedSVD(n_components=1000)
X4_train_svd = svd_4.fit_transform(X4_train)
X4_test_svd = svd_4.transform(X4_test)

# Train a logistic regression model on the transformed data
lin_reg = LinearRegression()
lin_reg.fit(X4_train_svd, y4_train)

# Make predictions on the test set
y4_preds = lin_reg.predict(X4_test_svd)
y4_preds



array([2.87351046, 3.85234846, 3.6223249 , ..., 4.15415128, 3.67876046,
       2.91828228])

**Results comparison between traditional system (CBF) and hybrid system (CBF + CF)**

In [36]:
# Compute Mean Absolute Percentage Error (MAPE)
mape_cbf = mean_absolute_percentage_error(y3_test, y3_preds)
mape_cbf_added_cf = mean_absolute_percentage_error(y4_test, y4_preds)

# Compute Root Mean Squared Error (RMSE)
rmse_cbf = mean_squared_error(y3_test, y3_preds)
rmse_cbf_added_cf = mean_squared_error(y4_test, y4_preds)

# Build the comparison table
df_results_3_4 = pd.DataFrame({'MAPE': [mape_cbf, mape_cbf_added_cf], 'RMSE': [rmse_cbf, rmse_cbf_added_cf]}, 
                          index=['CBF', 'CBF with added CF']
                         )

df_results_3_4

Unnamed: 0,MAPE,RMSE
CBF,0.17882,0.389887
CBF with added CF,0.117259,0.219607


### 8) Conclusion

In conclusion, our evaluation of the hybrid system for movie recommendations demonstrates the effectiveness of incorporating Collaborative Filtering (CF) and Content-Based Filtering (CBF) techniques. The addition of CF to CBF significantly improved the accuracy of predicting average movie ratings compared to the traditional CBF system alone, as evidenced by a reduction in the MAPE of approximately 0.06, and a reduction in the RMSE of approximately 0.17.

Therefore, **To create a hybrid system of movie recommendations that can effectively recommend diverse types of movies to both old and new users, we recommend to use the hybrid recommendation systems using Collaborative Filtering and Content-Based FIltering higher users satisfactions**. This approach can provide a more comprehensive and accurate recommendation system for users with different preferences and help ensure better user satisfaction. However, further research is necessary to identify the optimal combination of recommendation techniques and to improve the accuracy of the hybrid system.