# **CAP6819 Advanced Internet System**
## Spring 2023 - Recommender System Project
## Nha Tran

## **Problem**: Design a program using any programming language (C, C++, Java, Python) for Content-Based Recommendation

**Implementation Plan**:

**1. Training set:**
- Read the dataset file `movies_2022_netflix.csv` 
- Randomly choose **10** movies to be in the **training set**
- Extract all the genres available in the dataset
- Convert genres to binary outputs
- Create a binary matrix 
- Calculate the user profile for later use

**2. Testing set:**
- Randomly choose **10** movies to be in the **testing set**
- Convert genres to binary outputs
- Create a binary matrix 
- Calculate the weighted movie matrix
- Get recommendation score for each movie and sort the list in a descending order

**Import libraries**

In [2]:
import numpy as np
import pandas as pd
import string

###Preparations for training set

- Read the original dataset 
- Get to know the dataset
- Randomly choose 10 movies to be in the **training set**
- Convert genres to binary outputs
- Create a binary matrix
- Calculate the weighted movie matrix
- Calculate the user profile for later use

In [3]:
# Read the original dataset
df_org = pd.read_csv('movies_2022_netflix.csv')
df_org.head(5)

Unnamed: 0,title,genres,imdb_score
0,Major,"[action, drama]",9.1
1,Heartstopper,"[romance, drama]",8.7
2,Twenty Five Twenty One,"[drama, romance]",8.7
3,Alchemy of Souls,"[drama, action, thriller, scifi, comedy, fanta...",8.6
4,Our Blues,[drama],8.5


In [4]:
# Get to know the dataset
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       336 non-null    object 
 1   genres      336 non-null    object 
 2   imdb_score  336 non-null    float64
dtypes: float64(1), object(2)
memory usage: 8.0+ KB


**Note:**
- All the values are present and in the correct data type.
- Original dataset contains a total of 336 movies.



In [5]:
# Get randomly 10 movies from the original dataset
df = df_org.sample(n=10)

In [6]:
# Convert to list for genres
df_org['genres'] = df_org.genres.str.replace(" ", "")
df_org['genres'] = df_org.genres.apply(lambda x: x[1:-1].split(','))

# Get all the genres
genres = list(df_org["genres"].values) # Get all the list
genres = list(set([item.strip() for sublist in genres for item in sublist if item != None])) # Separate the genre and use set to get unique genres

# Print out the unique genres
print("Number of Genres: ", len(genres))
print("Genres:", genres)

Number of Genres:  19
Genres: ['romance', 'european', 'crime', 'horror', 'comedy', 'scifi', 'drama', 'reality', 'war', 'fantasy', 'thriller', 'history', 'documentation', 'family', 'western', 'sport', 'music', 'action', 'animation']


**Note:**

We have a total of 19 genres in this dataset including ['music', 'horror', 'history', 'documentation', 'scifi', 'action', 'sport', 'animation', 'crime', 'romance', 'thriller', 'family', 'drama', 'reality', 'comedy', 'western', 'fantasy', 'war', 'european']

In [7]:
# Convert to binary output for genres
for i, genre in enumerate(genres):
    df[genre] = df.genres.apply(lambda x: 1 if genre in x else 0).astype(int)

# Save in csv file
df.to_csv('training_binary.csv')
df

Unnamed: 0,title,genres,imdb_score,romance,european,crime,horror,comedy,scifi,drama,...,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
126,The Gray Man,"[thriller, action]",6.6,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
10,Jana Gana Mana,"[drama, thriller, crime]",8.3,0,0,1,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
280,Love & Gelato,"[romance, comedy, drama]",5.1,1,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
68,Anek,"[action, thriller, drama]",7.1,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,1,0
246,"Love, Life & Everything in Between","[comedy, drama, romance]",5.6,1,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
52,Apollo 10¬Ω: A Space Age Childhood,"[animation, scifi, action, comedy, romance, fa...",7.3,1,0,0,1,1,1,1,...,0,0,0,0,1,0,0,0,1,1
118,The G Word with Adam Conover,"[documentation, comedy]",6.7,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
20,Viraata Parvam,"[romance, action, drama]",8.0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
54,Taylor Tomlinson: Look at You,"[comedy, documentation]",7.3,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
9,Who Rules The World,"[drama, fantasy, romance]",8.3,1,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


In [8]:
# Create movie matrix
movie_matrix = df.iloc[:,3:].to_numpy()

# Get a list of imdb_score	
score_list = np.array(df.imdb_score) 

# Multiply score with the movie matrix
multiply_score = movie_matrix * score_list[:, None]

# Userprofile = Sum of movie matrix
total_sum_score = np.sum(multiply_score)
user_profile = np.sum(multiply_score,axis=0)/total_sum_score

print(f"Number of genres: {len(user_profile)}")
print("\nUser Profile:")
print(np.around(user_profile,3))


Number of genres: 19

User Profile:
[0.151 0.    0.037 0.032 0.141 0.032 0.219 0.    0.    0.037 0.097 0.
 0.062 0.032 0.    0.    0.    0.128 0.032]


In [9]:
print("Original Movie Matrix: ")
print(display(pd.DataFrame(np.around(movie_matrix,3), columns = genres)))
print("\nWeighted Movie Matrix: ")
print(display(pd.DataFrame(np.around(multiply_score,3), columns = genres)))
print("\nUser Profile: ")
print(display(pd.DataFrame([np.around(user_profile,3)], columns = genres)))

Original Movie Matrix: 


Unnamed: 0,romance,european,crime,horror,comedy,scifi,drama,reality,war,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
2,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0
4,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
5,1,0,0,1,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1
6,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
8,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
9,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0


None

Weighted Movie Matrix: 


Unnamed: 0,romance,european,crime,horror,comedy,scifi,drama,reality,war,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.6,0.0,0.0,0.0,0.0,0.0,0.0,6.6,0.0
1,0.0,0.0,8.3,0.0,0.0,0.0,8.3,0.0,0.0,0.0,8.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5.1,0.0,0.0,0.0,5.1,0.0,5.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,7.1,0.0,0.0,0.0,7.1,0.0,0.0,0.0,0.0,0.0,0.0,7.1,0.0
4,5.6,0.0,0.0,0.0,5.6,0.0,5.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,7.3,0.0,0.0,7.3,7.3,7.3,7.3,0.0,0.0,0.0,0.0,0.0,0.0,7.3,0.0,0.0,0.0,7.3,7.3
6,0.0,0.0,0.0,0.0,6.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.7,0.0,0.0,0.0,0.0,0.0,0.0
7,8.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0
8,0.0,0.0,0.0,0.0,7.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.3,0.0,0.0,0.0,0.0,0.0,0.0
9,8.3,0.0,0.0,0.0,0.0,0.0,8.3,0.0,0.0,8.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


None

User Profile: 


Unnamed: 0,romance,european,crime,horror,comedy,scifi,drama,reality,war,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
0,0.151,0.0,0.037,0.032,0.141,0.032,0.219,0.0,0.0,0.037,0.097,0.0,0.062,0.032,0.0,0.0,0.0,0.128,0.032


None


###**Preparations for testing set**

- Randomly choose **10** movies from the original dataset to be in the **testing set**
- Drop the `imdb_score` column
- Convert genres to binary outputs
- Create a binary matrix 
- Calculate the weighted movie matrix
- Get recommendation score for each movie and sort the list in a descending order


In [10]:
# Randomly get 10 movies from the original testing dataset 
df_test = df_org.sample(n=10)

In [11]:
# Drop the score column
df_test.drop(columns=['imdb_score'],inplace=True)

In [12]:
# Convert to binary output for genres
for i, genre in enumerate(genres):
    df_test[genre] = df_test.genres.apply(lambda x: 1 if genre in x else 0).astype(int)

# Save in csv file
df_test.to_csv('testing_binary.csv')
df_test

Unnamed: 0,title,genres,romance,european,crime,horror,comedy,scifi,drama,reality,...,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
222,David Spade: Nothing Personal,"[comedy, documentation]",0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
200,Donkeyhead,[drama],0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
161,As the Crow Flies,"[drama, comedy]",0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
284,Snowflake Mountain,[reality],0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
140,"Mom, Dont Do That!","[comedy, drama, romance]",1,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
201,Adam by Eve: A Live in Animation,"[drama, animation, music]",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,1
315,40 Years Young,[comedy],0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
255,Amandla,"[drama, crime]",0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
160,Trust No One: The Hunt for the Crypto King,"[documentation, crime]",0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
150,My Fathers Violin,"[drama, music]",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [13]:
# Create movie matrix
movie_matrix_test = df_test.iloc[:,2:].to_numpy()

# Get weighted movie matrix = movie_matrix * user_profile
weighted_movie_matrix = movie_matrix_test*user_profile

# Get recommendation score for each movie
recommendation_score = np.sum(weighted_movie_matrix,axis=1)*10

# Append to the dataframe
df_test.insert(2, "score",recommendation_score)

In [14]:
print("Candidate Movie Matrix: ")
print(display(pd.DataFrame(np.around(movie_matrix_test,3), columns = genres)))
print("\nWeighted Movie matrix: ")
print(display(pd.DataFrame(np.around(weighted_movie_matrix,3), columns = genres)))
print("\nRecommendation Score: ")
print(display(pd.DataFrame([np.around(recommendation_score,3)], columns = df_test['title'])))

Candidate Movie Matrix: 


Unnamed: 0,romance,european,crime,horror,comedy,scifi,drama,reality,war,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1
6,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0


None

Weighted Movie matrix: 


Unnamed: 0,romance,european,crime,horror,comedy,scifi,drama,reality,war,fantasy,thriller,history,documentation,family,western,sport,music,action,animation
0,0.0,0.0,0.0,0.0,0.141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.141,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.151,0.0,0.0,0.0,0.141,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032
6,0.0,0.0,0.0,0.0,0.141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.037,0.0,0.0,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


None

Recommendation Score: 


title,David Spade: Nothing Personal,Donkeyhead,As the Crow Flies,Snowflake Mountain,"Mom, Dont Do That!",Adam by Eve: A Live in Animation,40 Years Young,Amandla,Trust No One: The Hunt for the Crypto King,My Fathers Violin
0,2.028,2.191,3.602,0.0,5.115,2.513,1.411,2.557,0.983,2.191


None


###**Results**

In [15]:
# Sort the dataframe according to the score
df_test_sorted = df_test.sort_values(by='score', ascending=False)

# Save in csv file
df_test_sorted[['title','genres','score']].to_csv('result.csv')
df_test_sorted[['title','genres','score']]

Unnamed: 0,title,genres,score
140,"Mom, Dont Do That!","[comedy, drama, romance]",5.114638
161,As the Crow Flies,"[drama, comedy]",3.602293
255,Amandla,"[drama, crime]",2.557319
201,Adam by Eve: A Live in Animation,"[drama, animation, music]",2.513228
200,Donkeyhead,[drama],2.191358
150,My Fathers Violin,"[drama, music]",2.191358
222,David Spade: Nothing Personal,"[comedy, documentation]",2.028219
315,40 Years Young,[comedy],1.410935
160,Trust No One: The Hunt for the Crypto King,"[documentation, crime]",0.983245
284,Snowflake Mountain,[reality],0.0
