# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies.

Author: <font color="blue">Àlex Montoya Pérez</font>

E-mail: <font color="blue">alex.montoya.01@estudiant.upf.edu</font>

Date: <font color="blue">11/11/2023</font>

# **Google Colaboratory Setup & Imports**

In order to develop this laboratory, I used Google Colaboratory, since I have worked with different files I had to set up the environment as follows:


1.   Importing the drive module from the google.colab package.
2.   Mounting the Google Drive at the specified path (/content/drive).
3.   Changing the current working directory to the directory where I have all needed data /content/drive/MyDrive/MineriaDadesMasives/Labs/.

Verify that we are in the correct directory:


4.   Printing the current working directory path using !pwd.
5.   Listing the contents of the current directory using !ls.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
#Here is how to change current working directory
#By default the current working directory is /content
%cd /content/drive/MyDrive/MineriaDadesMasives/Labs/
#Print path and content of the current directory
!pwd
!ls

Mounted at /content/drive
/content/drive/MyDrive/MineriaDadesMasives/Labs
/content/drive/MyDrive/MineriaDadesMasives/Labs
data					ps06_item_based_recsys.ipynb
old					ps07_outlier_analysis.ipynb
ps01_02_data_preparation_242873.ipynb	ps08_data_streams.ipynb
ps03_near_duplicates.ipynb		ps09_forecasting.ipynb
ps04_association_rules.ipynb		README.md
ps05_content_based_recsys_242873.ipynb


# 1. The Movies dataset

# 1.1. Load the input files

In [2]:
# Leave this code as-is

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import*
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [3]:
# Leave this code as-is

FILENAME_MOVIES = "data/movielens-25M-filtered/movies-2000s.csv"
FILENAME_RATINGS = "data/movielens-25M-filtered/ratings-2000s.csv"
FILENAME_TAGS = "data/movielens-25M-filtered/tags-2000s.csv"

In [4]:
# Leave this code as-is

movies = pd.read_csv(FILENAME_MOVIES,
                    sep=',',
                    engine='python',
                    encoding='latin-1',
                    names=['movie_id', 'title', 'genres'])
display(movies.head(5))

ratings_raw = pd.read_csv(FILENAME_RATINGS,
                    sep=',',
                    encoding='latin-1',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


# 1.2. Merge the data into a single dataframe

Both ratings_raw and users share the "user_id" column, making it possible to merge them based on this attribute. Similarly, both ratings_raw and movies possess the "movie_id" column, allowing you to combine them using this common field.

In [5]:
#Merge both dataframes using pandas.merge
ratings = pd.merge(ratings_raw, movies, how = 'inner', on = 'movie_id')
display(ratings.head(5))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,4,3624,2.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1,152,3624,3.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
2,171,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
3,276,3624,4.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
4,494,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western


## Find Movies

In [6]:
# function find_movies to easily find movies that contain a given keyword in its title.
def find_movies(text, movies):
    #For all the movies check
    for i in range(len(movies)):
        # Whether title has the input text
        if text in movies["title"][i]:
            #if it has it, print title and movie id
            print("movie_id: ", movies["movie_id"][i], ", title: ", movies["title"][i])
find_movies("Friday", movies)

movie_id:  3177 , title:  Next Friday (2000)
movie_id:  5874 , title:  Friday After Next (2002)
movie_id:  6593 , title:  Freaky Friday (2003)
movie_id:  7880 , title:  Friday Night (Vendredi Soir) (2002)
movie_id:  8937 , title:  Friday Night Lights (2004)
movie_id:  66783 , title:  Friday the 13th (2009)
movie_id:  97175 , title:  His Name Was Jason: 30 Years of Friday the 13th (2009)
movie_id:  121113 , title:  Shriek If You Know What I Did Last Friday the Thirteenth (2000)
movie_id:  133699 , title:  Black Friday (2004)
movie_id:  134649 , title:  Bad Hair Friday (2012)
movie_id:  161157 , title:  Friday (Pyatnitsa) (2016)
movie_id:  171951 , title:  Monster High: Friday Night Frights (2013)
movie_id:  192411 , title:  Freaky Friday (2018)
movie_id:  197903 , title:  Seven Days: Friday - Sunday (2015)


In [7]:
# LEAVE AS-IS

# For testing, this should print:
# movie_id:  4993, title: Lord of the Rings: The Fellowship of the Ring, The (2001)
# movie_id:  5952, title: Lord of the Rings: The Two Towers, The (2002)
# movie_id:  7153, title: Lord of the Rings: The Return of the King, The (2003)
find_movies("Lord of the Rings", movies)

movie_id:  4993 , title:  Lord of the Rings: The Fellowship of the Ring, The (2001)
movie_id:  5952 , title:  Lord of the Rings: The Two Towers, The (2002)
movie_id:  7153 , title:  Lord of the Rings: The Return of the King, The (2003)


In [8]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [9]:
# LEAVE AS-IS

# For testing, should print "Lord of the Rings: The Return of the King, The (2003)")
print(get_title(7153, movies))

Lord of the Rings: The Return of the King, The (2003)


## 1.3. Count unique registers

In [10]:
#Count the number of unique users and unique movies in the ratings variable, print also the total number of movies in the movies variable
print("Number of users who have rated a movie: ", len(pd.unique(ratings.user_id)))
print("Number of movies that have been rated: ", len(pd.unique(ratings.movie_id)))
print("Total number of movies: ", len(pd.unique(movies.movie_id)))

Number of users who have rated a movie:  12676
Number of movies that have been rated:  2049
Total number of movies:  33168


# 2. Item-based Collaborative Filtering

## 2.1. Data pre-processing

### Rated Movies generation

In [11]:
#Delete the column genres from the dataset ratings, rated_movies columns --> user_id, movie_id, rating, title
rated_movies = ratings.drop(columns = ['genres'])
display(rated_movies.head(5))

Unnamed: 0,user_id,movie_id,rating,title
0,4,3624,2.5,Shanghai Noon (2000)
1,152,3624,3.0,Shanghai Noon (2000)
2,171,3624,3.5,Shanghai Noon (2000)
3,276,3624,4.0,Shanghai Noon (2000)
4,494,3624,3.5,Shanghai Noon (2000)


### Rating Summary

In [12]:
#Group dataset by movie_id
ratings_summary = rated_movies.groupby('movie_id').first()
#Delete user_id and rating columns
ratings_summary = ratings_summary.drop(columns = ['user_id', 'rating'])
#Save the mean of the column rating per each movie
ratings_mean = rated_movies.groupby('movie_id')['rating'].mean()
#Count how many rates have each movie
ratings_count= rated_movies.groupby('movie_id')['rating'].count()
#Add the column ratings_mean with the mean of ratings of each movie
ratings_summary['ratings_mean'] = ratings_mean
#Add the column ratings_count with the count of rates of each movie
ratings_summary['ratings_count'] = ratings_count

display(ratings_summary.head(5))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2769,"Yards, The (2000)",3.122549,102
3177,Next Friday (2000),2.824,125
3190,Supernova (2000),2.395683,139
3225,Down to You (2000),2.577273,110
3228,Wirey Spindell (2000),2.5,2


###  Top 5 highest rated movies (at least 2500 ratings)

In [13]:
# top 5 highest rated movies with more than 2500 ratings
top_rated = ratings_summary[ratings_count>=2500]
top_rated = top_rated.sort_values(by = 'ratings_mean', ascending = False)
display(top_rated.head(5))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4226,Memento (2000),4.158512,4476
4973,"Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le)...",4.097234,3687
4993,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
7153,"Lord of the Rings: The Return of the King, The...",4.08396,5449
5952,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449


### Difference between 2500 rattings and 3 ratings

In [14]:
# top 5 highest rated movies with more than 3 ratings
top_rated = ratings_summary[ratings_count>=3]
top_rated = top_rated.sort_values(by = 'ratings_mean', ascending = False)
display(top_rated.head(5))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5082,"Rumor of Angels, A (2000)",4.666667,6
27764,2LDK (2003),4.5,3
31954,Beautiful City (Shah-re ziba) (2004),4.4,5
5224,Promises (2001),4.388889,18
6775,Life and Debt (2001),4.333333,3


The fewer ratings a product receives, the less reliable or indicative its rating becomes, as confidence in the rating diminishes with a sparse data set.

## 2.2. Compute the user-movie matrix

### User Movie Matrix

In [15]:
# Generate a "user_movie" matrix by calling "pivot_table" on "rated_movies"

# Compute user-movie matrix with each row a user_id and each column the rate of each movie for this user.
user_movie = rated_movies.pivot_table(index = 'user_id', columns = 'movie_id', values = 'rating')
# Print the first 5 rows
display(user_movie.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
33,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,4.5,,,...,,,,,,,,,,3.5
63,,,,,,,,,,,...,,,,,,,,,,
95,,,,,,,,3.5,,,...,,,,,,,,,,


**Why do you think the "user_movie" matrix has so many "NaN" values?**


There are two possible explanations for this phenomenon:
*   Firstly, users might choose not to rate the films they watch.
*   Secondly, it could be due to users having limited exposure to a variety of films, resulting in NaN ratings. This attribute is commonly referred to as a Sparse Matrix.

## 2.3. Explore some correlations in the user-movie matrix

### First 10 rows of the "ratings3" table

In [16]:
#Display the rates of each user for 3 different movies in ratings3

# Locate the movie_id
id_pivot = movies.loc[movies['title'] == 'Lord of the Rings: The Fellowship of the Ring, The (2001)']['movie_id'].to_list()[0]
id_m1 = movies.loc[movies['title'] == 'Finding Nemo (2003)']['movie_id'].to_list()[0]
id_m2 = movies.loc[movies['title'] == 'Talk to Her (Hable con Ella) (2002)']['movie_id'].to_list()[0]
# Drop from ratings3 all rows containing a NaN
s1 = user_movie[id_pivot].dropna()
s2 = user_movie[id_m1].dropna()
s3 = user_movie[id_m2].dropna()
# Consolidate these four series into a single dataframe
ratings3 = pd.concat([s1,s2,s3], axis = 1).dropna(0)
# Display the first 10 rows from this table.
display(ratings3.head(10))

  ratings3 = pd.concat([s1,s2,s3], axis = 1).dropna(0)


Unnamed: 0_level_0,4993,6377,5878
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
859,3.0,4.0,5.0
1229,4.0,4.0,4.5
1281,3.0,2.5,3.0
1722,5.0,4.5,4.0
2004,4.5,3.0,3.5
4590,4.0,4.0,2.0
5052,2.0,4.0,4.0
5144,5.0,5.0,5.0
6497,3.5,3.5,3.5
8369,3.0,4.0,4.5


### Correlations between these three movies,

In [17]:
#Check similiratiy between each pair of movies of these 3
print("Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': ", ratings3[id_pivot].corr(ratings3[id_m1]))
print("Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': ",ratings3[id_pivot].corr(ratings3[id_m2]))
print("Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': ", ratings3[id_m1].corr(ratings3[id_m2]))

Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)':  0.3840549071566764
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)':  0.16240502267155424
Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)':  0.2042645045941218


Based on the correlation analysis results, it appears that "Lord of the Rings: The Fellowship of the Ring" has a relatively higher similarity (0.384) with "Finding Nemo" compared to its similarity with "Talk to Her" (0.162). Similarly, "Finding Nemo" and "Talk to Her" exhibit a similarity of 0.204.

This suggests that there is a higher degree of similarity between "Lord of the Rings: The Fellowship of the Ring" and "Finding Nemo" than with "Talk to Her."

However, the overall similarity values are not extremely high, indicating that the movies may still have distinct characteristics

### Similar to Pivot Series

In [18]:
#Check correlation of each movie with the pivot movie.
similar_to_pivot = user_movie.corrwith(user_movie[id_pivot]).dropna()
display(similar_to_pivot.head(10))

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


movie_id
2769   -0.127515
3177    0.093221
3190    0.041206
3225    0.126600
3239    0.338378
3273    0.166968
3275    0.182484
3276    0.134264
3285    0.075311
3286    0.242781
dtype: float64

### Correlation with Pivot (more than 500 ratings)

In [19]:
#Add a column with the correlation computed before and display movies with more than 500 ratings
corr_with_pivot = pd.DataFrame(similar_to_pivot, columns = ['corr'])
corr_with_pivot = corr_with_pivot.join(ratings_summary)
corr_with_pivot = corr_with_pivot[corr_with_pivot['ratings_count']>500]
corr_with_pivot.sort_values('corr', ascending = False).head(10)

Unnamed: 0_level_0,corr,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
3578,0.337667,Gladiator (2000),3.95105,4811
3793,0.329686,X-Men (2000),3.556436,3535
4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
3624,0.307471,Shanghai Noon (2000),3.297443,1017
31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141


In my view, the films exhibit significant similarities, affirming the accuracy of the system. They share a common target audience and genres, suggesting that recommending films of a similar nature would be appropriate.

Adjusting the "ratings_count" parameter to a significantly larger value may lead to the exclusion of films with lower viewer counts, potentially overlooking genuinely similar movies. Striking a balance is crucial; setting the threshold too high could omit relevant films, while setting it too low might include movies with minimal viewership, potentially compromising the accuracy of their ratings due to a lack of widespread support. Finding a middle ground is essential for a more accurate representation of film similarity.

# 2.4. Implement the item-based recommendations

### Correlations between columns in user_movie

In [20]:
#Compute correlation of each pair of movies
item_similarity = user_movie.corr()
display(item_similarity.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,0.115068,0.033721,-0.232268,,-0.5,0.197011,0.199514,0.250873,,...,0.37998,0.87831,,,,0.248126,0.1806095,-0.08557,-0.408248,0.105671
3177,0.115068,1.0,0.30382,0.559533,,,0.331191,0.167918,1.0,,...,0.546119,0.735767,-1.0,,,-0.221382,0.3174747,0.014735,0.661989,0.185654
3190,0.033721,0.30382,1.0,0.636361,,-0.014315,0.146042,0.394293,-0.290397,,...,0.246183,0.632026,,,,0.378181,0.1709261,0.022444,-0.07336,-0.054114
3225,-0.232268,0.559533,0.636361,1.0,,0.578414,0.347716,0.263671,-0.250313,,...,-0.300376,0.318377,,,,0.480173,0.7503063,0.536828,0.753141,0.098748
3228,,,,,1.0,,,,,,...,,,,,,,,,,
3239,-0.5,,-0.014315,0.578414,,1.0,0.180846,1.0,,,...,,,,,,1.0,,1.0,0.636285,0.8882
3273,0.197011,0.331191,0.146042,0.347716,,0.180846,1.0,0.105735,0.154371,,...,0.006774,0.409968,1.0,,,0.088405,0.07516779,0.143492,0.466705,0.084202
3275,0.199514,0.167918,0.394293,0.263671,,1.0,0.105735,1.0,0.485071,,...,-0.011426,0.279624,,,,0.075827,0.2994603,0.187713,0.285584,0.225317
3276,0.250873,1.0,-0.290397,-0.250313,,,0.154371,0.485071,1.0,,...,,0.29277,,,,0.0,-6.885311000000001e-17,-0.45553,0.5,-0.138013
3279,,,,,,,,,,1.0,...,,,,,,,,,,


### Correlations between columns in user_movie (at leat 100 ratings)

In [21]:
#Same as before but with min 100 observations
item_similarity_min_ratings = user_movie.corr(min_periods = 100)
display(item_similarity_min_ratings.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,,,,,,,,,,...,,,,,,,,,,
3177,,1.0,,,,,,,,,...,,,,,,,,,,
3190,,,1.0,,,,,,,,...,,,,,,,,,,
3225,,,,1.0,,,,,,,...,,,,,,,,,,
3228,,,,,,,,,,,...,,,,,,,,,,


### User IDs who liked the three superhero movies and the three dramas movies

In [22]:
# movie_id=5349: Spider-Man (2002)
# movie_id=3793: X-Men (2000)
# movie_id=6534: Hulk (2003)
for user in user_movie.iterrows():
    #Save in user_id_super the id of an user who has rated with more than 4.5 in movies 5349, 3793 and 6534
    if(user[1][5349] > 4.5 and user[1][3793] > 4.5 and user[1][6534] > 4.5):
        user_id_super = user[0]
        break
# movie_id=6870: Mystic River (2003)
# movie_id=5995: Pianist, The (2002)
# movie_id=3555: U-571 (2000)
for user in user_movie.iterrows():
    #Save in user_id_drama the id of an user who has rated with more than 4.5 in movies 6870, 5595 and 3555
    if(user[1][6870] > 4.5 and user[1][5995] > 4.5 and user[1][3555] > 4.5):
        user_id_drama = user[0]
        break

In [23]:
# Leave this code as-is

# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)

# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]

# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))


In [24]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

5502 5.0 Signs (2002) 
5445 5.0 Minority Report (2002) 
6156 5.0 Shanghai Knights (2003) 
5952 5.0 Lord of the Rings: The Two Towers, The (2002) 
5944 5.0 Star Trek: Nemesis (2002) 
5816 5.0 Harry Potter and the Chamber of Secrets (2002) 
5618 5.0 Spirited Away (Sen to Chihiro no kamikakushi) (2001) 
5524 5.0 Blue Crush (2002) 
5480 5.0 Stuart Little 2 (2002) 
5459 5.0 Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (2002) 
5420 5.0 Windtalkers (2002) 
4388 5.0 Scary Movie 2 (2001) 
5389 5.0 Spirit: Stallion of the Cimarron (2002) 
5349 5.0 Spider-Man (2002) 
5218 5.0 Ice Age (2002) 
5064 5.0 The Count of Monte Cristo (2002) 
4993 5.0 Lord of the Rings: The Fellowship of the Ring, The (2001) 
4973 5.0 Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le) (2001) 
4896 5.0 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001) 
4886 5.0 Monsters, Inc. (2001) 
6186 5.0 Gods and Generals (2003) 
6333 5.0 X2: X-Men United (2003) 
6377 5.0 Finding Nemo (2003) 
6

In [25]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

3967 5.0 Billy Elliot (2000) 
4014 5.0 Chocolat (2000) 
4034 5.0 Traffic (2000) 
5995 5.0 Pianist, The (2002) 
7147 5.0 Big Fish (2003) 
4995 5.0 Beautiful Mind, A (2001) 
3555 5.0 U-571 (2000) 
6870 5.0 Mystic River (2003) 
5991 5.0 Chicago (2002) 
8464 5.0 Super Size Me (2004) 
5669 5.0 Bowling for Columbine (2002) 
8622 5.0 Fahrenheit 9/11 (2004) 
30707 5.0 Million Dollar Baby (2004) 
6953 4.5 21 Grams (2003) 
5015 4.5 Monster's Ball (2001) 
5464 4.5 Road to Perdition (2002) 
3510 4.5 Frequency (2000) 
5989 4.5 Catch Me If You Can (2002) 
4022 4.0 Cast Away (2000) 
5010 4.0 Black Hawk Down (2001) 
5299 4.0 My Big Fat Greek Wedding (2002) 
3897 4.0 Almost Famous (2000) 
3755 4.0 Perfect Storm, The (2000) 
4308 4.0 Moulin Rouge (2001) 
4447 3.5 Legally Blonde (2001) 
4246 3.5 Bridget Jones's Diary (2001) 
4975 3.5 Vanilla Sky (2001) 
4019 3.5 Finding Forrester (2000) 
5377 3.5 About a Boy (2002) 
3948 3.5 Meet the Parents (2000) 
5956 3.0 Gangs of New York (2002) 
6281 3.0 Phone Booth

### Get Movies Relevance

In [26]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):

    # Create an empty series
    movies_relevance = pd.Series(dtype = 'object')

    # Iterate through the movies the user has watched
    for watched_movie in user_movie.loc[user_id].index:

        # Obtain the rating given
        rating_given = user_movie[watched_movie][user_id]

        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = item_similarity_matrix[watched_movie]

        # Multiply this vector by the given rating
        weighted_similarities = rating_given * similarities

        # Append these terms to movies_relevance
        movies_relevance = movies_relevance.append(weighted_similarities)

    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()

    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index

    return movies_relevance_df

### 5 Most Relevant Movies for Superheroes movies

In [27]:
relevance_hero = get_movies_relevance(user_id_super, user_movie, item_similarity_min_ratings)
movies_recommended_hero = pd.merge(relevance_hero, movies, how = 'inner', on = 'movie_id')
movies_recommended_hero = movies_recommended_hero.sort_values(by = 'relevance', ascending=False)
display(movies_recommended_hero.head(5))

  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(wei

Unnamed: 0,relevance,movie_id,title,genres
1472,189.170085,8644,"I, Robot (2004)",Action|Adventure|Sci-Fi|Thriller
663,181.63812,5459,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
85,176.650945,3753,"Patriot, The (2000)",Action|Drama|War
1414,172.899804,8361,"Day After Tomorrow, The (2004)",Action|Adventure|Drama|Sci-Fi|Thriller
310,172.700877,4310,Pearl Harbor (2001),Action|Drama|Romance|War


### 5 Most Relevant Movies for drama movies

In [28]:
relevance_drama = get_movies_relevance(user_id_drama, user_movie, item_similarity_min_ratings)
movies_recommended_drama = pd.merge(relevance_drama, movies, how = 'inner', on = 'movie_id')
movies_recommended_drama = movies_recommended_drama.sort_values(by = 'relevance', ascending=False)
display(movies_recommended_drama.head(5))

  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(wei

Unnamed: 0,relevance,movie_id,title,genres
1572,65.46137,8958,Ray (2004),Drama
195,63.007635,4019,Finding Forrester (2000),Drama
1055,61.354376,6565,Seabiscuit (2003),Drama
501,61.21305,4995,"Beautiful Mind, A (2001)",Drama|Romance
508,61.209632,5014,I Am Sam (2001),Drama


**Super User Accuracy:**


*   I, Robot (2004)	--> Yes
*   Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) --> Yes
*   Patriot, The (2000) --> Yes  
*   Day After Tomorrow The (2004) --> No
*   Pearl Harbor (2001) --> Yes

I would suggest four out of the five recommended films to a super user. My decision is informed by examining the genres and synopses of the films to determine whether I would recommend them. Specifically, I recommend films associated with powers, action, thriller, and sci-fi, especially if their synopses resonate with superhero themes.

**Drama User Accuracy:**

*   Ray (2004) --> Yes
*   Finding Forrester (2000) --> Yes
*   Seabiscuit (2003) --> Yes
*   Beautiful Mind, A (2001) --> Yes
*   I Am Sam (2001)	--> Yes

I would advise all five of the recommended films to a drama enthusiast. My selection process remains consistent with my previous approach. After filtering for drama and romance films, I made my decisions based on the synopses.

### Get recommended movies

In [29]:
def get_recommended_movies(user_id, user_movie, item_similarity):
    relevant_movies = get_movies_relevance(user_id, user_movie, item_similarity)
    relevant_movies = relevant_movies.set_index('movie_id')
    movie_ids = get_watched_movies(user_id, user_movie)
    relevant_movies = relevant_movies.drop(movie_ids)
    return relevant_movies

### 10 Most recommended movies for the super users

In [30]:
relevant_hero_movies = get_recommended_movies(user_id_super, user_movie, item_similarity_min_ratings)
relevant_hero_movies = relevant_hero_movies.sort_values(by = 'relevance', ascending=False)
display(relevant_hero_movies.head(10))

  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(wei

Unnamed: 0_level_0,relevance
movie_id,Unnamed: 1_level_1
6365,166.866641
4018,165.338077
4025,163.032765
5507,161.080324
6378,155.293219
31685,154.993274
3948,150.570934
4369,148.949754
6934,148.394158
4963,148.251901


### 10 Most recommended movies for the drama users

In [31]:
relevant_drama_movies = get_recommended_movies(user_id_drama, user_movie, item_similarity_min_ratings)
relevant_drama_movies = relevant_drama_movies.sort_values(by = 'relevance', ascending=False)
display(relevant_drama_movies.head(10))

  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(weighted_similarities)
  movies_relevance = movies_relevance.append(wei

Unnamed: 0_level_0,relevance
movie_id,Unnamed: 1_level_1
8958,65.46137
6565,61.354376
5014,61.209632
7325,59.820898
7149,59.294621
4448,58.968024
7445,58.192646
5152,58.004447
3753,57.920754
4223,57.482846


 **Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?**

In light of my previous comment, I can affirm that these recommendations closely resemble suggestions I would make to these users, making them highly relevant. Notably, the exclusion of already-watched movies enhances the user experience, as recommending a film already viewed would be nonsensical.

However, it's crucial to acknowledge that after eliminating watched movies, the relevance scores need recalibration, given the reduced pool of available movies for recommendation. Consequently, the recalibrated relevance scores are not directly comparable, as they reflect the overall movie landscape, including both seen and unseen films, rather than solely the unseen ones.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>