# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Miguel Rando</font>

E-mail: <font color="blue">miguel.rando01@estudiant.upf.edu</font>

Date: <font color="blue">8/11/24</font>

# 1. The Movies dataset

# 1.1. Load the input files

In [1]:
# Leave this code as-is

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import*
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Leave this code as-is

FILENAME_MOVIES = "movies-2000s.csv"
FILENAME_RATINGS = "ratings-2000s.csv"
FILENAME_TAGS = "tags-2000s.csv"

In [3]:
# Leave this code as-is

movies = pd.read_csv(FILENAME_MOVIES, 
                    sep=',', 
                    engine='python', 
                    encoding='latin-1',
                    names=['movie_id', 'title', 'genres'])
display(movies.head(5))

ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    encoding='latin-1',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


# 1.2. Merge the data into a single dataframe

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

In [4]:
ratings_selected = ratings_raw[['user_id', 'movie_id', 'rating']]
movies_selected = movies[['movie_id', 'title', 'genres']]

# Unir los dos dataframes usando solo las columnas seleccionadas
ratings = pd.merge(ratings_selected, movies_selected, on='movie_id')


display(ratings.head(5))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,4,3624,2.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1,152,3624,3.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
2,171,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
3,276,3624,4.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
4,494,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western


<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [5]:
def find_movies(keyword, movies_df):
    # convert keyword to lowercase 
    keyword_lower = keyword.lower()
    
    # filter movies 
    matching_movies = movies_df[movies_df['title'].str.lower().str.contains(keyword_lower, na=False)]
    
    # print each matching movie ID and title
    print(matching_movies[['movie_id', 'title']].to_string(index=False, header=False))


In [6]:
# LEAVE AS-IS

# For testing, this should print 9 movies
find_movies("Spider-Man", movies)

  5349                                Spider-Man (2002)
  8636                              Spider-Man 2 (2004)
 52722                              Spider-Man 3 (2007)
 76709 Spider-Man: The Ultimate Villain Showdown (2002)
 95510                   Amazing Spider-Man, The (2012)
110553                  The Amazing Spider-Man 2 (2014)
122926                Untitled Spider-Man Reboot (2017)
195159         Spider-Man: Into the Spider-Verse (2018)
201773                 Spider-Man: Far from Home (2019)


In [7]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [8]:
# LEAVE AS-IS

# For testing, should print "Spider-Man 2 (2004)"
print(get_title(8636, movies))

Spider-Man 2 (2004)


## 1.3. Count unique registers

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

In [9]:
num_unique_users = len(ratings['user_id'].unique())

num_rated_movies = len(ratings['movie_id'].unique())

total_movies = len(movies['movie_id'].unique())

print(f"Number of users who have rated a movie : {num_unique_users}")
print(f"Number of movies that have been rated  : {num_rated_movies}")
print(f"Total number of movies                 : {total_movies}")

Number of users who have rated a movie : 12676
Number of movies that have been rated  : 2049
Total number of movies                 : 33168


# 2. Item-based Collaborative Filtering

## 2.1. Data pre-processing

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

In [10]:
rated_movies = ratings.drop(columns = 'genres')

print(rated_movies.head(5))

   user_id  movie_id  rating                 title
0        4      3624     2.5  Shanghai Noon (2000)
1      152      3624     3.0  Shanghai Noon (2000)
2      171      3624     3.5  Shanghai Noon (2000)
3      276      3624     4.0  Shanghai Noon (2000)
4      494      3624     3.5  Shanghai Noon (2000)


<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

In [11]:
# we only keep one row per film
ratings_summary = rated_movies.groupby('title').first()

# we compute the series for each film
ratings_mean = rated_movies.groupby('title')['rating'].mean()
ratings_count = rated_movies.groupby('title')['rating'].count()

# here we add the series as columns to the dataframe
ratings_summary['ratings_mean'] = ratings_mean
ratings_summary['ratings_count'] = ratings_count


# we drop the rating column as we dont need it any more
ratings_summary = ratings_summary.drop(columns = ['rating', 'user_id'])


<font size="+1" color="red">Replace this cell with code to print the top 5 highest rated movies, considering only movies receiving at least 100 ratings.</font>

In [12]:
#  at least 100 ratings
ratings_100 = ratings_summary[ratings_summary.ratings_count >= 100]

# sort by decreasing value of ratings
ratings_sorted = ratings_100.sort_values(by='ratings_mean', ascending=False)

print(ratings_sorted.head(5))

                                                    movie_id  ratings_mean  \
title                                                                        
Spirited Away (Sen to Chihiro no kamikakushi) (...      5618      4.215216   
City of God (Cidade de Deus) (2002)                     6016      4.186592   
Memento (2000)                                          4226      4.158512   
Fog of War: Eleven Lessons from the Life of Rob...      7156      4.112013   
Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le) ...      4973      4.097234   

                                                    ratings_count  
title                                                              
Spirited Away (Sen to Chihiro no kamikakushi) (...           2458  
City of God (Cidade de Deus) (2002)                          2133  
Memento (2000)                                               4476  
Fog of War: Eleven Lessons from the Life of Rob...            308  
Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le

<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings.</font>

In [13]:
#  at least 3 ratings
ratings_3 = ratings_summary[ratings_summary.ratings_count >= 3]

# sort by decreasing value of ratings
ratings_sorted = ratings_3.sort_values(by='ratings_mean', ascending=False)

print(ratings_sorted.head(5))

                                                 movie_id  ratings_mean  \
title                                                                     
Rumor of Angels, A (2000)                            5082      4.666667   
2LDK (2003)                                         27764      4.500000   
Beautiful City (Shah-re ziba) (2004)                31954      4.400000   
Promises (2001)                                      5224      4.388889   
Surplus: Terrorized Into Being Consumers (2003)     31856      4.333333   

                                                 ratings_count  
title                                                           
Rumor of Angels, A (2000)                                    6  
2LDK (2003)                                                  3  
Beautiful City (Shah-re ziba) (2004)                         5  
Promises (2001)                                             18  
Surplus: Terrorized Into Being Consumers (2003)              3  


<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, on what happens when the number of ratings is set to a small value.</font>

With small number of ratings, the rating mean is higher, as it is easier to have a greater mean because if someone rate the film with a very high value (5 for example), the impact of this rating in the mean is going to have a grater impact. Thats why the 5 films with better rating mean have few rating count.

## 2.2. Compute the user-movie matrix

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

In [14]:
user_movie = rated_movies.pivot_table( index = 'user_id', columns = 'movie_id', values = 'rating')

print(user_movie)

movie_id  2769   3177   3190   3225   3228   3239   3273   3275   3276   \
user_id                                                                   
4           NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
33          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
62          NaN    NaN    NaN    NaN    NaN    NaN    NaN    4.5    NaN   
63          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
95          NaN    NaN    NaN    NaN    NaN    NaN    NaN    3.5    NaN   
...         ...    ...    ...    ...    ...    ...    ...    ...    ...   
162488      NaN    NaN    NaN    NaN    NaN    NaN    NaN    4.5    NaN   
162513      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
162527      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
162533      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
162536      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   

movie_id  3279   ...  33

<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

Because users only rate some films, and there are a lot of films that are not rated by the big part of the users.
We call it sparsity.

# 2.3. Explore some correlations in the user-movie matrix

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

In [15]:
find_movies("Lord of the Rings", movies)
find_movies("Finding Nemo", movies)
find_movies("Talk to Her", movies)

4993 Lord of the Rings: The Fellowship of the Ring, The (2001)
5952             Lord of the Rings: The Two Towers, The (2002)
7153     Lord of the Rings: The Return of the King, The (2003)
6377 Finding Nemo (2003)
5878 Talk to Her (Hable con Ella) (2002)


In [16]:
id_pivot = 4993
id_m1 = 6377
id_m2 = 5878


s1 = user_movie[id_pivot].dropna()
s2 = user_movie[id_m1].dropna()
s3 = user_movie[id_m2].dropna()

ratings3 = pd.concat([s1, s2, s3], axis=1)

ratings3 = ratings3.dropna()

print(ratings3.head(10))

         4993  6377  5878
user_id                  
859       3.0   4.0   5.0
1229      4.0   4.0   4.5
1281      3.0   2.5   3.0
1722      5.0   4.5   4.0
2004      4.5   3.0   3.5
4590      4.0   4.0   2.0
5052      2.0   4.0   4.0
5144      5.0   5.0   5.0
6497      3.5   3.5   3.5
8369      3.0   4.0   4.5


<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

In [17]:
similarity_1 = ratings3[id_pivot].corr(ratings3[id_m1])  # Lord of the Rings and Finding Nemo
similarity_2 = ratings3[id_pivot].corr(ratings3[id_m2])  # Lord of the Rings and Talk to Her
similarity_3 = ratings3[id_m1].corr(ratings3[id_m2])  # Finding Nemo and Talk to Her


# print the similarities
print(f"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': {similarity_1:.2f}")
print(f"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': {similarity_2:.2f}")
print(f"Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': {similarity_3:.2f}")

Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': 0.38
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': 0.16
Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': 0.20


<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

The higher correlation is between Lord of the Ring and Nemo. It is normal because they have the same gender (fiction) so people who likes one of it, tends to like the other.

<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

In [18]:
df_pivot = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})

# create an empty list to store the correlation results
correlations = []

for movie_id in user_movie.columns:
    if movie_id != id_pivot:  # skip the pivot movie itself
        # extract the ratings 
        df_other_movie = pd.DataFrame(user_movie[movie_id].dropna()).rename(columns={movie_id: "rating"})
        
        # Compute the correlation between the pivot movie and the current movie
        corr = df_pivot.corrwith(df_other_movie)[0]  # corrwith returns a series, so extract the scalar value
        
        # Store the result in the list 
        correlations.append((movie_id, corr))

        
similarity_to_pivot = pd.DataFrame(correlations, columns=["movie_id", "corr_with_pivot"])

print(similarity_to_pivot.head())


   movie_id  corr_with_pivot
0      2769        -0.127515
1      3177         0.093221
2      3190         0.041206
3      3225         0.126600
4      3228              NaN


<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 500 times or more) with the highest correlation with the selected movie.</font>

In [19]:
#I had to reset the index to make the column title appear
ratings_summary_reset = ratings_summary.reset_index()


# merge similarity_to_pivot with ratings_summary on movie_id
corr_with_pivot = ratings_summary_reset.merge(similarity_to_pivot, on="movie_id", how='left')

# filter for movies with ratings_count > 500 
corr_with_pivot = corr_with_pivot[corr_with_pivot['ratings_count'] > 500]

# sort the dataframe in descending order
corr_with_pivot = corr_with_pivot.sort_values(by="corr_with_pivot", ascending=False)

print(corr_with_pivot.head(20))

                                                  title  movie_id  \
1113      Lord of the Rings: The Two Towers, The (2002)      5952   
1112  Lord of the Rings: The Return of the King, The...      7153   
1353                                  Open Range (2003)      6617   
536                           Dogtown and Z-Boyz (2001)      5325   
1418  Pirates of the Caribbean: The Curse of the Bla...      6539   
133                       Assault on Precinct 13 (2005)     31420   
1377                               Pacifier, The (2005)     32017   
721   Ghost in the Shell 2: Innocence (a.k.a. Innoce...     27728   
149                                  Bad Company (2002)      5414   
801     Harry Potter and the Prisoner of Azkaban (2004)      8368   
740                                    Gladiator (2000)      3578   
518                                     Dinosaur (2000)      3615   
582                               Ella Enchanted (2004)      7380   
2030                              

<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

Obiously the film that has the higher correlation is another Lord of the rings film. And the other films are also fiction and adventure films. If we increase the count of the rating we only get very popular films, and if we decrease it, we get less known films  

# 2.4. Implement the item-based recommendations

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings"</font>

<font size="+1" color="red">Replace this cell with your code to find the userids of two example users: user_id_super (the who liked the three superhero movies), and user_id_drama (the one who liked the three dramas)</font>

In [20]:
# Leave this code as-is

# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]

# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))


In [21]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

NameError: name 'user_id_super' is not defined

In [None]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 5 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 1980s and 1990s pop culture were supposed to be useful one day; that day has arrived :-)</font>

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most recommended movies for the users user_id_super and user_id_drama</font>

<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>