# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Nil Tomàs Plans</font>

E-mail: <font color="blue">nil.tomas01@estudiant.upf.edu</font>

Date: <font color="blue">10/11/2024</font>

# 1. The Movies dataset

# 1.1. Load the input files

In [1]:
# Leave this code as-is

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import*
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Leave this code as-is

FILENAME_MOVIES = "movies-2000s.csv"
FILENAME_RATINGS = "ratings-2000s.csv"
FILENAME_TAGS = "tags-2000s.csv"

In [3]:
# Leave this code as-is

movies = pd.read_csv(FILENAME_MOVIES, 
                    sep=',', 
                    engine='python', 
                    encoding='latin-1',
                    names=['movie_id', 'title', 'genres'])
display(movies.head(5))

ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    encoding='latin-1',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


# 1.2. Merge the data into a single dataframe

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

In [4]:
ratings=pd.merge(movies, ratings_raw, on="movie_id")
display(ratings.head())

Unnamed: 0,movie_id,title,genres,user_id,rating
0,2769,"Yards, The (2000)",Crime|Drama,1115,4.0
1,2769,"Yards, The (2000)",Crime|Drama,1209,2.0
2,2769,"Yards, The (2000)",Crime|Drama,2004,3.0
3,2769,"Yards, The (2000)",Crime|Drama,2502,4.0
4,2769,"Yards, The (2000)",Crime|Drama,2827,4.0


<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [5]:
#code from the previous practice for "find_movies" that list movies containing a keyword
def find_movies(word, movies):
    for movie in movies.to_dict('records'):#for each row in the dataframe we turn it into a dictionary 
        #to be able to access to columns more easly
        if word in movie['title']:#if word in title print results
            print("movie_id: ",movie['movie_id']," title: ",movie['title'])

  

In [6]:
# LEAVE AS-IS

# For testing, this should print 9 movies
find_movies("Spider-Man", movies)

movie_id:  5349  title:  Spider-Man (2002)
movie_id:  8636  title:  Spider-Man 2 (2004)
movie_id:  52722  title:  Spider-Man 3 (2007)
movie_id:  76709  title:  Spider-Man: The Ultimate Villain Showdown (2002)
movie_id:  95510  title:  Amazing Spider-Man, The (2012)
movie_id:  110553  title:  The Amazing Spider-Man 2 (2014)
movie_id:  122926  title:  Untitled Spider-Man Reboot (2017)
movie_id:  195159  title:  Spider-Man: Into the Spider-Verse (2018)
movie_id:  201773  title:  Spider-Man: Far from Home (2019)


In [7]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [8]:
# LEAVE AS-IS

# For testing, should print "Spider-Man 2 (2004)"
print(get_title(8636, movies))

Spider-Man 2 (2004)


## 1.3. Count unique registers

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

In [9]:
#code to indicate the number of unique users and unique movies in the "ratings" variable.
users=ratings['user_id'].unique()#first we do the select of the unique values and we store them in arrays
mov=ratings['movie_id'].unique()
cont_movies=movies['movie_id'].unique()
#print the total count of values using the length of the array
print("Number of users who have rated a movie: ",len(users))
print("Number of movies that have been rated: ",len(mov))
print("Total number of movies: ", len(cont_movies))


Number of users who have rated a movie:  12676
Number of movies that have been rated:  2049
Total number of movies:  33168


# 2. Item-based Collaborative Filtering

## 2.1. Data pre-processing

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

In [10]:
#code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title

rated_movies=ratings.drop('genres', axis='columns')
display(rated_movies.head(10))

Unnamed: 0,movie_id,title,user_id,rating
0,2769,"Yards, The (2000)",1115,4.0
1,2769,"Yards, The (2000)",1209,2.0
2,2769,"Yards, The (2000)",2004,3.0
3,2769,"Yards, The (2000)",2502,4.0
4,2769,"Yards, The (2000)",2827,4.0
5,2769,"Yards, The (2000)",6629,1.0
6,2769,"Yards, The (2000)",12435,4.0
7,2769,"Yards, The (2000)",13873,3.0
8,2769,"Yards, The (2000)",14799,3.0
9,2769,"Yards, The (2000)",15691,2.5


<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

In [11]:
#code to generate "ratings_summary" and print the first 10 rows.

#initialize the ratings_summary
ratings_summary=rated_movies.drop(['user_id', 'rating'], axis='columns').drop_duplicates()

#Compute two series: ratings_mean and ratings_count
ratings_mean = rated_movies.groupby('movie_id')['rating'].mean()
ratings_count = rated_movies.groupby('movie_id')['user_id'].count()

#use.map(series) to add the series to the dataframe
ratings_summary['ratings_mean'] = ratings_summary['movie_id'].map(ratings_mean)
ratings_summary['ratings_count'] = ratings_summary['movie_id'].map(ratings_count)


display(ratings_summary.head(10))


Unnamed: 0,movie_id,title,ratings_mean,ratings_count
0,2769,"Yards, The (2000)",3.122549,102
102,3177,Next Friday (2000),2.824,125
227,3190,Supernova (2000),2.395683,139
366,3225,Down to You (2000),2.577273,110
476,3228,Wirey Spindell (2000),2.5,2
478,3239,Isn't She Great? (2000),1.947368,19
497,3273,Scream 3 (2000),2.444664,759
1256,3275,"Boondock Saints, The (2000)",3.870682,1071
2327,3276,Gun Shy (2000),3.33871,31
2358,3279,Knockout (2000),2.0,2


<font size="+1" color="red">Replace this cell with code to print the top 5 highest rated movies, considering only movies receiving at least 100 ratings.</font>

In [12]:
new_ratings=ratings_summary[ratings_summary['ratings_count']>=100]#To select from dataframe thise with more than 100 rates
sorted_ratings=new_ratings.sort_values(by='ratings_mean',ascending=False)#sort according to ratings_count
display(sorted_ratings.head())#print top-5

Unnamed: 0,movie_id,title,ratings_mean,ratings_count
242492,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.215216,2458
279800,6016,City of God (Cidade de Deus) (2002),4.186592,2133
103127,4226,Memento (2000),4.158512,4476
361785,7156,Fog of War: Eleven Lessons from the Life of Ro...,4.112013,308
172739,4973,"Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le)...",4.097234,3687


<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings.</font>

In [13]:
new_ratings=ratings_summary[ratings_summary['ratings_count']>=3]#To select from dataframe thise with more than 100 rates
sorted_ratings=new_ratings.sort_values(by='ratings_mean',ascending=False)#sort according to ratings_count
display(sorted_ratings.head())#print top-5

Unnamed: 0,movie_id,title,ratings_mean,ratings_count
197114,5082,"Rumor of Angels, A (2000)",4.666667,6
444858,27764,2LDK (2003),4.5,3
464942,31954,Beautiful City (Shah-re ziba) (2004),4.4,5
203989,5224,Promises (2001),4.388889,18
331467,6775,Life and Debt (2001),4.333333,3


<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, on what happens when the number of ratings is set to a small value.</font>

<font size="+1" color="blue">After analysing the two dataframes, we can observe that when the number of ratings that movie receive is very small, when sorting on rating_mean, usually the movies in the top-5 the rating is around 4.5 . Therefore, it's not really reflecting the reality, as there can be other movies with 2000 rates that the rate_mean is a bit lower, but it may still have got more rates with higher rates than movies in the top 5 wit 6-7 rates.</font>

## 2.2. Compute the user-movie matrix

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

In [14]:

user_movie=rated_movies.pivot_table(index='user_id',columns='movie_id',values='rating')
print(user_movie.head())

movie_id  2769   3177   3190   3225   3228   3239   3273   3275   3276   \
user_id                                                                   
4           NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
33          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
62          NaN    NaN    NaN    NaN    NaN    NaN    NaN    4.5    NaN   
63          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
95          NaN    NaN    NaN    NaN    NaN    NaN    NaN    3.5    NaN   

movie_id  3279   ...  33138  33145  33148  33150  33152  33154  33158  33162  \
user_id          ...                                                           
4           NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
33          NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
62          NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
63          NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    N

<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

<font size="+1" color="blue">In my opinion it might be because not every user has rated all the movies. We call that the matrix is sparse because users have only rated a little group of movies inside the overall of movies  </font>

# 2.3. Explore some correlations in the user-movie matrix

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

In [15]:
titles=['Lord of the Rings: The Fellowship of the Ring, The (2001)','Finding Nemo (2003)','Talk to Her (Hable con Ella) (2002)']
def get_movies_id(title, movies):
    for movie in movies.to_dict('records'):#for each row in the dataframe we turn it into a dictionary 
        #to be able to access to columns more easly
        if title == movie['title']:#if title ==title of movie
            return movie['movie_id']


id_pivot=get_movies_id(titles[0],movies)
id_m1=get_movies_id(titles[1],movies)
id_m2=get_movies_id(titles[2],movies)


rates=[]
r1=user_movie[id_pivot].dropna()#rates for movie== title1
rates.append(r1)
r2=user_movie[id_m1].dropna()#rates for movie== title2
rates.append(r2)
r3=user_movie[id_m2].dropna()#rates for movie== title3
rates.append(r3)


ratings3 = pd.concat([r1, r2, r3], axis=1)
ratings3=ratings3.dropna()

print(ratings3.head(10))

         4993  6377  5878
user_id                  
859       3.0   4.0   5.0
1229      4.0   4.0   4.5
1281      3.0   2.5   3.0
1722      5.0   4.5   4.0
2004      4.5   3.0   3.5
4590      4.0   4.0   2.0
5052      2.0   4.0   4.0
5144      5.0   5.0   5.0
6497      3.5   3.5   3.5
8369      3.0   4.0   4.5


<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

In [16]:
# Sample code snippet for calculating correlations between pairs of movie ratings
pairs = [[1, 2], [1, 3], [2, 3]]
for i, j in pairs:
    similarity=ratings3.iloc[:, i-1].corr(ratings3.iloc[:, j-1])
    print(f'Similarity between {titles[i-1]} and {titles[j-1]} : {similarity:.2f} ')


Similarity between Lord of the Rings: The Fellowship of the Ring, The (2001) and Finding Nemo (2003) : 0.38 
Similarity between Lord of the Rings: The Fellowship of the Ring, The (2001) and Talk to Her (Hable con Ella) (2002) : 0.16 
Similarity between Finding Nemo (2003) and Talk to Her (Hable con Ella) (2002) : 0.20 


<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

<font size="+1" color="blue">The correlations computed are correct. We can observe that the higher correlation between movies is between Lord of the Rings and Nemo. This could mean that usually, users that like Lord of the rings, a movie of adventure and action, may also like, or have liked nemo, an adventure movie but for a more youthful public.</font>

<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

In [17]:
#extract ratings from df_pivot
df_pivot = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})

# Compute correlation with other movies
similarity_to_pivot = []
for movie_id in user_movie.columns:
    if movie_id != id_pivot:
        df = pd.DataFrame(user_movie[movie_id].dropna()).rename(columns={movie_id: "rating"})
        corr = df_pivot.corrwith(df)[0]
        similarity_to_pivot.append([movie_id, corr])
    

#create the dataframe with both columns and drop NaN values
similarity_to_pivot = pd.DataFrame(similarity_to_pivot, columns=["movie_id", "corr_with_pivot"])
similarity_to_pivot = similarity_to_pivot.dropna()


print(similarity_to_pivot.head())

   movie_id  corr_with_pivot
0      2769        -0.127515
1      3177         0.093221
2      3190         0.041206
3      3225         0.126600
5      3239         0.338378


<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 500 times or more) with the highest correlation with the selected movie.</font>

In [18]:
corr_with_pivot = pd.merge(similarity_to_pivot,ratings_summary, how='inner',on='movie_id')
#I used the function merge to merge join the two dataframes on movie_id

corr_with_pivot =corr_with_pivot[corr_with_pivot['ratings_count']>=500]#To select from dataframe thise with more than 500 rates
corr_with_pivot =corr_with_pivot.sort_values(by='corr_with_pivot',ascending=False)#sort according to ratings_count

display(corr_with_pivot.head(20))#print top-20

Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
807,5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
1177,7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
986,6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
1339,8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
55,3578,0.337667,Gladiator (2000),3.95105,4811
86,3793,0.329686,X-Men (2000),3.556436,3535
451,4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
68,3624,0.307471,Shanghai Noon (2000),3.297443,1017
1774,31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141
591,5349,0.302174,Spider-Man (2002),3.457931,3209


<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

<font size="+1" color="blue">As expected, we can observe that the movies the most correlated to pivot_id (Lord of the rings: first movie) are the other two movies of the Lord of the rings. Following we can see other movies of science-fiction and action as Pirates of th Caribbean or Harry Potter. After analysing this top-20 movies, we conclude that are very correlated because they are all action and science-fiction movies, aspect tnat users that rated high 'Lord of the rings' will be more likely to rate high similar movies. If we set the ratings_count to a smaller value, for example 5, we get movies that are not correlated on the genres to pivot_id movie, and it is just not doing it's main function, where we want to select those movies that for sure (a big part of the users) have rated high. Instead, if we set the value to a much higher value = 3000, the argument of the selection of movies are more likely to Lord of the Rings, what we were searching. </font>

# 2.4. Implement the item-based recommendations

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

In [19]:
item_similarity = user_movie.corr(method='pearson')
 
# Print the first 10 rows of the resulting matrix
print(item_similarity.head(10))

movie_id     2769      3177      3190      3225   3228      3239      3273   \
movie_id                                                                      
2769      1.000000  0.115068  0.033721 -0.232268    NaN -0.500000  0.197011   
3177      0.115068  1.000000  0.303820  0.559533    NaN       NaN  0.331191   
3190      0.033721  0.303820  1.000000  0.636361    NaN -0.014315  0.146042   
3225     -0.232268  0.559533  0.636361  1.000000    NaN  0.578414  0.347716   
3228           NaN       NaN       NaN       NaN    1.0       NaN       NaN   
3239     -0.500000       NaN -0.014315  0.578414    NaN  1.000000  0.180846   
3273      0.197011  0.331191  0.146042  0.347716    NaN  0.180846  1.000000   
3275      0.199514  0.167918  0.394293  0.263671    NaN  1.000000  0.105735   
3276      0.250873  1.000000 -0.290397 -0.250313    NaN       NaN  0.154371   
3279           NaN       NaN       NaN       NaN    NaN       NaN       NaN   

movie_id     3275      3276   3279   ...     33138 

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings"</font>

In [20]:
item_similarity_min_ratings = user_movie.corr(method='pearson', min_periods=100)

# Print the first 5 rows of the resulting matrix
print(item_similarity_min_ratings.head(5))

movie_id  2769   3177   3190   3225   3228   3239   3273   3275   3276   \
movie_id                                                                  
2769        1.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3177        NaN    1.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3190        NaN    NaN    1.0    NaN    NaN    NaN    NaN    NaN    NaN   
3225        NaN    NaN    NaN    1.0    NaN    NaN    NaN    NaN    NaN   
3228        NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   

movie_id  3279   ...  33138  33145  33148  33150  33152  33154  33158  33162  \
movie_id         ...                                                           
2769        NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3177        NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3190        NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3225        NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    N

<font size="+1" color="red">Replace this cell with your code to find the user ids of two example users: user_id_super (the who liked the three superhero movies), and user_id_drama (the one who liked the three dramas)</font>

In [21]:
user_id_super=user_movie[(user_movie[5349]>=4.5)&(user_movie[3793]>=4.5)&(user_movie[6534]>=4.5)].index[0]

user_id_drama=user_movie[(user_movie[6870]>=4.5)&(user_movie[5995]>=4.5)&(user_movie[3555]>=4.5)].index[0]

print("User id's super:", user_id_super,'\n')
print("User id's drama:", user_id_drama)

User id's super: 5144 

User id's drama: 34336


In [22]:
# Leave this code as-is
# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values( ascending=False).index)

# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]

# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))


In [23]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

33166 5.0 Crash (2004) 
7360 5.0 Dawn of the Dead (2004) 
8622 5.0 Fahrenheit 9/11 (2004) 
4643 5.0 Planet of the Apes (2001) 
3994 5.0 Unbreakable (2000) 
4878 5.0 Donnie Darko (2001) 
6377 5.0 Finding Nemo (2003) 
8327 5.0 Dolls (2002) 
7572 5.0 Wit (2001) 
6380 5.0 Capturing the Friedmans (2003) 
7371 5.0 Dogville (2003) 
7361 5.0 Eternal Sunshine of the Spotless Mind (2004) 
4226 5.0 Memento (2000) 
8638 5.0 Before Sunset (2004) 
7153 5.0 Lord of the Rings: The Return of the King, The (2003) 
6534 5.0 Hulk (2003) 
6620 5.0 American Splendor (2003) 
4308 5.0 Moulin Rouge (2001) 
7090 5.0 Hero (Ying xiong) (2002) 
5669 5.0 Bowling for Columbine (2002) 
4370 5.0 A.I. Artificial Intelligence (2001) 
6711 5.0 Lost in Translation (2003) 
6874 5.0 Kill Bill: Vol. 1 (2003) 
6870 5.0 Mystic River (2003) 
4993 5.0 Lord of the Rings: The Fellowship of the Ring, The (2001) 
4642 5.0 Hedwig and the Angry Inch (2000) 
3949 5.0 Requiem for a Dream (2000) 
5679 5.0 Ring, The (2002) 
5902 5.0 Adapt

In [24]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

3967 5.0 Billy Elliot (2000) 
4014 5.0 Chocolat (2000) 
4034 5.0 Traffic (2000) 
5995 5.0 Pianist, The (2002) 
7147 5.0 Big Fish (2003) 
4995 5.0 Beautiful Mind, A (2001) 
3555 5.0 U-571 (2000) 
6870 5.0 Mystic River (2003) 
5991 5.0 Chicago (2002) 
8464 5.0 Super Size Me (2004) 
5669 5.0 Bowling for Columbine (2002) 
8622 5.0 Fahrenheit 9/11 (2004) 
30707 5.0 Million Dollar Baby (2004) 
6953 4.5 21 Grams (2003) 
5015 4.5 Monster's Ball (2001) 
5464 4.5 Road to Perdition (2002) 
3510 4.5 Frequency (2000) 
5989 4.5 Catch Me If You Can (2002) 
4022 4.0 Cast Away (2000) 
5010 4.0 Black Hawk Down (2001) 
5299 4.0 My Big Fat Greek Wedding (2002) 
3897 4.0 Almost Famous (2000) 
3755 4.0 Perfect Storm, The (2000) 
4308 4.0 Moulin Rouge (2001) 
4447 3.5 Legally Blonde (2001) 
4246 3.5 Bridget Jones's Diary (2001) 
4975 3.5 Vanilla Sky (2001) 
4019 3.5 Finding Forrester (2000) 
5377 3.5 About a Boy (2002) 
3948 3.5 Meet the Parents (2000) 
5956 3.0 Gangs of New York (2002) 
6281 3.0 Phone Booth

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

In [25]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series
    movies_relevance = pd.Series()
    
    # Iterate through the movies the user has watched
    for watched_movie in user_movie:
        
        # Obtain the rating given
        rating_given = get_rating(user_id,watched_movie,user_movie)
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = item_similarity_matrix[watched_movie]
        
        # Multiply this vector by the given rating
        weighted_similarities = rating_given*similarities
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

<font size="+1" color="red">Replace this cell with your code to obtain the 5 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

In [32]:
print("RELEVANT MOVIES FOR USER_ID_SUPER\n")

relevant_movies_superhero=get_movies_relevance(user_id_super,user_movie,item_similarity)#use the function above
relevant_movies_superhero=relevant_movies_superhero.sort_values(by='relevance',ascending=False)#sort the values by relevance
relevant_movies_superhero['title'] = relevant_movies_superhero['movie_id'].apply(lambda x: get_title(x, movies))#use the function get_title to print the names of the movies
#otherwise we can't do the following exercice

print(relevant_movies_superhero.head())



print("\nRELEVANT MOVIES FOR USER_ID_DRAMA\n")
relevant_movies_drama=get_movies_relevance(user_id_drama,user_movie,item_similarity)#use the function above
relevant_movies_drama=relevant_movies_drama.sort_values(by='relevance',ascending=False)#sort the values by relevance
relevant_movies_drama['title'] = relevant_movies_drama['movie_id'].apply(lambda x: get_title(x, movies))

print(relevant_movies_drama.head())

RELEVANT MOVIES FOR USER_ID_SUPER

       relevance  movie_id                                  title
7443  448.106701      7443         This So-Called Disaster (2003)
6688  433.983302      6688     Autumn Spring (BabÃ­ lÃ©to) (2001)
8895  422.849117      8895                             AKA (2002)
4173  419.304747      4173          When Brendan Met Trudy (2000)
6375  415.161506      6375  Gigantic (A Tale of Two Johns) (2002)

RELEVANT MOVIES FOR USER_ID_DRAMA

        relevance  movie_id                           title
7521   160.000000      7521                    Mercy (2000)
4449   154.388241      4449               Adanggaman (2000)
7443   146.447098      7443  This So-Called Disaster (2003)
31636  135.966211     31636              Bunker, The (2001)
27835  133.500000     27835          Agronomist, The (2003)


<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 1980s and 1990s pop culture were supposed to be useful one day; that day has arrived :-)</font>

<font size="+1" color="blue">On the first matrix, the one for the user taht is a fan of Superheros, I can say that the movies that are the more relevant to him/her are AKA (action and policiac movie), also "When Brendan et Trudy" because its more or less a romantic movie and it also includes a bit of a policiac action due to the protagonist falls in love with a thief.
The other movies are not very relevant because they are  kind of a documental movies that might not be of interest to the super_user</font>

<font size="+1" color="blue">Concerning the second user, fanatic of drama, I'd say that he/she might like Mercy because a movie with a lot of action that sometimes it can also carry something of drama, even if it is not a typical drama movie as Adanggaman, which the user will also like. Finally the user might also enjoy watching The Bunker as it's about soldiers that get locked in a bunker and I imagine there can be some very dramatic scenes, therefore it might be relevant for the user. The other movies are not relevant for him/her. </font>

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

In [39]:
def get_recommended_movies(user_id,user_movie, item_similarity,movies):
    
    relevant_movies=get_movies_relevance(user_id,user_movie,item_similarity)#use the function above
    relevant_movies.set_index('movie_id', inplace=True)
    
    list_watched_movies = get_watched_movies(user_id, user_movie)

    relevant_movies = relevant_movies.drop(index=list_watched_movies, errors='ignore')#delete movies that the user has watched
    
    relevant_movies['title'] = relevant_movies.index.map(lambda x: get_title(x, movies))
    relevant_movies_sorted=relevant_movies.sort_values(by='relevance', ascending=False)
    # Step 6: Return the final recommended movies
    return relevant_movies_sorted
    

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most recommended movies for the users user_id_super and user_id_drama</font>

In [40]:
print("RECOMMENDED MOVIES FOR user_id_super\n")
rec_mov_super=get_recommended_movies(user_id_super,user_movie, item_similarity,movies)
print(rec_mov_super.head(10))

print("\nRECOMMENDED MOVIES FOR user_id_drama\n")
rec_mov_drama=get_recommended_movies(user_id_drama, user_movie ,item_similarity,movies)
print(rec_mov_drama.head(10))

RECOMMENDED MOVIES FOR user_id_super

           relevance                                              title
movie_id                                                               
7443      448.106701                     This So-Called Disaster (2003)
6688      433.983302                 Autumn Spring (BabÃ­ lÃ©to) (2001)
8895      422.849117                                         AKA (2002)
4173      419.304747                      When Brendan Met Trudy (2000)
6375      415.161506              Gigantic (A Tale of Two Johns) (2002)
6336      408.500000     Marooned in Iraq (Gomgashtei dar Aragh) (2002)
4150      408.500000                             Signs & Wonders (2001)
5806      394.993866                  Blackboards (TakhtÃ© Siah) (2000)
5575      393.628939  Alias Betty (Betty Fisher et autres histoires)...
5687      391.691250  Take Care of My Cat (Goyangileul butaghae) (2001)

RECOMMENDED MOVIES FOR user_id_drama

           relevance                                       

<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

<font size="+1" color="blue">After analysing both lists, I can observe that the order of relevance and recommendations movies are the same. This could be due to that the movies that are more relevant means that are ones that should be recommended the most. However, I don't think that this last list is very relevant due to that we've seen that the movies that should be recommended should agree on the interests of the user regarding genres and plot.The relevance scores are not comparable as the higher scores usually are the ones that the user has seen, therefore those are the ones that we've dropped.</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>