# `AA Workshop 12` — Coding Challenge

Complete the tasks below to practice collaborative filtering techniques from `W12_Recommender_Systems.ipynb`.

Guidelines:
- Work in order. Run each cell after editing with Shift+Enter.
- Keep answers short; focus on making things work.
- If a step fails, read the error and fix it.

By the end you will have exercised:
- implementing item- and user-based approaches to predict ratings
- generating recommendations for a specific user

## Task 1 - Predict a specific rating

Let's apply what we learned about collaborative filtering. We will use the same datasets as in the workshop notebook, i.e. `ratings.csv` and `movies.csv` from https://grouplens.org/datasets/movielens/. Again, we only want to consider movies with five or more ratings. The user with `userId = 15` has not yet rated the movie named _Beauty and the Beast (1991)_. First, check out some movies the user has rated with the highest score (5). Then, apply and compare item-item and user-user approaches using Pearson correlation and Cosine similarity as similarity measures to predict whether the user will likely enjoy or dislike this movie _Beauty and the Beast (1991)_. Given the users most and least favorite movies, did you expect the predicted rating for _Beauty and the Beast (1991)_?

In [1]:
import numpy as np
import pandas as pd
import scipy.sparse as sp

# load data
df = pd.read_csv("../data/ratings.csv")
df_mov = pd.read_csv("../data/movies.csv", index_col="movieId")

In [2]:
# favorite movies
df_mov[df_mov.index.isin(df[(df["userId"] == 15) & (df["rating"] == 5)]["movieId"])].head(20)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
82,Antonia's Line (Antonia) (1995),Comedy|Drama
111,Taxi Driver (1976),Crime|Drama|Thriller
149,Amateur (1994),Crime|Drama|Thriller
246,Hoop Dreams (1994),Documentary
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
293,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
306,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama


In [3]:
# build user-item matrix
X = np.asarray(sp.coo_matrix((df["rating"], (df["userId"]-1, df["movieId"]-1))).todense())

# only consider movies with at least 5 ratings
valid_movies = (X!=0).sum(axis=0) >= 5
movie_to_title = dict(zip(range(len(valid_movies)), df_mov.loc[np.where(valid_movies)[0]+1]["title"]))
X = X[:,valid_movies]

# position of Beauty and the Beast (1991)
for index, title in movie_to_title.items():
    if title == "Beauty and the Beast (1991)":
        print(index)

403


In [4]:
# validate that user 15 (i.e. index 14) has not rated this movie
print(X[14,403])

0.0


In [5]:
# compute user and item means
user_means = np.array([X[i,X[i,:]!=0].mean() for i in range(X.shape[0])])
movie_means = np.array([X[X[:,i]!=0,i].mean() for i in range(X.shape[1])])

# retrieve average rating for Beauty and the Beast (1991)
print(movie_means[403])

3.75


In [6]:
# define functions
def all_pearson(X, user_means, min_common_items=5):
    X_norm = (X - user_means[:,None])*(X != 0)
    X_col_norm = (X_norm**2) @ (X_norm != 0).T
    common_items = (X!=0).astype(float) @ (X!=0).T
    return (X_norm @ X_norm.T)/(np.sqrt(X_col_norm*X_col_norm.T)+1e-12) * (common_items >= min_common_items)

def all_cosine(X):
    x_norm = np.sqrt((X**2).sum(axis=1))
    return (X @ X.T) / np.outer(x_norm, x_norm)

def predict_user_user(X, W, user_means, i):
    """ Return prediction of X_(ij). """
    return user_means[i] + (np.sum((X - user_means[:,None]) * (X != 0) * W[i,:,None], axis=0) / 
                            (np.sum((X != 0) * np.abs(W[i,:,None]), axis=0) + 1e-12))

def predict_item_item(X, W, item_means, i):
    return predict_user_user(X.T, W, item_means, i)

In [None]:
# predict rating
user_id = 14 # remember: 15-1
movie_id = 403

## user-user with Pearson correlation
W_user_pearson = all_pearson(X, user_means)
user_pearson_pred = predict_user_user(X, W_user_pearson, user_means, user_id)

## user-user with Cosine similarity
W_user_cosine = all_cosine(X)
user_cosine_pred = predict_user_user(X, W_user_cosine, user_means, user_id)

## item-item with Pearson correlation
W_item_pearson = all_pearson(X.T, movie_means)
item_pearson_pred = np.array([predict_item_item(X, W_item_pearson, movie_means, i) for i in range(X.shape[1])]).T

## item-item with Cosine similarity
W_item_cosine = all_cosine(X.T)
item_cosine_pred = np.array([predict_item_item(X, W_item_cosine, movie_means, i) for i in range(X.shape[1])]).T

print(f"Predicted rating for '{movie_to_title[movie_id]}' (user {user_id+1}):")
print(f"User-User (Pearson): {user_pearson_pred[movie_id]:.2f}")
print(f"User-User (Cosine): {user_cosine_pred[movie_id]:.2f}")
print(f"Item-Item (Pearson): {item_pearson_pred[user_id,movie_id]:.2f}")
print(f"Item-Item (Cosine): {item_cosine_pred[user_id,movie_id]:.2f}")

## Task 2 - Recommend five movies

Task 1 should have told you that _Beauty and the Beast (1991)_ is likely not the best recommendation to give to the user with `userId = 15`. Again, apply and compare item-item and user-user approaches using Pearson correlation and Cosine similarity as similarity measures to recommend the five movies with the highest predicted rating.

In [None]:
# create summary df containing all true and predicted ratings
user_ratings = df[df.userId == 15].merge(df_mov, on="movieId", how="left")
summary_df = pd.DataFrame(movie_to_title.items(), columns=["index", "title"])
summary_df = summary_df.merge(user_ratings[["title", "rating"]], on="title", how="left")
summary_df["user_pearson_pred"] = user_pearson_pred
summary_df["user_cosine_pred"] = user_cosine_pred
summary_df["item_pearson_pred"] = item_pearson_pred[user_id,:]
summary_df["item_cosine_pred"] = item_cosine_pred[user_id,:]
summary_df.head()

In [None]:
# return top five movie titles for each approach
for i in ["user_pearson_pred", "user_cosine_pred", "item_pearson_pred", "item_cosine_pred"]:
    print("\nTop 5 recommended movies based on", i)
    print(summary_df[summary_df.rating.isna()].sort_values(by=i, ascending=False)[["title", i]].head(5))