# Lab 8: Recommender System

In this assignment, we will study how to do user-based collaborative filtering and item-based collaborative filtering. 

## 1. Dataset

In this assignment, we will use MovieLens-100K dataset. It includes about 100,000 ratings from 1000 users on 1700 movies.  

In [2]:
from math import sqrt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors


# 1. load data
user_ratings_train = pd.read_csv('./ml-100k/u1.base',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

user_ratings_test = pd.read_csv('./ml-100k/u1.test',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

movie_info =  pd.read_csv('./ml-100k/u.item', 
                          sep='|', names=['movie_id','title'], usecols=[0,1],
                          encoding="ISO-8859-1")

user_ratings_train = pd.merge(movie_info, user_ratings_train)
user_ratings_test = pd.merge(movie_info, user_ratings_test)

# 2. get the rating matrix. Each row is a user, and each column is a movie.
user_ratings_train = user_ratings_train.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')

user_ratings_test = user_ratings_test.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')




user_ratings_train = user_ratings_train.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

user_ratings_test = user_ratings_test.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

print(user_ratings_train.shape)
print(user_ratings_test.shape)

(943, 1664)
(943, 1664)


## Task 1. User-based CF

* Use pearson correlation to get the similarity between different users.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [6]:
# a little more information about the user ratings
print(user_ratings_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 1 to 943
Columns: 1664 entries, 'Til There Was You (1997) to Á köldum klaka (Cold Fever) (1994)
dtypes: float64(1664)
memory usage: 12.0 MB
None


In [14]:
# get the similarities for the training data
Xsim = user_ratings_train.corr(method=r'pearson')

In [18]:
# visual inspection
Xsim

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.000000,,,,-0.500000,,,,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.000000,-0.054823,1.000000,0.062517,0.273662,-0.259677,,0.106600,...,,,,0.244725,0.267500,0.764706,0.327327,0.866025,,
12 Angry Men (1957),,,-0.054823,1.000000,0.577350,0.134840,0.255123,0.112088,,0.431507,...,,,,0.042439,-0.071429,-0.349215,-0.037878,,,
187 (1997),-0.500000,,1.000000,0.577350,1.000000,0.624511,,-0.554700,,,...,,1.0,,0.562500,-1.000000,,0.176777,,,
2 Days in the Valley (1996),,,0.062517,0.134840,0.624511,1.000000,0.175412,0.343562,,0.174078,...,,,,0.015761,0.208691,0.392232,0.211189,,,
"20,000 Leagues Under the Sea (1954)",,,0.273662,0.255123,,0.175412,1.000000,0.278498,,-0.037450,...,,,,-0.125693,0.349064,0.000000,0.674200,,,
2001: A Space Odyssey (1968),,,-0.259677,0.112088,-0.554700,0.343562,0.278498,1.000000,,0.434572,...,,,,-0.056968,-0.171345,-0.343297,-0.425812,,-1.0,
3 Ninjas: High Noon At Mega Mountain (1998),,,,,,,,,,,...,,,,,,,,,,
"39 Steps, The (1935)",,,0.106600,0.431507,,0.174078,-0.037450,0.434572,,1.000000,...,,,,0.137038,-0.078811,0.174078,1.000000,,,


In [None]:
# create a new list for correlations
corr_list = []

# loop through a
for i_user_a in user_ratings_train.columns:
    # new list for user a's row
    row_user_a = []
    # get user a
    user_a = user_ratings_train[i_user_a]
    # subtract the mean for user a
    user_a = user_a - user_a.mean()
    # loop through b
    for i_user_b in user_ratings_train.columns:
        # get user b
        user_b = user_ratings_train[i_user_b]
        # subtract the mean for user b
        user_b = user_b - user_b.mean()
        # multiply for the similarity
        user_ab_sim = user_a.dot(user_b)
        # normalize it
        user_ab_sim = sim/(np.linalg.norm(user_a, ord=2)*np.linalg.norm(user_b, ord=2))
        row_user_a.append(user_ab_sim)
    # next i_user_b
    corr_list.append(row_user_a)
# next i_user_a

## Task 2. Item-based CF
* Use cosine similarity to get the similarity between different items.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [23]:
# your code