# CS 1656 – Introduction to Data Science 

## Instructor: Alexandros Labrinidis
### Teaching Assistants: Evangelos Karageorgos, Xiaoting Li, Gordon Lu
### Additional credits: Phuong Pham, Zuha Agha, Anatoli Shein
## Recitation 7: Collaborative Filtering & Similarity Metrics
---
In this recitation we will be doing a fun exercise to implement collaborative filtering for recommender systems. We will also learn how the choice of similarity metric in collaborative filtering can affect its output of predicted ratings. 

Packages you will need for the recitation are,

* pandas
* numpy
* scipy

Recall that numpy package provides nd-arrays and operations for easily manipulating them. 
Likewise, scipy provides an addtional suite of useful mathematical functions and distributions for numpy arrays, including distance functions which we will use in this recitation to compute the measure of similarity. We will only import the distance funcions we need for today's session as shown below. Note that cityblock is just another name for Manhattan distance metric seen in class.

In [1]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean, cityblock, cosine
from scipy.stats import pearsonr

ValueError: bad marshal data (unknown type code)

## User-Based vs Item-Based Recommendation
There are two type of collaborative filtering method: user-based and item-based.

User-based recommendation assumes that similar users give similar ratings to each item. Whereas item-based recommendation assumes that similar items receive similar ratings from each user. You can think of them as a dual of each other. 

In this recitation, we will walk through a toy example for user-based recommendation and you will try out item-based recommendation later in one of your tasks. 

## Data Input

In [None]:
df = pd.read_csv('http://data.cs1656.org/movies_example.csv')
df

### Accessing rows in dataframe

The two ways to access dataframes rows are shown below,

In [None]:
# Converting value equality test fo a Series of booleans
df['Name'] == 'The Matrix'

In [None]:
# First way to access rows
df[df['Name'] == 'The Matrix']

In [None]:
# Second way
df.iloc[0]

### Missing values in data frame

To exlude missing values or NaNs in a dataframe, we can use the notnull() function.

In [None]:
df['Frank'].notnull()

In [None]:
df['Elaine'].notnull()

You can also perform logical operations on the boolean Series returned as shown below,

In [None]:
df['Frank'].notnull() & df['Elaine'].notnull()

You can also select subset of rows and columns where the boolean value is True.

In [None]:
df_notmissing = df[['Frank','Elaine']][df['Frank'].notnull() & df['Elaine'].notnull()]
df_notmissing

## Similarity Metrics & Predicted Ratings
Different distance metrics can be used to measure the similarity. In this recitation, we will use Euclidean, Manhattan, Pearson Correlation and Cosine distance metrics to measure the similarity.

### Euclidean 

In [None]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    dist = euclidean(df_subset['Frank'], df_subset[user])
    sim_weights[user] = 1.0 / (1.0 + dist)
print ("similarity weights: %s" % sim_weights)

Now let's find the predicted rating of 'Frank' for 'The Matrix'. We can get all ratings for a movie by accessing a row of the dataframe using iloc learnt earlier. We only slice the columns of ratings we need indicated by the index [1:-1]. In this case we do not need the first column 'Name' and the last column 'Frank'.

In [None]:
ratings = df.iloc[0][1:-1]
ratings

Now we will find our predicted rating by multiplying each user weight with its corresponding rating for the movie matrix.

In [None]:
predicted_rating = 0.0
weights_sum = 0.0
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

### Manhattan (Cityblock)

We repeat our method of finding predicted rating using cityblock distance now.

In [None]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    dist = cityblock(df_subset['Frank'], df_subset[user])
    sim_weights[user] = 1.0 / (1.0 + dist)
print ("similarity weights: %s" % sim_weights)

predicted_rating = 0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

### Pearson Correlation Coefficient

In [None]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    sim_weights[user] = pearsonr(df_subset['Frank'], df_subset[user])[0]
print ("similarity weights: %s" % sim_weights)

predicted_rating = 0.0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %s" % predicted_rating)

Why nan?
Because anything divided by 0 is undefined. Computing it again with this modfication gives the following.

In [None]:
predicted_rating = 0.0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    if (not np.isnan(sim_weights[user])):
        predicted_rating += ratings[user] * sim_weights[user]
        weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

## Tasks
For your tasks, use the movie ratings data we collected from a previous class in movie_class_responses.csv. You will predict missing movie ratings of a student based on other students with similar tastes. The first column, 'Alias' is the name of the movie, while all other columns are user names od students. The ratings are from 1 to 5, while there are a lot of missing values (missing movie ratings).

In [None]:
df = pd.read_csv('http://data.cs1656.org/movie_class_responses.csv')
df

** Task 1: User-based Recommendation with Cosine Metric**

For a specified user, calculate ALL missing movie ratings using user-based recommendation with Cosine Metric.

** Task 2: Item-based Recommendation with Cosine Metric**

Repeat the task above by doing an item-based recommendation instead of a user based recommendation. To calculate a missing movie rating using item-based recommendation, you are supposed to find similarity between movies instead of users. In other words, you measure the similarity of the user's missing rating movie with movies that the user has rated in the past. Then compute a weighted average using similar movie weights and their ratings to find out the predicted rating. You need to predict ALL missing movie ratings for the user.

** Task 3: User-based Recommendation with Cosine Metric**

Repeat Task 1 while computing the weighted average using just top 10 most similar users instead of all users.