# Collaborative Filtering for Recommender System

## Type of Collaborative Filtering

A **User-Based Collaborative Filtering**(or _user-item filtering_) will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. 

>- User-Item Collaborative Filtering*: “Users who are similar to you also liked …”

In contrast, **Item-Based Collaborative Filtering**(or _item-item filtering_) will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations. 

>- Item-Item Collaborative Filtering*: “Users who liked this item also liked …”

<br>

This posting will mainly focus on **User-Based Collaborative Filtering**

## Similarity

### Cosine Similarity

A distance metric commonly used in recommender systems is **cosine similarity**, where the ratings are seen as vectors in ``n``-dimensional space and the similarity is calculated based on the angle between these vectors. 

- The smaller the angle($\theta$), the closer the direction the vectors are pointing. 
- The smaller the angle($\theta$), more significant the similarity.

$$a \cdot b = ||a|| \cdot ||b|| \cdot cos\theta$$

$$similarity = cos(\theta) = \dfrac{A \cdot B}{||A|| \cdot ||B||} = \dfrac{\sum_{i=1}^{n}(A_i - B_i)}{{\sqrt{\sum_{i=1}^{n}(A_i)^2}} \cdot {\sqrt{\sum_{i=1}^{n}(B_i)^2}} }$$



* * * *

### practice similarity with numpy

In [1]:
# import libraries
import numpy as np
import pandas as pd
from scipy import spatial

In [2]:
# sample data
vector_1 = np.array([1,2,3,4,5])
vector_2 = np.array([5,4,3,2,1])
vector_3 = np.array([11,19,28,32,47])

In [5]:
# numerator : inner product of vector
np.sum(np.dot(vector_1, vector_2))

# denominator : size of the vector
np.sqrt(sum(vector_1*vector_1))*np.sqrt(sum(vector_2*vector_2))

# numerator / denominator
np.sum(np.dot(vector_1, vector_2)) / (np.sqrt(sum(vector_1*vector_1))*np.sqrt(sum(vector_2*vector_2)))

0.63636363636363635

### practice cosine similarity with scipy

$$\text{cosine distance (A, B)} = 1 - \dfrac{\sum_{i=1}^{n}(A_i - B_i)}{{\sqrt{\sum_{i=1}^{n}(A_i)^2}} \cdot {\sqrt{\sum_{i=1}^{n}(B_i)^2}} }$$  

$$\text{cosine similarity (A, B)} = 1 - ( 1 - \dfrac{\sum_{i=1}^{n}(A_i - B_i)}{{\sqrt{\sum_{i=1}^{n}(A_i)^2}} \cdot {\sqrt{\sum_{i=1}^{n}(B_i)^2}} })$$  



In [18]:
1 - spatial.distance.cosine(vector_1, vector_2)

0.63636363636363635

* * * *

### Recommender System and Cosine Similarity

We'll practice simple User-Based Recommender System using Cosine Similarity, and finally evaluate the systme using MSE, RMSE, MAE scoring.

#### Create Sample Data
- Movie recommendation system based on scores from users

In [12]:
# sample dataset matrix

columns = ["movie_1","movie_2","movie_3","movie_4","movie_5"]
index = ["user_1", "user_2", "user_3", "user_4"]
data = np.array([
    [5,3,0,0,2],
    [2,0,0,1,4],
    [0,0,4,3,1],
    [4,0,4,5,0],
])

In [13]:
sample_df = pd.DataFrame(data, columns=columns, index=index)
sample_df

Unnamed: 0,movie_1,movie_2,movie_3,movie_4,movie_5
user_1,5,3,0,0,2
user_2,2,0,0,1,4
user_3,0,0,4,3,1
user_4,4,0,4,5,0


#### consine similarity function

In [14]:
def cosine_similarity(vector_1, vector_2):
    
    # eliminate vector_1 with zero value
    # get index from vector_1 without zero value
    idx = vector_1.nonzero()[0] 
    
    # filter vectors with the index
    vector_1, vector_2 = np.array(vector_1)[idx], np.array(vector_2)[idx]
    
    # return cosine similarity
    return 1 - spatial.distance.cosine(vector_1, vector_2)

In [15]:
cosine_similarity(sample_df.loc['user_1'], sample_df.loc['user_2'])

0.65292862509901051

#### similarity matrix fucntion

In [17]:
def similarity_matrix(sample_df, similarity_func):
    
    # save index
    index = sample_df.index
    
    # transform dataframe 
    # index : article
    # columns : user
    df = sample_df.T
    
    # calculate similarity of between all of each user data into matrix
    matrix = []
    
    for idx_1, value_1 in df.items():
    
        # save row
        row = []
    
        for idx_2, value_2 in df.items():
    
            # similarity between two user data
            row.append(similarity_func(value_1, value_2))
        
        matrix.append(row)
        
    return pd.DataFrame(matrix, columns=index, index=index)

In [11]:
sm_df = similarity_matrix(sample_df, cosine_similarity)
sm_df

Unnamed: 0,user_1,user_2,user_3,user_4
user_1,1.0,0.652929,0.324443,0.811107
user_2,0.729397,1.0,0.483046,0.443039
user_3,0.196116,0.332956,1.0,0.949474
user_4,0.529813,0.770054,0.82121,1.0


#### Decide which user will receive recommendation and how many data should be considered in order to make recommendation

In [19]:
user, closer_count = "user_1", 2

#### drop "user_1" data

In [20]:
ms_df = sm_df.drop(user)
ms_df

Unnamed: 0,user_1,user_2,user_3,user_4
user_2,0.729397,1.0,0.483046,0.443039
user_3,0.196116,0.332956,1.0,0.949474
user_4,0.529813,0.770054,0.82121,1.0


#### sort by similarity

In [22]:
ms_df = ms_df.sort_values(user, ascending=False)
ms_df

Unnamed: 0,user_1,user_2,user_3,user_4
user_2,0.729397,1.0,0.483046,0.443039
user_4,0.529813,0.770054,0.82121,1.0
user_3,0.196116,0.332956,1.0,0.949474


#### filter two users with high similarity

In [23]:
ms_df = ms_df[:closer_count]
ms_df

Unnamed: 0,user_1,user_2,user_3,user_4
user_2,0.729397,1.0,0.483046,0.443039
user_4,0.529813,0.770054,0.82121,1.0


#### get data of user_1 and user_4 from the sample data

In [24]:
sample_df.loc[ms_df.index]

Unnamed: 0,movie_1,movie_2,movie_3,movie_4,movie_5
user_2,2,0,0,1,4
user_4,4,0,4,5,0


### Mean score functions

In [26]:
def mean_score(sample_df, sm_df, target, closer_count):
    
    # get dataframe of user with close similarity
    ms_df = sm_df.drop(target)
    ms_df = ms_df.sort_values(target, ascending=False)
    ms_df = ms_df[target][:closer_count]

    # dataframe with high similarity user
    ms_df = sample_df.loc[ms_df.index]

    # create dataframe to return result
    pred_df = pd.DataFrame(columns=sample_df.columns)
    pred_df.loc["user"] = sample_df.loc[target]
    pred_df.loc["mean"] = ms_df.mean()
    return pred_df

### get result
- sample_df : sample dataframe
- sm_df : similarity matrix dataframe

In [28]:
target, closer_count = "user_1", 2
pred_df = mean_score(sample_df, sm_df, target, closer_count)
pred_df

Unnamed: 0,movie_1,movie_2,movie_3,movie_4,movie_5
user,5,3,0,0,2
mean,3,0,2,3,2


### return movies that the user hasn't watched(or rated) and sort it by point mean score

In [29]:
recommand_df = pred_df.T
recommand_df

Unnamed: 0,user,mean
movie_1,5,3
movie_2,3,0
movie_3,0,2
movie_4,0,3
movie_5,2,2


In [30]:
recommand_df = recommand_df[recommand_df["user"] == 0]
recommand_df

Unnamed: 0,user,mean
movie_3,0,2
movie_4,0,3


In [31]:
recommand_df = recommand_df.sort_values("mean", ascending=False)
print(list(recommand_df.index))
recommand_df

['movie_4', 'movie_3']


Unnamed: 0,user,mean
movie_4,0,3
movie_3,0,2


#### Now we got the final result
- user hasn't rated movie_3 and movie_4
- according to the similarity and mean score from user_2 and user_4, 
    - movie_4 has higher score, therefore it's comes priority recommendation.

## Performance Evaluation

From the dataframe, 'user' are the vectors from real value, and 'mean' are vectors from predicted value. We can evaluate the performance by getting the error from these two vectors. The evaluation score can refered as a benchmark to compare the performance with the updated version (with more datas to train, better algorithm, etc)

In [40]:
pred_df

Unnamed: 0,movie_1,movie_2,movie_3,movie_4,movie_5
user,5,3,0,0,2
mean,3,0,2,3,2


### Mean Squared Error

$$ MSE = \dfrac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{n} (y_i : \text{real value}, \hat{y_i}: \text{predicted value})$$

In [32]:
def mse(value, pred):
    
    # drop zero value from user data
    idx = value.nonzero()[0]
    value, pred = np.array(value)[idx], np.array(pred)[idx]
    
    # calculate according to the formula
    return sum((value - pred)**2) / len(idx)

mse(pred_df.loc["user"], pred_df.loc["mean"])

4.333333333333333

#### Evaluation on total user data

In [33]:
def evaluate(df, sm_df, closer_count, algorithm):
    
    # user list
    users = df.index
    evaluate_list = []
    
    # get all MSE from user data
    for target in users:
        result_df = mean_score(df, sm_df, target, closer_count)
        evaluate_list.append(algorithm(result_df.loc["user"], result_df.loc["mean"]))

    # return average value of all MSE user data
    return np.average(evaluate_list)

In [34]:
evaluate(sample_df, sm_df, 2, mse)

4.5

### Root Mean Squared Error

$$ RMSE = \sqrt{\dfrac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{n}} (y_i : \text{real value}, \hat{y_i}: \text{predicted value})$$

In [35]:
def rmse(value, pred):
    
    # drop zero value from user data
    idx = value.nonzero()[0]
    value, pred = np.array(value)[idx], np.array(pred)[idx]
    
    # calculate according to the formula
    return np.sqrt(sum((value - pred)**2) / len(idx))

rmse(pred_df.loc["user"], pred_df.loc["mean"])

2.0816659994661326

In [36]:
evaluate(sample_df, sm_df, 2, rmse)

2.0677918275480169

### Mean Absolute Error

$$ MAE = \dfrac{\sum_{i=1}^{n}|e_i|)^2}{n} $$

$$ e_i = y_i - \hat{y_i}\;\;\;(y_i : \text{real value}, \hat{y_i}: \text{predicted value})$$

In [37]:
# MAE on one user
def mae(value, pred):
    
    # drop zero value from user data
    idx = value.nonzero()[0]
    value, pred = np.array(value)[idx], np.array(pred)[idx]
    
    # calculate formula and return result
    return np.absolute(sum(value - pred)) / len(idx)

In [38]:
mae(pred_df.loc['user'], pred_df.loc['mean'])

1.6666666666666667

In [39]:
evaluate(sample_df, sm_df, 2, mae)

1.1666666666666667