# Collaborative Filtering

Sometimes called user-to-user recommendation. This is the recommendation problem approach that solves "Users like you also like".

Resources:
- https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0

In [1]:
from itertools import combinations

import numpy as np
import pandas as pd
from scipy import spatial
from sklearn.decomposition import NMF

## Similarity Measure of Users

We can predict user-u’s rating for item-i by taking weighted sum of item-i ratings from all other users (u′s) where weighting is similarity number between each user and user-u.

We use Cosine Similarity in this example, but this isn't the only measure (https://surprise.readthedocs.io/en/stable/similarities.html)



In [2]:
# Import Movielens small dataset
data = pd.read_csv('../data/movielens_small/ratings.csv')

# Let's create a user-item matrix
df = data.drop(['timestamp'], axis=1)
df = df.pivot(index='userId', columns='movieId', values='rating').fillna(0)

df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
def cosine_sim(a, b):
    """
    Cosine Similarity of a and b
    
    slow
    """
    return 1 - spatial.distance.cosine(a, b)
    #return np.dot(a, b) / (np.linalg.norm(a)*np.linalg.norm(b))  # Slower


    
# Pre-calculate User's similarity scores
user_sim = pd.DataFrame(index=df.index.values, columns= df.index.values).fillna(0)
user_comb = list(combinations(df.index.values, 2))
for u, v in user_comb:
    # Get similarity
    u_arr = df.loc[u].values
    v_arr = df.loc[v].values
    sim = cosine_sim(u_arr, v_arr)
    
    # Assign in reflected matrix
    user_sim.loc[u, v] = sim
    user_sim.loc[v, u] = sim

# For example, user 1 and movie 4
u = 1
i = 4

# Note: 
# You would calculate similarity for every user for every item and them rank by score
# However that is expensive and there is already optimized solutions for doing that, so we are just doing one item.

# Calculate Average rating weighted by similarity, then normalize
#rating_hat = np.sum([user_sim.loc[u, v]*df.loc[v, i] for v in df.index.values if u != v])
rating_hat = np.dot(user_sim[u], df.loc[:, i])
#norm = np.sum([np.absolute(user_sim.loc[u, v]) for v in df.index.values if u != v])
norm = np.sum(np.absolute(user_sim[u]))
rating_hat_norm = rating_hat / norm

rating_hat_norm

0.027651548451892778

## Matrix Factorization

Matrix Factorization is you decompose a matrix to a User and Item embeddings and them use those to predict (users, item) pairs. Non-negative is one method of doing that. 

Additional Matrix Factorization Methods
- Probabilistic Matrix Factorization (PMF) (Python Package - Surprise http://surpriselib.com/)
- Singular Value Decomposition (SVD) (Python Package - Surprise http://surpriselib.com/)

In [4]:
# Import Movielens small dataset
data = pd.read_csv('../data/movielens_small/ratings.csv')

# Let's create a user-item matrix
df = data.drop(['timestamp'], axis=1)
df = df.pivot(index='userId', columns='movieId', values='rating').fillna(0)

df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
model = NMF(n_components=25, init='random', random_state=0)
W = model.fit_transform(df)
H = model.components_

In [6]:
print(W.shape)
print(W)

(610, 25)
[[0.25289547 0.         0.         ... 0.01260417 0.         0.        ]
 [0.         0.         0.20028585 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.01435548 0.        ]
 ...
 [0.19460835 0.154867   0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         8.29225927 0.        ]]


In [7]:
print(H.shape)
print(H)

(25, 9724)
[[0.         0.08769692 0.06663176 ... 0.         0.         0.        ]
 [0.11264899 0.08577695 0.12754013 ... 0.         0.         0.        ]
 [0.         0.28529324 0.         ... 0.00469229 0.00469229 0.        ]
 ...
 [1.0496118  0.65715279 0.58232139 ... 0.         0.         0.        ]
 [0.59217965 0.01705675 0.         ... 0.         0.         0.00588314]
 [0.         1.31053144 0.         ... 0.01212665 0.01212665 0.        ]]


In [8]:
# We can now use these W (User) and H (Item) embeddings to estimate rating to unrated items

#For example - user 1, item 4

np.dot(W[0], H[:, 3])

0.023073640044190838

In [9]:
# ... or we can simply calculate the whole matrix

df_hat = pd.DataFrame(np.dot(W, H), index=df.index.values, columns=df.columns.values)
df_hat

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,2.502286,1.063538,1.021043,0.023074,0.259647,1.742920,0.394722,0.019058,0.122244,1.681873,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.007025
2,0.300478,0.095122,0.000000,0.000000,0.019522,0.081120,0.000000,0.000000,0.000000,0.028265,...,0.002595,0.002225,0.002966,0.002966,0.002595,0.002966,0.002595,0.002595,0.002595,0.008072
3,0.089395,0.039485,0.060553,0.000000,0.000767,0.050289,0.010517,0.000000,0.000809,0.061269,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000084
4,1.922919,0.345081,0.229664,0.060621,0.308501,0.912845,0.322713,0.017734,0.135406,0.244940,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.935526,0.997915,0.291425,0.088075,0.331870,0.455761,0.370419,0.109717,0.079032,1.352347,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.001391
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,1.875818,0.000000,0.000000,0.019654,0.019261,0.000000,2.672544,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
607,1.729569,1.100820,0.375177,0.036632,0.147921,1.578620,0.172289,0.074629,0.018680,1.913489,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.001648
608,3.655791,1.381658,1.156974,0.057763,0.308798,2.743766,0.422516,0.358971,0.153624,4.086133,...,0.000026,0.000022,0.000029,0.000029,0.000026,0.000029,0.000026,0.000026,0.000026,0.000000
609,0.517193,0.743038,0.143971,0.056574,0.159381,0.241274,0.178848,0.078200,0.000000,1.068756,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.001612


In [10]:
# Now we can pull a single user and sort their items to get recommendations.

# Note: This will include items the user has already rated. For example, Item 1196 the user rated 5 already.

print("Top 5")
user_1_hat = df_hat.loc[1]
user_1_hat.sort_values(ascending=False)[:5]

Top 5


1196    5.663548
260     5.336826
1210    4.970881
589     4.948825
2571    4.923245
Name: 1, dtype: float64

In [11]:
# ... we can filter out already reviewed items
user_1 = df.loc[1]
reviewed_items_idx = user_1[user_1 == 0].index.values

print("Top 5 Not Reviewed")
user_1_hat = df_hat.loc[1]
user_1_hat.filter(items=reviewed_items_idx).sort_values(ascending=False)[:5]

Top 5 Not Reviewed


589     4.948825
1200    4.729397
1036    3.586060
2762    3.454161
1374    3.185944
Name: 1, dtype: float64

## Classification / Regression from Matrix Factorization Embeddings

We can actually take these embeddings and turn them into columnar input for a more traditional Classification or Regression learning task. Generally, this approach works better than simply Matrix Factorization when our data is very sparse and imbalanced. Deep Learning, where we can concateinate many types of collaborative and content data together, is generally what state-of-the-art recommendation systems do.  
