# Course Recommender: Collaborative-Filtering Methods

This project implements and deploys an AI course Recommender System using [Streamlit](https://streamlit.io/). It was inspired by the the [IBM Machine Learning Professional Certificate](https://www.coursera.org/professional-certificates/ibm-machine-learning) offered by IBM & Coursera. In the last course/module of the Specialization, Machine Learning Capstone, a similar application is built; check my [class notes](https://github.com/mxagar/machine_learning_ibm/tree/main/06_Capstone_Project/06_Capstone_Recommender_System.md) for more information.

This notebook researches and implements different **Collaborative-Filtering** recommender systems. You can open it in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mxagar/course_recommender_streamlit/blob/main/notebooks/04_Collaborative_RecSys.ipynb)

The following is implemented in this notebook:
  
  - A dense user-items ratings table is converted to a sparse table.
  - Item-based and user-based k-NN search is applied to find the closest items/users given their similarity and perform a prediction with a weighted sum. This is done with the [surprise](https://surprise.readthedocs.io/en/stable/index.html) library and manually.
  - Non-Negative Matrix Factorization (NMF) is used to decompose a ratings table into lower rank matrices which encode latent features. This is done with the [surprise](https://surprise.readthedocs.io/en/stable/index.html) library and with Scikit-Learn.

Table of contents:

- [Part 1: k-NN](#Part-1:-k-NN)
    - [1. Convert Dense Ratings Matrix to Sparse User-Item Matrix](#1.-Convert-Dense-Ratings-Matrix-to-Sparse-User-Item-Matrix)
    - [2. Item-Based Collaborative Filtering: k-NN with Suprise Library](#2.-Item-Based-Collaborative-Filtering:-k-NN-with-Suprise-Library)
    - [3. User-Based Collaborative Filtering: k-NN Manually](#3.-User-Based-Collaborative-Filtering:-k-NN-Manually)
- [Part 2: Non-Negative Matrix Factorization (NMF)](#Part-2:-Non-Negative-Matrix-Factorization-(NMF))
    - [1. NMF with the Suprise Library](#1.-NMF-with-the-Suprise-Library)
    - [2. NMF with Scikit-Learn](#2.-NMF-with-Scikit-Learn)

# Basic Definitions

**User-based** collaborative filtering: user similarity matrix is computed (user row against user row with cosine similarity) and then k most similar users are used to estimate ratings as weighted sums:

$$\hat{r}_{ui} = \frac{
\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v) \cdot r_{vi}}
{\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v)}$$

**Item-based** collaborative filtering: item similarity matrix is computed (item column against item column with cosine similarity) and then k most similar items are used to estimate ratings as weighted sums:

$$\hat{r}_{ui} = \frac{
\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j) \cdot r_{uj}}
{\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j)}$$


In [66]:
import numpy as np
import pandas as pd
import math

# Part 1: k-NN

## 1. Convert Dense Ratings Matrix to Sparse User-Item Matrix

In [67]:
rating_df = pd.read_csv("../data/ratings.csv")

In [68]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


In [69]:
rating_sparse_df = rating_df.pivot(index='user', columns='item', values='rating').fillna(0).reset_index().rename_axis(index=None, columns=None)
rating_sparse_df.head()

Unnamed: 0,user,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,2,0.0,3.0,0.0,0.0,3.0,2.0,0.0,2.0,2.0,...,0.0,2.0,0.0,3.0,0.0,2.0,2.0,0.0,3.0,0.0
1,4,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,...,0.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0
2,5,2.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,...,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0
3,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Item-Based Collaborative Filtering: k-NN with Suprise Library

In [None]:
from surprise import KNNBasic
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

In [16]:
# Surprise is a Recommender System Engine similar to Scikit-Learn
# https://surprise.readthedocs.io/en/stable/getting_started.html

In [18]:
# Suprise loads from file
rating_df.to_csv("../data/ratings_sparse.csv", index=False)
# Read the course rating dataset with columns user item rating
reader = Reader(
        line_format='user item rating', sep=',', skip_lines=1, rating_scale=(2, 3))

coruse_dataset = Dataset.load_from_file("../data/ratings_sparse.csv", reader=reader)

In [19]:
trainset, testset = train_test_split(coruse_dataset, test_size=.3)

In [20]:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the trainingset")

Total 31330 users and 125 items in the trainingset


In [21]:
# Define a KNNBasic() model
# more KNN model hyperparamets can be found here:
# https://surprise.readthedocs.io/en/stable/knn_inspired.html
sim_option = {
    'name': 'cosine',
    'user_based': False,
    #'user_based': True,
}

# Train the KNNBasic model on the trainset, and predict ratings for the testset
algo = KNNBasic(sim_options=sim_option)
algo.fit(trainset)

# Then compute RMSE
predictions = algo.test(testset)
accuracy.rmse(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.1939


0.19387109865857158

In [22]:
type(predictions)

list

In [23]:
# We could parse the Prediction objects to a dataframe
predictions[0]

Prediction(uid='993479', iid='BD0123EN', r_ui=3.0, est=3, details={'actual_k': 17, 'was_impossible': False})

## 3. User-Based Collaborative Filtering: k-NN Manually

In [73]:
## STEPS:
## - Calculate the similarity between two users using their rating history (the row vectors of interaction matrix)
## - Build a similarity matrix for each pair of users with the training dataset
## - For each user, find its k nearest neighbors in the sim matrix
## - For each rating in the test dataset, estimate its rating using the KNN collaborative filtering equations shown before
## - Calculate RMSE for the entire test dataset

In [74]:
from scipy.spatial.distance import cosine

In [75]:
# Extract user list
user_list = rating_sparse_df['user'].to_list()[:50] # only 50 users taken, otherwise it takes very long
user_list_df = pd.DataFrame(data=user_list, columns = ['user'])
# Empts similarity matrix
user_sim = np.zeros((len(user_list), len(user_list)))
# Compare users pairwise
for i, this_user in enumerate(user_list):
    this_user_ratings = rating_sparse_df[rating_sparse_df.user == this_user].iloc[:,1:].values
    for j, other_user in enumerate(user_list):
        other_user_ratings = rating_sparse_df[rating_sparse_df.user == other_user].iloc[:,1:].values
        similarity = 1 - cosine(this_user_ratings, other_user_ratings)
        # FIXME: matrix is symmetric, only half needs to be computed!
        user_sim[i,j] = similarity
# Assemble similarity dataframe
user_sim_df = pd.DataFrame(data=user_sim, columns=user_list)
user_sim_df = pd.concat([user_list_df, user_sim_df], axis=1)
# Set index to user to sort similar users easily later on
user_sim_df.set_index(user_sim_df['user'], inplace=True, drop=True)
user_sim_df.drop(['user'], axis=1, inplace=True)

In [76]:
user_sim_df.head(10)

Unnamed: 0_level_0,2,4,5,7,8,9,12,16,17,19,...,55,56,57,58,59,60,61,62,63,64
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,1.0,0.56997,0.546903,0.140028,0.188639,0.339486,0.339486,0.39085,0.093352,0.165025,...,0.19803,0.256718,0.140028,0.353156,0.140028,0.19803,0.585837,0.19803,0.427793,0.093352
4,0.56997,1.0,0.353281,0.150756,0.261116,0.429058,0.333712,0.207514,0.150756,0.213201,...,0.213201,0.301511,0.150756,0.35528,0.150756,0.213201,0.826153,0.213201,0.657952,0.150756
5,0.546903,0.353281,1.0,0.0,0.075165,0.123508,0.247016,0.209072,0.0,0.184115,...,0.092057,0.065094,0.130189,0.177627,0.130189,0.0,0.299187,0.092057,0.255686,0.0
7,0.140028,0.150756,0.0,1.0,0.0,0.316228,0.316228,0.344124,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.176777,0.0,0.218218,0.0
8,0.188639,0.261116,0.075165,0.0,1.0,0.365148,0.182574,0.19868,0.0,0.0,...,0.0,0.288675,0.0,0.143223,0.0,0.0,0.306186,0.0,0.251976,0.57735
9,0.339486,0.429058,0.123508,0.316228,0.365148,1.0,0.3,0.217643,0.0,0.223607,...,0.223607,0.474342,0.316228,0.156893,0.316228,0.0,0.447214,0.223607,0.345033,0.316228
12,0.339486,0.333712,0.247016,0.316228,0.182574,0.3,1.0,0.290191,0.0,0.0,...,0.0,0.158114,0.0,0.11767,0.0,0.0,0.33541,0.0,0.345033,0.0
16,0.39085,0.207514,0.209072,0.344124,0.19868,0.217643,0.290191,1.0,0.0,0.243332,...,0.243332,0.344124,0.344124,0.170733,0.344124,0.0,0.243332,0.243332,0.225282,0.0
17,0.093352,0.150756,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,0.165025,0.213201,0.184115,0.0,0.0,0.223607,0.0,0.243332,0.0,1.0,...,0.5,0.353553,0.707107,0.175412,0.707107,0.0,0.25,0.5,0.154303,0.0


In [77]:
# For user n=7, which are the closest k=3 users?
n = 7
u = user_list[n]
k = 3
user_sim_df[u].sort_values(ascending=False)[1:(k+1)] # first user is the user herself, discarded

user
2     0.390850
33    0.344124
40    0.344124
Name: 16, dtype: float64

In [78]:
# For user n=7, which is her ratings prediction using the k=3 closest neighbors?
n = 7
u = user_list[n]
k = 3
item_list = rating_sparse_df.columns[1:]
neighbor_users = list(user_sim_df[u].sort_values(ascending=False)[1:(k+1)].index)
neighbor_weights = list(user_sim_df[u].sort_values(ascending=False)[1:(k+1)].values)
estimared_ratings = [0]*len(item_list)
for i, item in enumerate(item_list):
    n = 0
    summed_weights = 0
    for j, user in enumerate(neighbor_users):
        r = rating_sparse_df.loc[rating_sparse_df.user==user, item].values[0]
        if r > 0:
            n += 1
            estimared_ratings[i] += r*neighbor_weights[j]
            summed_weights += neighbor_weights[j]
    if n > 0:
        estimared_ratings[i] /= summed_weights

In [79]:
true_ratings = rating_sparse_df.loc[rating_sparse_df.user==user].values[0,1:]

In [80]:
rmse = np.sqrt(np.sum(np.square(np.array(estimared_ratings)-true_ratings))/len(estimared_ratings))

In [81]:
rmse

1.8445474891761775

## 4. User-Based Collaborative Filtering: k-NN Manually, but for Courses

In [101]:
# The main difference is that we are taking columns, not rows, as before!
# Extract item list
item_list = rating_sparse_df.columns[1:]
item_list_df = pd.DataFrame(data=item_list, columns = ['item'])
# Empts similarity matrix
item_sim = np.zeros((len(item_list), len(item_list)))
# Compare items pairwise
for i, this_item in enumerate(item_list):
    this_item_ratings = rating_sparse_df[this_item].values
    for j, other_item in enumerate(item_list):
        other_item_ratings = rating_sparse_df[other_item].values
        similarity = 1 - cosine(this_item_ratings, other_item_ratings)
        # FIXME: matrix is symmetric, only half needs to be computed!
        item_sim[i,j] = similarity
# Assemble similarity dataframe
item_sim_df = pd.DataFrame(data=item_sim, columns=item_list)
item_sim_df = pd.concat([item_list_df, item_sim_df], axis=1)

In [102]:
# Set index to user to sort similar users easily later on
#item_sim_df.set_index(item_sim_df['item'], inplace=True, drop=True)
#item_sim_df.drop(['item'], axis=1, inplace=True)

In [104]:
item_sim_df

Unnamed: 0,item,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,AI0111EN,1.000000,0.105708,0.093261,0.158023,0.081823,0.071958,0.050822,0.035848,0.060667,...,0.101413,0.015380,0.085273,0.004320,0.024105,0.001451,0.005567,0.018910,0.010874,0.003331
1,BC0101EN,0.105708,1.000000,0.381801,0.262163,0.384545,0.312488,0.150896,0.080234,0.113649,...,0.120289,0.086303,0.160478,0.044557,0.005751,0.024492,0.005313,0.004512,0.205472,0.011523
2,BC0201EN,0.093261,0.381801,1.000000,0.338096,0.165998,0.139687,0.091785,0.054832,0.086070,...,0.100588,0.023002,0.083121,0.010301,0.009832,0.006362,0.002271,0.007713,0.072589,0.004415
3,BC0202EN,0.158023,0.262163,0.338096,1.000000,0.111778,0.099261,0.065476,0.055781,0.075561,...,0.068099,0.009607,0.058334,0.010508,0.000000,0.006883,0.000000,0.000000,0.017575,0.000000
4,BD0101EN,0.081823,0.384545,0.165998,0.111778,1.000000,0.733963,0.418689,0.277713,0.201113,...,0.086475,0.181796,0.190855,0.104190,0.008269,0.070181,0.011458,0.010541,0.287737,0.037563
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,TMP0105EN,0.001451,0.024492,0.006362,0.006883,0.070181,0.084228,0.105991,0.128590,0.147617,...,0.003618,0.102654,0.008691,0.139820,0.060193,1.000000,0.062554,0.023610,0.073011,0.043666
122,TMP0106,0.005567,0.005313,0.002271,0.000000,0.011458,0.010107,0.015328,0.013629,0.013379,...,0.000000,0.012734,0.003335,0.015918,0.115470,0.062554,1.000000,0.090582,0.014469,0.055844
123,TMP107,0.018910,0.004512,0.007713,0.000000,0.010541,0.000000,0.000000,0.000000,0.032821,...,0.000000,0.000000,0.011327,0.021629,0.392232,0.023610,0.090582,1.000000,0.000000,0.054198
124,WA0101EN,0.010874,0.205472,0.072589,0.017575,0.287737,0.247177,0.125282,0.092818,0.085882,...,0.066273,0.270450,0.061226,0.173706,0.000000,0.073011,0.014469,0.000000,1.000000,0.118429


In [107]:
item_sim_df.loc[item_sim_df.item=="AI0111EN"]["BC0101EN"].values

array([0.10570823])

# Part 2: Non-Negative Matrix Factorization (NMF)

## 1. NMF with the Suprise Library

In [5]:
from surprise import NMF
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

In [6]:
ratings_df = pd.read_csv("../data/ratings.csv")

In [7]:
ratings_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


In [8]:
# Dense -> Sparse ratings
# HOWEVER: note that Suprise works with dense tables, too
ratings_sparse_df = ratings_df.pivot(index='user', columns='item', values='rating').fillna(0).reset_index().rename_axis(index=None, columns=None)
ratings_sparse_df.head()

Unnamed: 0,user,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,2,0.0,3.0,0.0,0.0,3.0,2.0,0.0,2.0,2.0,...,0.0,2.0,0.0,3.0,0.0,2.0,2.0,0.0,3.0,0.0
1,4,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,...,0.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0
2,5,2.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,...,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0
3,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# Read the course rating dataset with columns user item rating
# The CSV file can be a dense one
reader = Reader(
        line_format='user item rating', sep=',', skip_lines=1, rating_scale=(2, 3))
course_dataset = Dataset.load_from_file("../data/ratings.csv", reader=reader)

In [10]:
trainset, testset = train_test_split(course_dataset, test_size=.3)

In [11]:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the trainingset")

Total 31348 users and 126 items in the trainingset


In [12]:
algo = NMF(n_factors=15, n_epochs=50, verbose=True, random_state=123)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
RMSE: 0.21

0.21235480559689332

## 2. NMF with Scikit-Learn

In [13]:
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [14]:
# Scikit-learn uses the SPARSE representation
X_train, X_test = train_test_split(
    ratings_sparse_df.iloc[:,1:],
    test_size=0.3, # portion of dataset to allocate to test set
    random_state=42 # we are setting the seed here, ALWAYS DO IT!
)

In [15]:
nmf = NMF(n_components=15, init='random', random_state=818)
W = nmf.fit_transform(X_train) # (X.shape[0]=n_samples, n_components)
H = nmf.components_ # (n_components, X.shape[1]=n_features)
X_hat = W@H

In [16]:
W.shape

(23730, 15)

In [17]:
H.shape

(15, 126)

In [18]:
# The RMSE metric is worse, with a default configuration
print('RMSE: ', mean_squared_error(X_train, X_hat, squared=False))

RMSE:  0.3360953022513968


In [19]:
# IMPORTANT: The fitted NMF model has constant H components,
# but W is different for each input data X.
W_test = nmf.transform(X_test)

In [20]:
W_test.shape

(10171, 15)

In [21]:
X_test_hat = W_test@H

In [22]:
# The RMSE metric is worse, with a default configuration
print('RMSE: ', mean_squared_error(X_test, X_test_hat, squared=False))

RMSE:  0.3373680254871046
