<h1 align="center">Recommender Systems - Collaborative Filtering in Python</h1>

One approach to the design of recommender systems that has wide use is collaborative filtering.Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. <br><br>
A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself.Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.<br><br>
This notebook contains implementations of User-User and Item-Item collaborative Filtering algorithms.<br><br>

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Exploration</a></span></li><li><span><a href="#User---User-Collaborative-Filtering" data-toc-modified-id="User---User-Collaborative-Filtering-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>User - User Collaborative Filtering</a></span></li><li><span><a href="#Item---Item-Collaborative-Filtering" data-toc-modified-id="Item---Item-Collaborative-Filtering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Item - Item Collaborative Filtering</a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Results</a></span><ul class="toc-item"><li><span><a href="#Baseline" data-toc-modified-id="Baseline-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Baseline</a></span></li><li><span><a href="#User-User-Collaborative-Filtering" data-toc-modified-id="User-User-Collaborative-Filtering-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>User-User Collaborative Filtering</a></span></li><li><span><a href="#Item-Item-Collaborative-Filtering" data-toc-modified-id="Item-Item-Collaborative-Filtering-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Item-Item Collaborative Filtering</a></span></li></ul></li><li><span><a href="#Drawbacks-of-Collaborative-Filtering" data-toc-modified-id="Drawbacks-of-Collaborative-Filtering-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Drawbacks of Collaborative Filtering</a></span></li><li><span><a href="#References" data-toc-modified-id="References-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>References</a></span></li></ul></div>

## Imports

In [1]:
import pickle
import numpy as np
import pandas as pd
from sortedcontainers import SortedList

FILE_PATH = "./data/Movielens/"

## Load the data
The original dataset has been proprocessed to filter out and keep only the top users and items.<br>
Please refer to the preprocessing notebook in the repo for more details.

In [2]:
## user_to_item_map={}  ## Key:= User_id, Value:= [list of items] 
## item_to_user_map={}  ## Key:= item_id, Value:=[list of users] 
## train_ratings={}     ## Key:= (User_id, item_id) Value:=Rating 
## test_ratings={}      ## Key:= (User_id, item_id) Value:=Rating
## user_statistics={}   ## Key:= User_id, Value:=(Avg_rating, Norm_of_deviations)
## item_statistics={}   ## Key:= User_id, Value:=(Avg_rating, Norm_of_deviations)

with open(FILE_PATH+'user_to_item_map.pkl', 'rb') as fp:
    user_to_item_map=pickle.load(fp)

with open(FILE_PATH+'item_to_user_map.pkl', 'rb') as fp:
    item_to_user_map=pickle.load(fp)

with open(FILE_PATH+'train_ratings.pkl', 'rb') as fp:
    train_ratings=pickle.load(fp)

with open(FILE_PATH+'test_ratings.pkl', 'rb') as fp:
    test_ratings=pickle.load(fp)

with open(FILE_PATH+'user_statistics.pkl', 'rb') as fp:
    user_statistics=pickle.load(fp)
    
with open(FILE_PATH+'item_statistics.pkl', 'rb') as fp:
    item_statistics=pickle.load(fp)

## Data Exploration

In [3]:
n_users=len(user_to_item_map)
n_items=len(item_to_user_map)
matrix_size=n_users*n_items
n_ratings=len(train_ratings)+len(test_ratings)

print("Number of unique users:",n_users)
print("Number of unique items:",n_items)
print("Total ratings in the dataset:",n_ratings)

print("User-Item matrix size:",matrix_size)
print("User-Item matrix empty percentage:",(matrix_size-n_ratings)*100/(matrix_size))

Number of unique users: 200
Number of unique items: 600
Total ratings in the dataset: 95871
User-Item matrix size: 120000
User-Item matrix empty percentage: 20.1075


## User - User Collaborative Filtering

In [4]:
def initialize_user_neighbours(n_neighbours,min_common_items,k):
    ## Key: User_id, Value: List of (similarity,user_id) of max length n_neighbours 
    ## (Sorted by absolute similarity in decreasing order)
    user_neighbours={i:SortedList(key=lambda x: -abs(x[0])) for i in range(n_users)}

    for user1 in range(n_users):
        for user2 in range(user1+1,n_users):

            ## Find the common items between user1 and user2
            common_items=user_to_item_map[user1].intersection(user_to_item_map[user2])

            if len(common_items) <= min_common_items: continue

            u1_ratings = np.array( [ train_ratings[(user1,item)] for item in common_items ] )
            u2_ratings = np.array( [ train_ratings[(user2,item)] for item in common_items ] )

            u1_avg, u1_norm = user_statistics[user1]
            u2_avg, u2_norm = user_statistics[user2]

            numerator=sum((u1_ratings-u1_avg)*(u2_ratings-u2_avg))
            similarity=numerator/(u1_norm*u2_norm)

            user_neighbours[user1].add((similarity,user2))
            user_neighbours[user2].add((similarity,user1))

            ## If the lists contain more than the allowed users, pop items
            if len(user_neighbours[user1])>n_neighbours:
                user_neighbours[user1].pop()
            if len(user_neighbours[user2])>n_neighbours:
                user_neighbours[user2].pop()

        if user1%k==0: 
            print("Calculated neighbours for user: ",user1)
    return user_neighbours

In [5]:
def user_user_predict(user,item,user_neighbours):
    
    numerator=denominator=0

    for similarity,user2 in user_neighbours[user]:
        if (user2,item) in train_ratings:
            numerator+=similarity*(train_ratings[(user2,item)]-user_statistics[user2][0])
            denominator+=abs(similarity)
            
    pred=user_statistics[user][0]  ## Use average rating of the user
    
    if denominator>0:
        pred+=(numerator/denominator) ## Adding the weighted deviations of neighbours
        
    pred=max(0.5,pred)
    pred=min(5,pred)
    
    return pred

## Item - Item Collaborative Filtering

In [6]:
def initialize_item_neighbours(n_neighbours,min_common_users,k):
    ## Key: item_id, Value: List of (similarity,item_id) of max length n_neighbours 
    ## (Sorted by absolute similarity in decreasing order)
    item_neighbours={i:SortedList(key=lambda x: -abs(x[0])) for i in range(n_items)}

    for item1 in range(n_items):
        for item2 in range(item1+1,n_items):

            ## Find the common users between item1 and item2
            common_users=item_to_user_map[item1].intersection(item_to_user_map[item2])

            if len(common_users) <= min_common_users: continue

            m1_ratings = np.array( [ train_ratings[(user,item1)] for user in common_users ] )
            m2_ratings = np.array( [ train_ratings[(user,item2)] for user in common_users ] )

            m1_avg, m1_norm = item_statistics[item1]
            m2_avg, m2_norm = item_statistics[item2]

            numerator=sum((m1_ratings-m1_avg)*(m2_ratings-m2_avg))
            similarity=numerator/(m1_norm*m2_norm)

            item_neighbours[item1].add((similarity,item2))
            item_neighbours[item2].add((similarity,item1))

            ## If the lists contain more than the allowed items, pop items
            if len(item_neighbours[item1])>n_neighbours:
                item_neighbours[item1].pop()
            if len(item_neighbours[item2])>n_neighbours:
                item_neighbours[item2].pop()

        if item1%k==0: 
            print("Calculated neighbours for item: ",item1)
            
    return item_neighbours

In [7]:
def item_item_predict(user,item,item_neighbours):
    
    numerator=denominator=0

    for similarity,item2 in item_neighbours[item]:
        if (user,item2) in train_ratings:
            numerator+=similarity*(train_ratings[(user,item2)]-item_statistics[item2][0])
            denominator+=abs(similarity)
    
    pred=item_statistics[item][0]  ## Use average rating of a item
    
    if denominator>0:
        pred+=(numerator/denominator) ## Average rating + weighted deviation of neighbours
        
    pred=max(0.5,pred)
    pred=min(5,pred)
    
    return pred

## Results
The mean squared error (MSE) is printed for train and test datasets. 

### Baseline

In [8]:
train_errors=[(item_statistics[m][0]-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(item_statistics[m][0]-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.8238406990373789
Test error: 0.8135730078827685


### User-User Collaborative Filtering

In [9]:
min_common_items=5   ## For each user consider only other users with min_common_items
n_neighbours=25      ## For each user Consider n_neighbours with highest absolute weight similarity
k=100                 ## Verbose printing for every kth iteration
user_neighbours=initialize_user_neighbours(n_neighbours,min_common_items,k)

Calculated neighbours for user:  0
Calculated neighbours for user:  100


In [10]:
train_errors=[(user_user_predict(u,m,user_neighbours)-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(user_user_predict(u,m,user_neighbours)-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.5895553103600147
Test error: 0.6194113057221694


### Item-Item Collaborative Filtering

In [11]:
min_common_users=5   ## For each item consider only other users with min_common_users
n_neighbours=25      ## For each item Consider n_neighbours with highest absolute weight similarity
k=80                 ## Verbose printing for every kth iteration
item_neighbours=initialize_item_neighbours(n_neighbours,min_common_users,k)

Calculated neighbours for item:  0
Calculated neighbours for item:  80
Calculated neighbours for item:  160
Calculated neighbours for item:  240
Calculated neighbours for item:  320
Calculated neighbours for item:  400
Calculated neighbours for item:  480
Calculated neighbours for item:  560


In [12]:
train_errors=[(item_item_predict(u,m,item_neighbours)-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(item_item_predict(u,m,item_neighbours)-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.4051205463827412
Test error: 0.5708141774284932


## Drawbacks of Collaborative Filtering

Collaborative filtering approaches often suffer from three problems: cold start, scalability, and sparsity.<br><br>
(i) Cold start: These systems often require a large amount of existing data on a user in order to make accurate recommendations.<br><br>
(ii) Scalability: In many of the environments in which these systems make recommendations, there are millions of users and products. Thus, a large amount of computation power is often necessary to calculate recommendations.<br><br>
(iii) Sparsity: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings.<br>

## References

1. https://en.wikipedia.org/wiki/Recommender_system <br>
2. https://www.kaggle.com/grouplens/movielens-20m-dataset <br>
3. https://www.udemy.com/recommender-systems/ <br>