<h1 align="center">Recommender Systems - Collaborative Filtering in Python</h1>

One approach to the design of recommender systems that has wide use is collaborative filtering.Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. <br><br>
A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself.Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.<br><br>
This notebook contains implementations of User-User and Item-Item collaborative Filtering algorithms for the Movie Lens dataset.<br><br>

## Imports

In [1]:
import pickle
import numpy as np
import pandas as pd
from sortedcontainers import SortedList

## Load the data
Original Dataset - https://www.kaggle.com/grouplens/movielens-20m-dataset<br>
The original dataset has been proprocessed to filter out and keep only the top users movies.

In [2]:
## user_to_movie_map={}  ## Key:= User_id, Value:= [list of movies] 
## movie_to_user_map={}  ## Key:= Movie_id, Value:=[list of users] 
## train_ratings={}      ## Key:= (User_id, Movie_id) Value:=Rating 
## test_ratings={}       ## Key:= (User_id, Movie_id) Value:=Rating 
## user_statistics={}    ## Key:= User_id, Values:=(Avg_rating, Norm_of_deviations) 
## movie_statistics={}   ## Key:= Movie_id, Values:=(Avg_rating, Norm_of_deviations)

with open('./data/user_to_movie_map.pkl', 'rb') as fp:
    user_to_movie_map=pickle.load(fp)

with open('./data/movie_to_user_map.pkl', 'rb') as fp:
    movie_to_user_map=pickle.load(fp)

with open('./data/train_ratings.pkl', 'rb') as fp:
    train_ratings=pickle.load(fp)

with open('./data/test_ratings.pkl', 'rb') as fp:
    test_ratings=pickle.load(fp)

with open('./data/user_statistics.pkl', 'rb') as fp:
    user_statistics=pickle.load(fp)
    
with open('./data/movie_statistics.pkl', 'rb') as fp:
    movie_statistics=pickle.load(fp)

## Data Exploration

In [3]:
n_users=len(user_to_movie_map)
n_movies=len(movie_to_user_map)
matrix_size=n_users*n_movies
n_ratings=len(train_ratings)+len(test_ratings)

print("Number of unique users:",n_users)
print("Number of unique movies:",n_movies)
print("Total ratings in the dataset:",n_ratings)

print("User-Item matrix size:",matrix_size)
print("User-Item matrix empty percentage:",(matrix_size-n_ratings)*100/(matrix_size))

Number of unique users: 300
Number of unique movies: 800
Total ratings in the dataset: 179576
User-Item matrix size: 240000
User-Item matrix empty percentage: 25.176666666666666


## User - User Collaborative Filtering

In [4]:
min_common_movies=5  ## For each user consider only other users with min_common_movies
n_neighbours=25      ## For each user Consider n_neighbours with highest absolute weight similarity
k=30                 ## Verbose printing for every kth iteration

In [5]:
## Key: User_id, Value: List of (similarity,user_id) of max length n_neighbours 
## (Sorted by absolute similarity in decreasing order)
user_neighbours={i:SortedList(key=lambda x: -abs(x[0])) for i in range(n_users)}

for user1 in range(n_users):
    for user2 in range(user1+1,n_users):
        
        ## Find the common movies between user1 and user2
        common_movies=user_to_movie_map[user1].intersection(user_to_movie_map[user2])
        
        if len(common_movies) <= min_common_movies: continue
        
        u1_ratings = np.array( [ train_ratings[(user1,movie)] for movie in common_movies ] )
        u2_ratings = np.array( [ train_ratings[(user2,movie)] for movie in common_movies ] )
        
        u1_avg, u1_norm = user_statistics[user1]
        u2_avg, u2_norm = user_statistics[user2]
        
        numerator=sum((u1_ratings-u1_avg)*(u2_ratings-u2_avg))
        similarity=numerator/(u1_norm*u2_norm)
        
        user_neighbours[user1].add((similarity,user2))
        user_neighbours[user2].add((similarity,user1))
        
        ## If the lists contain more than the allowed users, pop items
        if len(user_neighbours[user1])>n_neighbours:
            user_neighbours[user1].pop()
        if len(user_neighbours[user2])>n_neighbours:
            user_neighbours[user2].pop()
            
    if user1%k==0: 
        print("Calculated neighbours for user: ",user1)

Calculated neighbours for user:  0
Calculated neighbours for user:  30
Calculated neighbours for user:  60
Calculated neighbours for user:  90
Calculated neighbours for user:  120
Calculated neighbours for user:  150
Calculated neighbours for user:  180
Calculated neighbours for user:  210
Calculated neighbours for user:  240
Calculated neighbours for user:  270


In [6]:
def user_user_predict(user,movie):
    
    numerator=denominator=0

    for similarity,user2 in user_neighbours[user]:
        if (user2,movie) in train_ratings:
            numerator+=similarity*(train_ratings[(user2,movie)]-user_statistics[user2][0])
            denominator+=abs(similarity)
            
    pred=user_statistics[user][0]  ## Use average rating of the user
    
    if denominator>0:
        pred+=(numerator/denominator) ## Adding the weighted deviations of neighbours
        
    pred=max(0.5,pred)
    pred=min(5,pred)
    
    return pred

## Item - Item Collaborative Filtering

In [7]:
min_common_users=5   ## For each movie consider only other users with min_common_users
n_neighbours=25      ## For each movie Consider n_neighbours with highest absolute weight similarity
k=80                 ## Verbose printing for every kth iteration

In [8]:
## Key: movie_id, Value: List of (similarity,movie_id) of max length n_neighbours 
## (Sorted by absolute similarity in decreasing order)
movie_neighbours={i:SortedList(key=lambda x: -abs(x[0])) for i in range(n_movies)}

for movie1 in range(n_movies):
    for movie2 in range(movie1+1,n_movies):
        
        ## Find the common users between movie1 and movie2
        common_users=movie_to_user_map[movie1].intersection(movie_to_user_map[movie2])
        
        if len(common_users) <= min_common_users: continue
        
        m1_ratings = np.array( [ train_ratings[(user,movie1)] for user in common_users ] )
        m2_ratings = np.array( [ train_ratings[(user,movie2)] for user in common_users ] )
        
        m1_avg, m1_norm = movie_statistics[movie1]
        m2_avg, m2_norm = movie_statistics[movie2]
        
        numerator=sum((m1_ratings-m1_avg)*(m2_ratings-m2_avg))
        similarity=numerator/(m1_norm*m2_norm)
        
        movie_neighbours[movie1].add((similarity,movie2))
        movie_neighbours[movie2].add((similarity,movie1))
        
        ## If the lists contain more than the allowed movies, pop items
        if len(movie_neighbours[movie1])>n_neighbours:
            movie_neighbours[movie1].pop()
        if len(movie_neighbours[movie2])>n_neighbours:
            movie_neighbours[movie2].pop()
            
    if movie1%k==0: 
        print("Calculated neighbours for movie: ",movie1)

Calculated neighbours for movie:  0
Calculated neighbours for movie:  80
Calculated neighbours for movie:  160
Calculated neighbours for movie:  240
Calculated neighbours for movie:  320
Calculated neighbours for movie:  400
Calculated neighbours for movie:  480
Calculated neighbours for movie:  560
Calculated neighbours for movie:  640
Calculated neighbours for movie:  720


In [9]:
def item_item_predict(user,movie):
    
    numerator=denominator=0

    for similarity,movie2 in movie_neighbours[movie]:
        if (user,movie2) in train_ratings:
            numerator+=similarity*(train_ratings[(user,movie2)]-movie_statistics[movie2][0])
            denominator+=abs(similarity)
    
    pred=movie_statistics[movie][0]  ## Use average rating of a movie
    
    if denominator>0:
        pred+=(numerator/denominator) ## Average rating + weighted deviation of neighbours
        
    pred=max(0.5,pred)
    pred=min(5,pred)
    
    return pred

## Results
The mean squared error (MSE) is printed for train and test datasets. 

### Baseline

In [10]:
train_errors=[(movie_statistics[m][0]-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(movie_statistics[m][0]-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.8138737033960305
Test error: 0.8205727179945912


### User-User Collaborative Filtering

In [11]:
train_errors=[(user_user_predict(u,m)-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(user_user_predict(u,m)-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.5848141948510613
Test error: 0.6216076548853222


### Item-Item Collaborative Filtering

In [12]:
train_errors=[(item_item_predict(u,m)-r)**2 for (u,m),r in train_ratings.items()]
test_errors=[(item_item_predict(u,m)-r)**2 for (u,m),r in test_ratings.items()]

print("Train error:",np.mean(train_errors))
print("Test error:",np.mean(test_errors))

Train error: 0.42304816865788514
Test error: 0.5642476318528216


## Drawbacks of Collaborative Filtering

Collaborative filtering approaches often suffer from three problems: cold start, scalability, and sparsity.<br><br>
(i) Cold start: These systems often require a large amount of existing data on a user in order to make accurate recommendations.<br><br>
(ii) Scalability: In many of the environments in which these systems make recommendations, there are millions of users and products. Thus, a large amount of computation power is often necessary to calculate recommendations.<br><br>
(iii) Sparsity: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings.<br>

## References

1. https://en.wikipedia.org/wiki/Recommender_system <br>
2. https://www.kaggle.com/grouplens/movielens-20m-dataset <br>
3. https://www.udemy.com/recommender-systems/ <br>