# Collaborative Filtering Recommender System based on Cosin Similarity

### Purpose
To get a working cosin similarity model based off of collaborative filtering

### Methodology
This notebook assumes that the model will receive a pre-processed dataset of user-item interactions. For simplification purposes, it uses the [small movielens dataset](https://surprise.readthedocs.io/en/stable/dataset.html)

### Author Information
Nishant Aswani (@niniack)


### Setup (Imports)

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
from lenskit import batch, topn, util
from lenskit import crossfold as xf
from lenskit.algorithms import Recommender, Predictor, als, basic, user_knn
from lenskit import topn
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from lenskit.data import sparse_ratings

# Dataset
from lenskit.datasets import ML100K
movielens = ML100K('../ml-100k')

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualizations and debugging
import plotly.graph_objs as go
from pprintpp import pprint as pp
import logging

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

### Downloading ML100K Dataset

In [2]:
# %%!
# wget -q -O ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip

## This unzip method may not work!
# unzip -f ml-100k.zip

### Data Exploration

The lenskit ML100K dataset provides the following: movies, ratings, users

In [3]:
ratings = movielens.ratings
ratings.head()

Unnamed: 0,user,item,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
len(ratings)

100000

In [5]:
users = movielens.users
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [6]:
len(users)

943

In [7]:
movies = movielens.movies
movies.head()

Unnamed: 0_level_0,title,release,vidrelease,imdb,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [8]:
len(movies)

1682

### Testing The "Most Popular Item Recommendation" System

The popular recommender calculates a score for each item in the ratings matrix. When provided a user to recommend items to, the model returns the top scoring n items that the given user has not previously rated.

In [9]:
# Initializing and "training" the popular recommender
algo_popular = basic.Popular()
algo_popular.fit(ratings)

<lenskit.algorithms.basic.Popular at 0x7fa80844a430>

In [10]:
# Recommend the top 10 most popular items for UserID 20
algo_popular.recommend(20, 10)

Unnamed: 0,item,score
0,258,509.0
1,100,508.0
2,294,485.0
3,286,481.0
4,300,431.0
5,127,413.0
6,56,394.0
7,7,392.0
8,237,384.0
9,117,378.0


In [11]:
pop = ratings.groupby('item').user.count()
pop.sort_values(ascending=False)

item
50      583
258     509
100     508
181     507
294     485
       ... 
1576      1
1577      1
1348      1
1579      1
1682      1
Name: user, Length: 1682, dtype: int64

The highest recommendation, item 258, to user 20 is not the item with the highest score. However, it is the highest scoring item that user 20 has never rated.

In [12]:
ratings.loc[(ratings['user'] == 20) & (ratings['item'] == 258)]

Unnamed: 0,user,item,rating,timestamp


In [13]:
ratings.loc[(ratings['user'] == 31)].set_index('item')

Unnamed: 0_level_0,user,rating,timestamp
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
886,31,2.0,881547877
484,31,5.0,881548030
682,31,2.0,881547834
302,31,4.0,881547719
135,31,4.0,881548030
...,...,...,...
519,31,4.0,881548053
1022,31,5.0,881547814
1019,31,5.0,881548082
611,31,4.0,881548111


## User-User Collaborative Filtering Algorithm

The goal of the user-user cosin approach is so that the model can be updated at each iteration, rather than retraining the entire mode. This will save computational cost and allow for the dynamics experiments to run much quicker

### Testing out the Scikit-learn Cosin Similarity function with Dummy Data

In [14]:
A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
print(A)

[[0 1 0 0 1]
 [0 0 1 1 1]
 [1 1 0 1 0]]


In [15]:
A[0].reshape(1,-1)

array([[0, 1, 0, 0, 1]])

In [16]:
A[1].reshape(1,-1)

array([[0, 0, 1, 1, 1]])

In [17]:
similarities = cosine_similarity(A[0].reshape(1,-1), A[1].reshape(1,-1))
similarities

array([[0.40824829]])

### Testing out the Scikit-learn Cosin Similarity function with ML100K

In [18]:
%%time
user1_ratings = ratings[ratings['user'].values == 99].set_index('item')
user2_ratings = ratings[ratings['user'].values == 100].set_index('item')

CPU times: user 1.18 ms, sys: 748 µs, total: 1.93 ms
Wall time: 1.53 ms


In [19]:
user1_ratings.index.values

array([   4,  268,   79,  111, 1016,  873,  403,  246,  274,   50,  354,
        742,  181,  182,  597,  410,  168,  204,  895,  508,  546,  682,
        124,  421,  741, 1052,  118,  203, 1119,  346,  258,  123,  196,
        232,   11,  963,  685,    7,   56,  363,  237,  369,  676, 1067,
        100,  322,  331,  312,  245,   69,  265,  926,   12,  255,    3,
        871,  288,  827,  402,  107,   66,  762,  628,  328,  413,  694,
        409,  748,  978,  406,  472,  273,  338,  815,   28,   64,  342,
         92,  751,  829,  275,  619,  315,  125,  845,  358,  345,  201,
        121,  475,   22,  300,  471,  405,  456,  238,    1,  595,  117,
        367,  763,   25,  591,  332,  240, 1079, 1132,  975,  173,  210,
        105,  313,  147,  931,  651,  120,  282,  290,  780,  116, 1048,
        310,  172,  326,  678,  276,  789,   98,  294,  544,  174,  329,
       1047,  473,  433,  348])

In [20]:
user2_ratings.index.values

array([ 344,  354,  268,  321,  355,  750,  266,  288,  302,  340,  689,
        905,  289,  691,  316, 1236,  342,  990,  333,  752,  323,  348,
        313,  292, 1238,  879,  300,  328, 1235, 1237,  678,  286,  908,
        690,  874,  880,  349,  310,  347, 1234,  270, 1233,  326,  269,
        258,  900,  886,  294,  272,  881,  895,  892,  887,  885,  346,
        751,  271,  898,  315])

In [31]:
%%time
merge = pd.merge(user1_ratings, user2_ratings, how="inner", left_index=True, right_index=True)

CPU times: user 2 ms, sys: 0 ns, total: 2 ms
Wall time: 1.75 ms


In [32]:
%%time
df2 = user2_ratings[user2_ratings.index.isin(user1_ratings.index)]
user1_ratings.merge(df2, left_index=True, right_index=True)

CPU times: user 2.29 ms, sys: 0 ns, total: 2.29 ms
Wall time: 2.07 ms


Unnamed: 0_level_0,user_x,rating_x,timestamp_x,user_y,rating_y,timestamp_y
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
268,99,3.0,885678247,100,3.0,891374982
354,99,2.0,888469332,100,2.0,891375260
895,99,3.0,885678304,100,2.0,891375212
346,99,4.0,885678415,100,3.0,891375630
258,99,5.0,885678696,100,4.0,891374675
288,99,4.0,885678247,100,2.0,891374603
328,99,4.0,885678696,100,4.0,891375212
342,99,1.0,885678348,100,3.0,891375454
751,99,4.0,885678397,100,4.0,891374868
315,99,4.0,885678479,100,5.0,891375557


In [33]:
%%time
user2_ratings.join(user1_ratings, how="inner", lsuffix='rating', rsuffix='rating')

CPU times: user 1.86 ms, sys: 86 µs, total: 1.95 ms
Wall time: 1.7 ms


Unnamed: 0_level_0,userrating,ratingrating,timestamprating,userrating,ratingrating,timestamprating
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
354,100,2.0,891375260,99,2.0,888469332
268,100,3.0,891374982,99,3.0,885678247
288,100,2.0,891374603,99,4.0,885678247
342,100,3.0,891375454,99,1.0,885678348
348,100,3.0,891375630,99,4.0,886518562
313,100,5.0,891374706,99,5.0,885678348
300,100,4.0,891375112,99,4.0,885678397
328,100,4.0,891375212,99,4.0,885678696
678,100,3.0,891375428,99,2.0,885678479
310,100,3.0,891375522,99,3.0,885678348


In [34]:
user2_ratings.join(user1_ratings, how="inner", lsuffix='_x', rsuffix='_y')

Unnamed: 0_level_0,user_x,rating_x,timestamp_x,user_y,rating_y,timestamp_y
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
354,100,2.0,891375260,99,2.0,888469332
268,100,3.0,891374982,99,3.0,885678247
288,100,2.0,891374603,99,4.0,885678247
342,100,3.0,891375454,99,1.0,885678348
348,100,3.0,891375630,99,4.0,886518562
313,100,5.0,891374706,99,5.0,885678348
300,100,4.0,891375112,99,4.0,885678397
328,100,4.0,891375212,99,4.0,885678696
678,100,3.0,891375428,99,2.0,885678479
310,100,3.0,891375522,99,3.0,885678348


#### In ndarray representation

In [35]:
%%time
similarities = cosine_similarity(np.array(merge['rating_x']).reshape(1,-1), np.array(merge['rating_y']).reshape(1,-1))
similarities

CPU times: user 803 µs, sys: 38 µs, total: 841 µs
Wall time: 785 µs


array([[0.96814626]], dtype=float32)

## Building the CosinSimilarity Class

In [43]:
%%time
# Instantiate object
num_users = 5 # set to len(users)
algo_cosin = CosinSimilarity(num_users)

# Reduce the candidates space + build user-user cosin similarity matrix 
algo_cosin.fit(ratings)

# Ask for rating predictions
algo_cosin.recommend(1)

None
CPU times: user 29.1 ms, sys: 48 µs, total: 29.2 ms
Wall time: 28 ms


In [44]:
class CosinSimilarityEfficient(Recommender, Predictor):
    """
    Recommend new items by finding users that are the most similar to the given users using the cosin distance formula
    
    Args:
        selector(CandidateSelector):
            The candidate selector to use. If ``None``, uses a new
            :class:`UnratedItemCandidateSelector`.

    """
    def __init__(self, min_neighbors=1, min_sim=0, selector=None):
        
        # Set selector
        if selector is None:
            self.selector = basic.UnratedItemCandidateSelector()
        else:
            self.selector = selector
    
        # Set parameters
        self.min_neighbors = min_neighbors
        self.min_sim = min_sim
        
        # Enable logging
        _logger = logging.getLogger(__name__)
            
    def __str__(self):
        return 'CosinSimilarity'
        
    def _get_user_ratings(self, user):
        
        if self.user_index_ is None:
            _logger.warning('User %d has no ratings', user)
            return None, 0
        
        user_pos = self.user_index_.get_loc(user)
        ratings = self.rating_matrix_.getrow(user_pos).toarray()
        return ratings
    
    def _agg_weighted_average():
        pass
        
    ## Input the ratings matrix
    def fit(self, ratings, **kwargs):
        
        # Get sparse representation in CSR format
        uir, users, items = sparse_ratings(ratings, scipy=True)
        
        # Reduce candidate space to unseen items
        self.selector.fit(ratings)
        
        # Store ratings
        self.rating_matrix_ = uir
        self.user_index_ = users
        self.item_index_ = items
    
    ## Provide a recommendation of top "n" movies given "user"
    ## The recommender uses the UnratedItemCandidateSelector by default and uses the ratings matrix 
    ## it was originally fit on
    def recommend(self, user_id, top_n = 10, candidates=None, ratings=None):
        
        # Reduce candidate space and store candidates with item ID
        if candidates is None:
            candidates = self.selector.candidates(user_id, ratings)  
        
        # Get ratings of user given user ID
        user_ratings = self._get_user_ratings(user_id)
       
        # Assert that ratings is a dense vector, filling unrated items with 0
        assert len(user_ratings[0]) == len(self.item_index_)
        
        # Instantiate similarity vector matrix
        # Similarities are stored using user ID
        similarity_vector = np.zeros((len(self.user_index_) + 1,), dtype=float)
        
        # Calculate similarity matrix between all users
        for i in range(1, len(self.user_index_) + 1):
            neighbor_ratings = self._get_user_ratings(i)
            similarity_vector[i] = cosine_similarity(user_ratings, neighbor_ratings)
            
        # Predict ratings and scores for all unseen items
        prediction_score_df = self.predict_for_user(user_id, similarity_vector, candidates)
        
        return(prediction_score_df)
        
    def predict_for_user(self, user, similarity_vector, items):
        
        # Instantiate ratings and item_popularity vectors
        predicted_ratings = np.zeros(len(items), dtype=float)
        item_popularity = np.zeros(len(items), dtype=float)

        # For each unseen item
        for i in range(len(items)):
            
            # Item position given item i ID
            item_pos = self.item_index_.get_loc(items[i])
            
            # Get item column
            i_col = self.rating_matrix_.getcol(item_pos)
            
            # User positions that rated item i
            i_raters_pos = i_col.nonzero()[0]
            
            # Store popularity of item based on number of total ratings 
            item_popularity[i] = i_col.count_nonzero()

            # Ratings of users that rated item i
            i_ratings = self.rating_matrix_.getcol(item_pos).data
            
            # Assert that number of ratings is equal to number of users that rated
            assert len(i_raters_pos) == len(i_ratings)
            
            # Obtain user IDs from user positions
            i_raters_id = self.user_index_[i_raters_pos].values
            
            # Similarity values with all users that rated item i
            i_raters_similarities = similarity_vector[i_raters_id]
            
            # Calculate floored ratings and scores
            sum_of_product_of_similarities_and_ratings = np.multiply(i_raters_similarities, i_ratings).sum()
            sum_of_all_similarities = i_raters_similarities.sum()
            predicted_ratings[i] = sum_of_product_of_similarities_and_ratings//sum_of_all_similarities
        
        # minmax scale the popularity of each item
        normalized_popularity = np.interp(item_popularity, (item_popularity.min(), item_popularity.max()), (0, +1))
        score = np.multiply(normalized_popularity, predicted_ratings)
        
        results = {'predicted_ratings':predicted_ratings, 'normalized_popularity':normalized_popularity}
        return pd.DataFrame(results, index=items)


In [45]:
%%time
# Instantiate object
algo_cosin = CosinSimilarityEfficient()

# Reduce the candidates space + build user-user cosin similarity matrix 
algo_cosin.fit(ratings)

# Recommend items to user
recs = algo_cosin.recommend(1)


CPU times: user 1.42 s, sys: 0 ns, total: 1.42 s
Wall time: 1.42 s


In [46]:
recs.sort_values(
    by=["predicted_ratings", "normalized_popularity"],
    ascending=False
)[["predicted_ratings", "normalized_popularity"]].head(20)

Unnamed: 0,predicted_ratings,normalized_popularity
1500,5.0,0.002066
814,5.0,0.0
1201,5.0,0.0
1599,5.0,0.0
1653,5.0,0.0
313,4.0,0.721074
318,4.0,0.613636
302,4.0,0.61157
357,4.0,0.543388
483,4.0,0.5


# References
Relevant references:
1. https://realpython.com/build-recommendation-engine-collaborative-filtering/
1. https://lkpy.readthedocs.io/en/stable/GettingStarted.html#
1. https://github.com/lenskit/lkpy/tree/main/lenskit/algorithms
1. https://link.springer.com/book/10.1007%2F978-3-319-29659-3

# Appendix

Legacy implementation

In [47]:
class CosinSimilarity(Recommender, Predictor):
    """
    Recommend new items by finding users that are the most similar to the given users using the cosin distance formula
    
    Args:
        selector(CandidateSelector):
            The candidate selector to use. If ``None``, uses a new
            :class:`UnratedItemCandidateSelector`.

    """
    def __init__(self, num_users, selector = None):
        
        ## Set selector
        if selector is None:
            self.selector = basic.UnratedItemCandidateSelector()
        else:
            self.selector = selector
        
        ## Set number of users in original matrix
        self.num_users = num_users
        
        ## Instantiate sparse matrix with diagonal
        self.sim_matrix = sparse.eye(num_users)
        
        ## Convert diagonal to lil
        self.sim_matrix = self.sim_matrix.tolil()
        
    ## Input the ratings matrix
    def fit(self, ratings, **kwargs):
        
        # Store each user's rating df in dictionary for quick lookup
        self.df_dict = {}
        for i in range(1, self.num_users+1):
            self.df_dict[i] = ratings[ratings['user'].values == i].set_index('item')
            self.df_dict[i].sort_index(inplace=True)
        
        # Populate sparse matrix for all users with cos sim
        for i in range(1, self.num_users+1):
            for j in range (i+1, self.num_users+1):
                user_x = self.df_dict[i]
                user_y = self.df_dict[j]

                ## METHOD1: using inner join (marginally faster)
                merge = user_y.join(user_x, how="inner", lsuffix='_x', rsuffix='_y')
                    
                ## METHOD2: using inner merge (marginally slower)
                # merge = pd.merge(user_x, user_y, how="inner", left_index=True, right_index=True)

                if (len(merge) > 5):
                    similarity = cosine_similarity(np.array(merge['rating_x']).reshape(1,-1), np.array(merge['rating_y']).reshape(1,-1))
                    self.sim_matrix[i-1, j-1] = similarity
                else:
                    self.sim_matrix[i-1, j-1] = 0
        
        # Reduce candidate space to unseen items
        self.selector.fit(ratings)
    
    ## Provide a recommendation of top "n" movies given "user"
    ## The recommender uses the UnratedItemCandidateSelector by default and uses the ratings matrix 
    ## it was originally fit on
    def recommend(self, user, n=5, candidates=None, ratings=None):
        
        # Obtain reduced candidate space
        if candidates is None:
            candidates = self.selector.candidates(user, ratings)   
        
        # Obtain predictions from reduced candidate space
        self.predict_for_user(user, candidates, ratings)
        
        pass 
    
    def predict_for_user(self, user, items, ratings=None):
        
        print(ratings)
    
    def __str__(self):
        return 'CosinSimilarity'