# Collaborative Filtering Recommender System based on Cosin Similarity

### Purpose
To get a working cosin similarity model based off of collaborative filtering

### Methodology
This notebook assumes that the model will receive a pre-processed dataset of user-item interactions. For simplification purposes, it uses the [small movielens dataset](https://surprise.readthedocs.io/en/stable/dataset.html)

### Author Information
Nishant Aswani (@niniack)


### Setup (Imports)

In [40]:
# Data manipulation
import pandas as pd
import numpy as np
from lenskit import batch, topn, util
from lenskit import crossfold as xf
from lenskit.algorithms import Recommender, Predictor, als, basic, user_knn
from lenskit import topn
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

# Dataset
from lenskit.datasets import ML100K
movielens = ML100K('ml-100k')

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualizations
import plotly.graph_objs as go

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

### Downloading ML100K Dataset

In [2]:
# %%!
# wget -q -O ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip

## This unzip method may not work!
# unzip -f ml-100k.zip

### Data Exploration

The lenskit ML100K dataset provides the following: movies, ratings, users

In [3]:
ratings = movielens.ratings
ratings.head()

Unnamed: 0,user,item,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
len(ratings)

100000

In [5]:
users = movielens.users
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [6]:
len(users)

943

In [7]:
movies = movielens.movies
movies.head()

Unnamed: 0_level_0,title,release,vidrelease,imdb,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [8]:
len(movies)

1682

### Testing The "Most Popular Item Recommendation" System

The popular recommender calculates a score for each item in the ratings matrix. When provided a user to recommend items to, the model returns the top scoring n items that the given user has not previously rated.

In [9]:
# Initializing and "training" the popular recommender
algo_popular = basic.Popular()
algo_popular.fit(ratings)

<lenskit.algorithms.basic.Popular at 0x7fc7947b2fd0>

In [10]:
# Recommend the top 10 most popular items for UserID 20
algo_popular.recommend(20, 10)

Unnamed: 0,item,score
0,258,509.0
1,100,508.0
2,294,485.0
3,286,481.0
4,300,431.0
5,127,413.0
6,56,394.0
7,7,392.0
8,237,384.0
9,117,378.0


In [11]:
pop = ratings.groupby('item').user.count()
pop.sort_values(ascending=False)

item
50      583
258     509
100     508
181     507
294     485
       ... 
1576      1
1577      1
1348      1
1579      1
1682      1
Name: user, Length: 1682, dtype: int64

The highest recommendation, item 258, to user 20 is not the item with the highest score. However, it is the highest scoring item that user 20 has never rated.

In [12]:
ratings.loc[(ratings['user'] == 20) & (ratings['item'] == 258)]

Unnamed: 0,user,item,rating,timestamp


## User-User Collaborative Filtering Algorithm

The goal of the user-user cosin approach is so that the model can be updated at each iteration, rather than retraining the entire mode. This will save computational cost and allow for the dynamics experiments to run much quicker

### Testing out the Scikit-learn Cosin Similarity function

In [36]:
A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
print(A)

[[0 1 0 0 1]
 [0 0 1 1 1]
 [1 1 0 1 0]]


In [54]:
A[0].reshape(1,-1)

array([[0, 1, 0, 0, 1]])

In [55]:
A[1].reshape(1,-1)

array([[0, 0, 1, 1, 1]])

In [58]:
similarities = cosine_similarity(A[0].reshape(1,-1), A[1].reshape(1,-1))
similarities

array([[0.40824829]])

### Implementing the CosinSimilarity Algorithm Class

In [81]:
class CosinSimilarity(Recommender, Predictor):
    """
    Recommend new items by finding users that are the most similar to the given users using the cosin distance formula
    
    Args:
        selector(CandidateSelector):
            The candidate selector to use. If ``None``, uses a new
            :class:`UnratedItemCandidateSelector`.

    """
    def __init__(self, selector = None):
        if selector is None:
            self.selector = basic.UnratedItemCandidateSelector()
        else:
            self.selector = selector
    
    ## Input the ratings matrix
    def fit(self, ratings, **kwargs):
        self.selector.fit(ratings)
    
    ## Provide a recommendation of top "n" movies given "user"
    ## The recommender uses the UnratedItemCandidateSelector by default and uses the ratings matrix 
    ## it was originally fit on
    def recommend(self, user, n=None, candidates=None, ratings=None):
        
        # Obtain reduced candidate space
        if candidates is None:
            candidates = self.selector.candidates(user, ratings)   
        
        # Obtain predictions from reduced candidate space
        predict_for_user(user, candidates, ratings)
        

        return candidates
    
    def predict_for_user(self, user, items, ratings=None):
        pass
    
    def __str__(self):
        return 'CosinSimilarity'

In [82]:
# Instantiate object
algo_cosin = CosinSimilarity()

# Reduces the candidate space to all items that the user has not yet rated
algo_cosin.fit(ratings)

# 


# References
Relevant references:
1. https://realpython.com/build-recommendation-engine-collaborative-filtering/
1. https://lkpy.readthedocs.io/en/stable/GettingStarted.html#
1. https://github.com/lenskit/lkpy/tree/main/lenskit/algorithms