# Neighborhood-Based Collaborative Filtering
This is an end-to-end demonstration of neighborhood-based collaborative filtering experiments using Recsys Lab. It covers:
- Fetching the DataSource.
- Converting the dataset to an Interaction Matrix
- Splitting the dataset into training, validation and test sets
- Subsampling the training dataset 
- Normalizing the ratings data
- Computing user and item similarity matrices
- Making predictions
- Evaluating the performance of the model using metrics

## DataSource
For this experiment, we will use the MovieLens 25M Dataset from the GroupLens website. 

In [1]:
import pandas as pd

from recsys.datasource.movielens import MovieLens25M
from recsys.dataset.movielens import MovieLens
from recsys.dataprep.split import TemporalSplitter

In [2]:
# The fetch_data method returns the ratings interaction dataframe.
datasource = MovieLens25M()
ratings = datasource.fetch_data()

## Dataset
Interaction data is encapsulated within a MovieLens Dataset object with access methods for analyzing and converting the data to sparse row and column representations. Let's briefly examine the statistics of the dataset.



In [3]:
dataset = MovieLens(name='movielens25m_raw_interaction_matrix', description="Movielens25M Interaction Matrix from Raw Data", filepath="data/movielens25m/raw/interaction.pkl", data=ratings)
dataset.summary()

Unnamed: 0,movielens25m_raw_interaction_matrix
nrows,25000095.0
ncols,4.0
n_users,162541.0
n_items,59047.0
max_ratings_per_user,32202.0
mean_ratings_per_user,153.81
min_ratings_per_user,20.0
max_ratings_per_item,81491.0
mean_ratings_per_item,423.39
min_ratings_per_item,1.0


 The statistics above describe the interactions of some 162,000 users over about 60,000 movies.  Each user has rated a minimum of 20 films; yet ratings per item range from 1 to over 81,000 ratings, with an average number of ratings per movie in the 420 range. Most noted, the extreme sparsity (98.9%), perhaps the most distinguishing feature of this dataset.

## Dataset Split
Next, we consider data splitting in service of principled approaches to model selection and evaluation. Indeed, there is no dearth of opinion, research, and variation in data splitting strategies, an important confounding variable amidst the state-of-the-art recommender systems rankings. For this exercise, we will adopt a global temporal splitting strategy that defines a fixed point-in-time, after which the data are used for testing. A 80/10/10 split for the training, validation, and test sets, respectively, minimizes data leakage and models a more realistic setting of recommender systems in industry.

In [4]:
splitter = TemporalSplitter(
    directory="data/movielens25m/split/",
    dataset=MovieLens,
    train_size=0.8,
    validation_size=0.1,
    test_size=0.1,
    force=True,
)
splits = splitter(dataset=dataset)
train = splits['train'].summary()
validation = splits['validation'].summary()
test = splits['test'].summary()
summary = pd.concat([train, validation, test], axis=1)
summary

Unnamed: 0,train,validation,test
nrows,20000076.0,2500009.0,2500009.0
ncols,4.0,4.0,4.0
n_users,137883.0,18003.0,18442.0
n_items,34461.0,37221.0,49405.0
max_ratings_per_user,12097.0,6947.0,13158.0
mean_ratings_per_user,145.05,138.87,135.56
min_ratings_per_user,1.0,1.0,1.0
max_ratings_per_item,67782.0,8643.0,8267.0
mean_ratings_per_item,580.37,67.17,50.6
min_ratings_per_item,1.0,1.0,1.0
