# Recommending movies using collaborative filtering

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Data

To build a recommender system, we need data to learn from. Specifically, we need the a dataset of **ratings** that different **users** assigend to different **items**, i.e., the movies. Let's start by loading the data and look at the first few rows.

In [2]:
ratings = pd.read_csv(
    '../datasets/u.data',
    delimiter='\t',
    header=None,
    names=['user_id', 'item_id', 'rating', 'timestamp']
) 

# We don't need the column timestamp, so we drop it
ratings.drop('timestamp', axis=1 , inplace=True)

ratings.head()

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


## Exercise 1.1: Similarity-based

**Question 1:** Implement the similarity-based algorithm given by the formulas of exercise 1.1

**Question 2:** Let's now use the similarity function to find similar users. Given a user `u`, find the
- user that is the most similar (positively correlated) to `u`;
- user that is the least similar (negatively correlated) to `u`;
- user that is weakly correlated to `u`.

What can you say about the influence that these three users will have on the ratings of user `u`?

In [2]:
u = 1
...

print('User positively correlated: user', ..., 'with similarity =', ...)
print('User negatively correlated: user', ..., 'with similarity =', ...)
print('User weakly correlated: user', ..., 'with similarity =', ...)

User positively correlated: user Ellipsis with similarity = Ellipsis
User negatively correlated: user Ellipsis with similarity = Ellipsis
User weakly correlated: user Ellipsis with similarity = Ellipsis


**Question 3:** Use your implementation to predict an unknown rating of the dataset. What is the run time of your implementation? Can you think of ways to speed it up?

## Exercise 1.2: Model-based

To make our life easier, we will now use the [Surprise](https://surprise.readthedocs.io) Python package which implements a variety of collaborative filtering algorithms. To install this package, simply run `pip install surprise` or use conda if you use Windows: open anaconda navigator, go to environments, click on the arrow next to the base (root) environment, click "open terminal" and run the following command: `conda install -y -c conda-forge scikit-surprise`.

In [None]:
import surprise

First, we need to convert our dataset to a format where the rows represent users and the columns represent the movies. The value at each cell is the rating for corresponding user and movie, or zero if the user did not rate the movie.

In [None]:
from surprise.dataset import Dataset, Reader
dataset = Dataset.load_from_df(ratings, reader=Reader(rating_scale=(1, 5)))
X = dataset.build_full_trainset()

**Question 4:** Now use the non-negative matrix factorization implementation of this package to predict the rating for a given user and movie.

Hint: you'll need the [surprise.prediction_algorithms.matrix_factorization.NMF](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF) class.

**Question 5:** What is the effect of the number of latent factors? Is there an optimal number of factors?

Hint: the [surprise.model_selection.search.GridSearchCV](https://surprise.readthedocs.io/en/stable/model_selection.html#surprise.model_selection.search.GridSearchCV) makes it easy to compare different parameter settings (note that GridSearchCV.fit expects a Dataset object instead of a Trainset object)