# Introduction to Recommender Systems

<p align="center">
    <img width="721" alt="cover-image" src="https://user-images.githubusercontent.com/49638680/204351915-373011d3-75ac-4e21-a6df-99cd1c552f2c.png">
</p>

---

# KNN Recommendations

In this lecture, we are going to see a first example of non-personalised recommendation system. 
We have already seen that non-personalised recommendations are exploiting the distance between items, recommending the items that are "close" to the previously liked ones.

The issue is that we never defined what "_close_" or "_far_" means in this context.

The main idea inspiring the algorithm is defining vectors of items (in our case, movies) in order to _learn_ which movies are _closer_ to the last appreciated one. This approach goes under tha name of _item-based collaborative filtering_. We are going to build later in the course a first example of personalised recommendations by a _user-based collaborative filtering_.

Now, let's briefly review the $k$-nn algorithm before applying it to building recommendations.

## KNN algorithm

$k$-nn is probably the simplest machine learning algorithm, from a certain perspective. 
Indeed, it is non-parametric, meaning that the algorithm does not have to estimate any parameter in order to _learn_.
It is based on _distance_, trying to estimate the target value of new points from the first $k$ neighbours points labels or values.

### Recall: Machine Learning

The general idea of machine learning is to get a model to learn trends from historical data on any topic and be able to reproduce those trends on comparable data in the future. Here is a diagram outlining the basic machine learning process:

<p align="center">
    <img width="699" alt="image" src="https://files.realpython.com/media/knn_01_MLgeneral_wide.74e5e2dc1094.png">
</p>

This graph is a visual representation of a machine learning model that is fitted onto historical data. On the left are the original observations with three variables: height, width, and shape. The shapes are stars, crosses, and triangles.

The shapes are located in different areas of the graph. On the right, you see how those original observations have been translated to a decision rule. For a new observation, you need to know the width and the height to determine in which square it falls. The square in which it falls, in turn, defines which shape it is most likely to have.

Many different models could be used for this task. 

A **model** is a mathematical formula that can be used to describe data points. 
One example is the linear model, which uses a linear function defined by the formula 

$$y = \beta_0 + \beta_1 x\, .$$

**Fitting** a model means finding the optimal values for the fixed parameters using some algorithm. 
In the linear model, the parameters are $\beta_0$ and $\beta_1$. 
Luckily, you won’t have to invent such estimation algorithms to get started. 
They’ve already been discovered by great mathematicians.

Once the model is estimated, it becomes a mathematical formula in which you can fill in values for your independent variables to make predictions for your target variable. From a high-level perspective, that’s all that happens!

### Distinguishing Features of $k$-NN

Let's have a look at the noteworthy features of $k$-NN.

#### kNN is a Supervised Machine Learning Algorithm

The $k$-NN algorithm is a supervised machine learning model. 
That means it predicts a target variable using one or multiple independent variables.

In particular, it expects to analyse data made as couples $(x, y)$ where $x$ are commonly known as _features_ while $y$ is said _target variable_.

#### kNN is non-parametric

As mentioned above, for knn there are no parameters to estimate.
It is an algorithm purely based on distances, meaning that all _features_ count in the same way.

### The algorithm in steps

1. Define $k$
2. Define a distance metric — _e.g._ Euclidean distance ($2$-norm distance).
3. For a new data point, find the $k$ nearest training points.
4. Here it depends whether we have a classification or a regression problem.
    - For classification, we combine neighbour classes in some way — usually _voting_ — to get a predicted class.
    - For regression we combine neighbour labels by their _average_ or _median_ to get the predicted value.
 
Let's explore these steps in details.


The kNN algorithm is a little bit atypical as compared to other machine learning algorithms. As you saw earlier, each machine learning model has its specific formula that needs to be estimated. The specificity of the k-Nearest Neighbours algorithm is that this formula is computed not at the moment of fitting but rather at the moment of prediction. This is not the case for most other models.

When a new data point arrives, the kNN algorithm, as the name indicates, will start by finding the nearest neighbours of this new data point. Then it takes the values of those neighbours and uses them as a prediction for the new data point.

<p align="center">
    <img width="699" alt="image" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/03/knn3.png">
</p>

As an intuitive example of why this works, think of your neighbours. Your neighbours are often relatively similar to you. They are probably in the same socioeconomic class as you. Maybe they have the same type of work as you, maybe their children go to the same school as yours, and so on. But for some tasks, this kind of approach is not as useful. For instance, it would not make any sense to look at your neighbour’s favourite color to predict yours.

Hence, the $k$NN algorithm is based on the assumption that you can predict the features of a data point based on the features of its neighbours. 

#### “Nearest” means we need a distance

kNN algorithm is built on the concept of _distance_. This concept has a very precise definitions in mathematics.

A _distance_ $\delta$ is a function 

$$ \delta : \Omega \times \Omega \longrightarrow \mathbb{R} $$ 

such that it satisfies the following properties:

1. A distance is always non-negative. $\delta(x, y) \geq 0\, , \forall x, y \in \Omega$
2. Separation, $\delta(x, y) = 0 \Leftrightarrow x = y\, , \forall x, y \in \Omega$
3. Symmetry, $\delta(x, y) = \delta(y, x)\, , \forall x, y \in \Omega$
4. Triangular Inequality, $\delta(x, z) \leq \delta(x, y) + \delta(y, z) \, , \forall x, y, z \in \Omega$

Examples of distances when $\Omega$ is a real vector space are (among others) [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), [Cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity), [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry).

##### Exercise

> *Implement Euclidean, Cosine and Manhattan distances in Python making use of numpy.*

---

### The algorithm

Let's have a look at the drawing here.

<p align="center">
    <img width="881" alt="image" src="https://user-images.githubusercontent.com/49638680/159125703-a6f683d0-5a03-43e2-9ae5-293c86fe4eb7.png">
</p>

Roughly speaking: we can look the nearest data points (in this case using Euclidean distance) to the green circle (new sample $x$) and make a prediction. 
So if we look at only three neighbours (where $k = 3$) we can say that it belongs to class $1$ and if we look at the $7$ nearest neighbours ($k = 7$) we can say it belongs to class $2$.

### Find the $k$ Nearest Neighbours

Now that you have a way to compute the distance from any point to any point, you can use this to find the nearest neighbours of a point on which you want to make a prediction.

You need to find a number of neighbours, and that number is given by $k$. The minimum value of $k$ is of course $1$. This means using only one neighbour for the prediction. The maximum is the number of data points that you have. This means using all neighbours. The value of $k$ is something that the user defines, you will see a lot of these quantities from now on. These are called _hyperparameters_. 
Cross validation procedures and optimization tools can help you with this, as you will see in the next lectures.

Now, to find the nearest neighbours with respect to a point $x$ in NumPy, we need to simply apply the right function to data. As you have seen, you need to define distances on the vectors of the independent variables. 

Once you have the array of distances, it is enough to sort it by the magnitude of distances and pick the first $k$ elements.

### Combining $k$ Nearest Neighbours labels

Now, to produce predictions we need to find a way to assign a _value_ $\hat{y}$ to the new point $x$, based on the $k$-nearest neighbours we just found.

#### Classification
If we are in a classification problem, $y_i$ are discrete values, representing classes. One method to assign a class to the new point is a procedure called _voting_.

##### Voting
**Majority Voting**: After you take the $k$ nearest neighbors, you take a “vote” of those neighbours’ classes. The new data point is classified with whatever the majority class of the neighbours is. If you are doing binary classification, it is recommended that you use an odd number of neighbors to avoid tied votes. However, in a multi-class problem, it is harder to avoid ties. A common solution to this is to decrease $k$ until the tie is broken.

**Distance Weighting**: Instead of directly taking votes of the nearest neighbors, you weight each vote by the distance of that instance from the new data point. A common weighting method is
$$\hat{y} = \dfrac{\sum_i w_i y_i}{\sum_i w_i}\, ,$$
where the weights $w_i := \sum_i \tfrac{1}{(x-x_i)^2}$. The new data point is added into the class with the largest added weight. Not only does this decrease the chances of ties, but it also reduces the effect of outliers.

#### Regression
If we are in a regression problem on the other hand, $y_i$ are continuous values. We can predict the new value $\hat{y}$ combining $y_i$ of neighbours.

**Median**: We take the median value out of the $k$-nearest neighbours.
**Weighted average**: The weights are defined as above and we calculate $\hat{y}$ as the weighted average of $y_i$.

#### Bonus: Radius Neighbours

This is the same idea as a $k$ nearest neighbour classifier, but instead of finding the $k$ nearest neighbours, you find all the neighbours within a given radius. Setting the radius requires some domain knowledge; if your points are closely packed together, you could want to use a smaller radius to avoid having nearly every point vote.

### KNN for recommendations

Now, we have to apply this nice algorithm to recommend movies to users.

As said, $k$-nn does not make any assumptions about data distribution, it uses feature similarity. 
When $k$-nn makes inference about a movie, it will calculate the “distance” between the target movie and every other movie in its database, then it ranks its distances and returns the top $k$ nearest neighbour movies as the most similar movie recommendations.

#### The code 👨‍💻

Finally we are going to put our hands on some real code to build our $k$-nn based recommender system.

First, let's import necessary libraries and data as in the previous lecture.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from typing import List
from scipy.sparse import csr_matrix

from sklearn.neighbors import NearestNeighbors

from utils.data_utils import load_data

import matplotlib.pyplot as plt

# set plot size
plt.rcParams["figure.figsize"] = (20, 13)
%matplotlib inline
%config InlineBackend.figure_format = "retina"

In [2]:
# Import data from movielens dataset

df_rating, df_rating_test, df_users, df_items, df_matrix, n_users, n_items = load_data()

Let's have a look at the size of our rating matrix.

In [3]:
print(f"Shape of rating matrix: {df_matrix.values.shape}")

Shape of rating matrix: (1650, 943)


These are high numbers (and we are using a small version of the database). We definitely do not want to feed our $k$-nn algorithm with a such a big matrix. Furthermore, the rating dataframe is plenty of nan's.

Hence, for more efficient calculation and less memory footprint, we need to transform the values of the dataframe into a _scipy sparse matrix_.

In [5]:
mat_movie_features = csr_matrix(df_matrix.fillna(0).values)

Now, this did not change the shape of the matrix. 
Indeed, our training data has a very high dimensionality. 

$k$-NN’s performance will suffer from [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) if it uses _Euclidean distance_ in its objective function. 
Roughly speaking, Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (target movie’s features). 
Instead, we will use _cosine similarity_ for nearest neighbour search. 
Luckily, there is a ready-to-use implementation of cosine similarity in nearest neighbour, the one provided by scikit-learn, that we are about to use.

There is also another popular approach to handle nearest neighbour search in high dimensional data, [locality sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing), which we will not cover in this lecture.

##### Exercise (hard)

Try to implement locality sensitive hashing for knn. [This blog post](https://towardsdatascience.com/locality-sensitive-hashing-for-music-search-f2f1940ace23) might be useful.

#### The model

Let's define the model, it will be _trained_ over movie vectors and will be called to give recommendations.

In [6]:
model_knn = NearestNeighbors(
    metric="cosine", algorithm="brute", n_neighbors=20, n_jobs=-1
)

Finally, we are ready to provide recommendations given the last well-rated movie, or the user favourite movie.

In order to do so, we can define a function.

In [7]:
def make_recommendations(fav_movie_id: int, n_recommendations: int) -> pd.DataFrame:
    """
    Function to make top n movie recommendations.

    Parameters
    ----------
    fav_movie: int,
        id of user input movie

    n_recommendations: int,
        number of recommendations to provide

    Returns
    -------
    pd.DataFrame
        The top recommendations collected in a dataframe.
    """
    # fit the model
    model_knn.fit(mat_movie_features)

    # build favourite movie vector
    fav_vec = df_matrix.fillna(0).loc[fav_movie_id].values.reshape(1, -1)

    # inference
    distances, indices = model_knn.kneighbors(
        fav_vec, n_neighbors=n_recommendations + 1
    )

    ## Method 1: Sort list of raw idx of recommendations
    # raw_recommends = \
    #        sorted(
    #            list(
    #                zip(
    #                    indices.squeeze().tolist(),
    #                    distances.squeeze().tolist()
    #                )
    #            ),
    #            key=lambda x: x[1]
    #        )[:0:-1]

    # Method 2: Sort dataframe of raw idx of recommendations
    df_res = df_items.iloc[indices[0]].copy()
    df_res["distances"] = distances[0]
    df_res.sort_values(by="distances")

    return df_res

In [8]:
make_recommendations(fav_movie_id=3, n_recommendations=10)

Unnamed: 0_level_0,Title,Date,VideoReleaseDate,Url,unknown,Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,distances
MovieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0.0
761,Nick of Time (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Nick%20of%20T...,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0.591842
249,Austin Powers: International Man of Mystery (1...,02-May-1997,,http://us.imdb.com/M/title-exact?Austin%20Powe...,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.609912
410,Kingpin (1996),12-Jul-1996,,http://us.imdb.com/M/title-exact?Kingpin%20(1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.611404
42,Clerks (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Clerks%20(1994),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.616329
822,Faces (1968),01-Jan-1968,,http://us.imdb.com/M/title-exact?Faces%20(1968),0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.626278
67,Ace Ventura: Pet Detective (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Ace%20Ventura...,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.630763
829,Fled (1996),19-Jul-1996,,http://us.imdb.com/M/title-exact?Fled%20(1996),0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0.641586
475,Trainspotting (1996),19-Jul-1996,,http://us.imdb.com/Title?Trainspotting+(1996),0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.644772
33,Desperado (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Desperado%20(...,0,1,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0.648739


##### Exercises

1. Implement a full object that contains `make_recommendations` as a method. The object will be initialised with the data paths and will build the model, a method to associate the title to movie id and will provide recommendations with (a small edit of) the function defined above.
2. Modify the algorithm of the previous point, such that you never recommend a movie the user has already rated.
3. Use the object you built above to create a recommender system, that looks at your last rated movie (whose rating is above $4$) and recommends the _closest_ movies to that one.

### Going further

A further improvement of this system is to calculate the _center of appreciation_ for each user, by averaging the vectors of favourite movies per each user. This can be done for example by the average of all the movies a user gave a rating greater than $4$ for instance and find the $10$ closest movies to this point.

<details>
<summary>Click to expand and see the solutions!</summary>

#### Exercise 1
The aim is to implement a class that contains `make_recommendations` as a method. The object will be initialised with the data paths and will build the model, a method to associate the title to movie id and will provide recommendations with (a small edit of) the function defined above.

An option might be the following.
```python
class MovieRecommender:
    """
    A movie recommender system using KNN.

    Attributes
    ----------
    movies_df : pd.DataFrame
        DataFrame containing movie information.
    ratings_df : pd.DataFrame
        DataFrame containing user ratings.
    model_knn : NearestNeighbors
        The KNN model.
    mat_movie_features : np.array
        Matrix of movie features.

    Methods
    -------
    make_recommendations(fav_movie_id, n_recommendations):
        Makes movie recommendations based on a favorite movie.
    """
    
    def __init__(self, movies_path: str, ratings_path: str):
        """
        Initializes the recommender system with data paths and builds the model.
        
        Parameters
        ----------
        movies_path : str
            Path to the movies data file.
        ratings_path : str
            Path to the ratings data file.
        """
        self.movies_df = pd.read_csv(movies_path)
        self.ratings_df = pd.read_csv(ratings_path)
        # Assume mat_movie_features is constructed here based on movies_df and ratings_df
        # For simplicity, this step is skipped in the code snippet
        # self.mat_movie_features = ...
        self.model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
        # Fit the model assuming mat_movie_features is ready
        # self.model_knn.fit(self.mat_movie_features)
    
    def make_recommendations(self, fav_movie_id: int, n_recommendations: int) -> pd.DataFrame:
        """
        Function to make top n movie recommendations based on a favorite movie.

        Parameters
        ----------
        fav_movie_id : int
            ID of the user's favorite movie.
        n_recommendations : int
            Number of recommendations to provide.

        Returns
        -------
        pd.DataFrame
            The top n recommendations.
        """
        # Implementation remains as provided, assuming mat_movie_features is part of this class
        pass
    
    def get_movie_id(self, title: str) -> int:
        """
        Retrieves the movie ID for a given title.
        
        Parameters
        ----------
        title : str
            The movie title.
        
        Returns
        -------
        int
            The movie ID.
        """
        movie_id = self.movies_df[self.movies_df['title'].str.contains(title, case=False, na=False)]['movieId'].values[0]
        return movie_id
```
#### Exercise 2
We need now to modify the algorithm of the previous point, such that you never recommend a movie the user has already rated. This involves checking the user's ratings and filtering out any movies that appear in their ratings history.

```python
class MovieRecommender:
    def __init__(self, movies_path: str, ratings_path: str):
        """
        Initializes the recommender system with data paths.
        """
        self.movies_df = pd.read_csv(movies_path)
        self.ratings_df = pd.read_csv(ratings_path)
        self.model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
        # Additional initializations if needed
    
    def make_recommendations(self, fav_movie_id: int, user_id: int, n_recommendations: int) -> pd.DataFrame:
        """
        Makes movie recommendations based on a favorite movie, excluding movies the user has already rated.
        """
        # Assuming mat_movie_features is a precomputed matrix where rows correspond to movies in the same order as movies_df
        user_rated_movies = self.ratings_df[self.ratings_df['userId'] == user_id]['movieId'].unique()
        not_rated_mask = ~self.movies_df['movieId'].isin(user_rated_movies)
        
        # Filter mat_movie_features to exclude movies the user has already rated
        filtered_mat_movie_features = self.mat_movie_features[not_rated_mask]
        
        # Find index of fav_movie_id in the filtered matrix
        fav_movie_idx = self.movies_df[not_rated_mask].reset_index().index[self.movies_df['movieId'] == fav_movie_id][0]
        fav_vec = filtered_mat_movie_features[fav_movie_idx].reshape(1, -1)
        
        self.model_knn.fit(filtered_mat_movie_features)
        distances, indices = self.model_knn.kneighbors(fav_vec, n_neighbors=n_recommendations + 1)
        
        # Since we're using indices on the filtered matrix, map back to original movie IDs
        recommended_indices = indices[0]
        recommended_movies = self.movies_df[not_rated_mask].iloc[recommended_indices].copy()
        recommended_movies['distance'] = distances[0]
        
        return recommended_movies.sort_values(by='distance', ascending=True).iloc[1:]  # Exclude the favorite itself
    
    def get_user_last_highly_rated_movie(self, user_id: int) -> int:
        """
        Retrieves the last highly-rated movie (rating above 4) by a user.
        """
        last_highly_rated = self.ratings_df[(self.ratings_df['userId'] == user_id) & (self.ratings_df['rating'] > 4)].sort_values(by='timestamp', ascending=False)['movieId'].values[0]
        return last_highly_rated

# Usage example (assuming paths, user_id, and n_recommendations are defined):
# recommender = MovieRecommender(movies_path, ratings_path)
# last_highly_rated_movie_id = recommender.get_user_last_highly_rated_movie(user_id)
# recommendations = recommender.make_recommendations(last_highly_rated_movie_id, user_id, n_recommendations)
# print(recommendations)
```

#### Exercise 3

Now it is matter of just use the implementation of the class above.

```python
movies_path = 'path/to/your/movies.csv'  # Update this path
ratings_path = 'path/to/your/ratings.csv'  # Update this path
user_id = 1  # Example user ID
n_recommendations = 5  # Number of recommendations

recommender = MovieRecommender(movies_path, ratings_path)
recommendations = recommender.recommend_based_on_last_highly_rated(user_id, n_recommendations)

print("Recommendations based on your last highly-rated movie:")
print(recommendations)
```

</details>
