# Collaborative Filtering

The work in this notebook was inspiring by this [article](https://realpython.com/build-recommendation-engine-collaborative-filtering/)

This is probabily the most common technique used for building recommender engines. This technique performs better as more data on users is gathered.

## What is Collaborative Filtering?

The aim of this technique is to identify items that a given user may like based on other items similiar users have liked. A large group of users is usually searched to find similar users to the one in question. In a typical CF scenario we have a list of users $m$ users $U = \{U_1, U_2, …, U_m\}$ and a list of $n$ items $I = \{i_1, i_2, …, I_n\}$. Each users ($U_i$) has a list of items $I_{ui}$ which the users has expressed their opinion about. This could be rating an item, liking an item etc… These opinions can be explicit where the user has rated at item on a scale or they can be implicit where they are derived from purchase records, analysis of behaviour on the site etc.. For a given user the task of collaborative filtering can be one of the following tasks:
 - **Prediction**: Predicting the numerical value of an item which expressing the likeliness of an item the given user is yet to rate ($i_j \notin I_{ua}$). The predicted value is on the same scale as the other items.
 - **Recommendation**: A list of $N$ items is predicted for the user that the active user will like the most ($I_r \subset I$). These items cannot be items already rated by the user i.e $I_r \cap I_{ua} =  \emptyset $. This is known as the top N recommendation.

CF algorithms represent the entire dataset as an $m \times n$ user-item matrix, $A$. Each entry $a_{ij}$ is $A$ represent the preference (or rating) for the ith user on the jth item. A number of algorithms have been developed to perform CF, they fall into two broad categories: Memory based and Model based.
- **Memory Based:** Utilises user-item interactions to generate predictions. Find known users A.K.A. neighbours that have a similar history to the user in question. Use these neighbours to produce a prediction.
- **Model Based:** A model of user ratings is developed. Using this model recommendations are provided. Algorithms in this category take a probabilistic approach and compute the expected value of an item by looking at the user’s rating history of other items. 



## The Dataset

The dataset usually consists of a sparse matrix where the rows are users and the columns are items and the cells are values of interaction e.g. purchase (yes / no), rating (1 - 5). These are **explicit** ratings where the user has directly interacted with the item. There are also **implicit** ratings such as viewing an item, adding it to a wish list etc..

A typical matix would look like this:

||$i_1$|$i_2$|$i_3$|$i_4$|$i_5$|
|-----|-----|-----|-----|-----|-----|
|$u_1$|2|5||1|
|$u_2$||5|1|1|
|$u_3$|5|||3|1|
|$u_4$|4||3|4|
|$u_5$|2|5||5|

In this dataset are 5 users who have rated up to 5 items. Most cells in the table are empty and this is typical for this kind of data.

## Steps Involved in Collaborative Filtering

1. Find similar users or items
2. Predict the rating / value of the items that a user has not yet rated

How do you accomplish the above? To do that the following questions need to answered:
- How do you determine which users are similar to others
- Given that you know what the similar users are how to you determine the rating a user would give for an item?
- How is the accuracy of the prediction measured?

Collaborative filtering is a family of methods with numerous way of finding similar users or items and also multiple ways of calculating ratings. This means the first two questions above don't have a single answer. How do you know which will works best? The best thing to do is to try a few and select the one which works best.

The data used in collaborative filtering and the subsequent calculations don't take demographic data into account such as age or gender. It is only calculated on the implicit or explicit information provided by the user.

Like the other two questions, the calculation of accuracy also has numerous answers. Two common methods used are root mean square error (RMSE) where you make predictions for ratings where you know the answer and compare the predicted to the known to determine the error. Square all the error values, calculate the mean and then take the square root to get the RMSE. Another common one used is mean absolute error (MAE), where the mean of the magnitude or the errors is taken.

## Item-Based Collaborative Filtering

Collaborative filtering works by calculating the preferences of items for users. A new users joins your dataset and is matched to discover neighbours, which are other users who have historically had similar taste to the new user. Items that these users liked are then recommended to the new user. This user-based collaborative filtering approach has shown to be effective in both a research and practical sense. There are challenges that face this sort of recommender system:
- Scalability: The algorithms needs to search tens of thousands of potential neighbours in real-time, but modern demands are in the millions of potential neighbours.
- Quality: Improving the quality of recommendations for the users. Users need recommendations they can trust otherwise they will refuse to user a recommender that is consistently inaccurate for them.

A bottleneck for collaborative filtering algorithms is the search for neighbours among a large user population of potential neighbours. Item-based algorithms avoid this bottleneck by exploring the relationship between items first, rather than relationships between users. Recommendations for users are computed by finding similar items that are similar to other items the user has liked. The basic intuition is that a user would be interested in purchasing items that are similar to the items the user liked earlier and would tend to avoid items that are similar to the items the user didn’t like earlier.

The item-based approach loos into the set of items the target user has rated and computes how similar they are to the target item $i$ and then selects the $k$ most similar items $\{i_1, i_2, …, i_k\}$. Along with selecting the most similar items their similarities are also computed $\{s_{i1}, s_{i2}, …, s_{ik}\}$. With the similar items found we can then compute the prediction by taking the weighted average of the target user’s rating on these similar items.

## Similarity Computation
The basic idea behind similarity computation is to identify similar items / users:
- Users: Find similar users to the target user
- Items: Find similar items to the target item. In this instance we need to isolate items that have been co-rated by users i.e. Users that have both rated $Item_a$ and $Item_b$. These co-ratings are used to calculate the product similarity.

There are numerous ways to calculate similarity between items / users. Below are some of the more common methods: cosine-based, correlation-bases and adjusted cosine similarity.

### Cosine-based Similarity

Users / Items are though of as two vectors and similarity between them is measured by computing the cosine of the angle between these two vectors. Formally, this is given as:

\begin{equation}
sim(i, j) = cos(\overrightarrow{i}, \overrightarrow{j}) = \frac{\overrightarrow{i} \cdot \overrightarrow{j}}{\lVert\overrightarrow{i}\rVert_2 * \lVert\overrightarrow{j}\rVert_2}
\end{equation}
