# Demystifying Recommendations: Their Nature and Purpose

Ever wondered why Netflix suggests certain movies or TV shows for your next binge? Or why does your Spotify playlist seem like it's reading your mind? Magic? Not quite. In both instances, the power of machine learning-based recommendation models is at work.

These models discern patterns in your viewing or listening habits and then find movies, TV shows, or songs that bear similarity to your tastes. Following this, they curate a list of recommendations.

## The Rationale Behind Recommendations

The primary role of a recommendation system is to guide users in uncovering captivating content within an extensive collection. Consider platforms like Netflix, with its vast repertoire of movies and TV shows, or Spotify, a platform that hosts millions of tracks. These platforms are updated regularly with new content. So, how can users discover fresh and intriguing content amidst this sea of choices?

While search functions serve as a means to directly access content, they rely on the user knowing what they are looking for. This is where a recommendation engine stands out. It has the ability to surface items that the user might not have thought to seek out independently, thus enriching their overall experience.

## Unraveling the Lingo of Recommendation Systems


In our journey to understand recommendation systems, it's vital to get acquainted with some key terms and concepts. Here's your cheat sheet:

1. **Items (or Documents)**: These are the entities that the system is recommending. If we're talking about Netflix, items would be the movies and TV shows. For Spotify, the items would be songs or playlists.

2. **Query (or Context)**: This represents the information the system uses to generate recommendations. Queries could include a mix of the following:

   - User information or the user's ID.
   - Items that the user has previously interacted with.  

4. **Embedding**: This is a transformation from a discrete set (like the set of queries or items to recommend) to a vector space, known as the embedding space. The effectiveness of many recommendation systems hinges on learning suitable embedding representations for queries and items.

5. **Recommendation Systems**: These are algorithms designed to suggest products, services, or information to users based on analysis of data.

6. **Collaborative Filtering**: This technique uses other users' habits to recommend products to a user. The underlying assumption is that if users A and B have similar tastes on one issue, they are likely to have similar opinions on others.

7. **Content-Based Filtering**: This technique recommends items by comparing the content of the items to a user's profile. Each item is represented by a set of descriptors, such as the words in a document.

8. **User-Item Matrix**: A matrix used in collaborative filtering where each row represents a user, and each column represents an item. The entries can be explicit data (like ratings) or implicit data (like number of purchases).

9. **Cold Start Problem**: This issue arises when the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information. This is a common problem in recommendation systems, particularly for new users (user cold-start) or new items (item cold-start).

10. **Implicit and Explicit Feedback**: Explicit feedback is input directly provided by the user (like ratings), while implicit feedback is gathered from user actions (like browsing history).

## Recommendation systems

Recommendation systems are your secret sauce for a personalized user experience, nudging users towards content they'll adore. They're the backbone of e-commerce, streaming, and social platforms, guiding choices by harnessing the power of data.

Their magic lies in learning from past user behaviors. Your behavior, fused with insights from others' interactions, shapes what you see next.

Staggering stats attest to their influence: 40% of Google Play app installs, 60% of YouTube watch time, 35% of Amazon purchases, and 75% of Netflix watches are recommendation-driven.

At the heart of it, these systems pivot around two key strategies: Content-Based Filtering, which aligns recommendations with a user's known preferences, and Collaborative Filtering, that draws on collective user insights to predict individual interests. Each strategy, with its unique strengths, plays a pivotal role in the art of making spot-on suggestions.

| Type | Definition | Example |
|---|---|---|
| Content-Based Filtering | Uses similarity between items to recommend items similar to what the user likes. | If user A watches two cute cat videos, then the system can recommend cute animal videos to that user. |
| Collaborative Filtering | Uses similarities between queries and items simultaneously to provide recommendations. | If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn’t seen any videos similar to video 1). |

### Content-based Filtering

Content-based filtering uses item characteristics to suggest items that are similar to what a user likes, based on their prior actions or explicit feedback.

To illustrate content-based filtering, let's craft some features for Spotify. Consider a feature matrix where each row signifies a song, and each column represents a feature. These features could include genres (like pop, rock, or jazz), the artist, the album, and many others. For the sake of simplicity, let's assume this feature matrix is binary: a non-zero value means the song possesses that feature.

We also represent the user in the same feature space. Some user-related features could be explicitly provided by the user. For instance, a user might select "Rock music" in their profile. Other features can be implicit, based on the songs they've previously listened to. For example, if the user has frequently played songs by the artist 'Queen'.

The goal of the model is to recommend songs that would resonate with this user. To do this, you'd first choose a similarity metric (like the dot product). Then, you'd set up the system to score each candidate song according to this similarity metric. It's crucial to note that the recommendations are user-specific, as the model hasn't used any information about other users.

Consider the following songs:

- "Bohemian Rhapsody" by Queen
- "Imagine" by John Lennon
- "Sweet Child o' Mine" by Guns N' Roses
- "Stairway to Heaven" by Led Zeppelin
- "Bad Guy" by Billie Eilish

Let's also consider three music genres: Rock, Pop, and Classic Rock. We can create a binary feature matrix to represent these songs and their genres, where '1' represents the presence of a feature and '0' represents the absence of it.

| Song                     | Artist          | Rock | Pop | Classic Rock |
|--------------------------|-----------------|------|-----|--------------|
| Bohemian Rhapsody        | Queen           | 1    | 0   | 1            |
| Imagine                  | John Lennon     | 0    | 1   | 0            |
| Sweet Child o' Mine      | Guns N' Roses   | 1    | 0   | 1            |
| Stairway to Heaven       | Led Zeppelin    | 1    | 0   | 1            |
| Bad Guy                  | Billie Eilish   | 0    | 1   | 0            |

Now, consider a user who has a preference for Rock and Classic Rock and has recently listened to songs by Queen and Led Zeppelin. We can represent this user's profile in the same feature space:

| User Profile | Queen | John Lennon | Guns N' Roses | Led Zeppelin | Billie Eilish | Rock | Pop | Classic Rock |
|--------------|-------|-------------|---------------|--------------|---------------|------|-----|--------------|
| User 1       | 1     | 0           | 0             | 1            | 0             | 1    | 0   | 1            |

Using content-based filtering, we would calculate the similarity between the user profile and each song in the feature matrix. This could be done with the cosine similarity, dot product, or any other similarity measure.

For simplicity, let's use the dot product. The dot product between two vectors is the sum of the products of their corresponding entries. For binary vectors, it effectively counts the number of features they have in common. A higher dot product means more common features, and hence a higher similarity.

The dot product between the user profile and each song will be as follows:

- User 1 and "Bohemian Rhapsody": 4 (Queen, Rock, Classic Rock)
- User 1 and "Imagine": 0
- User 1 and "Sweet Child o' Mine": 2 (Rock, Classic Rock)
- User 1 and "Stairway to Heaven": 4 (Led Zeppelin, Rock, Classic Rock)
- User 1 and "Bad Guy": 0

So, the system would recommend "Bohemian Rhapsody" and "Stairway to Heaven" to the user, as these songs have the highest similarity scores.

### Collaborative Filtering

Think about a situation where you have a bunch of users and a bunch of items, and you're trying to figure out which items would most likely be of interest to which users. You see this problem everywhere, from recommending movies on Netflix, highlighting relevant content on a homepage, deciding what posts to show in a social media feed, and beyond. There's a nifty solution to this problem that's called collaborative filtering.

Here's how collaborative filtering works: it looks at what items the current user has interacted with or liked, finds other users who have interacted with or liked similar items, and then recommends other items that those users have interacted with or liked.

To give you a concrete example, let's say you've been watching a ton of sci-fi action movies from the 70s on Netflix. Now, Netflix might not have these specific details about the films you watched, but it can see that other people who watched the same movies you did also tend to watch other sci-fi action films from the 70s. The interesting part is that to use this approach, we don't really need to know anything about the movies themselves, except for who likes to watch them.

Collaborative filtering is actually a way to solve a broader class of problems that doesn't necessarily involve users and products. In fact, we usually talk about items instead of products in the context of collaborative filtering. These items could be anything from links that people click on, to diagnoses selected for patients, and more.

Now, the cornerstone of this whole approach is something called latent factors. In our Netflix example, we started with the assumption that you're into old, action-packed sci-fi movies. But you never actually told Netflix that you like these types of movies. And Netflix didn't have to add columns to its movie database saying which films are of these types. But, there must be some latent (or hidden) concept of sci-fi, action, and film age, and these concepts are likely relevant to the movie watching choices of at least some people.

Consider the following movies:

- "The Last Skywalker"
- "Casablanca"
- "Avengers: Endgame"
- "The Godfather"
- "Toy Story 4"

Let's also consider three movie categories: Science Fiction (Sci-Fi), Action, and Age (where a positive value represents an old movie and a negative value represents a new movie). We can create a feature matrix to represent these movies and their categories as follows:

| Movie                 | Sci-Fi | Action | Age |
|-----------------------|--------|--------|-----|
| The Last Skywalker    | 0.98   | 0.9    | -0.9|
| Casablanca            | -0.99  | -0.3   | 0.8 |
| Avengers: Endgame     | 0.9    | 0.95   | -0.85|
| The Godfather         | -0.8   | -0.4   | 0.9 |
| Toy Story 4           | 0.85   | 0.8    | -0.75|

Now, consider a user who enjoys modern sci-fi action movies. We can represent this user's profile in the same feature space:

| User Profile | Sci-Fi | Action | Age |
|--------------|--------|--------|-----|
| User 1       | 0.9    | 0.8    | -0.6|

Using the dot product calculation for each movie:

- User 1 and "The Last Skywalker": $(0.9*0.98 + 0.8*0.9 + (-0.6)*(-0.9))$
- User 1 and "Casablanca": $(0.9*(-0.99) + 0.8*(-0.3) + (-0.6)*0.8)$
- User 1 and "Avengers: Endgame": $(0.9*0.9 + 0.8*0.95 + (-0.6)*(-0.85))$
- User 1 and "The Godfather": $(0.9*(-0.8) + 0.8*(-0.4) + (-0.6)*0.9)$
- User 1 and "Toy Story 4": $(0.9*0.85 + 0.8*0.8 + (-0.6)*(-0.75))$

These calculations will give you the match between the user's preferences and each movie. The higher the score, the better the match, and hence the more likely the user is to enjoy the movie. 

In [None]:
import numpy as np

# Define the feature vectors for the movies
last_skywalker = np.array([0.98, 0.9, -0.9])
casablanca = np.array([-0.99, -0.3, 0.8])
avengers_endgame = np.array([0.9, 0.95, -0.85])
the_godfather = np.array([-0.8, -0.4, 0.9])
toy_story_4 = np.array([0.85, 0.8, -0.75])

# Define the user profile
user1 = np.array([0.9, 0.8, -0.6])

# Calculate the dot product between the user profile and each movie
score_last_skywalker = np.dot(user1, last_skywalker)
score_casablanca = np.dot(user1, casablanca)
score_avengers_endgame = np.dot(user1, avengers_endgame)
score_the_godfather = np.dot(user1, the_godfather)
score_toy_story_4 = np.dot(user1, toy_story_4)

score_last_skywalker, score_casablanca, score_avengers_endgame, score_the_godfather, score_toy_story_4

In [None]:
import pandas as pd

# Create a dictionary with the movie names and their corresponding scores
data = {
    'Movie': ['The Last Skywalker', 'Casablanca', 'Avengers: Endgame', 'The Godfather', 'Toy Story 4'],
    'Score': [score_last_skywalker, score_casablanca, score_avengers_endgame, score_the_godfather, score_toy_story_4]
}

# Create a pandas DataFrame from the dictionary
df_scores = pd.DataFrame(data)

df_scores

The positive scores indicate a good match between the user's preferences and the movie, while the negative scores indicate a mismatch. The higher the score, the better the match.

So, according to these scores, the user is most likely to enjoy "The Last Skywalker", followed by "Avengers: Endgame" and "Toy Story 4". The user is least likely to enjoy "Casablanca" and "The Godfather".

In practice, we often don't know what the latent factors are or how to score them for each user and item. We also don't know how many latent factors there should be. However, we can learn them from the data.

The idea is to start with random values for the user and item factors, and then iteratively adjust these values in a way that minimizes the difference between the predicted and actual ratings. This process is known as matrix factorization, and it can be accomplished using techniques like singular value decomposition (SVD) or algorithms like stochastic gradient descent (SGD).

In this approach, each user and each item is represented by a vector of latent factors. The dot product of a user vector and an item vector gives the predicted rating for that item by that user. The goal of the learning process is to find the values for these vectors that give the best predictions.

### Diving Into the Embedding Space (Latent-Factors)

Both content-based and collaborative filtering techniques translate each item and each query (or context) into an embedding vector in a shared embedding space, $E = R^d$. Typically, this space is low-dimensional (meaning, $d$ is significantly smaller than the corpus size) and encapsulates some latent structure of the item or query set. Items bearing similarity, like movies typically watched by the same user, find themselves in close proximity in the embedding space. This idea of "closeness" is determined by a similarity measure.

#### Matrix Factorization

Matrix factorization is a simple embedding model. Given the feedback matrix $A \in \mathbb{R}^{n \times m}$ where $m$ is the number of users (or queries) and $n$ is the number of items, the model learns:

- A user embedding matrix $U \in \mathbb{R}^{m \times d}$ where row $i$ is the embedding for user $i$.
- An item embedding matrix $V \in \mathbb{R}^{n \times d}$ where row $j$ is the embedding for item $j$.

It learns a dense representation (embedding) for both users and items in a shared low-dimensional space. This space is of dimension $d$, which is typically much smaller than $m$ (the number of users) or $n$ (the number of items). 

The key idea is that the interaction between a user and an item can be modeled as the dot product of their respective embeddings:


$$A_{ij} \approx U_i \cdot V_j^T$$


where $A_{ij}$ is the element in the $i$-th row and $j$-th column of the feedback matrix $A$, $U_i$ is the $i$-th row of the user embedding matrix $U$, and $V_j$ is the $j$-th row of the item embedding matrix $V$. 

### Learning the Latent Factors

Certainly, here are the steps in LaTeX:

**1.** The first step is to randomly initialize some parameters known as latent factors for each user and movie. These latent factors can be represented as vectors $U_i$ and $V_j$ for user $i$ and movie $j$, respectively. The entries of these vectors are displayed next to the users and movies in our crosstab, and the results of multiplying each combination of these entries (dot products) are filled in the middle.

**2.** The second step is to calculate our predictions by taking the dot product of each movie's latent factors with each user's latent factors. If we let $P_{ij}$ represent the predicted rating for user $i$ and movie $j$, we can calculate all predictions as follows:

$$
P_{ij} = U_i \cdot V_j^T
$$

The product will be high if the user's preferences and the movie's characteristics align, and low if they don't.

**3.** The third step is to calculate our loss using a loss function. In this case, we're using mean squared error (MSE), which measures the average of the squares of the differences between the predicted and actual ratings. This can be represented as follows:

$$
L = \frac{1}{N} \sum_{(i,j) \in \text{observed}} (A_{ij} - P_{ij})^2
$$

where $N$ is the number of observed ratings, $A_{ij}$ is the actual rating for user $i$ and movie $j$, and $P_{ij}$ is the predicted rating.

**4.** With these in place, we optimize our parameters (latent factors) using stochastic gradient descent (SGD) to minimize the loss. This process calculates the gradient of the loss with respect to the latent factors, and then adjusts the latent factors by taking a step in the direction of steepest descent. This process, repeated many times, reduces the loss and thus improves the quality of our recommendations. The update rule for SGD can be represented as follows:

$$
U_i = U_i - \alpha \frac{\partial L}{\partial U_i}, \quad V_j = V_j - \alpha \frac{\partial L}{\partial V_j}
$$

where $\alpha$ is the learning rate, and $\frac{\partial L}{\partial U_i}$ and $\frac{\partial L}{\partial V_j}$ are the gradients of the loss with respect to $U_i$ and $V_j$, respectively.

# Fastai

In [None]:
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

In [None]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head(10)

In [None]:
ratings.drop(['timestamp'], axis=1, inplace=True)
ratings.head(10)

In [None]:
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head(10)

In [None]:
last_skywalker = np.array([0.98,0.9,-0.9])
user1 = np.array([0.9,0.8,-0.6])

In [None]:
(last_skywalker*user1).sum()

In [None]:
casablanca = np.array([-0.99,-0.3,0.8])

In [None]:
(casablanca*user1).sum()

## 1. Learning the Latent Factors

**1.** The first step involves randomly initializing some parameters known as latent factors for each user and movie. These values are displayed next to the users and movies in our table (or crosstab), and the results of multiplying each combination of these elements (dot products) are filled in the middle.

**2.** The second step is to calculate our predictions by taking the dot product of each movie with each user. The product will be high if the user's preferences and the movie's characteristics match, and low if they don't.

**3.** The third step is to calculate our loss using a loss function, in this case, we're using mean squared error, which represents the accuracy of a prediction.

**4.** With these in place, we optimize our parameters (latent factors) using stochastic gradient descent to minimize the loss. This process calculates the match between each movie and each user, compares it to the actual rating, calculates the derivative, and adjusts the weights using the learning rate. This process, repeated multiple times, improves the loss and thus the quality of recommendations.

5. In machine learning, we often use a technique called 'one-hot encoding' to represent data. This technique, however, can consume a lot of memory and time. So, instead, we use an 'embedding' approach. 

6. An 'embedding' is a computational shortcut to one-hot encoding. It uses an integer to index directly into a vector (a list of numbers). This indexing approach behaves as if it had done a matrix multiplication with a one-hot-encoded vector. The vector we index into directly is called the 'embedding matrix'.

7. In computer vision, each pixel in an image is represented by three numbers: the RGB values. This is a straightforward way to characterize a pixel.

8. When dealing with complex data, like a user's movie preference, characterizing isn't that simple. A user's preference can be influenced by factors like genre, dialogue, action content, or specific actors. 

9. Instead of manually assigning numbers to characterize these complex factors, we let our machine learning model learn them by analyzing user-movie interactions.

10. To do this, we assign each user and movie a random vector of a certain length (5 in this case), and these become learnable parameters. The model then adjusts these parameters as it learns from the data.

11. Initially, these randomly chosen numbers don't have any meaning, but by the end of the training, they do. The model can pick up on important features, like differentiating between blockbuster and independent cinema, action movies from romance, and so on.

12. With these concepts understood, we are ready to build our model from scratch.

In [None]:
ratings = ratings.merge(movies)
ratings.head(10)

In [None]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

In [None]:
dls.classes.keys()

In [None]:
dls.classes['user']

In [None]:
dls.classes['title'][:10]

In [None]:
n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

In [None]:
n_users, n_movies

In [None]:
user_factors.shape, movie_factors.shape

In [None]:
one_hot_3 = one_hot(3, n_users).float()
one_hot_3.shape

In [None]:
one_hot_3

In [None]:
user_factors.shape

In [None]:
user_factors.t() @ one_hot_3

In [None]:
user_factors[3]

In [None]:
one_hot_ = np.eye(5, n_users)
one_hot_.shape

In [None]:
user_factors.t() @ one_hot_.T

In [None]:
user_factors[0]

In [None]:
one_hot_

In [None]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

In [None]:
x,y = dls.one_batch()
x.shape, y.shape

In [None]:
x[:10]

In [None]:
x[:, 0][:10]

In [None]:
x[:, 1][:10]

In [None]:
(x[:, 0][:10] * x[:, 1][:10]).sum()

In [None]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [None]:
learn.fit_one_cycle(5, 5e-3)

In [None]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

In [None]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [None]:
learn.fit_one_cycle(5, 5e-3)

The text is discussing the limitations of a machine learning model used for predicting movie ratings, specifically a model that relies solely on the dot product of user and movie latent factors (these are essentially characteristics or features that the model has learned).

According to the text, the model is currently unable to account for the inherent bias some users might have towards being more positive or negative in their ratings. Similarly, the model cannot account for the inherent quality of movies - that some movies are just generally liked or disliked, regardless of their specific characteristics.

For instance, if a movie is characterized by the model as very sci-fi, very action-oriented, and very new, the model doesn't have a way to capture whether the movie is generally well-liked or not. These characteristics tell us about the movie's genre and style, but not its overall quality or general reception.

The text suggests adding biases as a solution to this problem. A bias is a term in machine learning that allows us to shift our predictions by a constant value. For this case, we can have a bias for each user (representing their general positivity or negativity) and for each movie (representing its general quality). By adding these biases to our predictions, the model would be able to make more accurate predictions.

So, the model architecture needs to be adjusted to include these biases for each user and movie.

In [None]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

In [None]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

### Using weight decay

In [None]:
model = DotProductBias(n_users, n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 4e-3, wd=0.3)

In [None]:
model

In [None]:
model.user_factors

## Creating Our Own Embedding Module

In [None]:
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

In [None]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

In [None]:
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

##  Interpreting Embeddings and Biases


In [None]:
movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

In [None]:
movie_bias[idxs]

In [None]:
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

In [None]:
movie_bias[idxs]

In [None]:
#hide_input
#id img_pca_movie
#caption Representation of movies based on two strongest PCA components
#alt Representation of movies based on two strongest PCA components
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

In [None]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

### Embedding Distance

In [None]:
movie_factors = learn.movie_factors
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']

In [None]:
movie_factors.shape, idx

In [None]:
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

In [None]:
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[:5]
dls.classes['title'][idx]

### Bootstrapping a Collaborative Filtering Model

The biggest challenge with using collaborative filtering models in practice is the bootstrapping problem. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What products do you recommend to your very first user?

But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of use your common sense. You could assign new users the mean of all of the embedding vectors of your other users, but this has the problem that that particular combination of latent factors may be not at all common (for instance, the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent average taste.

Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them that could help you to understand their tastes. Then you can create a model where the dependent variable is a user’s embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup metadata. We will see in the next section how to create these kinds of tabular models. (You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations.)

One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don’t watch very much else, and spend a lot of time putting their ratings on websites. As a result, anime tends to be heavily overrepresented in a lot of best ever movies lists. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.

Such a problem can change the entire makeup of your user base, and the behavior of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This type of bias has a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorated in such a way that they expressed values at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly and in a way that is hidden until it is too late.

In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify up front how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It’s all about ensuring that there are humans in the loop; that there is careful monitoring, and a gradual and thoughtful rollout.

Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as probabilistic matrix factorization (PMF). Another approach, which generally works similarly well given the same data, is deep learning.

### Deep Learning for Collaborative Filtering

In [None]:
embs = get_emb_sz(dls)
embs

In [None]:
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        # import pdb; pdb.set_trace()
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

In [None]:
model = CollabNN(*embs)
model

In [None]:
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

# Colab: Build a Movie Recommendation System