In [None]:
!pip install fastai
import pandas as pd
import numpy as np
import torch
import fastai

import sklearn
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
print(dict(
    torch=torch.__version__, 
    fastai=fastai.__version__, 
    pandas=pd.__version__, 
    numpy=np.__version__))

# Embeddings

## What are they?
---

**An embedding may be thought of as a combination of one-hot-encoding + dimensionality reduction.**

The basic idea is to map categories, inspired by the [distributional hypothesis](https://aclweb.org/aclwiki/Distributional_Hypothesis#:~:text=The%20Distributional%20Hypothesis%20is%20that,meanings%20(Harris%2C%201954) for words, to similar locations in a low dimensional latent space:

![](https://developers.google.com/static/machine-learning/crash-course/images/linear-relationships.svg)

Embeddings are a critical tool for making neural networks (NN) efficient, especially when it comes to tabular data. Well-designed embeddings allow NNs to be as powerful and tree ensemble methods for tabular data.  When modeling categorical variables, we often one-hot-encode (OHE) these with different levels.  For many cases of categorical variables, there is no evidence that OHE actually improves the performance of such tree ensembles, so you can just keep the catogerical variable as a raw input.  For other models, like NNs, you cannot do that. When a table contains raw data (text or images) or have very high cardinality categorical features it is generally recommended to use NN instead of random forests, for example.  This is because [OHE creates an enormous, sparse matrix](https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data) which is not ideal.  Converting this high dimensionality encoding to a sensible, lower dimensional one is the point of embeddings.

**An Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words, for example) to dense vectors (their embeddings).**

Although they are often used for NLP, embeddings also make sense when modeling other things that should display similarity, such as chemistry.  The [Similar Property Principle](https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics/05%3A_5._Quantitative_Structure_Property_Relationships/5.02%3A_Similar-Structure_Similar-Property_Principle) is analogous to the distributional hypothesis in that it posits that similar chemical structures have similar properties or behavior.  Transformers often use embeddings internally; these are often used to do transfer learning, which relies on pretraining to take input and map it to a fixed output format, which is then fine-tuned and fit against some other desired target.  The tasks are similar in nature.  For example, [MOFTransformer](https://chemrxiv.org/engage/chemrxiv/article-details/634fbf8a4a18764f58e9fda5) and [MOFormer](https://pubs.acs.org/doi/full/10.1021/jacs.2c11420) are two examples of using this to make predictions about the properties of metal organic frameworks.

Practically, this also has applications when it comes to chemometric modeling for tools like [PLS-DA](/examples/common_chemometrics/) modeling which use OHE to map categories to vertices of polytopes.  Vertices are equidistant, but this is not always sensible.  Some catgories are naturally more similar than others and it would be ideal to place them closer.

## How do I get one?
---
In NNs, an embedding layer is exactly equivalent to placing an ordinary linear layer (NN) after an OHE layer or transformation. [Entity embedding](https://paperswithcode.com/paper/entity-embeddings-of-categorical-variables) is essentially just a separate NN after a OHE transformation for each categorical variable.  This transforms each category (entity) into its own embedding.

From [Google](https://developers.google.com/machine-learning/crash-course/embeddings/obtaining-embeddings): 
> "In general, when you have sparse data (or dense data that you'd like to embed), you can create an embedding unit that is just a special type of hidden unit of size d. This embedding layer can be combined with any other features and hidden layers. As in any DNN, the final layer will be the loss that is being optimized. For example, let's say we're performing collaborative filtering, where the goal is to predict a user's interests from the interests of other users. We can model this as a supervised learning problem by randomly setting aside (or holding out) a small number of the movies that the user has watched as the positive labels, and then optimize a softmax loss."

![](https://developers.google.com/static/machine-learning/crash-course/images/EmbeddingExample3-1.svg)

Embeddings can also be achieved via matrix factorization (see below) so they are not restricted to use with NNs, but for many practical reasons NNs are either used to obtain them in the first place or naturally contain layers which function as embeddings.  Consequently, embeddings are often discussed within the context of NNs.  

In practice, embeddings are most useful for natural language processing (NLP) and many embeddings are pre-trained and open source so they can be used off-the-shelf for NLP tasks.  Google discusses some, like [word2vec](https://en.wikipedia.org/wiki/Word2vec), [here](https://developers.google.com/machine-learning/crash-course/embeddings/obtaining-embeddings).

In general you can:
1. use an embedding trained separately
2. train an embedding (as a layer) directly in a NN

The advantage of (1) is that it can be easy to use for arbitrary models, and essentially amounts to preprocessing your data; here, you don't have any control over the embedding so it may not be optimal for your task.  (2) is simpler to program and start with, but can take longer to train the final model since you need to learn the embedding, too.

[Fastai](https://docs.fast.ai/) has a [tool for heuristically suggesting](https://docs.fast.ai/tabular.model.html#get_emb_sz) the best size for your embeddings. This is just a suggestion, but a good starting point.
~~~code
get_emb_sz(my_dataloader)
~~~

## Building an embedding
---
These are some examples from Google using tensorflow to build word embeddings:
* [Training word embeddings](https://www.tensorflow.org/text/guide/word_embeddings)
* [word2vec tutorial](https://www.tensorflow.org/tutorials/text/word2vec)

One very nice tool is the [Embedding Projector]() which helps visualize these embeddings to give intuition about your embedding and its "quality."

![](https://www.tensorflow.org/static/text/guide/images/embedding.jpg)


# Collaborative Filtering

Collaborative filtering uses embeddings to find latent factors connecting categorical or labeled inputs and outputs (for example, usernames vs. movie titles).  The hypothesis is that there is a structured latent space defined by these embeddings in which similar points reflect similar users or movies.

[Fastai](https://docs.fast.ai/) has a lightweight, fast [collab_learner](https://docs.fast.ai/collab.html) which is a good tool for a first pass if you don't want to dive too deeply into the details.

## Probabilistic Matrix Factorization Approach

This is from ["Deep learning for coders with fastai & pytorch"](https://www.amazon.com/Deep-Learning-Coders-fastai-PyTorch/dp/1492045527) (Chapter 8).

Here we are trying to find the latent factors controlling the connection between users and the movies the like.

We will use a very simple metric (dot product) to define matches / similarity. This approach to collaborative filtering is called [probabilistic matrix factorization](https://towardsdatascience.com/probabilistic-matrix-factorization-b7852244a321). The basic assumption is that the response (here, movie rating from 1-5 is a product of the user's characteristic vector with the movie's characteristic vector - the vectors are the embeddings!).

Here is a nice introduction from [google](https://developers.google.com/machine-learning/recommendation/collaborative/basics).  One thing google points out is that you can improve things by [weighting observed](https://developers.google.com/machine-learning/recommendation/collaborative/matrix) entries differently that unobserved ones during training (which will make sense soon).  The example below is as simpler implementation whcih does not do this.

Some [limitations](https://developers.google.com/machine-learning/recommendation/dnn/softmax), which can be overcome by using NN (next section), include:
> 1. This only works on the training data - you cannot make predictions on new, unseen data.  In principle, if you had another model (e.g., NN) to predict the user embedding based on other features you should be able to use the filtering model to make predictions, but this adds another layer - using a NN directly essentially accomplishes the same thing in one step.
> 2. "Popular items tend to be recommended for everyone, especially when using dot product as a similarity measure." User-specific interests are not well-tailored.

In [None]:
from fastai.collab import *
from fastai.tabular.all import *

In [None]:
path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)

In [None]:
movies.head()

In [None]:
ratings.head()

In [None]:
print(
    ratings['rating'].min(),
    ratings['rating'].max())

In [None]:
ratings = ratings.merge(movies) # merge based on movie (common column)

In [None]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64) # Make a dataloader from the dataframe
dls.show_batch()

In [None]:
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])

In [None]:
lower = 1
upper = 10
x = torch.tensor(np.linspace(-10, +10, 1000))
y = sigmoid_range(x, lower, upper) # This creates a response variable that is stretched out

plt.plot(x,y)

In [None]:
# Let's do this using fastai's built in Embedding module

# We are going to predict the user ratings (1-5) - heuristically increasing the scale (y_range) a bit seems to help

class DotProductBias(Module):
  def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
    self.user_factors = Embedding(n_users, n_factors)
    self.user_bias = Embedding(n_users, 1)
    self.movie_factors = Embedding(n_movies, n_factors)
    self.movie_bias = Embedding(n_movies, 1)
    self.y_range = y_range

  def forward(self, x):
    users = self.user_factors(x[:,0])
    movies = self.movie_factors(x[:,1])
    # A dot product reflects the similarity between 2 vectors so is a convenient "matching" approach
    res = (users * movies).sum(dim=1, keepdim=True) + self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
    return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat()) 

In [None]:
learn.fit_one_cycle(5, 5e-3, wd=0.1 )# Use weight decay (l2 norm) to regularize

In [None]:
# To be a little more transparent we can make our own "Embedding" module
def create_params(size):
  # nn.Parameter tells pytorch this is a trainable parameter
  # This also randomizes them
  return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

class ManualDotProductBias(Module):
  def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
    self.user_factors = create_params([n_users, n_factors])
    self.user_bias = create_params([n_users])
    self.movie_factors = create_params([n_movies, n_factors])
    self.movie_bias = create_params([n_movies])
    self.y_range = y_range

  def forward(self, x):
    users = self.user_factors[x[:,0]]
    movies = self.movie_factors[x[:,1]]
    # A dot product reflects the similarity between 2 vectors so is a convenient "matching" approach
    res = (users * movies).sum(dim=1) + self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
    return sigmoid_range(res, *self.y_range)

model = ManualDotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat(), wd=0.1) # Use weight decay (l2 norm) to regularize

In [None]:
learn.fit_one_cycle(5, 5e-3, wd=0.1) # Use weight decay (l2 norm) to regularize

In [None]:
# The bias tells us what movies are just "bad" - i.e., even if the dot product is high (movie well matched to user's general taste), a low bias means they probably still won't like it
# Conversely, high biases can show movies that are very "good" and even users who don't normally like a certain type of movie will probably still enjoy it.

movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
bad_movies = [dls.classes['title'][i] for i in idxs]

idxs = movie_bias.argsort(descending=True)[:5]
good_movies = [dls.classes['title'][i] for i in idxs]

In [None]:
bad_movies

In [None]:
good_movies

In [None]:
# We can look at the latent space

pca = PCA(n_components=2)

factors = learn.model.movie_factors.detach().numpy()
look_at = 20 # Look at the "best" N movies

idxs_best = movie_bias.argsort(descending=True)[:look_at]
mask = np.array([True if i in idxs_best else False for i in range(n_movies)])
low_d_best = pca.fit_transform(factors[mask])

idxs_worst = movie_bias.argsort(descending=False)[:look_at]
mask = np.array([True if i in idxs_worst else False for i in range(n_movies)])
low_d_worst = pca.fit_transform(factors[mask])

In [None]:
plt.figure(figsize=(10,10))
plt.plot(low_d_best[:,0], low_d_best[:,1], 'go')
for i in range(len(low_d_best)):
  plt.text(low_d_best[i,0], low_d_best[i,1], dls.classes['title'][idxs_best[i]])

plt.plot(low_d_worst[:,0], low_d_worst[:,1], 'ro')
for i in range(len(low_d_worst)):
  plt.text(low_d_worst[i,0], low_d_worst[i,1], dls.classes['title'][idxs_worst[i]])

In [None]:
# Fastai does the same thing

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

In [None]:
learn.model

In [None]:
movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

Importantly, the latent space can be used to define distances, for example - similar things should be close in this space (we hope), so we can define recommendations, or authentication/predictions, based on similarity to a known exemplar.

## Deep Neural Network Approach

This is from ["Deep learning for coders with fastai & pytorch"](https://www.amazon.com/Deep-Learning-Coders-fastai-PyTorch/dp/1492045527) (Chapter 8).

Now, we will use a neural network to predict the latent similarity.  This is achieved by concatenating the factors of the movies and the users to create a single input, then a NN predicts the "degree of recommendation / compatability" based on that net input.

Google also has a nice example [here](https://developers.google.com/machine-learning/recommendation/dnn/softmax) which does multiclass prediction (probability of liking different videos) - this example predicts a single scalar (user rating) so we use MSELoss instead of categorical cross entropy, for example, but the point is this can be adapted.

In addition to overcoming the shortcomings of the PMF approach above, NN also enable you to concatenate other continuous variables with the input to these layers to create [wide and deep NN](https://medium.com/analytics-vidhya/wide-deep-learning-for-recommender-systems-dc99094fc291) as explained in this [paper](https://arxiv.org/abs/1606.07792) from google. Also see [here](https://developers.google.com/machine-learning/recommendation/dnn/softmax).

In [None]:
from fastai.collab import *
from fastai.tabular.all import *

In [None]:
path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
ratings = ratings.merge(movies) # merge based on movie (common column)
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64) # Make a dataloader from the dataframe

In [None]:
dls.classes.keys()

In [None]:
get_emb_sz(dls) # fastai has a heuristic tool to recommend the size of your embedding for each 'class'

In [None]:
??get_emb_sz

In [None]:
class CollabNN(Module):
  def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
    self.user_factors = Embedding(*user_sz)
    self.item_factors = Embedding(*item_sz)
    self.layers = nn.Sequential(
        nn.Linear(user_sz[1] + item_sz[1], n_act),
        nn.ReLU(),
        nn.Linear(n_act, 1) # Output a single scalar = user rating
    )
    self.y_range = y_range

  def forward(self, x):
    embs = self.user_factors(x[:,0]), self.item_factors(x[:,1])
    x = self.layers(torch.cat(embs, dim=1))
    return sigmoid_range(x, *self.y_range)

In [None]:
model = CollabNN(*get_emb_sz(dls))

In [None]:
dls.items

In [None]:
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

In [None]:
# fastai has a shortcut function for this which uses get_emb_sz automatically

learn = collab_learner(dls, 
                       use_nn=True, # Use NN instead of PMF
                       y_range=(0,5.5), 
                       layers=[100, 50] # You can also adjust the size of the NN automatically
                       )
learn.fit_one_cycle(5, 5e-3, wd=0.1)