We now discuss replacing the dot product model with a deep-learning model. We now no longer require users and movies to be represented by latent factors of the same dimension, but it is still useful to represent users and movies in a given embedding.

More generally, it is useful to represent categorical data in an embedding. Recall that categorical features in \(n\) values can have a one-hot representation as \(n\) binary features. This manes it amenable to regression techniques, but that is a lot of factors and we are not using the dynamic range \((0,1)\) very well. We can do something intermediate and learn an embedding in a smaller number of factors, which makes better utilization of the dynamic range and also learn relationships between the embeddings themselves.

`fastai` provides `get_emb_sz`, which when given a `Tabular` object, like a `TabularDataLoaders`, looks at the categorical features and for each such feature heuristically estimates a suggested dimensionality for an embedding of it based on the number of values in the category, and then returns these as a list of shape parameters for each category, where the shape parameters are those for a matrix that defines a linear map from each categorical value into a vector corresponding to its embedding. (Being a linear may, it is equivalent to an FC layer from one-hot inputs without an activation.)

To demonstrate this, we first repeat the work in previous notebooks up to defining the dataloaders.

In [1]:
from fastai.collab import *
from fastai.tabular.all import *

# Dataset preparation
path = untar_data(URLs.ML_100k)

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0, 1), names=('movie_id', 'title'), header=None)
ratings = ratings.merge(movies)
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)


`dls` is a `TabularDataLoaders` object, which supports being passed into `get_emb_sz`

In [2]:
type(dls)

fastai.tabular.data.TabularDataLoaders

And we do just that.

In [3]:
embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]