In [None]:
!pip install fastai
import pandas as pd
import numpy as np
import torch
import fastai

In [None]:
print(dict(
    torch=torch.__version__, 
    fastai=fastai.__version__, 
    pandas=pd.__version__, 
    numpy=np.__version__))

# Embeddings

Embeddings are a critical tool for making neural networks efficient, especially when it comes to tabular data.  Embeddings allow NN to be as powerful and tree ensemble methods for tabular data.  

When a table contains raw data (text or images) or have very high cardinality categorical features it is generally recommended to use NN instead of random forests, for example.

In [None]:
import torch

# Collaborative Filtering

Collaborative filtering uses embeddings to find latent factors connecting categorical or labeled inputs and outputs (for example, usernames vs. movie titles).  

Approximately:
* Regression model ~ Continuous input -> continuous output
* Classification model ~ Continuous input -> categorical/labeled output 
* Collaborative filtering ~ Categorical/labels -> Categorical/labels

## Example

This is from ["Deep learning for coders with fastai & pytorch"](https://www.amazon.com/Deep-Learning-Coders-fastai-PyTorch/dp/1492045527) (Chapter 8).

Here we are trying to find the latent factors controlling the connection between users and the movies the like.

In [None]:
from fastai.collab import *
from fastai.tabular.all import *

In [None]:
path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)

In [None]:
movies.head()

In [None]:
ratings.head()

In [None]:
ratings = ratings.merge(movies) # merge based on movie (common column)

In [None]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64) # Make a dataloader from the dataframe
dls.show_batch()