# Deep learning for Tabular Data

Tabular modeling takes data in table form. The goal is to estimate the value in one column based on the other column values. One example of Tabular data is the rossman store sales from https://www.kaggle.com/c/rossmann-store-sales/data. Where you are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.

Most of the fields are self-explanatory.

    Id - an Id that represents a (Store, Date) duple within the test set
    Store - a unique Id for each store
    Sales - the turnover for any given day (this is what you are predicting)
    Customers - the number of customers on a given day
    Open - an indicator for whether the store was open: 0 = closed, 1 = open
    StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
    SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
    StoreType - differentiates between 4 different store models: a, b, c, d
    Assortment - describes an assortment level: a = basic, b = extra, c = extended
    CompetitionDistance - distance in meters to the nearest competitor store
    CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
    Promo - indicates whether a store is running a promo on that day
    Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
    Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
    PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store


In [1]:
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from pandas.api.types import is_numeric_dtype, is_categorical_dtype
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import column_or_1d
from tqdm.notebook import trange,  tqdm
import numpy as np
import warnings
warnings.filterwarnings('ignore')
torch.__version__

'1.3.1'

## Data visualization and preprocessing

Before we first need to visualize and preprocess our data. Check if they have the proper type and handle the missing value if any.

We use pandas to load and interact with our data. 

In [2]:
data = pd.read_csv('train.csv', low_memory=False)
n = len(data)
idx = np.random.permutation(range(n))[:3000]
idx.sort()
data = data.iloc[idx[:3000]] # Let's only use a small portion of your data

`head()` is nice to look at the first few rows in a dataset and at the column names. This already gives us an intuition of what kind of data we can expect.

In [3]:
data.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
315,316,5,2015-07-31,11678,927,1,1,0,1
556,557,5,2015-07-31,5417,752,1,1,0,1
760,761,5,2015-07-31,12761,1203,1,1,0,1
777,778,5,2015-07-31,7341,804,1,1,0,0
797,798,5,2015-07-31,9268,997,1,1,0,1


This displays a few summary statistics.

In [4]:
data.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,564.0,3.961,5778.641667,639.087,0.830667,0.384667,0.179
std,320.077351,1.996866,3897.991649,477.320096,0.375109,0.486598,0.383416
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,289.0,2.0,3707.5,399.0,1.0,0.0,0.0
50%,569.0,4.0,5709.5,613.0,1.0,0.0,0.0
75%,836.0,6.0,7812.5,840.0,1.0,1.0,0.0
max,1115.0,7.0,28682.0,4394.0,1.0,1.0,1.0


This returns the data types of our columns.

In [5]:
data.dtypes

Store             int64
DayOfWeek         int64
Date             object
Sales             int64
Customers         int64
Open              int64
Promo             int64
StateHoliday     object
SchoolHoliday     int64
dtype: object

In [6]:
data.StateHoliday.unique()

array(['0', 'a', 'b', 'c'], dtype=object)

As we can see, some columns contain numerical data, others contain string values. The numerical data can be directly fed to our Neural Network (with some optional preprocessing), but other columns need to be converted to numbers. Since the values in those correspond to different categories, we often call these type of variables categorical variables. The first type are called continuous variables.

### Continuous variables
are numerical data, such as "age" can be directly fed to the model, since you can add and multiply them directly. 

### Categorical variables
contain a number of discrete levels, such as "sex", for which addition and multiplication don't have meaning (even if they're stored as numbers).
Therefere this refers to input features that represent one or more discrete items from a finite set of choices. 
Categorical data is most efficiently represented via sparse tensors, which are tensors with very few non-zero elements.

In order to use such representations within a machine learning system, we need a way to represent each sparse vector as a vector of numbers so that semantically similar items have similar distances in the vector space. But how do you represent a word as a vector of numbers?

The simplest way is to define a giant input layer with a node for every word in your vocabulary, or at least a node for every word that appears in your data. If 500,000 unique words appear in your data, you could represent a word with a length 500,000 vector and assign each word to a slot in the vector.

If you assign "horse" to index 1247, then to feed "horse" into your network you might copy a 1 into the 1247th input node and 0s into all the rest. This sort of representation is called a one-hot encoding, because only one index has a non-zero value.

More typically your vector might contain counts of the words in a larger chunk of text. This is known as a "bag of words" representation. In a bag-of-words vector, several of the 500,000 nodes would have non-zero value.

But however you determine the non-zero values, one-node-per-word gives you very sparse input vectors—very large vectors with relatively few non-zero values. Sparse representations have a couple of problems that can make it hard for a model to learn effectively.

The solution to this problem is to use *embeddings*, which translate large sparse vectors into a lower-dimensional space that preserves semantic relationships. We'll explore embeddings intuitively, conceptually, and programmatically in the following sections of this module.


### What is an embedding?


An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. 
In other word, It is a categorical feature represented as a continuous-valued feature. It can be also perseved as a translation of a high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English sentence in either of the following two ways:

    * As a million-element (high-dimensional) sparse vector in which all elements are integers. Each cell in the vector represents a separate English word; the value in a cell represents the number of times that word appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1) representing the number of times that word appeared in the sentence.
    * As a several-hundred-element (low-dimensional) dense vector in which each element holds a floating-point value between 0 and 1. This is an embedding.

Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.



A key technique to making the most of deep learning for tabular data is to use embeddings for your categorical variables. This approach allows for relationships between categories to be captured. 

## Pipeline
A typical pipeline is as follow for tabular/ structured data:
* Categorify
* Fill missing value
* Normalize Continious variable
Those step are necessary in order to get good result with your model. This implementation is left to the reader as an exercice.

We see that those variable represent categorical feature, therefore we transform them into categorical variable.

In [7]:
data.DayOfWeek.unique()

array([5, 4, 3, 2, 1, 7, 6])

In [8]:
data.Open.unique()

array([1, 0])

In [9]:
data.Promo.unique()

array([1, 0])

In [10]:
data.StateHoliday.unique()

array(['0', 'a', 'b', 'c'], dtype=object)

In [11]:
data.SchoolHoliday.unique()

array([1, 0])

In [12]:
data.StateHoliday = data.StateHoliday.astype(str)
data.SchoolHoliday = data.SchoolHoliday.astype(str)
data.DayOfWeek = data.DayOfWeek.astype(str)
data.Open  = data.Open.astype(str)
data.Promo = data.Promo.astype(str)

In [13]:
data.loc[data['StateHoliday'] == 0, 'StateHoliday'] = 'zero'

Determine categorical variables in our data and return their names and number of unique categories.

In [14]:
def categorize(X):
    '''
    Determine categorical variables in X and return
    their names and number of unique categories.
    :param X: input DataFrame
    :return: list of tuples
    '''
    cat_sz = [(col, X[col].unique().shape[0]) for col in X.columns
              if X[col].dtype == 'object']

    return cat_sz

In [15]:
cat_sz = categorize(data)
cat_sz

[('DayOfWeek', 7),
 ('Date', 908),
 ('Open', 2),
 ('Promo', 2),
 ('StateHoliday', 4),
 ('SchoolHoliday', 2)]

Determine the embedding dimensions for categorical variables. If embedding dimensions are not provided, will use a rule of thumb.

In [16]:
def pick_emb_dim(cat_sz, max_dim=50, emb_dims=None, include_unseen=False):
    '''
    Determine the embedding dimensions for categorical variables.
    If embedding dimensions are not provided, will use a rule of thumb.
    :param cat_sz: list of tuples
    :param max_dim: maximum embedding dimension
    :param emb_dims: array-like of embedding dimensions,
                     same length as cat_sz
    :param include_unseen: optional, add extra category for 'unseen'
    :return: dictionary of categorical variables for Embedder
    '''
    if emb_dims is None:
        emb_sz = {var: (input_dim, min(max_dim, (input_dim + 1) // 2))
                  for var, input_dim in cat_sz}
    else:
        emb_sz = {c[0]: (c[1], emb_dim)
                  for c, emb_dim in zip(cat_sz, emb_dims)
                  }

    if include_unseen:
        emb_sz = {var: (sz[0] + 1, sz[1]) for var, sz in emb_sz.items()}

    return emb_sz


In [17]:
emb_sz = pick_emb_dim(cat_sz)
emb_sz

{'DayOfWeek': (7, 4),
 'Date': (908, 50),
 'Open': (2, 1),
 'Promo': (2, 1),
 'StateHoliday': (4, 2),
 'SchoolHoliday': (2, 1)}

Let's encode categorical variables as integers.

In [18]:
class SafeLabelEncoder(LabelEncoder):
    """An extension of LabelEncoder that will
    not throw an exception for unseen data, but will
    instead return a default value of len(labels)
    Attributes
    ----------
    classes_ : the classes that are encoded
    """

    def transform(self, y):

        check_is_fitted(self, 'classes_')
        y = column_or_1d(y, warn=True)

        unseen = len(self.classes_)

        e = np.array([
                     np.searchsorted(self.classes_, x)
                     if x in self.classes_ else unseen
                     for x in y
                     ])

        if unseen in e:
            self.classes_ = np.array(self.classes_.tolist() + ['unseen'])

        return e

In [19]:
def encode_categorical(X,
                       categorical_vars=None,
                       copy=True):
    '''
    Encode categorical variables as integers.
    :param X: input DataFrame
    :param categorical_vars: optional, list of categorical variables
    :param copy: optional, whether to modify a copy
    :return: DataFrame, LabelEncoders
    '''
    df = X.copy() if copy else X
    encoders = {}

    if categorical_vars is None:
        categorical_vars = [col for col in df.columns
                            if df[col].dtype == 'object']

    for var in categorical_vars:
        encoders[var] = SafeLabelEncoder()
        encoders[var].fit(df[var])
        df.loc[:, var] = encoders[var].transform(df.loc[:, var])

    return df, encoders

In [20]:
df, encoders = encode_categorical(data)

In [21]:
df

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
315,316,4,907,11678,927,1,1,0,1
556,557,4,907,5417,752,1,1,0,1
760,761,4,907,12761,1203,1,1,0,1
777,778,4,907,7341,804,1,1,0,0
797,798,4,907,9268,997,1,1,0,1
...,...,...,...,...,...,...,...,...,...
1015700,721,2,1,6353,760,1,0,0,1
1015722,743,2,1,3106,425,1,0,0,1
1016001,1022,2,1,5584,697,1,0,0,1
1016465,371,1,0,0,0,0,0,1,1


In [22]:
encoders

{'DayOfWeek': SafeLabelEncoder(),
 'Date': SafeLabelEncoder(),
 'Open': SafeLabelEncoder(),
 'Promo': SafeLabelEncoder(),
 'StateHoliday': SafeLabelEncoder(),
 'SchoolHoliday': SafeLabelEncoder()}

List the name of column of your cateorical feature

In [23]:
categorical_features = [c[0] for c in cat_sz]
categorical_features

['DayOfWeek', 'Date', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']

In [24]:
output_feature = "Sales"

The following implementation uses pytorch.

## Dataset

Our dataset drop the `Sales` column from the training set because it is the value we want to predict (dependent variable).

In [25]:
class TabularDataset(Dataset):
    def __init__(self, data, cat_cols=None, output_col=None):
        """
        Characterizes a Dataset for PyTorch

        Parameters
        ----------

        data: pandas data frame
          The data frame object for the input data. It must
          contain all the continuous, categorical and the
          output columns to be used.

        cat_cols: List of strings
          The names of the categorical columns in the data.
          These columns will be passed through the embedding
          layers in the model. These columns must be
          label encoded beforehand. 

        output_col: string
          The name of the output variable column in the data
          provided.
        """

        self.n = data.shape[0]

        if output_col:
            self.y = data[output_col].astype(np.float32).values.reshape(-1, 1)
        else:
            self.y =  np.zeros((self.n, 1))

        self.cat_cols = cat_cols if cat_cols else []
        self.cont_cols = [col for col in data.columns
                          if col not in self.cat_cols + [output_col]]

        if self.cont_cols:
            self.cont_X = data[self.cont_cols].astype(np.float32).values
        else:
            self.cont_X = np.zeros((self.n, 1))

        if self.cat_cols:
            self.cat_X = data[cat_cols].astype(np.int32).values
        else:
            self.cat_X =  np.zeros((self.n, 1))
    def __len__(self):
        """
        Denotes the total number of samples.
        """
        return self.n

    def __getitem__(self, idx):
        """
        Generates one sample of data.
        """
        return [self.y[idx], self.cont_X[idx], self.cat_X[idx]]

In [26]:
ds = TabularDataset(df, cat_cols=categorical_features, output_col=output_feature)

In [27]:
ds[0]

[array([11678.], dtype=float32),
 array([316., 927.], dtype=float32),
 array([  4, 907,   1,   1,   0,   1], dtype=int32)]

## Model

In [28]:
from torch import nn
class TabularModel(nn.Module):
    "Basic model for tabular data"
    
    def __init__(self, emb_szs, n_cont, out_sz, layers, drops, y_range):
        """
            Parameters
            ----------

            emb_szs: List of two element tuples
              This list will contain a two element tuple for each
              categorical feature. The first element of a tuple will
              denote the number of unique values of the categorical
              feature. The second element will denote the embedding
              dimension to be used for that feature.

            n_cont: Integer
              The number of continuous features in the data.

            layers: List of integers.
              The size of each linear layer. The length will be equal
              to the total number of linear layers in the network.

            out_sz: Integer
              The size of the final output.

            drops: List of floats
              The dropouts to be used after each linear layer.
        """
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
        self.bn_cont = nn.BatchNorm1d(n_cont)
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb,self.n_cont, self.y_range = n_emb,n_cont, y_range
        final_act = None if y_range is None else nn.Sigmoid()
        sizes = [n_emb + n_cont] + layers + [out_sz]
        actns = [nn.ReLU(inplace=True)] * (len(sizes)-2) + [final_act]
        layers = []
        for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+drops,actns)):
            layers += nn.ModuleList([
                    nn.BatchNorm1d(n_in), 
                    nn.Dropout(dp),
                    nn.Linear(n_in, n_out),
                    act
            ])
        self.layers = nn.Sequential(*layers)
    
    def forward(self, x_cat, x_cont):
        if self.n_emb != 0:
            x = [e(x_cat[:,i].long()) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
        if self.n_cont != 0:
            x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
        x = self.layers(x)
        if self.y_range is not None: x = (self.y_range[1] - self.y_range[0]) * x + self.y_range[0]
        return x.squeeze()

In [29]:
max_log_y = np.log(np.max(df['Sales']))
y_range = torch.tensor([0, max_log_y*1.2]).float().to(torch.device("cuda:1"))

## Train our network

In [30]:
from torch import optim
from torch.nn import functional as F
lr = 1e-3
epochs = 10
bs = 64
device = torch.device("cuda:1")
train_dl = DataLoader(ds, batch_size=bs, shuffle=True, num_workers=1)
model = TabularModel(emb_szs=[e for i,e in emb_sz.items()], n_cont=len(ds.cont_cols), out_sz=1, layers=[1000,500], drops=[0.001,0.01], y_range=y_range).to(device)
opt = optim.SGD(model.parameters(), lr=lr)
loss_func = nn.MSELoss()

This is the topology of our network. As we can see each categorical variable has its own embedding.

In [31]:
model

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(7, 4)
    (1): Embedding(908, 50)
    (2): Embedding(2, 1)
    (3): Embedding(2, 1)
    (4): Embedding(4, 2)
    (5): Embedding(2, 1)
  )
  (bn_cont): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): BatchNorm1d(61, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Dropout(p=0.0, inplace=False)
    (2): Linear(in_features=61, out_features=1000, bias=True)
    (3): ReLU(inplace=True)
    (4): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Dropout(p=0.001, inplace=False)
    (6): Linear(in_features=1000, out_features=500, bias=True)
    (7): ReLU(inplace=True)
    (8): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): Dropout(p=0.01, inplace=False)
    (10): Linear(in_features=500, out_features=1, bias=True)
    (11): Sigmoid()
  )
)

In [32]:
for epoch in tqdm(range(epochs)):
    for y, x_cont, x_cat  in train_dl:
        
        y = y.to(device)
        x_cat = x_cat.to(device)
        x_cont = x_cont.to(device)
        # Forward Pass
        pred = model(x_cat, x_cont)
        loss = loss_func(pred, y)
        
        # Backward Pass and Optimization
        loss.backward()
        opt.step()
        opt.zero_grad()
    print(loss)

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

tensor(56245004., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(36936156., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(48797868., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(51558412., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(50960380., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(33605044., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(50288356., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(59564748., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(50550820., device='cuda:1', grad_fn=<MseLossBackward>)
tensor(59472228., device='cuda:1', grad_fn=<MseLossBackward>)



# Exercice

* Improve our result by using a proper data preprocessing pipeline (Normalizing, etc)
* Plot the embedding result
* Implement the model on other problem for predicting categorical feature

# Resources

* Wide & Deep Learning for Recommender Systems https://arxiv.org/abs/1606.07792
* Entity Embeddings of Categorical Variables https://arxiv.org/abs/1604.06737
* Artificial Neural Networks Applied to Taxi Destination Prediction https://arxiv.org/abs/1508.00021

# References

* https://www.manning.com/books/deep-learning-with-structured-data
* https://towardsdatascience.com/the-right-way-to-use-deep-learning-for-tabular-data-entity-embedding-b5c4aaf1423a
* https://towardsdatascience.com/structured-deep-learning-b8ca4138b848
* https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
* https://medium.com/@markryan_69718/deep-learning-on-structured-data-part-1-7f08584b9883