## The `deeptabular` component

In the previous notebook I described the linear model (`Wide`) and the standard text classification and regression models (`DeepText` and `DeepImage`) that can be used as the `wide`, `deeptext` and `deepimage` components respectively when building a `WideDeep` model. 

In this notebook I will describe the 3 models (or architectures) available in `pytorch-widedeep` that can be used as the `deeptabular` model. Note that the `deeptabular` model alone is what normally would be referred as Deep Learning for tabular data. As I mentioned in previous notebooks, each component can be used independently. Therefore, if you wanted to use `deeptabular` alone it is perfectly possible. There are just a couple of simple requirement that will be covered in a later notebook.

The 3 models available in `pytorch-widedeep` as the `deeptabular` are:

1. `TabMlp`
2. `TabResnet`
3. `TabTransformer`

Let's have a close look to the 3 of them

## 1. `TabMlp`

`TabMlp` is the simples architecture and is very similar to the tabular model available in the fantastic fastai library. In fact, the implementation of the dense layers of the MLP is mostly identical to that in that library.

The figure below illustrate the `TabMlp` architecture:

<img src="../docs/figures/tabmlp_arch.png" width="300" align="center"/>

The dashed-border boxes indicate that these components are optional. For example, we could use `TabMlp` without categorical components, or without continuous components, if we wanted.

In [1]:
import torch
from pytorch_widedeep.models import TabMlp

  return f(*args, **kwds)


In [2]:
?TabMlp

Let's have a look to a model and one example

In [5]:
colnames = ['a', 'b', 'c', 'd', 'e']
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
tabmlp = TabMlp(mlp_hidden_dims=[8,4], continuous_cols=['e'], column_idx=column_idx, 
                embed_input=embed_input, batchnorm_cont=True)
out = tabmlp(X_tab)

In [6]:
tabmlp

TabMlp(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (tab_mlp): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=33, out_features=8, bias=True)
        (2): ReLU(inplace=True)
      )
      (dense_layer_1): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=8, out_features=4, bias=True)
        (2): ReLU(inplace=True)
      )
    )
  )
)

Note that the input dimension of the MLP is `33`, `32` from the embeddings and `1` for the continuous features. Before we move on, is worth commenting an aspect that applies to the three models discussed here. The `TabPreprocessor` included in this package gives the user the possibility of standarising the input via `sklearn`'s `StandardScaler`. Alternatively, or in addition to it, it is possible to add a `BatchNorm1d` layer to normalise continuous columns within `TabMlp`. To do so simply set the `batchnorm_cont` parameter as `True` when defining the model, as indicated in the example above.

I will insist on this in this and the following sections. Note that `TabMlp` (or any of the wide and deep components) does not build the final connection with the final neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).

For example:

In [7]:
from pytorch_widedeep.models import WideDeep

In [8]:
wd_model = WideDeep(deeptabular=tabmlp, pred_dim=1)

In [9]:
wd_model

WideDeep(
  (deeptabular): Sequential(
    (0): TabMlp(
      (embed_layers): ModuleDict(
        (emb_layer_a): Embedding(5, 8, padding_idx=0)
        (emb_layer_b): Embedding(5, 8, padding_idx=0)
        (emb_layer_c): Embedding(5, 8, padding_idx=0)
        (emb_layer_d): Embedding(5, 8, padding_idx=0)
      )
      (embedding_dropout): Dropout(p=0.1, inplace=False)
      (norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (tab_mlp): MLP(
        (mlp): Sequential(
          (dense_layer_0): Sequential(
            (0): Dropout(p=0.1, inplace=False)
            (1): Linear(in_features=33, out_features=8, bias=True)
            (2): ReLU(inplace=True)
          )
          (dense_layer_1): Sequential(
            (0): Dropout(p=0.1, inplace=False)
            (1): Linear(in_features=8, out_features=4, bias=True)
            (2): ReLU(inplace=True)
          )
        )
      )
    )
    (1): Linear(in_features=4, out_features=1, bias=True)
  )


voila

## 2. `TabResnet`

`TabResnet` is very similar to `TabMlp`, but the embeddings (or the concatenation of embeddings and continuous features) are passed through a series of Resnet blocks built with dense layers. This is probably the most flexible `deeptabular` component in terms of the many variants one can define via the parameters. Let's have a look to the architecture:

<img src="../docs/figures/tabresnet_arch.png" width="300" align="center"/>

The dashed-border boxes indicate the the component is optional and the dashed lines indicate the different paths or connections present depending on which components we decide to include. For example, we could chose to concatenate the continuous features, normalized or not via a `BatchNorm1d` layer, with the embeddings and pass the result of such a concatenation trough the series of Resnet blocks. Alternatively, we might prefer to concatenate the continuous features with the results of passing the embeddings through the Resnet blocks. Another optional component is the MLP before the output neuron(s). If not MLP is present, the output from the Resnet blocks or the results of concatenating that output with the continuous features (normalised or not) will be connected directly to the output neuron(s). 

Each Resnet block is comprised by the following operations:

<img src="../docs/figures/resnet_block.png" width="350" align="center"/>

For more details see [`pytorch_widedeep/models/tab_resnet.BasicBlock`](https://github.com/jrzaurin/pytorch-widedeep/blob/master/pytorch_widedeep/models/tab_resnet.py). 

Let's have a look to an example now:

In [10]:
from pytorch_widedeep.models import TabResnet

In [11]:
?TabResnet

In [12]:
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
tabresnet = TabResnet(blocks_dims=[16,16,16], 
                  column_idx=column_idx, 
                  embed_input=embed_input,
                  continuous_cols = ['e'],
                  batchnorm_cont = True,
                  concat_cont_first = False, 
                  mlp_hidden_dims = [16, 4],
                  mlp_dropout = 0.5)
out = tabresnet(X_tab)

In [13]:
tabresnet

TabResnet(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (tab_resnet_blks): DenseResnet(
    (dense_resnet): Sequential(
      (lin1): Linear(in_features=32, out_features=16, bias=True)
      (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (block_0): BasicBlock(
        (lin1): Linear(in_features=16, out_features=16, bias=True)
        (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
        (dp): Dropout(p=0.1, inplace=False)
        (lin2): Linear(in_features=16, out_features=16, bias=True)
        (bn2): Batch

As we can see, first the embeddings are concatenated (resulting in a tensor of dim ($*$, 32) and are projected (or resized, which happens in `lin1` and `bn1`) to the input dimension of the Resnet block (16). The we have the two Resnet blocks defined by the sequence `[INP1 (16) -> OUT1 == INP2 (16) -> OUT2 (16)]`. Finally the output from the Resnet blocks is concatenated and passed to the MLP. 

As I mentioned earlier, note that `TabResnet` does not build the connection to the output neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).

##  3. `TabTransformer`

Details on this architecture can be found in [TabTransformer: Tabular Data Modeling
Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf). Also, there are so many variants and details that I thought it deserves its own post. Therefore, if you want to dive properly into the use of the Transformer for tabular data I recommend to read the paper and the post (probably in that order). 

In general terms, `TabTransformer` takes the embeddings from the categorical columns that are then passed through a Tranformer encoder, concatenated with the normalised continuous features, and then passed through an MLP. Let's have a look:

<img src="../docs/figures/tabtransformer_arch.png" width="300" align="center"/>


The dashed-border boxes indicate the the component is optional. In terms of the Transformer block, I am sure at this stage the reader has seen every possible diagram of The Transformer, its multihead attention etc, so I thought about drawing something that resembles more to the actual execution/code for each block. 

<img src="../docs/figures/transformer_block.png" width="600" align="center"/>

Note that this implementation assumes that the so called `inner-dim` (aka the projection dimension) is the same as the `dimension of the model` or, in this case, embedding dimension. Relaxing this assumption is relatively easy and programatically would involve including one parameter more in the `TabTransformer` class. For now, and consistent with other Transformer implementations, I will assume `inner-dim = dimension of the model`. Also, and again consistent other implementations, I assume that the Keys, Queries and Values are of the same `dim`. 

Enough writing, let's have a look to the code

In [14]:
from pytorch_widedeep.models import TabTransformer

In [None]:
?TabTransformer

In [17]:
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
continuous_cols = ['e']
column_idx = {k:v for v,k in enumerate(colnames)}
tab_transformer = TabTransformer(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
out = tab_transformer(X_tab) 

In [18]:
tab_transformer

TabTransformer(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 32, padding_idx=0)
    (emb_layer_b): Embedding(5, 32, padding_idx=0)
    (emb_layer_c): Embedding(5, 32, padding_idx=0)
    (emb_layer_d): Embedding(5, 32, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (tab_transformer_blks): Sequential(
    (block0): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, e

Note that I have used the parameters that are suggested in the paper, or those used in the AutoGluon implementation.

Finally, and as I mentioned earlier, note that `TabTransformer` does not build the connection to the output neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).