## The `deeptabular` component

In the previous notebook I described the linear model (`Wide`) and the standard text and image classification and regression models (`DeepText` and `DeepImage`) that can be used as the `wide`, `deeptext` and `deepimage` components respectively when building a `WideDeep` model. 

In this notebook I will describe the different models (or architectures) available in `pytorch-widedeep` that can be used as the `deeptabular` model. Note that the `deeptabular` model alone is what normally would be referred as Deep Learning for tabular data. As I mentioned in previous notebooks, each component can be used independently. Therefore, if you wanted to use `deeptabular` alone it is perfectly possible. There are just a couple of simple requirement that will be covered in a later notebook.

The models available in `pytorch-widedeep` as the `deeptabular` component are:

1. `TabMlp`
2. `TabResnet`
3. `Tabnet`
4. `TabTransformer`
5. `FT-Tabransformer` (which is a simple variation of the `TabTransformer`)
6. `SAINT`

Let's have a close look to the 6 of them

## 1. `TabMlp`

`TabMlp` is the simples architecture and is very similar to the tabular model available in the fantastic fastai library. In fact, the implementation of the dense layers of the MLP is mostly identical to that in that library.

The figure below illustrate the `TabMlp` architecture:

<img src="../docs/figures/tabmlp_arch.png" width="300" align="center"/>

The dashed-border boxes indicate that these components are optional. For example, we could use `TabMlp` without categorical components, or without continuous components, if we wanted.

In [1]:
import torch
from pytorch_widedeep.models import TabMlp

  return f(*args, **kwds)


In [2]:
?TabMlp

Let's have a look to a model and one example

In [3]:
colnames = ['a', 'b', 'c', 'd', 'e']
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
tabmlp = TabMlp(mlp_hidden_dims=[8,4], continuous_cols=['e'], column_idx=column_idx, 
                embed_input=embed_input, cont_norm_layer="batchnorm")
out = tabmlp(X_tab)

In [4]:
tabmlp

TabMlp(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (cont_norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (tab_mlp): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=33, out_features=8, bias=True)
        (2): ReLU(inplace=True)
      )
      (dense_layer_1): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=8, out_features=4, bias=True)
        (2): ReLU(inplace=True)
      )
    )
  )
)

Note that the input dimension of the MLP is `33`, `32` from the embeddings and `1` for the continuous features. Before we move on, is worth commenting an aspect that applies to all models discussed here. The `TabPreprocessor` included in this package gives the user the possibility of standarising the input via `sklearn`'s `StandardScaler`. Alternatively, or in addition to it, it is possible to add a continuous normalization layer (`BatchNorm1d` or `LayerNorm`). To do so simply set the `cont_norm_layer` as indicated in the example above. See also the docs.

I will insist on this in this and the following sections. Note that `TabMlp` (or any of the wide and deep components) does not build the final connection with the final neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).

For example:

In [5]:
from pytorch_widedeep.models import WideDeep

In [6]:
wd_model = WideDeep(deeptabular=tabmlp, pred_dim=1)

In [7]:
wd_model

WideDeep(
  (deeptabular): Sequential(
    (0): TabMlp(
      (embed_layers): ModuleDict(
        (emb_layer_a): Embedding(5, 8, padding_idx=0)
        (emb_layer_b): Embedding(5, 8, padding_idx=0)
        (emb_layer_c): Embedding(5, 8, padding_idx=0)
        (emb_layer_d): Embedding(5, 8, padding_idx=0)
      )
      (embedding_dropout): Dropout(p=0.1, inplace=False)
      (cont_norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (tab_mlp): MLP(
        (mlp): Sequential(
          (dense_layer_0): Sequential(
            (0): Dropout(p=0.1, inplace=False)
            (1): Linear(in_features=33, out_features=8, bias=True)
            (2): ReLU(inplace=True)
          )
          (dense_layer_1): Sequential(
            (0): Dropout(p=0.1, inplace=False)
            (1): Linear(in_features=8, out_features=4, bias=True)
            (2): ReLU(inplace=True)
          )
        )
      )
    )
    (1): Linear(in_features=4, out_features=1, bias=True)

voila

## 2. `TabResnet`

`TabResnet` is very similar to `TabMlp`, but the embeddings (or the concatenation of embeddings and continuous features) are passed through a series of Resnet blocks built with dense layers. This is probably the most flexible `deeptabular` component in terms of the many variants one can define via the parameters. Let's have a look to the architecture:

<img src="../docs/figures/tabresnet_arch.png" width="300" align="center"/>

The dashed-border boxes indicate the the component is optional and the dashed lines indicate the different paths or connections present depending on which components we decide to include. For example, we could chose to concatenate the continuous features, normalized or not via a `BatchNorm1d` layer, with the embeddings and pass the result of such a concatenation trough the series of Resnet blocks. Alternatively, we might prefer to concatenate the continuous features with the results of passing the embeddings through the Resnet blocks. Another optional component is the MLP before the output neuron(s). If not MLP is present, the output from the Resnet blocks or the results of concatenating that output with the continuous features (normalised or not) will be connected directly to the output neuron(s). 

Each Resnet block is comprised by the following operations:

<img src="../docs/figures/resnet_block.png" width="350" align="center"/>

For more details see [`pytorch_widedeep/models/tab_resnet.BasicBlock`](https://github.com/jrzaurin/pytorch-widedeep/blob/master/pytorch_widedeep/models/tab_resnet.py). 

Let's have a look to an example now:

In [8]:
from pytorch_widedeep.models import TabResnet

In [9]:
?TabResnet

In [10]:
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
tabresnet = TabResnet(blocks_dims=[16,16,16], 
                  column_idx=column_idx, 
                  embed_input=embed_input,
                  continuous_cols = ['e'],
                  cont_norm_layer = "layernorm",
                  concat_cont_first = False, 
                  mlp_hidden_dims = [16, 4],
                  mlp_dropout = 0.5)
out = tabresnet(X_tab)

In [11]:
tabresnet

TabResnet(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (cont_norm): LayerNorm((1,), eps=1e-05, elementwise_affine=True)
  (tab_resnet_blks): DenseResnet(
    (dense_resnet): Sequential(
      (lin1): Linear(in_features=32, out_features=16, bias=True)
      (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (block_0): BasicBlock(
        (lin1): Linear(in_features=16, out_features=16, bias=True)
        (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
        (dp): Dropout(p=0.1, inplace=False)
        (lin2): Linear(in_features=16, out_features=16, bias=True)
        (bn2): BatchNorm1d(16, eps=1e-05, 

As we can see, first the embeddings are concatenated (resulting in a tensor of dim ($*$, 32) and are projected (or resized, which happens in `lin1` and `bn1`) to the input dimension of the Resnet block (16). The we have the two Resnet blocks defined by the sequence `[INP1 (16) -> OUT1 == INP2 (16) -> OUT2 (16)]`. Finally the output from the Resnet blocks is concatenated and passed to the MLP. 

As I mentioned earlier, note that `TabResnet` does not build the connection to the output neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).

##  3. `Tabnet`

Details on this architecture can be found in [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/pdf/1908.07442.pdf). This is not a simple algorithm. Therefore, I strongly recommend reading the paper.

In general terms, `Tabnet` takes the embeddings from the categorical columns and the continuous columns (standarised or not) that are then passed through a series of `Steps`. Each `Step` involves a so called Attentive Transformer and a Feature Transformer, combined with masking and a `Relu` non-linearity. This is shown in the figure below, directly taken from the paper. The part of the diagram drawn as $[FC \rightarrow out]$ would be what in other figures I draw as $[MLP -> output \space neuron]$.

<img src="../docs/figures/tabnet_arch_1.png" width="600" align="center"/>
<img src="../docs/figures/tabnet_arch_2.png" width="600" align="center"/>

Note that in the paper the authors use an encoder-decoder architecture and apply a routine that involves unsupervised pre-training plus supervised fine-tunning. However the authors found that unsupervised pre-training is useful when the data size is very small and/or there is a large number of unlabeled observations. This result is consistent with those obtained by subsequent papers using the same approach. 

`pytorch-widedeep` was conceived as a library to use wide and deep models with tabular data, images and text for supervised learning (regression or classification). Therefore, I decided to implement only the encoder architecture of this model, and the transformer-based models. 

If you want more details on each component I recommend reading the paper and have a look to the implementation by the guys at [dreamquark-ai](https://github.com/dreamquark-ai/tabnet). In fact, and let me make this clear, **the Tabnet implementation in this package is mostly a copy and paste from that at the dreamquark-ai's library**. Simply, I have adapted it to work with wide and deep models and I have added a few extras, such as being able to add dropout in the GLU blocks or to not use Ghost batch normalization. 

Enough writing, let's have a look to the code

In [12]:
from pytorch_widedeep.models import TabNet

In [13]:
?TabNet

In [14]:

X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
tabnet = TabNet(
        column_idx=column_idx,
        embed_input=embed_input,
        continuous_cols=['e'],
        cont_norm_layer = "batchnorm",
        ghost_bn = False,
    )
out = tabnet(X_tab)

In [15]:
tabnet

TabNet(
  (embed_and_cont): EmbeddingsAndContinuous(
    (embed_layers): ModuleDict(
      (emb_layer_a): Embedding(5, 8, padding_idx=0)
      (emb_layer_b): Embedding(5, 8, padding_idx=0)
      (emb_layer_c): Embedding(5, 8, padding_idx=0)
      (emb_layer_d): Embedding(5, 8, padding_idx=0)
    )
    (embedding_dropout): Dropout(p=0.0, inplace=False)
    (cont_norm): Identity()
  )
  (tabnet_encoder): TabNetEncoder(
    (initial_bn): BatchNorm1d(33, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
    (initial_splitter): FeatTransformer(
      (shared): GLU_Block(
        (glu_layers): ModuleList(
          (0): GLU_Layer(
            (fc): Linear(in_features=33, out_features=32, bias=False)
            (bn): BatchNorm1d(32, eps=1e-05, momentum=0.02, affine=True, track_running_stats=True)
            (dp): Dropout(p=0.0, inplace=False)
          )
          (1): GLU_Layer(
            (fc): Linear(in_features=16, out_features=32, bias=False)
            (bn): BatchNorm

##  4 and 5. `TabTransformer` and the `Feature-Tokenizer Transformer`

Details on the `TabTransformer` can be found in [TabTransformer: Tabular Data Modeling
Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf). The `FT-Transformer` is a variant introduced in the following two papers: [SAINT: Improved Neural Networks for Tabular Data
via Row Attention and Contrastive Pre-Training](https://arxiv.org/pdf/2106.01342.pdf) and [Revisiting Deep Learning Models for Tabular Data](https://arxiv.org/pdf/2106.11959.pdf). The name itself (`FT-Transformer`) was first used in the latter, but the variant (which I will explain in a second) was already introduced in the `SAINT` paper. 

In general terms, the `TabTransformer` takes the embeddings from the categorical columns that are then passed through a Tranformer encoder, concatenated with the normalised continuous features, and then passed through an MLP. Let's have a look:

<img src="../docs/figures/tabtransformer_arch.png" width="300" align="center"/>


The dashed-border boxes indicate the the component is optional. In terms of the Transformer block, I am sure at this stage the reader has seen every possible diagram of The Transformer, its multihead attention etc, so I thought about drawing something that resembles more to the actual execution/code for each block. 

<img src="../docs/figures/transformer_block.png" width="600" align="center"/>

Note that this implementation assumes that the so called `inner-dim` (aka the projection dimension) is the same as the `dimension of the model` or, in this case, embedding dimension. Relaxing this assumption is relatively easy and programatically would involve including one parameter more in the `TabTransformer` class. For now, and consistent with other Transformer implementations, I will assume `inner-dim = dimension of the model`. Also, and again consistent other implementations, I assume that the Keys, Queries and Values are of the same `dim`. 

The architecture of the `FT-Transformer` is identical to that of the `TabTransformer` with the exception that the continuous cols are each passed through a 1-layer MLP with or without activation (referred in the figure below as `Cont Embeddings`) function before being concatenated with the continuous cols. 

<img src="../docs/figures/ft_transformer_arch.png" width="300" align="center"/>

Using the `FT-Transformer` with `pytorch-widedeep` is simply available by setting the param `embed_continuous` to `True`. In addition, I have also added the possibility of pooling all outputs from the transformer blocks using the `[CLS]` token. Otherwise all the outputs form the transformer blocks will be concatenated. Look at some of the other example notebooks for more details. 

In [16]:
from pytorch_widedeep.models import TabTransformer

In [17]:
?TabTransformer

In [18]:
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
continuous_cols = ['e']
column_idx = {k:v for v,k in enumerate(colnames)}
tab_transformer = TabTransformer(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
out = tab_transformer(X_tab) 

In [19]:
tab_transformer

TabTransformer(
  (cat_embed): Embedding(17, 32, padding_idx=0)
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (cont_norm): Identity()
  (transformer_blks): Sequential(
    (block0): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (ff): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
 

In [20]:
ft_transformer = TabTransformer(
    column_idx=column_idx, 
    embed_input=embed_input, 
    continuous_cols=continuous_cols,
    embed_continuous=True,
    embed_continuous_activation="relu",
)
out = ft_transformer(X_tab) 

In [21]:
ft_transformer

TabTransformer(
  (cat_embed): Embedding(17, 32, padding_idx=0)
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (cont_norm): Identity()
  (cont_embed): ContinuousEmbeddings(
    (act_fn): ReLU(inplace=True)
  )
  (transformer_blks): Sequential(
    (block0): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (ff): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=

Finally, and as I mentioned earlier, note that `TabTransformer` class does not build the connection to the output neuron(s). This is done by the ``WideDeep`` class, which collects all wide and deep components and connects them to the output neuron(s).

##  6. `SAINT`


Details on `SAINT` (Self-Attention and Intersample Attention Transformer) can be found in [SAINT: Improved Neural Networks for Tabular Data
via Row Attention and Contrastive Pre-Training](https://arxiv.org/pdf/2106.01342.pdf). The main contribution of the saint model is the addition of an intersample attention block.  


<img src="../docs/figures/saint_arch.png" width="300" align="center"/>

In case you wonder what is this mysterious "inter-sample attention", simply, is the exact same mechanism as the well-known self-attention, but instead of features attending to each other here are observations/rows attending to each other. If you wanted to understand more details on what are the advantages of using this mechanism, I strongly encourage you to read the paper. Effectively, all that one needs to do is to reshape the input tensors of the transformer blocks and "off we go". 

`pytorch-widedeep`'s implementation is partially based in the [original code release](https://github.com/somepago/saint) (and the word "*partially*" is well used here in the sense that are notable differences, but in essence is the same implementation described in the paper).

Let's have a look to some code

In [22]:
from pytorch_widedeep.models import SAINT

In [23]:
?SAINT

In [24]:
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
continuous_cols = ['e']
column_idx = {k:v for v,k in enumerate(colnames)}
saint = SAINT(
    column_idx=column_idx, 
    embed_input=embed_input, 
    continuous_cols=continuous_cols,
    embed_continuous=True,
    embed_continuous_activation="leaky_relu",
)
out = saint(X_tab) 

In [25]:
saint

SAINT(
  (cat_embed): Embedding(17, 32, padding_idx=0)
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (cont_norm): LayerNorm((1,), eps=1e-05, elementwise_affine=True)
  (cont_embed): ContinuousEmbeddings(
    (act_fn): LeakyReLU(negative_slope=0.01, inplace=True)
  )
  (transformer_blks): Sequential(
    (block0): SaintEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (self_attn_ff): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (self_attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    