## The FineTune/Warm Up option

Let's place ourselves in two possible scenarios. 

1. Let's assume we have run a model and we want to just transfer the learnings (you know...transfer-learning) to another dataset, or simply we have received new data and we do not want to start the training of each component from scratch. Simply, we want to load the pre-trained weights and fine-tune.

2. We just want to "warm up" individual model components individually before the joined training begins.  

This can be done with the `finetune` set of parameters (aliased all as `warmup` parameters). There are 3 fine-tuning routines:

1. Fine-tune all trainable layers at once with a triangular one-cycle learning rate (referred as slanted triangular learning rates in Howard & Ruder 2018)
2. Gradual fine-tuning inspired by the work of [Felbo et al., 2017](https://arxiv.org/abs/1708.00524)
3. Gradual fine-tuning based on the work of [Howard & Ruder 2018](https://arxiv.org/abs/1801.06146)

Currently fine-tunning is only supported without a fully connected head, i.e. if `deephead=None`. In addition, `Felbo` and `Howard` routines only applied, of course, to the `deeptabular`, `deeptext` and `deepimage` models. The `wide` component can also be fine-tuned, but only in an "all at once" mode.

### Fine-tune or warm-up all at once

Here, the model components will be trained for `finetune_epochs` using a triangular one-cycle learning rate (slanted triangular learning rate) ranging from `finetune_max_lr/10` to `finetune_max_lr` (default is 0.01). 10% of the training steps are used to increase the learning rate which then decreases for the remaining 90%. 

Here all trainable layers are fine-tuned.

Let's have a look to one example. 

In [1]:
import numpy as np
import pandas as pd
import torch

from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, TabResnet, WideDeep
from pytorch_widedeep.metrics import Accuracy

  return f(*args, **kwds)


In [2]:
df = pd.read_csv('data/adult/adult.csv.zip')
# For convenience, we'll replace '-' with '_'
df.columns = [c.replace("-", "_") for c in df.columns]
#binary target
df['income_label'] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop('income', axis=1, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


In [3]:
wide_cols = ['education', 'relationship','workclass','occupation','native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]
cat_embed_cols = [('education',16), ('relationship',8), ('workclass',16), ('occupation',16),('native_country',16)]
continuous_cols = ["age","hours_per_week"]
target_col = 'income_label'

In [4]:
# TARGET
target = df[target_col].values

# WIDE
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)

# DEEP
tab_preprocessor = TabPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df)

In [5]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular = TabMlp(mlp_hidden_dims=[64,32], 
                   column_idx=tab_preprocessor.column_idx,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols
                    )
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [6]:
trainer = Trainer(model, objective="binary", metrics=[Accuracy])

Up until here is identical to the code in notebook `03_Binary_Classification_with_Defaults`. Now you can warm up via the warm up parameters

In [7]:
trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=2, val_split=0.2, batch_size=256)

epoch 1: 100%|██████████| 153/153 [00:04<00:00, 36.06it/s, loss=0.529, metrics={'acc': 0.7448}]
valid: 100%|██████████| 39/39 [00:00<00:00, 68.26it/s, loss=0.389, metrics={'acc': 0.8176}]
epoch 2: 100%|██████████| 153/153 [00:03<00:00, 39.18it/s, loss=0.401, metrics={'acc': 0.8122}]
valid: 100%|██████████| 39/39 [00:00<00:00, 116.68it/s, loss=0.368, metrics={'acc': 0.8272}]


In [8]:
trainer.save(path="models_dir/", save_state_dict=True, model_filename="model_1.pt")

Now time goes by...and we want to fine-tune the model to another, new dataset (for example, a dataset that is identical to the one you used to train the previous model but for another country). 

Here I will use the same dataset just for illustration purposes, but the flow would be identical to that new dataset

In [9]:
wide_1 = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular_1 = TabMlp(mlp_hidden_dims=[64,32], 
                   column_idx=tab_preprocessor.column_idx,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols)
model_1 = WideDeep(wide=wide_1, deeptabular=deeptabular_1)

In [10]:
model_1.load_state_dict(torch.load("models_dir/model_1.pt"))

<All keys matched successfully>

In [11]:
trainer_1 = Trainer(model_1, objective="binary", metrics=[Accuracy])

In [12]:
trainer_1.fit(
    X_wide=X_wide, 
    X_tab=X_tab, 
    target=target, 
    finetune=True, 
    finetune_epochs=2, 
    n_epochs=2, 
    batch_size=256)

epoch 1:   3%|▎         | 5/191 [00:00<00:03, 47.72it/s, loss=0.794, metrics={'acc': 0.5348}]

Training wide for 2 epochs


epoch 1: 100%|██████████| 191/191 [00:02<00:00, 67.54it/s, loss=0.504, metrics={'acc': 0.7554}]
epoch 2: 100%|██████████| 191/191 [00:02<00:00, 70.24it/s, loss=0.386, metrics={'acc': 0.79}]  
epoch 1:   4%|▎         | 7/191 [00:00<00:03, 60.96it/s, loss=0.39, metrics={'acc': 0.7909}] 

Training deeptabular for 2 epochs


epoch 1: 100%|██████████| 191/191 [00:03<00:00, 62.41it/s, loss=0.369, metrics={'acc': 0.8028}]
epoch 2: 100%|██████████| 191/191 [00:03<00:00, 59.52it/s, loss=0.352, metrics={'acc': 0.8107}]
epoch 1:   3%|▎         | 5/191 [00:00<00:04, 43.10it/s, loss=0.363, metrics={'acc': 0.8418}]

Fine-tuning of individual components completed. Training the whole model for 2 epochs


epoch 1: 100%|██████████| 191/191 [00:04<00:00, 39.91it/s, loss=0.352, metrics={'acc': 0.8378}]
epoch 2: 100%|██████████| 191/191 [00:04<00:00, 43.80it/s, loss=0.344, metrics={'acc': 0.8419}]


Note that, as I describe above, in scenario 2, we can just use this to warm up models before they joined training begins:

In [13]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular = TabMlp(mlp_hidden_dims=[128, 32], 
                   column_idx=tab_preprocessor.column_idx,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [14]:
trainer_2 = Trainer(model, objective="binary", metrics=[Accuracy])

In [15]:
trainer_2.fit(
    X_wide=X_wide, 
    X_tab=X_tab, 
    target=target, 
    val_split=0.1, 
    warmup=True, 
    warmup_epochs=2, 
    n_epochs=2, 
    batch_size=256
)

epoch 1:   3%|▎         | 6/172 [00:00<00:02, 58.53it/s, loss=0.988, metrics={'acc': 0.4435}]

Training wide for 2 epochs


epoch 1: 100%|██████████| 172/172 [00:02<00:00, 73.06it/s, loss=0.54, metrics={'acc': 0.7276}] 
epoch 2: 100%|██████████| 172/172 [00:02<00:00, 75.57it/s, loss=0.389, metrics={'acc': 0.7736}]
epoch 1:   3%|▎         | 6/172 [00:00<00:02, 55.48it/s, loss=0.582, metrics={'acc': 0.7728}]

Training deeptabular for 2 epochs


epoch 1: 100%|██████████| 172/172 [00:02<00:00, 58.52it/s, loss=0.392, metrics={'acc': 0.7881}]
epoch 2: 100%|██████████| 172/172 [00:02<00:00, 58.26it/s, loss=0.353, metrics={'acc': 0.8}]   
epoch 1:   2%|▏         | 4/172 [00:00<00:04, 38.87it/s, loss=0.337, metrics={'acc': 0.8589}]

Fine-tuning of individual components completed. Training the whole model for 2 epochs


epoch 1: 100%|██████████| 172/172 [00:04<00:00, 42.81it/s, loss=0.355, metrics={'acc': 0.8366}]
valid: 100%|██████████| 20/20 [00:00<00:00, 89.21it/s, loss=0.35, metrics={'acc': 0.8356}] 
epoch 2: 100%|██████████| 172/172 [00:04<00:00, 41.35it/s, loss=0.346, metrics={'acc': 0.8381}]
valid: 100%|██████████| 20/20 [00:00<00:00, 87.63it/s, loss=0.349, metrics={'acc': 0.8373}]


### Fine-tune Gradually: The "felbo"  and the "howard" routines

The Felbo routine can be illustrated as follows:

<p align="center">
  <img width="600" src="../docs/figures/felbo_routine.png">
</p>

**Figure 1.** The figure can be described as follows: fine-tune (or train) the last layer for one epoch using a one cycle triangular learning rate. Then fine-tune the next deeper layer for one epoch, with a learning rate that is a factor of 2.5 lower than the previous learning rate (the 2.5 factor is fixed) while freezing the already warmed up layer(s). Repeat untill all individual layers are warmed. Then warm one last epoch with all warmed layers trainable. The vanishing color gradient in the figure attempts to illustrate the decreasing learning rate. 

Note that this is not identical to the Fine-Tunning routine described in Felbo et al, 2017, this is why I used the word 'inspired'.

The Howard routine can be illustrated as follows:

<p align="center">
  <img width="600" src="../docs/figures/howard_routine.png">
</p>

**Figure 2.** The figure can be described as follows: fine-tune (or train) the last layer for one epoch using a one cycle triangular learning rate. Then fine-tune the next deeper layer for one epoch, with a learning rate that is a factor of 2.5 lower than the previous learning rate (the 2.5 factor is fixed) while keeping the already warmed up layer(s) trainable. Repeat. The vanishing color gradient in the figure attempts to illustrate the decreasing learning rate. 

Note that I write "*fine-tune (or train) the last layer for one epoch [...]*". However, in practice the user will have to specify the order of the layers to be fine-tuned. This is another reason why I wrote that the fine-tune routines I have implemented are **inspired** by the work of Felbo and Howard and not identical to their implemenations.

The `felbo` and `howard` routines can be accessed with via the `fine-tune` parameters.

We need to explicitly indicate 

1. That we want fine-tune

2. The components that we want to individually fine-tune 

3. In case of gradual fine-tuning, the routine ("felbo" or "howard")

4. The layers we want to fine-tune. 

For example

In [16]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular = TabResnet(
    blocks_dims=[128, 64, 32], 
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input,
    continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [17]:
model

WideDeep(
  (wide): Wide(
    (wide_linear): Embedding(797, 1, padding_idx=0)
  )
  (deeptabular): Sequential(
    (0): TabResnet(
      (embed_layers): ModuleDict(
        (emb_layer_education): Embedding(17, 16, padding_idx=0)
        (emb_layer_native_country): Embedding(43, 16, padding_idx=0)
        (emb_layer_occupation): Embedding(16, 16, padding_idx=0)
        (emb_layer_relationship): Embedding(7, 8, padding_idx=0)
        (emb_layer_workclass): Embedding(10, 16, padding_idx=0)
      )
      (embedding_dropout): Dropout(p=0.1, inplace=False)
      (cont_norm): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (tab_resnet_blks): DenseResnet(
        (dense_resnet): Sequential(
          (lin1): Linear(in_features=74, out_features=128, bias=True)
          (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (block_0): BasicBlock(
            (lin1): Linear(in_features=128, out_features=64, bias=True

let's first train as usual

In [18]:
trainer_3 = Trainer(model, objective="binary", metrics=[Accuracy])

In [19]:
trainer_3.fit(X_wide=X_wide, X_tab=X_tab, target=target, val_split=0.1, n_epochs=2, batch_size=256)

epoch 1: 100%|██████████| 172/172 [00:05<00:00, 29.00it/s, loss=0.453, metrics={'acc': 0.7787}]
valid: 100%|██████████| 20/20 [00:00<00:00, 90.03it/s, loss=0.363, metrics={'acc': 0.8282}]
epoch 2: 100%|██████████| 172/172 [00:05<00:00, 32.24it/s, loss=0.371, metrics={'acc': 0.8262}]
valid: 100%|██████████| 20/20 [00:00<00:00, 88.22it/s, loss=0.351, metrics={'acc': 0.8356}]


In [20]:
trainer_3.save(path="models_dir", save_state_dict=True, model_filename="model_3.pt")

Now we are going to fine-tune the model components, and in the case of the `deeptabular` component, we will fine-tune the resnet-blocks and the linear layer but NOT the embeddings. 

For this, we need to access the model component's children: ``deeptabular`` $\rightarrow$ ``tab_resnet`` $\rightarrow$ ``dense_resnet`` $\rightarrow$ ``blocks``

In [21]:
wide_3 = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular_3 = TabResnet(
    blocks_dims=[128, 64, 32], 
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input,
    continuous_cols=continuous_cols)
model_3 = WideDeep(wide=wide, deeptabular=deeptabular)

In [22]:
model_3.load_state_dict(torch.load("models_dir/model_3.pt"))

<All keys matched successfully>

In [23]:
tab_lin_layers = list(model_3.deeptabular.children())[1]

In [24]:
tab_deep_layers = list(
    list(list(list(model_3.deeptabular.children())[0].children())[3].children())[
        0
    ].children()
)[::-1][:2]

In [25]:
tab_layers = [tab_lin_layers] + tab_deep_layers

In [26]:
tab_layers

[Linear(in_features=32, out_features=1, bias=True),
 BasicBlock(
   (lin1): Linear(in_features=64, out_features=32, bias=True)
   (bn1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
   (dp): Dropout(p=0.1, inplace=False)
   (lin2): Linear(in_features=32, out_features=32, bias=True)
   (bn2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (resize): Sequential(
     (0): Linear(in_features=64, out_features=32, bias=True)
     (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 ),
 BasicBlock(
   (lin1): Linear(in_features=128, out_features=64, bias=True)
   (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
   (dp): Dropout(p=0.1, inplace=False)
   (lin2): Linear(in_features=64, out_features=64, bias=True)
   (bn2): BatchN

And now simply

In [27]:
trainer_4 = Trainer(model_3, objective="binary", metrics=[Accuracy])

In [None]:
trainer_4.fit(
    X_wide=X_wide, 
    X_tab=X_tab, 
    target=target, 
    val_split=0.1, 
    finetune=True, 
    finetune_epochs=2, 
    finetune_deeptabular_gradual=True,
    finetune_deeptabular_layers = tab_layers,
    finetune_deeptabular_max_lr = 0.01,
    n_epochs=2,
    batch_size=256
)

epoch 1:   5%|▍         | 8/172 [00:00<00:02, 68.51it/s, loss=0.767, metrics={'acc': 0.5605}]

Training wide for 2 epochs


epoch 1: 100%|██████████| 172/172 [00:02<00:00, 75.72it/s, loss=0.489, metrics={'acc': 0.7523}]
epoch 2: 100%|██████████| 172/172 [00:02<00:00, 64.95it/s, loss=0.383, metrics={'acc': 0.7876}]
epoch 1:   2%|▏         | 3/172 [00:00<00:07, 22.26it/s, loss=0.402, metrics={'acc': 0.788}] 

Training deeptabular, layer 1 of 3


epoch 1: 100%|██████████| 172/172 [00:08<00:00, 20.71it/s, loss=0.385, metrics={'acc': 0.7986}]
epoch 1:   0%|          | 0/172 [00:00<?, ?it/s]

Training deeptabular, layer 2 of 3


epoch 1: 100%|██████████| 172/172 [00:13<00:00, 13.08it/s, loss=0.369, metrics={'acc': 0.8058}]
epoch 1:   2%|▏         | 3/172 [00:00<00:07, 21.56it/s, loss=0.355, metrics={'acc': 0.806}]

Training deeptabular, layer 3 of 3


epoch 1: 100%|██████████| 172/172 [00:07<00:00, 22.34it/s, loss=0.361, metrics={'acc': 0.8108}]
epoch 1:   1%|          | 2/172 [00:00<00:09, 17.34it/s, loss=0.334, metrics={'acc': 0.8581}]

Fine-tuning of individual components completed. Training the whole model for 2 epochs


epoch 1: 100%|██████████| 172/172 [00:16<00:00, 10.31it/s, loss=0.353, metrics={'acc': 0.8366}]
valid: 100%|██████████| 20/20 [00:00<00:00, 40.53it/s, loss=0.345, metrics={'acc': 0.8405}]
epoch 2:  89%|████████▉ | 153/172 [00:06<00:00, 28.91it/s, loss=0.342, metrics={'acc': 0.8399}]

Finally, there is one more use case I would like to consider. The case where we train only one component and we just want to fine-tune and stop the training afterwards, since there is no joined training. This is a simple as

In [None]:
deeptabular = TabMlp(mlp_hidden_dims=[200, 100], 
                   column_idx=tab_preprocessor.column_idx,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols)
model = WideDeep(deeptabular=deeptabular)

In [None]:
trainer_5 = Trainer(model, objective="binary", metrics=[Accuracy])

In [None]:
trainer_5.fit(X_wide=X_wide, X_tab=X_tab, target=target, val_split=0.1, n_epochs=1, batch_size=256)

In [None]:
trainer_5.save(path="models_dir", save_state_dict=True, model_filename="model_5.pt")

In [None]:
deeptabular_5 = TabMlp(mlp_hidden_dims=[200, 100], 
                   column_idx=tab_preprocessor.column_idx,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols)
model_5 = WideDeep(deeptabular=deeptabular_5)

In [None]:
model_5.load_state_dict(torch.load("models_dir/model_5.pt"))

...times go by...

In [None]:
trainer_6 = Trainer(model_5, objective="binary", metrics=[Accuracy])

In [None]:
trainer_6.fit(
    X_wide=X_wide, 
    X_tab=X_tab, 
    target=target, 
    val_split=0.1, 
    finetune=True, 
    finetune_epochs=2,
    finetune_max_lr=0.01,
    stop_after_finetuning=True,
    batch_size=256
    
)   

In [None]:
import shutil

shutil.rmtree("models_dir/")