## Binary Classification with different optimizers, schedulers, etc.

In this notebook we will use the Adult Census dataset. Download the data from [here](https://www.kaggle.com/wenruliu/adult-income-dataset/downloads/adult.csv/2).

In [1]:
import numpy as np
import pandas as pd
import torch

from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy, Recall

In [2]:
df = pd.read_csv('data/adult/adult.csv.zip')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# For convenience, we'll replace '-' with '_'
df.columns = [c.replace("-", "_") for c in df.columns]
# binary target
df['income_label'] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop('income', axis=1, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


### Preparing the data

Have a look to notebooks one and two if you want to get a good understanding of the next few lines of code (although there is no need to use the package)

In [4]:
wide_cols = ['education', 'relationship','workclass','occupation','native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]
cat_embed_cols = [('education',16), ('relationship',8), ('workclass',16), ('occupation',16),('native_country',16)]
continuous_cols = ["age","hours_per_week"]
target_col = 'income_label'

In [5]:
# TARGET
target = df[target_col].values

# WIDE
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)

# DEEP
tab_preprocessor = TabPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df)

In [6]:
print(X_wide)
print(X_wide.shape)

[[  1  17  23 ...  89  91 316]
 [  2  18  23 ...  89  92 317]
 [  3  18  24 ...  89  93 318]
 ...
 [  2  20  23 ...  90 103 323]
 [  2  17  23 ...  89 103 323]
 [  2  21  29 ...  90 115 324]]
(48842, 8)


In [7]:
print(X_tab)
print(X_tab.shape)

[[ 1.          1.          1.         ...  1.         -0.99512893
  -0.03408696]
 [ 2.          2.          1.         ...  1.         -0.04694151
   0.77292975]
 [ 3.          2.          2.         ...  1.         -0.77631645
  -0.03408696]
 ...
 [ 2.          4.          1.         ...  1.          1.41180837
  -0.03408696]
 [ 2.          1.          1.         ...  1.         -1.21394141
  -1.64812038]
 [ 2.          5.          7.         ...  1.          0.97418341
  -0.03408696]]
(48842, 7)


As you can see, you can run a wide and deep model in just a few lines of code

Let's now see how to use `WideDeep` with varying parameters

###  2.1 Dropout and Batchnorm

In [8]:
?TabMlp

In [9]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
# We can add dropout and batchnorm to the dense layers, as well as chose the order of the operations
deeptabular = TabMlp(column_idx=tab_preprocessor.column_idx,
                   mlp_hidden_dims=[64,32], 
                   mlp_dropout=[0.5, 0.5], 
                   mlp_batchnorm=True, 
                   mlp_linear_first = True,
                   embed_input=tab_preprocessor.embeddings_input,
                   continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [10]:
model

WideDeep(
  (wide): Wide(
    (wide_linear): Embedding(797, 1, padding_idx=0)
  )
  (deeptabular): Sequential(
    (0): TabMlp(
      (embed_layers): ModuleDict(
        (emb_layer_education): Embedding(17, 16, padding_idx=0)
        (emb_layer_native_country): Embedding(43, 16, padding_idx=0)
        (emb_layer_occupation): Embedding(16, 16, padding_idx=0)
        (emb_layer_relationship): Embedding(7, 8, padding_idx=0)
        (emb_layer_workclass): Embedding(10, 16, padding_idx=0)
      )
      (embedding_dropout): Dropout(p=0.0, inplace=False)
      (tab_mlp): MLP(
        (mlp): Sequential(
          (dense_layer_0): Sequential(
            (0): Linear(in_features=74, out_features=64, bias=False)
            (1): ReLU(inplace=True)
            (2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (3): Dropout(p=0.5, inplace=False)
          )
          (dense_layer_1): Sequential(
            (0): Linear(in_features=64, out_features=32, b

We can use different initializers, optimizers and learning rate schedulers for each `branch` of the model

###  Optimizers, LR schedulers, Initializers and Callbacks

In [11]:
from pytorch_widedeep.initializers import KaimingNormal, XavierNormal
from pytorch_widedeep.callbacks import ModelCheckpoint, LRHistory, EarlyStopping
from pytorch_widedeep.optim import RAdam

In [12]:
# Optimizers
wide_opt = torch.optim.Adam(model.wide.parameters(), lr=0.03)
deep_opt = RAdam(model.deeptabular.parameters(), lr=0.01)
# LR Schedulers
wide_sch = torch.optim.lr_scheduler.StepLR(wide_opt, step_size=3)
deep_sch = torch.optim.lr_scheduler.StepLR(deep_opt, step_size=5)

the component-dependent settings must be passed as dictionaries, while general settings are simply lists

In [13]:
# Component-dependent settings as Dict
optimizers = {'wide': wide_opt, 'deeptabular':deep_opt}
schedulers = {'wide': wide_sch, 'deeptabular':deep_sch}
initializers = {'wide': KaimingNormal, 'deeptabular':XavierNormal}
# General settings as List
callbacks = [LRHistory(n_epochs=10), EarlyStopping, ModelCheckpoint(filepath='model_weights/wd_out')]
metrics = [Accuracy, Recall]

In [14]:
trainer = Trainer(model, 
                  objective='binary', 
                  optimizers=optimizers, 
                  lr_schedulers=schedulers,
                  initializers=initializers,
                  callbacks=callbacks,
                  metrics=metrics
                 )

In [16]:
trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=10, batch_size=256, val_split=0.2)

epoch 1: 100%|██████████| 153/153 [00:03<00:00, 46.93it/s, loss=0.597, metrics={'acc': 0.7751, 'rec': 0.4646}]
valid: 100%|██████████| 39/39 [00:00<00:00, 115.54it/s, loss=0.365, metrics={'acc': 0.7871, 'rec': 0.4839}]
epoch 2: 100%|██████████| 153/153 [00:03<00:00, 48.61it/s, loss=0.373, metrics={'acc': 0.8258, 'rec': 0.5525}]
valid: 100%|██████████| 39/39 [00:00<00:00, 126.36it/s, loss=0.354, metrics={'acc': 0.8282, 'rec': 0.5622}]
epoch 3: 100%|██████████| 153/153 [00:03<00:00, 46.11it/s, loss=0.356, metrics={'acc': 0.8329, 'rec': 0.5595}]
valid: 100%|██████████| 39/39 [00:00<00:00, 114.20it/s, loss=0.351, metrics={'acc': 0.8343, 'rec': 0.5672}]
epoch 4: 100%|██████████| 153/153 [00:03<00:00, 45.97it/s, loss=0.346, metrics={'acc': 0.8371, 'rec': 0.574}] 
valid: 100%|██████████| 39/39 [00:00<00:00, 107.73it/s, loss=0.349, metrics={'acc': 0.8374, 'rec': 0.5691}]
epoch 5: 100%|██████████| 153/153 [00:03<00:00, 46.22it/s, loss=0.345, metrics={'acc': 0.8384, 'rec': 0.571}] 
valid: 100%|█

You see that, among many methods and attributes we have the `history` and `lr_history` attributes

In [18]:
print(trainer.history)

{'train_loss': [0.5969724349336687, 0.3732765291640961, 0.35611476909880546, 0.3463761859080371, 0.34545664167871665, 0.34359567286142334, 0.3418893502428641, 0.34155894767225176, 0.33982853737531926, 0.3406260426527534], 'train_acc': [0.7750876564379495, 0.8257620351649476, 0.8328513295626135, 0.8371253806976685, 0.838405036726128, 0.8396590996340184, 0.8404013001305249, 0.8408363831802012, 0.8417321424001228, 0.8412970593504466], 'train_rec': [0.46464863419532776, 0.5524654984474182, 0.5595250725746155, 0.5739651322364807, 0.5709701776504517, 0.5702214241027832, 0.569151759147644, 0.5730024576187134, 0.5743929743766785, 0.5785645246505737], 'val_loss': [0.3653175265361101, 0.35369565853705776, 0.3509304424126943, 0.34885044204883087, 0.3483696549366682, 0.34747094985766286, 0.34690968424845964, 0.34662194358996856, 0.3462058512064127, 0.34638020854729873], 'val_acc': [0.7871094549772737, 0.8282011383645224, 0.8343229187993939, 0.8373735719258015, 0.8386839195774128, 0.839809999590516

In [19]:
print(trainer.lr_history)

{'lr_wide_0': [0.03, 0.03, 0.03, 0.003, 0.003, 0.003, 0.00030000000000000003, 0.00030000000000000003, 0.00030000000000000003, 3.0000000000000004e-05], 'lr_deeptabular_0': [0.01, 0.01, 0.01, 0.01, 0.01, 0.001, 0.001, 0.001, 0.001, 0.001]}


We can see that the learning rate effectively decreases by a factor of 0.1 (the default) after the corresponding `step_size`. Note that the keys of the dictionary have a suffix `_0`. This is because if you pass different parameter groups to the torch optimizers, these will also be recorded. We'll see this in the `Regression` notebook. 

And I guess one has a good idea of how to use the package. Before we leave this notebook just mentioning that the `WideDeep` class comes with a useful method to "rescue" the learned embeddings. For example, let's say I want to use the embeddings learned for the different levels of the categorical feature `education`

In [20]:
trainer.get_embeddings(col_name='education', cat_encoding_dict=tab_preprocessor.label_encoder.encoding_dict)

{'11th': array([-0.04348978,  0.45022243,  0.18464865,  0.06992336, -0.34323215,
         0.20416489, -0.310977  ,  0.02868079,  0.35072434, -0.16913618,
         0.36890852, -0.11342919,  0.12130431,  0.40069175,  0.13254811,
         0.11155065], dtype=float32),
 'HS-grad': array([ 0.02742365, -0.06904709,  0.07933907, -0.28882706,  0.05550014,
        -0.2779799 , -0.17582755, -0.01317326, -0.36928716,  0.31297338,
        -0.26600584, -0.24179696,  0.29325986, -0.41419625,  0.11772554,
        -0.14493649], dtype=float32),
 'Assoc-acdm': array([-0.22545657, -0.09746448,  0.25342208,  0.04131372, -0.0871982 ,
         0.13787483,  0.25658223,  0.08221256, -0.5716772 , -0.08774945,
        -0.55335015,  0.0017394 ,  0.02402168, -0.06060516, -0.13931672,
         0.01669413], dtype=float32),
 'Some-college': array([ 0.17594509,  0.1466892 , -0.68583024,  0.00980275, -0.03707155,
         0.28042212, -0.34241527,  0.4351751 ,  0.25033662, -0.03148668,
        -0.321893  ,  0.40399942, 