# Binary Classification with different optimizers, schedulers, etc.

In this notebook we will use the Adult Census dataset. Download the data from [here](https://www.kaggle.com/wenruliu/adult-income-dataset/downloads/adult.csv/2).

In [1]:
import numpy as np
import pandas as pd
import torch

from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy, Recall

  return f(*args, **kwds)


In [2]:
df = pd.read_csv("data/adult/adult.csv.zip")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# For convenience, we'll replace '-' with '_'
df.columns = [c.replace("-", "_") for c in df.columns]
# binary target
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


### Preparing the data

Have a look to notebooks one and two if you want to get a good understanding of the next few lines of code (although there is no need to use the package)

In [4]:
wide_cols = [
    "education",
    "relationship",
    "workclass",
    "occupation",
    "native_country",
    "gender",
]
crossed_cols = [("education", "occupation"), ("native_country", "occupation")]
cat_embed_cols = [
    ("education", 16),
    ("relationship", 8),
    ("workclass", 16),
    ("occupation", 16),
    ("native_country", 16),
]
continuous_cols = ["age", "hours_per_week"]
target_col = "income_label"

In [5]:
# TARGET
target = df[target_col].values

# WIDE
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)

# DEEP
tab_preprocessor = TabPreprocessor(
    embed_cols=cat_embed_cols, continuous_cols=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(df)

In [6]:
print(X_wide)
print(X_wide.shape)

[[  1  17  23 ...  89  91 316]
 [  2  18  23 ...  89  92 317]
 [  3  18  24 ...  89  93 318]
 ...
 [  2  20  23 ...  90 103 323]
 [  2  17  23 ...  89 103 323]
 [  2  21  29 ...  90 115 324]]
(48842, 8)


In [7]:
print(X_tab)
print(X_tab.shape)

[[ 1.          1.          1.         ...  1.         -0.99512893
  -0.03408696]
 [ 2.          2.          1.         ...  1.         -0.04694151
   0.77292975]
 [ 3.          2.          2.         ...  1.         -0.77631645
  -0.03408696]
 ...
 [ 2.          4.          1.         ...  1.          1.41180837
  -0.03408696]
 [ 2.          1.          1.         ...  1.         -1.21394141
  -1.64812038]
 [ 2.          5.          7.         ...  1.          0.97418341
  -0.03408696]]
(48842, 7)


As you can see, you can run a wide and deep model in just a few lines of code

Let's now see how to use `WideDeep` with varying parameters

###  2.1 Dropout and Batchnorm

In [8]:
# ?TabMlp

In [9]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
# We can add dropout and batchnorm to the dense layers, as well as chose the order of the operations
deeptabular = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    mlp_hidden_dims=[64, 32],
    mlp_dropout=[0.5, 0.5],
    mlp_batchnorm=True,
    mlp_linear_first=True,
    embed_input=tab_preprocessor.embeddings_input,
    continuous_cols=continuous_cols,
)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [10]:
model

WideDeep(
  (wide): Wide(
    (wide_linear): Embedding(797, 1, padding_idx=0)
  )
  (deeptabular): Sequential(
    (0): TabMlp(
      (cat_embed_and_cont): CatEmbeddingsAndCont(
        (embed_layers): ModuleDict(
          (emb_layer_education): Embedding(17, 16, padding_idx=0)
          (emb_layer_native_country): Embedding(43, 16, padding_idx=0)
          (emb_layer_occupation): Embedding(16, 16, padding_idx=0)
          (emb_layer_relationship): Embedding(7, 8, padding_idx=0)
          (emb_layer_workclass): Embedding(10, 16, padding_idx=0)
        )
        (embedding_dropout): Dropout(p=0.1, inplace=False)
        (cont_norm): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (tab_mlp): MLP(
        (mlp): Sequential(
          (dense_layer_0): Sequential(
            (0): Linear(in_features=74, out_features=64, bias=False)
            (1): ReLU(inplace=True)
            (2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_ru

We can use different initializers, optimizers and learning rate schedulers for each `branch` of the model

###  Optimizers, LR schedulers, Initializers and Callbacks

In [11]:
from pytorch_widedeep.initializers import KaimingNormal, XavierNormal
from pytorch_widedeep.callbacks import ModelCheckpoint, LRHistory, EarlyStopping
from pytorch_widedeep.optim import RAdam

In [12]:
# Optimizers
wide_opt = torch.optim.Adam(model.wide.parameters(), lr=0.03)
deep_opt = RAdam(model.deeptabular.parameters(), lr=0.01)
# LR Schedulers
wide_sch = torch.optim.lr_scheduler.StepLR(wide_opt, step_size=3)
deep_sch = torch.optim.lr_scheduler.StepLR(deep_opt, step_size=5)

the component-dependent settings must be passed as dictionaries, while general settings are simply lists

In [13]:
# Component-dependent settings as Dict
optimizers = {"wide": wide_opt, "deeptabular": deep_opt}
schedulers = {"wide": wide_sch, "deeptabular": deep_sch}
initializers = {"wide": KaimingNormal, "deeptabular": XavierNormal}
# General settings as List
callbacks = [
    LRHistory(n_epochs=10),
    EarlyStopping,
    ModelCheckpoint(filepath="model_weights/wd_out"),
]
metrics = [Accuracy, Recall]

In [14]:
trainer = Trainer(
    model,
    objective="binary",
    optimizers=optimizers,
    lr_schedulers=schedulers,
    initializers=initializers,
    callbacks=callbacks,
    metrics=metrics,
)

In [15]:
trainer.fit(
    X_wide=X_wide,
    X_tab=X_tab,
    target=target,
    n_epochs=10,
    batch_size=256,
    val_split=0.2,
)

epoch 1: 100%|██████████| 153/153 [00:03<00:00, 42.78it/s, loss=0.562, metrics={'acc': 0.7779, 'rec': 0.488}] 
valid: 100%|██████████| 39/39 [00:00<00:00, 54.81it/s, loss=0.374, metrics={'acc': 0.8363, 'rec': 0.5684}]
epoch 2: 100%|██████████| 153/153 [00:03<00:00, 44.03it/s, loss=0.373, metrics={'acc': 0.8277, 'rec': 0.5535}]
valid: 100%|██████████| 39/39 [00:00<00:00, 108.54it/s, loss=0.359, metrics={'acc': 0.8361, 'rec': 0.5915}]
epoch 3: 100%|██████████| 153/153 [00:03<00:00, 41.40it/s, loss=0.354, metrics={'acc': 0.8354, 'rec': 0.5686}]
valid: 100%|██████████| 39/39 [00:00<00:00, 100.84it/s, loss=0.355, metrics={'acc': 0.8378, 'rec': 0.5346}]
epoch 4: 100%|██████████| 153/153 [00:03<00:00, 43.49it/s, loss=0.346, metrics={'acc': 0.8381, 'rec': 0.5653}]
valid: 100%|██████████| 39/39 [00:00<00:00, 117.29it/s, loss=0.352, metrics={'acc': 0.8388, 'rec': 0.5633}]
epoch 5: 100%|██████████| 153/153 [00:03<00:00, 39.83it/s, loss=0.343, metrics={'acc': 0.8396, 'rec': 0.5669}]
valid: 100%|██

Model weights after training corresponds to the those of the final epoch which might not be the best performing weights. Usethe 'ModelCheckpoint' Callback to restore the best epoch weights.


You see that, among many methods and attributes we have the `history` and `lr_history` attributes

In [16]:
print(trainer.history)

{'train_loss': [0.5623695554296955, 0.3727661143330967, 0.3543393321676192, 0.3463186333382052, 0.34326155766162997, 0.34202106482063244, 0.34081082270036334, 0.34090089836930915, 0.3412071953411975, 0.3405635129002964], 'train_acc': [0.7778517134594221, 0.8277071123282062, 0.8353594553783943, 0.8381235123998669, 0.83960791339288, 0.8405804519745093, 0.8406572313362168, 0.8396590996340184, 0.840682824456786, 0.8404268932510941], 'train_rec': [0.4879666268825531, 0.5535351634025574, 0.5686169862747192, 0.5653011202812195, 0.5669055581092834, 0.5757834911346436, 0.5663707256317139, 0.5730024576187134, 0.5709701776504517, 0.5751417279243469], 'val_loss': [0.374390076368283, 0.35924087579433733, 0.354536472986906, 0.35208039711683226, 0.35081761387678295, 0.3504261534947615, 0.350106044457509, 0.34991710613935423, 0.35027056473952073, 0.34997811913490295], 'val_acc': [0.8363189681646023, 0.8361142389190296, 0.8377520728836114, 0.8387757191114751, 0.8387757191114751, 0.8400040945849114, 0.8

In [17]:
print(trainer.lr_history)

{'lr_wide_0': [0.03, 0.03, 0.03, 0.003, 0.003, 0.003, 0.00030000000000000003, 0.00030000000000000003, 0.00030000000000000003, 3.0000000000000004e-05], 'lr_deeptabular_0': [0.01, 0.01, 0.01, 0.01, 0.01, 0.001, 0.001, 0.001, 0.001, 0.001]}


We can see that the learning rate effectively decreases by a factor of 0.1 (the default) after the corresponding `step_size`. Note that the keys of the dictionary have a suffix `_0`. This is because if you pass different parameter groups to the torch optimizers, these will also be recorded. We'll see this in the `Regression` notebook. 

And I guess one has a good idea of how to use the package. 

Before we leave this notebook just mentioning that the `WideDeep` class comes with a what is perhaps a useful method that I intend to deprecate in favor of `Tab2Vec`. This method, called `get_embeddings` is designed to "rescue" the learned embeddings. For example, let's say I want to use the embeddings learned for the different levels of the categorical feature `education`

In [18]:
trainer.get_embeddings(
    col_name="education", cat_encoding_dict=tab_preprocessor.label_encoder.encoding_dict
)



{'11th': array([-0.3475832 ,  0.34912273, -0.11974874,  0.14691196, -0.22545682,
        -0.3613695 , -0.00136127, -0.0563265 ,  0.3466888 ,  0.11706785,
        -0.01166581, -0.01369573, -0.17875178,  0.18713965,  0.2914308 ,
        -0.198182  ], dtype=float32),
 'HS-grad': array([ 0.09942148, -0.33260158,  0.2164713 , -0.2940495 ,  0.22636804,
         0.12042803, -0.07338171,  0.17187971, -0.12905738,  0.3129245 ,
        -0.31488863, -0.17345233,  0.32477817,  0.00439972,  0.39258945,
        -0.14481816], dtype=float32),
 'Assoc-acdm': array([-0.00751864, -0.1771137 ,  0.06895561, -0.21083945,  0.23953192,
        -0.6551445 ,  0.01284237, -0.0050387 , -0.07738334,  0.00540992,
         0.0681937 ,  0.05531053, -0.4259041 , -0.1871334 , -0.04381247,
         0.32671115], dtype=float32),
 'Some-college': array([ 0.01929094,  0.10994322,  0.36765632, -0.23809849,  0.10644584,
        -0.19297272, -0.39444843,  0.32810718, -0.05060181,  0.4375799 ,
         0.34009618, -0.30499312, 