## 1. Simple Binary Classification with defaults.

In this notebook we will use the Adult Census dataset. Download the data from [here](https://www.kaggle.com/wenruliu/adult-income-dataset/downloads/adult.csv/2).

In [1]:
import numpy as np
import pandas as pd
import torch

from pytorch_widedeep.preprocessing import WidePreprocessor, DeepPreprocessor
from pytorch_widedeep.models import Wide, DeepDense, WideDeep
from pytorch_widedeep.metrics import BinaryAccuracy

In [2]:
df = pd.read_csv('data/adult/adult.csv.zip')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# For convenience, we'll do some mild preprocessing
df.columns = [c.replace("-", "_") for c in df.columns]
df['age_buckets'] = pd.cut(df.age, bins=[16, 25, 30, 35, 40, 45, 50, 55, 60, 91], labels=np.arange(9))
df['income_label'] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop('income', axis=1, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,age_buckets,income_label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,3,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,4,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0,0


### 1.1 Preparing the data 

The so called `Wide` part of the model is simply set of one-hot encoded features connected to the output neuron(s) through a linear layer. 

The so called `Deep` part, which I will refer here as `DeepDense` (since there will also be `DeepText` and `DeepImage`) is a set of numerical continuous features, concatenated with embedding representations of categorical features, passed through a series of Dense layers. 

With this in mind, we prepare the data as follows:

In [4]:
wide_cols = ['age_buckets', 'education', 'relationship','workclass','occupation',
    'native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]
cat_embed_cols = [('education',16), ('relationship',8), ('workclass',16),
    ('occupation',16),('native_country',16)]
continuous_cols = ["age","hours_per_week"]
target_col = 'income_label'

In [5]:
# TARGET
target = df[target_col].values

# WIDE
preprocess_wide = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = preprocess_wide.fit_transform(df)

# DEEP
preprocess_deep = DeepPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_deep = preprocess_deep.fit_transform(df)

In [6]:
print(X_wide)
print(X_wide.shape)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(48842, 805)


In [7]:
print(X_deep)
print(X_deep.shape)

[[ 0.          0.          0.         ...  0.         -0.99512893
  -0.03408696]
 [ 1.          1.          0.         ...  0.         -0.04694151
   0.77292975]
 [ 2.          1.          1.         ...  0.         -0.77631645
  -0.03408696]
 ...
 [ 1.          3.          0.         ...  0.          1.41180837
  -0.03408696]
 [ 1.          0.          0.         ...  0.         -1.21394141
  -1.64812038]
 [ 1.          4.          6.         ...  0.          0.97418341
  -0.03408696]]
(48842, 7)


### 1.2. Define the model

In [8]:
wide = Wide(wide_dim=X_wide.shape[1], output_dim=1)
deepdense = DeepDense(hidden_layers=[64,32], 
                      deep_column_idx=preprocess_deep.deep_column_idx,
                      embed_input=preprocess_deep.embeddings_input,
                      continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deepdense=deepdense)

In [9]:
model

WideDeep(
  (wide): Wide(
    (wide_linear): Linear(in_features=805, out_features=1, bias=True)
  )
  (deepdense): Sequential(
    (0): DeepDense(
      (embed_layers): ModuleDict(
        (emb_layer_education): Embedding(16, 16)
        (emb_layer_native_country): Embedding(42, 16)
        (emb_layer_occupation): Embedding(15, 16)
        (emb_layer_relationship): Embedding(6, 8)
        (emb_layer_workclass): Embedding(9, 16)
      )
      (dense): Sequential(
        (dense_layer_0): Sequential(
          (0): Linear(in_features=74, out_features=64, bias=True)
          (1): LeakyReLU(negative_slope=0.01, inplace=True)
          (2): Dropout(p=0.0, inplace=False)
        )
        (dense_layer_1): Sequential(
          (0): Linear(in_features=64, out_features=32, bias=True)
          (1): LeakyReLU(negative_slope=0.01, inplace=True)
          (2): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (1): Linear(in_features=32, out_features=1, bias=True)
  )
)

As you can see the model is not particularly complex. In mathematical terms (Eq 3 in the [original paper](https://arxiv.org/pdf/1606.07792.pdf)): 

$$
pred = \sigma(W^{T}_{wide}[x, \phi(x)] + W^{T}_{deep}a_{deep}^{(l_f)} +  b) 
$$ 


The architecture above will output the 1st and the second term in the parenthesis. `WideDeep` will then add them and apply and activation function (`sigmoid` in this case). For more details, please refer to the paper.

### 1.3 Compiling and Running/Fitting

Once the model is built, we just need to compile it and run it

In [10]:
model.compile(method='binary', metrics=[BinaryAccuracy])

In [11]:
model.fit(X_wide=X_wide, X_deep=X_deep, target=target, n_epochs=5, batch_size=256, val_split=0.2)

epoch 1: 100%|██████████| 153/153 [00:01<00:00, 88.02it/s, loss=0.404, metrics={'acc': 0.812}] 
valid: 100%|██████████| 39/39 [00:00<00:00, 140.39it/s, loss=0.356, metrics={'acc': 0.8165}]
epoch 2: 100%|██████████| 153/153 [00:01<00:00, 91.37it/s, loss=0.348, metrics={'acc': 0.8357}]
valid: 100%|██████████| 39/39 [00:00<00:00, 122.80it/s, loss=0.35, metrics={'acc': 0.8361}]
epoch 3: 100%|██████████| 153/153 [00:01<00:00, 89.79it/s, loss=0.343, metrics={'acc': 0.8382}]
valid: 100%|██████████| 39/39 [00:00<00:00, 97.31it/s, loss=0.348, metrics={'acc': 0.8382}]
epoch 4: 100%|██████████| 153/153 [00:01<00:00, 96.99it/s, loss=0.34, metrics={'acc': 0.8398}]  
valid: 100%|██████████| 39/39 [00:00<00:00, 137.56it/s, loss=0.348, metrics={'acc': 0.8395}]
epoch 5: 100%|██████████| 153/153 [00:01<00:00, 97.95it/s, loss=0.338, metrics={'acc': 0.8411}] 
valid: 100%|██████████| 39/39 [00:00<00:00, 133.39it/s, loss=0.347, metrics={'acc': 0.8406}]


As you can see, you can run a wide and deep model in slightly over a dozen lines of code:

```
wide_cols = ['age_buckets', 'education', 'relationship','workclass','occupation', 'native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]
cat_embed_cols = [('education',16), ('relationship',8), ('workclass',16),
    ('occupation',16),('native_country',16)]
continuous_cols = ["age","hours_per_week"]
target_col = 'income_label'
target = df[target_col].values
preprocess_wide = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = preprocess_wide.fit_transform(df)
preprocess_deep = DeepPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_deep = preprocess_deep.fit_transform(df)
wide = Wide(wide_dim=X_wide.shape[1], output_dim=1)
deepdense = DeepDense(hidden_layers=[64,32], 
                      deep_column_idx=preprocess_deep.deep_column_idx,
                      embed_input=preprocess_deep.embeddings_input,
                      continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deepdense=deepdense)
model.compile(method='binary', metrics=[BinaryAccuracy])
model.fit(X_wide=X_wide, X_deep=X_deep, target=target, n_epochs=5, batch_size=256, val_split=0.2)
```

Let's now see how to use WideDeep with varying parameters

## 2. Binary Classification with varying parameters

In [13]:
wide = Wide(wide_dim=X_wide.shape[1], output_dim=1)
# We can add dropout and batchnorm to the dense layers
deepdense = DeepDense(hidden_layers=[64,32], dropout=[0.5, 0.5], batchnorm=True,
                      deep_column_idx=preprocess_deep.deep_column_idx,
                      embed_input=preprocess_deep.embeddings_input,
                      continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deepdense=deepdense)

In [21]:
model

WideDeep(
  (wide): Wide(
    (wide_linear): Linear(in_features=805, out_features=1, bias=True)
  )
  (deepdense): Sequential(
    (0): DeepDense(
      (embed_layers): ModuleDict(
        (emb_layer_education): Embedding(16, 16)
        (emb_layer_native_country): Embedding(42, 16)
        (emb_layer_occupation): Embedding(15, 16)
        (emb_layer_relationship): Embedding(6, 8)
        (emb_layer_workclass): Embedding(9, 16)
      )
      (dense): Sequential(
        (dense_layer_0): Sequential(
          (0): Linear(in_features=74, out_features=64, bias=True)
          (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): LeakyReLU(negative_slope=0.01, inplace=True)
          (3): Dropout(p=0.5, inplace=False)
        )
        (dense_layer_1): Sequential(
          (0): Linear(in_features=64, out_features=32, bias=True)
          (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): LeakyRe

We can use different initializers, optimizers and learning rate schedulers for each `branch` of the model

In [22]:
from pytorch_widedeep.initializers import KaimingNormal, XavierNormal
from pytorch_widedeep.callbacks import ModelCheckpoint, LRHistory, EarlyStopping
from pytorch_widedeep.optim import RAdam

In [27]:
# Optimizers
wide_opt = torch.optim.Adam(model.wide.parameters())
deep_opt = RAdam(model.deepdense.parameters())
# LR Schedulers
wide_sch = torch.optim.lr_scheduler.StepLR(wide_opt, step_size=3)
deep_sch = torch.optim.lr_scheduler.StepLR(deep_opt, step_size=5)

the components that are model dependent must be passed as dictionaries, while general components are simply lists

In [28]:
optimizers = {'wide': wide_opt, 'deepdense':deep_opt}
schedulers = {'wide': wide_sch, 'deepdense':deep_sch}
initializers = {'wide': KaimingNormal, 'deepdense':XavierNormal}
callbacks = [LRHistory(n_epochs=10), EarlyStopping, ModelCheckpoint(filepath='model_weights/wd_out')]
metrics = [BinaryAccuracy]

In [29]:
model.compile(method='binary', optimizers=optimizers, lr_schedulers=schedulers, 
              initializers=initializers,
              callbacks=callbacks,
              metrics=metrics)

In [30]:
model.fit(X_wide=X_wide, X_deep=X_deep, target=target, n_epochs=10, batch_size=256, val_split=0.2)

epoch 1: 100%|██████████| 153/153 [00:01<00:00, 77.95it/s, loss=0.618, metrics={'acc': 0.6819}]
valid: 100%|██████████| 39/39 [00:00<00:00, 141.69it/s, loss=0.417, metrics={'acc': 0.7088}]
epoch 2: 100%|██████████| 153/153 [00:02<00:00, 73.45it/s, loss=0.453, metrics={'acc': 0.7875}]
valid: 100%|██████████| 39/39 [00:00<00:00, 141.03it/s, loss=0.375, metrics={'acc': 0.7954}]
epoch 3: 100%|██████████| 153/153 [00:02<00:00, 73.42it/s, loss=0.407, metrics={'acc': 0.8123}]
valid: 100%|██████████| 39/39 [00:00<00:00, 140.30it/s, loss=0.362, metrics={'acc': 0.8159}]
epoch 4: 100%|██████████| 153/153 [00:02<00:00, 76.28it/s, loss=0.385, metrics={'acc': 0.8197}]
valid: 100%|██████████| 39/39 [00:00<00:00, 124.08it/s, loss=0.357, metrics={'acc': 0.8224}]
epoch 5: 100%|██████████| 153/153 [00:02<00:00, 76.28it/s, loss=0.374, metrics={'acc': 0.8247}]
valid: 100%|██████████| 39/39 [00:00<00:00, 136.63it/s, loss=0.355, metrics={'acc': 0.8261}]
epoch 6: 100%|██████████| 153/153 [00:02<00:00, 71.99it

In [31]:
dir(model)

['__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_activation_fn',
 '_apply',
 '_backend',
 '_backward_hooks',
 '_buffers',
 '_construct',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_get_name',
 '_load_from_state_dict',
 '_load_state_dict_pre_hooks',
 '_loss_fn',
 '_lr_scheduler_step',
 '_modules',
 '_named_members',
 '_parameters',
 '_predict',
 '_register_load_state_dict_pre_hook',
 '_register_state_dict_hook',
 '_save_to_state_dict',
 '_slow_forward',
 '_state_dict_hooks',
 '_tracing_name',
 '_train_val_split',
 '_training_step',
 '_validation_step',
 '_version',
 'add_module',
 'apply',
 'batch_size',
 'buffers',
 'call

You see that, among many methods and attributes we have the `history` and `lr_history` attributes

In [32]:
model.history.epoch

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [33]:
model.history._history

{'train_loss': [0.6175944029895308,
  0.45259543707947325,
  0.4069893552976496,
  0.38474080356117946,
  0.3735188660668392,
  0.37029798708710016,
  0.36808953779974796,
  0.3682427739395815,
  0.36711323962492104,
  0.36631106590133866],
 'train_acc': [0.6819,
  0.7875,
  0.8123,
  0.8197,
  0.8247,
  0.8273,
  0.8292,
  0.8272,
  0.8289,
  0.8269],
 'val_loss': [0.4171598324408898,
  0.3752642174561818,
  0.36203143841181046,
  0.35711457561223936,
  0.3552806025896317,
  0.35450722697453624,
  0.3542046585144141,
  0.3536104666881072,
  0.3536183413786766,
  0.35328011482189864],
 'val_acc': [0.7088,
  0.7954,
  0.8159,
  0.8224,
  0.8261,
  0.8284,
  0.8301,
  0.8284,
  0.8297,
  0.8283]}

In [34]:
model.lr_history

{'lr_wide_0': [0.001,
  0.001,
  0.001,
  0.0001,
  0.0001,
  0.0001,
  1.0000000000000003e-05,
  1.0000000000000003e-05,
  1.0000000000000003e-05,
  1.0000000000000002e-06],
 'lr_deepdense_0': [0.001,
  0.001,
  0.001,
  0.001,
  0.001,
  0.0001,
  0.0001,
  0.0001,
  0.0001,
  0.0001]}

We can see that the learning rate effectively decreases by a factor of 0.1 (the default) after the corresponding `step_size`. Note that the keys of the dictionary have a suffix `_0`. This is because if you pass different parameter groups to the torch optimizers, these will also be recorded. We'll see this in the `Regression` notebook. 

And I guess one has a good idea of how to use the package. Before we leave this notebook just mentioning that the `WideDeep` class comes with a useful method to "rescue" the learned embeddings. For example, let's say I want to use the embeddings learned for the different levels of the categorical feature `education`

In [36]:
model.get_embeddings(col_name='education', cat_encoding_dict=preprocess_deep.encoding_dict)

{'11th': array([-0.08673421, -0.20611182,  0.59797794,  0.20942385, -0.1655451 ,
        -0.35815904, -0.18617755, -0.03302404,  0.35232136,  0.04572977,
         0.3746049 , -0.17023513, -0.28413588, -0.0887626 , -0.131666  ,
        -0.00224428], dtype=float32),
 'HS-grad': array([ 0.52553725,  0.02418377,  0.01348715,  0.08806376,  0.15482062,
         0.3125567 , -0.04339381,  0.20687783, -0.13482526,  0.24505831,
        -0.01041402, -0.01626713, -0.25524345,  0.0543058 , -0.00122186,
         0.17666139], dtype=float32),
 'Assoc-acdm': array([ 0.203534  ,  0.04936524,  0.03697095,  0.08620811, -0.22591479,
         0.03520108,  0.05318382, -0.15239732, -0.27075678, -0.12126133,
         0.2704458 , -0.15518834,  0.24027239,  0.05669733, -0.3112209 ,
         0.03612295], dtype=float32),
 'Some-college': array([-0.08446952,  0.36122507, -0.33226198,  0.35516104, -0.09688068,
         0.03067617, -0.24075967,  0.06579152, -0.27671552,  0.57062864,
         0.27405295,  0.04844938, 