# Deep Networks with Pytorch-Tabular

- branch: master
- toc: true 
- badges: false
- comments: false
- sticky_rank: 5
- author: Huon Fraser
- categories: [mangoes]

In [1]:
#collapse-hide
import sys
sys.path.append('/notebooks/Mangoes/src/')
model_path  = '../models/'

import pathlib
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

from codetiming import Timer
from sklearn.model_selection import GroupKFold
from scikit_models import *
from skopt.space import Real, Integer
from lwr import LocalWeightedRegression
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

In [2]:
#collapse-hide
mangoes=load_mangoes()

train_data,test_data = train_test_split(mangoes)
train_X, train_y, train_cat = X_y_cat(train_data,min_X=684,max_X=990)
test_X, test_y, test_cat = X_y_cat(test_data,min_X=684,max_X=990)
nrow,ncol=train_X.shape
groups = train_cat['Pop']
splitter=GroupKFold()

PyTorch, the neural-network platform of choice for this project requires users to define their training and validation loops. While this provides excellent flexibility, higher-level API's like [pytorch-lighting](https://www.pytorchlightning.ai/) and [fastai](https://www.fast.ai/) cut out much of the boilerplate and integrate functionality like callbacks and logging. The [pytorch-tabular](https://pytorch-tabular.readthedocs.io/en/latest/) library builds upon the PyTorch-lightning library to provide an API for dealing with tabular data.

In this notebook, we work through building an MLP model. In the next notebook, we will cover training and testing in a manner that is consistent with our earlier sklearn models.

In [3]:
#collapse-hide
import torch
import torch.nn as nn
import torch.nn.functional as F
from omegaconf import DictConfig
from typing import Dict
from dataclasses import dataclass, field

from pytorch_tabular import TabularModel
from pytorch_tabular.config import DataConfig,OptimizerConfig, TrainerConfig, ExperimentConfig,ModelConfig
from pytorch_tabular.models import BaseModel
from collections import OrderedDict
from pytorch_tabular.models import CategoryEmbeddingModelConfig

## Defining a Model

Within PyTorch-tabular custom networks can be defined by extending the BaseModel class, which in turn extends the PyTorch-Lightning LightningModule. All hyperparameters are passed in at initialisation by the config parameter and is accessible after super() has been called from self.hparams.

Our first step is to write our MLP. We define our network in the \_build_network function, consisting of two matrices (linear layers) separated by a ReLU activation layer. The number of inputs and outputs are controlled by hyperparameters inferred from the data while the width of the hidden layer is a user-controlled parameter.

We also are required to define the forward class, which controls how data is passed through our network. The input of this function x, consists of a dictionary with continuous and categorical features broken down into x["continuous"] and x["categorical"]. Outputs of forward must be returned in a dictionary with predictions labelled by "logits". This is messy (and is unclear in their documentation). Hopefully, this is something that will be improved in future iterations of this library.

In [4]:
@dataclass
class MLPConfig(ModelConfig):
    width: int = 10
    
class MLP(BaseModel):
    def __init__(
        self,
        config: DictConfig,
        **kwargs
    ):
        super().__init__(config, **kwargs)

    def _build_network(self):
        layers = OrderedDict({'layer_1':nn.Linear(self.hparams["continuous_dim"], self.hparams["width"]),
                             # 'act_1':nn.ReLU(),
                              'layer_2':nn.Linear(self.hparams["width"],self.hparams["output_dim"])
        })
        self.model = nn.Sequential(layers)
        
    def forward(self,x):
        x = x["continuous"]
        y_hat=  self.model.forward(x)
        return  {'logits':y_hat}

## Configurations

Data is expected to be a single pd.DataFrame including both X and y. This is a departure from the sklearn approach, and in the future, we'll work on a fix for this.
For the time being, we merge our X and y and define as lists the names of our categorical columns, numerical columns, dates, and targets. We pass this metadata into a DataConfig object, which handles loading and transforming data for us. 

Similarly, we also initialise a TrainerConfig class and an OptimizerConfig class, which between them defines all the hyperparameters controlling training. We also define a ModelConfig, which specifies the parameters that determine how our model is built. 

In [5]:
from copy import deepcopy

num_col_names = train_X.columns.tolist()
cat_col_names = []

train_Xy = deepcopy(train_X)
test_Xy = deepcopy(test_X)
train_Xy['target']=train_y
test_Xy['target']=test_y

data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=-1, #index of the GPU to use. -1 means all available GPUs, None, means CPU
)

optimizer_config = OptimizerConfig()
model_config = MLPConfig(task="regression",
                            learning_rate = 1e-3)

All the pieces are assembled in the TabularModel class. We pass in each config class and model_callable, a reference to our MLP class.

In [6]:
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
    model_callable = MLP
)

To train a Tabular Model we call fit, passing in our training data (as a pd.DataFrame) and optionally validation 

In [7]:
tabular_model.fit(train=train_Xy, validation=None)

Global seed set to 42
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 1.1 K 
1 | loss  | MSELoss    | 0     
-------------------------------------
1.1 K     Trainable params
0         Non-trainable params
1.1 K     Total params
0.004     Total estimated model params size (MB)
Global seed set to 42


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

LR finder stopped early after 99 steps due to diverging loss.
Restored states from the checkpoint file at /notebooks/Mangoes/_notebooks/lr_find_temp_model.ckpt
Learning rate set to 0.15848931924611143
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 1.1 K 
1 | loss  | MSELoss    | 0     
-------------------------------------
1.1 K     Trainable params
0         Non-trainable params
1.1 K     Total params
0.004     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 42


Training: 98it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Then to evaluate a model we call evaluate.

In [8]:
tabular_model.evaluate(test_Xy) 

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_mean_squared_error': 1.9310595989227295,
 'test_mean_squared_error_0': 1.9310595989227295}
--------------------------------------------------------------------------------


[{'test_mean_squared_error': 1.9310595989227295,
  'test_mean_squared_error_0': 1.9310595989227295}]

Notice that there is a disconnect between the PyTorch-Tabular and scikit-learn API's. In the next part, we will work on an implementation that works the same for both scikit-learn and PyTorch-Tabular models. We will also introduce cross-validation for our neural networks.