# Deep Networks with Pytorch-Tabular

- branch: master
- toc: true 
- badges: false
- comments: false
- sticky_rank: 5
- author: Huon Fraser
- categories: [mangoes]

In [1]:
#collapse-hide
import sys
sys.path.append('/notebooks/Mangoes/src/')
model_path  = '../models/'

import pathlib
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

from codetiming import Timer
from sklearn.model_selection import GroupKFold
from scikit_models import *
from skopt.space import Real, Integer
from lwr import LocalWeightedRegression
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'codetiming'

In [None]:
#collapse-hide
mangoes=load_mangoes()

train_data,test_data = train_test_split(mangoes)
train_X, train_y, train_cat = X_y_cat(train_data,min_X=684,max_X=990)
test_X, test_y, test_cat = X_y_cat(test_data,min_X=684,max_X=990)
nrow,ncol=train_X.shape
groups = train_cat['Pop']
splitter=GroupKFold()

Pytorch, the neural-network platform of choice for this project requires users to define their own training and validation loops. While this provides excellent flexability, higher order API's like [pytorch-lighting](https://www.pytorchlightning.ai/) and [fastai](https://www.fast.ai/) cut outmuch of the boilerplate and integrate functionality like callbacks and logging. The [pytorch-tabular](https://pytorch-tabular.readthedocs.io/en/latest/) libary extends the pytorch-lightning API to work better with tabular data.

In this notebook we work through building a MLP model, and then training and testing in a manner that is consistent with our earlier sklearn models.

In [None]:
#collapse-hide
import torch
import torch.nn as nn
import torch.nn.functional as F
from omegaconf import DictConfig
from typing import Dict
from dataclasses import dataclass, field

from pytorch_tabular import TabularModel
from pytorch_tabular.config import DataConfig,OptimizerConfig, TrainerConfig, ExperimentConfig,ModelConfig
from pytorch_tabular.models import BaseModel
from collections import OrderedDict
from pytorch_tabular.models import CategoryEmbeddingModelConfig


## Defining a Model

Within pytorch-tabular custom networks can be defined by extending the BaseModel class, which in turn extends the pytorch-lightning LightningModule. All hyperparameters are passed in at iniatialisation by the config paramater and is accessable after super() has been called from self.hparams.

Our first step is to write our MLP. We define our network in the \_build_network function, consisting of two matrices (linear layers) seperated by a ReLU activation layer. The number of inputs and outputs are controlled by hyperparameters inferred from the data while the width of the hidden layer is a user controlled parameter.

We also are requred to define the forward class, which controls how data is passed through our network. The input of this function x, consists of a dictionary with continuous and categorical features broken down into x["continuous"] and x["categorical"]. Outputs of forward must be returned in a dictionary with predictions labelled by "logits". This is messy (and is unclear in their documentation). Hopefully this is something that will be improved in future iterations of this library.

In [None]:
@dataclass
class MLPConfig(ModelConfig):
    width: int = 10
    
class MLP(BaseModel):
    def __init__(
        self,
        config: DictConfig,
        **kwargs
    ):
        super().__init__(config, **kwargs)

    def _build_network(self):
        layers = OrderedDict({'layer_1':nn.Linear(self.hparams["continuous_dim"], self.hparams["width"]),
                             # 'act_1':nn.ReLU(),
                              'layer_2':nn.Linear(self.hparams["width"],self.hparams["output_dim"])
        })
        self.model = nn.Sequential(layers)
        
    def forward(self,x):
        x = x["continuous"]
        y_hat=  self.model.forward(x)
        return  {'logits':y_hat}

## Configurations

Data is expected to be a single pd.DataFram including both X and y. This is a departure from the sklearn approach, and in the future we'll work on a fix for this.
For the time being, we merge our X and y and define the names of our categorical and numerical columns. We pass this metadata into a DataConfig object, which handles loading and transforming data for us.

Similarly we also define a TrainerConfig class and an OptimizerConfig class, which defines all the hyperparmeters controlling training.

In [None]:
num_col_names = train_X.columns.tolist()
cat_col_names = []

train_Xy = deepcopy(train_X)
test_Xy = deepcopy(test_y)
train_Xy['target']=train_y
test_Xy['target']=test_y

data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=False, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=-1, #index of the GPU to use. -1 means all available GPUs, None, means CPU
)

model_config = TestNetConfig(task="regression",
                            learning_rate = 1e-3)

optimizer_config = OptimizerConfig()

All the pieces are assembled in the TabularModel class. As well as our Config classes, we also define the model_callable to be a reference to our MLP class.

In [None]:

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
    model_callable = MLP
)

In [None]:
tabular_model.fit(train=train_Xy)

In [None]:
tabular_model.evaluate(test_Xy)

In [None]:
#todo check outputs of tabular_model.evaluate

## Cross-Validation

So far we have implemented a minimum-viable model and evaluation. We now wrap this into a cross-validation framework.

First we need to consider a key design choice. For our sklearn models, cross-validation led to building multiple versions of a model with slightly different parameters. For neural networks, with optimising being a highly stochastic gradient descent path, there is no gurantee that cross-validation folds are similar, or that folds resemble the final model trained on all the data. After running cross-validation, we let the user define how to build the final model; None, for building no final model to save time, "All", to train a model on the whole training set, or "Ensemble", to build an ensemble on the cross-validation folds. 

We define our ensemble implementation below. At the moment we just pass in each model. IN the future we may instead pass in a location of a savefile for each model, or a single model and a list of locations to get weights from.

In [None]:
@dataclass
class EnsembleConfig(ModelConfig):
    models: list  = []

class Ensemble(EnsembleModel):
    def __init__(
        self,
        config: DictConfig,
        **kwargs
    ):
        super().__init__(config, **kwargs)

    def _build_network(self):
        self.models = self.hparams["continuous_dim"]
        
    def forward(self,x):
        x = x["continuous"]
        y_hats=torch.zeros(x.shape[1],len(self.models))
        for i,model in enumerate(models):
            y_hats[:,i]= self.model.forward(x)['logits']
        y_hat = torch.mean(y_hats)
        return  {'logits':y_hat}

In [2]:
from copy import deepcopy
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GroupKFold, KFold

def cross_validate(model,X,y,splitter=GroupKFold(),groups=None,plot=False,save_loc=None,final_model="Ensemble"): #Ensemble, "All", None
    #combine X and y
    Xy = deepcopy(X)
    Xy["target"]=y
    
    preds = None
    ys = None
    models = []
    for fold, (inds1,inds2) in enumerate(splitter.split(X,y,groups)):
        
        model.fit(X.iloc[inds1,:],y.iloc[inds1,:])
        pred = model.predict(X.iloc[inds2,:])

        if preds is None:
            preds = pred
            ys = y.iloc[inds2,:]
        else:
            preds = np.concatenate((preds,pred),axis=0)
            ys = np.concatenate((ys,y.iloc[inds2,:]),axis=0)
            
        if final_model == "Ensemble":
            models.append(deepcopy(model))
            

    r2 = r2_score(ys,preds)
    mse = mean_squared_error(ys,preds)

    if plot:
        ys = ys.flatten()
        preds = preds.flatten()

        m, b = np.polyfit(ys, preds, 1)
        fig, ax = plt.subplots()

        ls = np.linspace(min(ys),max(ys))
        ax.plot(ls,ls*m+b,color = "black", label = r"$\hat{y}$ = "+f"{m:.4f}y + {b:.4f}")
        ax.scatter(x=ys,y=preds,label = r"$R^2$" + f"={r2:.4f}")

        ax.set_xlabel('True Values')
        ax.set_ylabel('Predicted Values')
        ax.legend(bbox_to_anchor=(0.5,1))
        if not save_loc is None:
            fig.savefig(save_loc)
            
    if final_model == "Ensemble":
         #create new tabular model, passing in same configs but with an ensemble
    elif final_model == "All": #train final model on all data
        model = model.fit(x,y)
    else: #ignore training the final model, for computation saving purposes 
        model = None
    
    return model, mse

NameError: name 'GroupKFold' is not defined

In [None]:
def evaluate(model,train_X,train_y,test_X,test_y,plot=False,save_loc=None,log=True):
    test_y=test_y.values.flatten()
    model.fit(train_X,train_y)
    preds = model.predict(test_X)

    r2 = r2_score(test_y,preds)
    mse = mean_squared_error(test_y,preds)

    if log:
        print(f"Test set MSE: {mse:.4f}")

    if plot:
        preds=preds.flatten()

        m, b = np.polyfit(test_y, preds, 1)
        fig, ax = plt.subplots()

        ls = np.linspace(min(test_y),max(test_y))
        ax.plot(ls,ls*m+b,color = "black", label = r"$\hat{y}$ = "+f"{m:.4f}y + {b:.4f}")
        ax.scatter(x=test_y,y=preds,label = r"$R^2$" + f"={r2:.4f}")

        ax.set_xlabel('True Values')
        ax.set_ylabel('Predicted Values')
        ax.legend(bbox_to_anchor=(0.5,1))
        if not save_loc is None:
            fig.savefig(save_loc)
    return model, mse 