In this example, we will learn more about the internals of torchtable as well as how to construct custom fields

In [1]:
import numpy as np
import pandas as pd

# Preparing the data

We'll be using the same data that we used in 01_introduction

In [2]:
df = pd.read_csv("application_train.csv")
test_df = pd.read_csv("application_test.csv")
df = df[df.columns[:15]]
test_df = df[[x for x in df.columns[:15] if x != "TARGET"]]

In [3]:
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, random_state=22)

# Feature Engineering

Suppose that we wanted to engineer a feature that represented whether a user is a single mother. This is not something that is possible with the regular `NumericField` or `CategoricalField`. Thankfully, torchtable recognizes that feature engineering can be critical in success for tabular datasets, so provides rich support for custom Fields. Let's see how we can construct one.

In [5]:
from torchtable.field import Field

We'll use the number of children, gender, and family status of the user to check if they are a single mother.

In [6]:
def is_single_mother(row, **kwargs):
    return ((row["CNT_CHILDREN"] > 0) & 
            (row["CODE_GENDER"] == "M") & 
            (row["NAME_FAMILY_STATUS"] != "Married"))

The function must take the `train` keyword argument that specifies whether the data being passed should be used to fit/train the function. Since our function is stateless, we can just ignore this argument. 

To construct a custom field, all we need to do is to pass this feature extraction function to the `Field` constructor along with whether it is a categorical/continuous field. For custom categorical fields, if we want to use automatic model construction we will need to explicitly pass the cardinality.

In [7]:
custom_field = Field(is_single_mother, categorical=True, continuous=False, cardinality=2)

To test if the field works as expected, try calling its `transform` method on some sample input.

In [8]:
custom_field.transform(df[["CNT_CHILDREN", "CODE_GENDER", "NAME_FAMILY_STATUS"]].head(3))

0    False
1    False
2    False
dtype: bool

Now, this is all fine, but what if we want to do something more complex? For instance, suppose we wanted to allocate a single category to each *combination* of gender and status of having children. We could write a single function to handle this, but torchtable makes such composite pipelines easy to write using `Operator`s.


We will now take a quick detour to explain `Operator`s in torchtable, then use these to create a more advanced feature engineering pipeline.

### Operators

In [9]:
from torchtable.operator import LambdaOperator, Categorize

Operators are - as their names suggest - a single operation on input data. Let's take the following simple example:

In [10]:
op1 = LambdaOperator(lambda x: x + 1)
op1(10)

11

There is nothing complex going on here. Where `Operator`s really shine is when they are chained to each other. Take the following example where we add 1 to an input, then multiply it by 3:

In [11]:
op2 = LambdaOperator(lambda x: x * 3)
op1 > op2 # we chain operations like this. this means op1's output is fed to op2
op2(10)

33

As you can see, chaining op1 and op2 changes op2 into a composite operation. This flexibility makes it easy to write complex pipelines in an intuitive manner.

Now, let's get back to the earlier example. We will first want to convert the gender and children status of the user into a single column, then categorize it. We can use the `operator.Categorize` operator to help us here.

In [12]:
gender_child_status = LambdaOperator(lambda x: x.apply(
    lambda row: f"{row['CODE_GENDER']}_{row['CNT_CHILDREN'] > 0}",
    axis=1)
)
ctgrz = Categorize(handle_unk=False)
custom_field2 = Field(
    gender_child_status > ctgrz, categorical=True, continuous=False,
    cardinality=6, # three gender categories * two child statuses
)

In [13]:
custom_field2.transform(df[["CNT_CHILDREN", "CODE_GENDER"]].head(3))

0    0
1    1
2    0
dtype: int64

# Building and Training a Model

All the rest is virtually the same as the first tutorial. We do the hard work of thinking of the appropriate pipeline, and torchtable does all the remaining hard work for us.

In [14]:
import torch
import torch.nn as nn
from torchtable.field import NumericField, CategoricalField
from torchtable.dataset import TabularDataset
from torchtable.loader import DefaultLoader
from torchtable.model import BatchHandlerModel

In [15]:
train_ds, val_ds, test_ds = TabularDataset.from_dfs(train_df, val_df=val_df, test_df=test_df, fields={
    ("CNT_CHILDREN", "CODE_GENDER", "NAME_FAMILY_STATUS"): custom_field, # Our custom field!
    ("CNT_CHILDREN", "CODE_GENDER"): custom_field2,
    "SK_ID_CURR": None,
    "TARGET": NumericField(normalization=None, fill_missing=None, is_target=True),
    "NAME_CONTRACT_TYPE": CategoricalField(),
    "CODE_GENDER": CategoricalField(),
    "FLAG_OWN_CAR": CategoricalField(),
    "FLAG_OWN_REALTY": CategoricalField(),
    "CNT_CHILDREN": [NumericField(normalization="MinMax"), CategoricalField(handle_unk=True)],
    "AMT_INCOME_TOTAL": NumericField(normalization="RankGaussian"),
    "AMT_CREDIT": NumericField(normalization="Gaussian"),
    "AMT_ANNUITY": NumericField(normalization="Gaussian"),
    "AMT_GOODS_PRICE": NumericField(normalization="Gaussian"),
    "NAME_TYPE_SUITE": CategoricalField(),
    "NAME_INCOME_TYPE": CategoricalField(handle_unk=True),
    "NAME_EDUCATION_TYPE": CategoricalField(),
    "NAME_FAMILY_STATUS": CategoricalField(),
})

The following columns are missing from the fields list: {'SK_ID_CURR'}


In [16]:
train_dl, val_dl, test_dl = DefaultLoader.from_datasets(train_ds, (32, 32, 128),  val_ds=val_ds, test_ds=test_ds)

In [17]:
class SampleModel(nn.Module):
    def __init__(self, ds):
        super().__init__()
        self.batch_handler = BatchHandlerModel.from_dataset(ds)
        self.l1 = nn.Linear(self.batch_handler.out_dim(), 32)
        self.l2 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = self.batch_handler(x)
        x = self.l1(x)
        x = torch.relu(x)
        x = self.l2(x)
        return x

In [18]:
model = SampleModel(train_ds)

In [19]:
import tqdm

loss_func = nn.BCEWithLogitsLoss()
opt = torch.optim.Adam(model.parameters())

for epoch in range(1, 3):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y["TARGET"].unsqueeze(1))
        loss.backward()
        opt.step()
        
        running_loss += loss.item() * len(y["TARGET"])
        
    epoch_loss = running_loss / len(train_ds)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in val_dl:
        preds = model(x)
        loss = loss_func(preds, y["TARGET"].unsqueeze(1))
        val_loss += loss.item() * len(y["TARGET"])

    val_loss /= len(val_ds)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|██████████| 7208/7208 [00:26<00:00, 268.32it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.2771, Validation Loss: 0.2693


100%|██████████| 7208/7208 [00:26<00:00, 268.83it/s]


Epoch: 2, Training Loss: 0.2750, Validation Loss: 0.2687
