First of all, thanks for taking an interest in torchtable! In this notebook, we will be going through a simple example illustrating how to apply torchtable to a dataset on Kaggle.

In [1]:
import numpy as np
import pandas as pd

# Preparing the data

First, we will download the sample data for this example. If you have not already, please install the kaggle cli via 

`$ pip install kaggle`

In this example, we will be using the data from the [home credit default risk competition](https://www.kaggle.com/c/home-credit-default-risk), a competition for predicting which users will default on a home loan.

We will use the kaggle cli to obtain the data

In [None]:
!kaggle competitions download -c home-credit-default-risk

In [None]:
!unzip application_train.csv.zip

In [None]:
!unzip application_test.csv.zip

Now, let's read the data and take a look at it.

In [2]:
df = pd.read_csv("application_train.csv")
test_df = pd.read_csv("application_test.csv")

In [3]:
df.shape, test_df.shape

((307511, 122), (48744, 121))

In [4]:
df.head(2)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


There are a lot of columns...For now, we will subsample the columns to make this example easier to understand. In future examples, we will see how torchtable can automate the process of feature processing for us.

In [5]:
df = df[df.columns[:15]]
test_df = test_df[[c for c in df.columns[:15] if c != "TARGET"]]

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train_df, val_df = train_test_split(df, random_state=22)

# Constructing the dataset

At the heart of torchtext is the TabularDataset. We'll see how to use this via a simple example.

In [9]:
from torchtable import *
from torchtable.field import *
from torchtable.dataset import TabularDataset

In torchtable, we can easily and declaratively define how we want to process each column/columns in the dataset. Let's see how through an example.

In [10]:
train_df.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS'],
      dtype='object')

In [11]:
train_ds, val_ds, test_ds = TabularDataset.from_dfs(train_df, val_df=val_df, test_df=test_df, fields={
    "SK_ID_CURR": None,
    "TARGET": NumericField(normalization=None, fill_missing=None, is_target=True),
    "NAME_CONTRACT_TYPE": CategoricalField(),
    "CODE_GENDER": CategoricalField(),
    "FLAG_OWN_CAR": CategoricalField(),
    "FLAG_OWN_REALTY": CategoricalField(),
    "CNT_CHILDREN": [NumericField(normalization="MinMax"), CategoricalField(handle_unk=True)],
    "AMT_INCOME_TOTAL": NumericField(normalization="RankGaussian"),
    "AMT_CREDIT": NumericField(normalization="Gaussian"),
    "AMT_ANNUITY": NumericField(normalization="Gaussian"),
    "AMT_GOODS_PRICE": NumericField(normalization="Gaussian"),
    "NAME_TYPE_SUITE": CategoricalField(),
    "NAME_INCOME_TYPE": CategoricalField(handle_unk=True),
    "NAME_EDUCATION_TYPE": CategoricalField(),
    "NAME_FAMILY_STATUS": CategoricalField(),
})

The following columns are missing from the fields list: {'SK_ID_CURR'}


There's a lot of information here, so let's pick it apart piece by piece.

First off, we're calling the `TabularDataset.from_dfs` method, which allows us to construct one dataset for a train, val, and test dataframe with (virtually) the same processing.

The most important part of the above code is the **fields** dictionary. For each column in the input, we are mapping a field or fields. Each field represents a collection of processing steps to apply to each column. We'll discuss this in more depth later and in subsequent example notebooks.

When we set field to be None, we are ignoring the column. The SK_ID_CURR field is unlikely to help us during training, so we'll be removing it for now. Though the same can be accomplished by not having "SK_ID_CURR" be a key in the fields dictionary, it is best practice to map fields to None to make this explicit. This is to distinguish ignored fields from those that you have just forgotten.

### Understanding fields

Now let's look more deeply into the fields. As you will see, there are two main types of fields: **Numeric** and **Categorical**. 

In [12]:
train_ds.fields

{'TARGET': NumericField[TARGET],
 'NAME_CONTRACT_TYPE': CategoricalField[NAME_CONTRACT_TYPE],
 'CODE_GENDER': CategoricalField[CODE_GENDER],
 'FLAG_OWN_CAR': CategoricalField[FLAG_OWN_CAR],
 'FLAG_OWN_REALTY': CategoricalField[FLAG_OWN_REALTY],
 'CNT_CHILDREN': [NumericField[CNT_CHILDREN/_0],
  CategoricalField[CNT_CHILDREN/_1]],
 'AMT_INCOME_TOTAL': NumericField[AMT_INCOME_TOTAL],
 'AMT_CREDIT': NumericField[AMT_CREDIT],
 'AMT_ANNUITY': NumericField[AMT_ANNUITY],
 'AMT_GOODS_PRICE': NumericField[AMT_GOODS_PRICE],
 'NAME_TYPE_SUITE': CategoricalField[NAME_TYPE_SUITE],
 'NAME_INCOME_TYPE': CategoricalField[NAME_INCOME_TYPE],
 'NAME_EDUCATION_TYPE': CategoricalField[NAME_EDUCATION_TYPE],
 'NAME_FAMILY_STATUS': CategoricalField[NAME_FAMILY_STATUS]}

Numeric fields represent columns that should be treated as continuous numbers. 

In contrast, Categorical fields represent columns whose values represent some discrete category. 

Numeric fields are normalized and have their missing values filled, while categorical fields are mapped to discrete integer ids. You can specify the behavior of both fields in detail by setting various parameters (e.g. handle_unk: whether to allow categories that are unseen in the train set). See documentation for more details.

You can also specify mulitple fields for a single column. In this case, we treat the number of children both as a numerical field and as a categorical field. 

In [13]:
train_ds.fields["CNT_CHILDREN"]

[NumericField[CNT_CHILDREN/_0], CategoricalField[CNT_CHILDREN/_1]]

There is a special keyword `is_target` that represent whether a field is a target field (i.e. the label in supervised learning). This does not affect the processing in any way, but is important in handling the train and test datasets. To see why, ley's take a look at the fields in the test set.

In [14]:
test_ds.fields

{'NAME_CONTRACT_TYPE': CategoricalField[NAME_CONTRACT_TYPE],
 'CODE_GENDER': CategoricalField[CODE_GENDER],
 'FLAG_OWN_CAR': CategoricalField[FLAG_OWN_CAR],
 'FLAG_OWN_REALTY': CategoricalField[FLAG_OWN_REALTY],
 'CNT_CHILDREN': [NumericField[CNT_CHILDREN/_0],
  CategoricalField[CNT_CHILDREN/_1]],
 'AMT_INCOME_TOTAL': NumericField[AMT_INCOME_TOTAL],
 'AMT_CREDIT': NumericField[AMT_CREDIT],
 'AMT_ANNUITY': NumericField[AMT_ANNUITY],
 'AMT_GOODS_PRICE': NumericField[AMT_GOODS_PRICE],
 'NAME_TYPE_SUITE': CategoricalField[NAME_TYPE_SUITE],
 'NAME_INCOME_TYPE': CategoricalField[NAME_INCOME_TYPE],
 'NAME_EDUCATION_TYPE': CategoricalField[NAME_EDUCATION_TYPE],
 'NAME_FAMILY_STATUS': CategoricalField[NAME_FAMILY_STATUS]}

You'll notice that the TARGET field is missing. This is because we set `is_target=True` in the field for TARGET. Target fields should not be present in test sets, and we handle this automatically behind the scenes.

To further deepen our understanding, let's take a look at what's inside the dataset. PyTorch datasets can be indexed like a list, so we'll check the first example.

In [15]:
train_ds[0]

{'TARGET': 0,
 'NAME_CONTRACT_TYPE': 0,
 'CODE_GENDER': 0,
 'FLAG_OWN_CAR': 0,
 'FLAG_OWN_REALTY': 0,
 'CNT_CHILDREN': [0.05263157891966759, 2],
 'AMT_INCOME_TOTAL': -1.1287163254695256,
 'AMT_CREDIT': -1.2731855305872526,
 'AMT_ANNUITY': -1.5177575690066853,
 'AMT_GOODS_PRICE': -1.2502974913704445,
 'NAME_TYPE_SUITE': 0,
 'NAME_INCOME_TYPE': 1,
 'NAME_EDUCATION_TYPE': 0,
 'NAME_FAMILY_STATUS': 2}

For columns with multiple fields, there are multiple outputs as well. As you can see, none of these values are tensors yet. Preprocessing the data is the job of the TabularDataset. Now, let's prepare the data to be fed into the model.

# Preparing the DataLoader

We can't use the dataset for training yet, since we haven't converted the inputs into minibatches of tensors. Thankfully, this functionality is also provided within torchtable in the form of the `DefaultLoader`.

In [16]:
from torchtable.loader import DefaultLoader

DefaultLoaders can also be created with an API similar to that for the TabularDataset

In [17]:
train_dl, val_dl, test_dl = DefaultLoader.from_datasets(train_ds, (32, 32, 128),  val_ds=val_ds, test_ds=test_ds)

Let's take a look at what a batch looks like.

In [18]:
next(iter(train_dl))

({'NAME_CONTRACT_TYPE': tensor([1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
          0, 0, 0, 0, 0, 0, 1, 0]),
  'CODE_GENDER': tensor([0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
          1, 0, 1, 0, 0, 1, 1, 0]),
  'FLAG_OWN_CAR': tensor([0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
          0, 1, 0, 0, 0, 1, 0, 0]),
  'FLAG_OWN_REALTY': tensor([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
          1, 0, 0, 0, 0, 0, 0, 0]),
  'CNT_CHILDREN': [tensor([0.2105, 0.0000, 0.0000, 0.0000, 0.0526, 0.1053, 0.0526, 0.0526, 0.0000,
           0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1053, 0.0000, 0.0526,
           0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.1053, 0.0000, 0.0000, 0.0000, 0.1053]),
   tensor([5, 1, 1, 1, 2, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 3, 1, 1, 1, 3])],
  'AMT_INCOME_TOT

As you can see, the values are now all conveniently converted to tensors. The actual processing to convert examples to batches of tensors is specified within each field.

The batch is a tuple consisting of two dictionaries. The first is the dictionary of inputs, and the second is the dictionary of outpus/targets. The targets are automatically discovered by looking for fields with `is_target` set to True.

# Preparing the model

Now, let's prepare the actual model to train. Our model will be simple: it embeds all categories into continuous space and concatenates all the embeddings with the numerical values. Then, it will put the resulting feature through two linear layers.

The most cumbersome part of this process is embedding all the features and concatenating them. To make this process easier, torchtable provides the option for constructing a model that does all this processing for you just by passing a dataset. The functionality is provided through the `BatchHandlerModel`.

In [19]:
import torch
import torch.nn as nn
from torchtable.model import BatchHandlerModel

In [20]:
BatchHandlerModel.from_dataset(train_ds)

BatchHandlerModel(
  (embs): ModuleList(
    (0): Embedding(2, 1)
    (1): Embedding(3, 3)
    (2): Embedding(2, 1)
    (3): Embedding(2, 1)
    (4): Embedding(17, 50)
    (5): Embedding(8, 28)
    (6): Embedding(9, 36)
    (7): Embedding(5, 10)
    (8): Embedding(6, 15)
  )
)

Now, using this feature, we can easily construct the full model.

In [21]:
class SampleModel(nn.Module):
    def __init__(self, ds):
        super().__init__()
        self.batch_handler = BatchHandlerModel.from_dataset(ds)
        self.l1 = nn.Linear(self.batch_handler.out_dim(), 32)
        self.l2 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = self.batch_handler(x)
        x = self.l1(x)
        x = torch.relu(x)
        x = self.l2(x)
        return x

In [22]:
model = SampleModel(train_ds)

In [23]:
model

SampleModel(
  (batch_handler): BatchHandlerModel(
    (embs): ModuleList(
      (0): Embedding(2, 1)
      (1): Embedding(3, 3)
      (2): Embedding(2, 1)
      (3): Embedding(2, 1)
      (4): Embedding(17, 50)
      (5): Embedding(8, 28)
      (6): Embedding(9, 36)
      (7): Embedding(5, 10)
      (8): Embedding(6, 15)
    )
  )
  (l1): Linear(in_features=150, out_features=32, bias=True)
  (l2): Linear(in_features=32, out_features=1, bias=True)
)

Let's see if it can really handle the batch inputs

In [24]:
model(next(iter(train_dl))[0])

tensor([[-0.0458],
        [-0.1506],
        [-0.0568],
        [ 0.0862],
        [ 0.0961],
        [-0.2124],
        [-0.0518],
        [-0.0706],
        [-0.1359],
        [-0.1723],
        [-0.1166],
        [-0.1334],
        [-0.1867],
        [ 0.1361],
        [-0.2082],
        [-0.0599],
        [-0.1815],
        [-0.0142],
        [-0.0444],
        [-0.0167],
        [-0.1260],
        [-0.0708],
        [-0.1113],
        [-0.0235],
        [-0.0471],
        [-0.0607],
        [-0.1434],
        [ 0.1893],
        [-0.1661],
        [-0.0162],
        [-0.0161],
        [-0.0705]], grad_fn=<AddmmBackward>)

Success!

# Training the model

Torchtable does not provide any in-house training functions, but writing the training loop is relatively simple. 

In [25]:
import tqdm

In [26]:
epochs = 10
loss_func = nn.BCEWithLogitsLoss()
opt = torch.optim.Adam(model.parameters())

In [27]:
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y["TARGET"].unsqueeze(1))
        loss.backward()
        opt.step()
        
        running_loss += loss.item() * len(y["TARGET"])
        
    epoch_loss = running_loss / len(train_ds)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in val_dl:
        preds = model(x)
        loss = loss_func(preds, y["TARGET"].unsqueeze(1))
        val_loss += loss.item() * len(y["TARGET"])

    val_loss /= len(val_ds)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|██████████| 7208/7208 [00:24<00:00, 298.25it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.2768, Validation Loss: 0.2706


100%|██████████| 7208/7208 [00:24<00:00, 298.49it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.2748, Validation Loss: 0.2688


100%|██████████| 7208/7208 [00:24<00:00, 294.20it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.2743, Validation Loss: 0.2689


100%|██████████| 7208/7208 [00:23<00:00, 303.33it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.2739, Validation Loss: 0.2688


100%|██████████| 7208/7208 [00:24<00:00, 293.09it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.2737, Validation Loss: 0.2682


100%|██████████| 7208/7208 [00:24<00:00, 296.43it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.2734, Validation Loss: 0.2697


100%|██████████| 7208/7208 [00:24<00:00, 293.28it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.2732, Validation Loss: 0.2678


100%|██████████| 7208/7208 [00:25<00:00, 283.26it/s]
  0%|          | 0/7208 [00:00<?, ?it/s]

Epoch: 8, Training Loss: 0.2729, Validation Loss: 0.2691


100%|██████████| 7208/7208 [00:24<00:00, 289.60it/s]
  0%|          | 1/7208 [00:00<23:41,  5.07it/s]

Epoch: 9, Training Loss: 0.2729, Validation Loss: 0.2677


100%|██████████| 7208/7208 [00:23<00:00, 304.42it/s]


Epoch: 10, Training Loss: 0.2726, Validation Loss: 0.2686


Since the training loss is gradually decreasing, it looks like it's working!

# Summary

The basic flow in torchtable can be summarized as follows:

1. Determine what kind of preprocessing to apply to each column.
2. Construct a dataset
3. Construct a data loader from a dataset
4. Construct a model using the dataset.
5. Train the model

In future examples, we will cover how to create custom Fields to apply arbitrary preprocessing, how to create loaders that perform some special processing, and many more topics!