# Speeding up fastai Tabular with NumPy

> Speeding up fastai tabular training by 40%

- toc: true
- badges: true
- comments: true
- image: images/chart-preview.png

---
This blog is also a Jupyter notebook available to run from the top down. There will be code snippets that you can then run in any Jupyter environment. This post was written using:

* `fastai2`: 0.0.16
* `fastcore`: 0.1.16

---

# What is this article?

In this article, we're going to dive deep into the `fastai` `DataLoader` and how to integrate it in with NumPy. The end result? Speeding up tabular training by 40% (to where almost half the time per epoch is just the time to train the model itself). 

## What is `fastai` Tabular? A TL;DR

When working with tabular data, `fastai` has introduced a powerful tool to help with prerocessing your data: `TabularPandas`. It's super helpful and useful as you can have everything in one place, encode and decode all of your tables at once, and the memory usage on top of your `Pandas` dataframe can be very minimal. Let's look at an example of it. 

First let's import the tabular module:

In [None]:
from fastai2.tabular.all import *

For our particular tests today, we'll be using the `ADULT_SAMPLE` dataset, where we need to identify if a particular individual makes above or below $50,000. Let's grab the data:

In [None]:
path = untar_data(URLs.ADULT_SAMPLE)

And now we can open it in `Pandas`:

In [None]:
df = pd.read_csv(path/'adult.csv')

In [None]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


Now that we have our `DataFrame`, let's fit it into a `TabularPandas` object for preprocessing. To do so, we need to decalre the following:

* `procs` (pre-processing our data, such as normalization and converting categorical values to numbers)
* `cat_names` (categorical variables)
* `cont_names` (continuous variables)
* `y_names` (our y columns)

For our case, these look like so:

In [None]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'

We'll also need to tell `TabularPandas` how we want to split our data. We'll use a random 20% subsample:

In [None]:
splits = RandomSplitter()(range_of(df))

Now let's make a `TabularPandas`!

In [None]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, splits=splits)

Now all of our data is pre-processed here and we can grab all of the raw values if we wanted to say use it with `XGBoost` like so:

In [None]:
to.train.xs.iloc[:3]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
3337,8,13,5,11,2,2,1,-1.071338,-0.39674,1.53295
13162,8,12,3,12,1,5,1,0.030019,0.46965,-0.417817
10215,5,10,1,5,2,5,1,-0.557371,0.392678,1.142797


Andi it's fully encoded! Now that we're a bit familiar with `TabularPandas`, let's do some speed tests!

## The Baseline

For our tests, we'll run 4 different tests:
1. One batch of the training data
2. Iterating over the entire training dataset
3. Iterating over the entire validation set
4. Fitting for 10 epochs (GPU only)

And for each of these we will compare the times on the CPU and the GPU. 

### CPU:

First let's grab the first batch. The reason this is important is each time we iterate over the training `DataLoader`, we actually shuffle our data, which can add some time:

In [None]:
dls = to.dataloaders(bs=128, device='cpu')

To test our times, we'll use `%%timeit`. It measures the execution time of a Python function for a certain amount of loops, and reports back the fastest one. For iterating over the entire `DataLoader` we'll look at the time per batch as well. 

First, a batch from training:

In [None]:
%%timeit
_ = next(iter(dls.train))

10 loops, best of 3: 18.3 ms per loop


Now the validation:

In [None]:
%%timeit
_ = next(iter(dls.valid))

100 loops, best of 3: 3.37 ms per loop


Alright, so first we can see that our shuffling function is adding almost 15 milliseconds on our time, something we can improve on! Let's then go through the entire `DataLoader`:

In [None]:
%%timeit
for _ in dls.train:
    _

1 loop, best of 3: 661 ms per loop


Now let's get an average time per batch:

In [None]:
print(661/len(dls.train))

3.2561576354679804


About 3.25 milliseconds per batch on the training dataset, let's look at the validation:

In [None]:
%%timeit
for _ in dls.valid:
    _

10 loops, best of 3: 159 ms per loop


In [None]:
print(159/len(dls.valid))

3.1176470588235294


And about 3.11 milliseconds per batch on the validation, so we can see that it's about the same after shuffling. Now let's compare some GPU times:

### GPU

In [None]:
dls = to.dataloaders(bs=128, device='cuda')

In [None]:
%%timeit
_ = next(iter(dls.train))

100 loops, best of 3: 18.8 ms per loop


In [None]:
%%timeit
_ = next(iter(dls.valid))

100 loops, best of 3: 3.49 ms per loop


So first, grabbing just one batch we can see it added about a half a millisecond on the training and .2 milliseconds on the validation, so we're not utilizing the GPU for this process much (which makes sense, `TabularPandas` is *CPU* bound). And now let's iterate:

In [None]:
%%timeit
for _ in dls.train:
    _

1 loop, best of 3: 693 ms per loop


In [None]:
print(693/len(dls.train))

3.413793103448276


In [None]:
%%timeit
for _ in dls.valid:
    _

10 loops, best of 3: 163 ms per loop


In [None]:
print(163/len(dls.valid))

3.196078431372549


And here we can see a little bit more being added here as well. Now that we have those baselines, let's fit for ten epochs real quick:

In [None]:
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)

In [None]:
%%time
learn.fit(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.377574,0.364423,0.833999,00:02
1,0.356772,0.357792,0.835688,00:02
2,0.358388,0.358207,0.833692,00:02
3,0.352414,0.352521,0.840602,00:02
4,0.349441,0.35007,0.840756,00:02
5,0.347263,0.358235,0.84137,00:02
6,0.346777,0.352908,0.838606,00:02
7,0.352095,0.352776,0.839681,00:02
8,0.347428,0.348187,0.840909,00:02
9,0.346684,0.352819,0.835074,00:02


CPU times: user 22.2 s, sys: 263 ms, total: 22.4 s
Wall time: 22.9 s


After fitting, we got about 22.9 seconds in total and ~2.29 seconds per epoch! Now that we have our baselines, let's try to speed that up!

## Bringing in `NumPy`

### The `Dataset`

With speeding everything up, I wanted to keep `TabularPandas` as it is, as it's a great way to pre-process your data! So instead we'll create a new `Dataset` class where we will convert our `TabularPandas` object into a `NumPy` array. Why is that important? `NumPy` is a super-fast library that has been hyper-optimized by using as much C code as it possibly can which is *leagues* faster than Python. Let's build our `Dataset`!

We'll want it to maintain the `cats`, `conts`, and `ys` from our `TabularPandas` object seperate. We can call `to_numpy()` on all of them because they are simply stored as a `DataFrame`! Finally, to deal with categorical versus continuous variables, we'll assign our `cats` as `np.long` and our `conts` as `np.float32` (we also have our `ys` as `np.int8`, but this is because we're doing classification):

In [None]:
class TabDataset():
    "A `NumPy` dataset from a `TabularPandas` object"
    def __init__(self, to):
        self.cats = to.cats.to_numpy().astype(np.long)
        self.conts = to.conts.to_numpy().astype(np.float32)
        self.ys = to.ys.to_numpy()

Great! Now we need a few more bits for everything to work! For our `Dataset` to function, we need to be able to gather the values from it each time we call from it. We use the `__getitem__` function to do so! For our particular problem, we need it to return some `cats`, `conts`, and our `ys`. And to save on more time we'll return a whole *batch* of values:

In [None]:
class TabDataset():
    "A `NumPy` dataset from a `TabularPandas` object"
    def __init__(self, to):
        self.cats = to.cats.to_numpy().astype(np.long)
        self.conts = to.conts.to_numpy().astype(np.float32)
        self.ys = to.ys.to_numpy()

    def __getitem__(self, idx):
        idx = idx[0]
        return self.cats[idx:idx+self.bs], self.conts[idx:idx+self.bs], self.ys[idx:idx+self.bs]

You'll notice we don't explicitly pass in a batch size, so where is that coming from? This is added when we build our `DataLoader`, as we'll see later. Let's finish up our `Dataset` class by adding in an option to get the length of the dataset (we'll do the length of our categorical table in this case).

In [None]:
class TabDataset():
    "A `NumPy` dataset from a `TabularPandas` object"
    def __init__(self, to):
        self.cats = to.cats.to_numpy().astype(np.long)
        self.conts = to.conts.to_numpy().astype(np.float32)
        self.ys = to.ys.to_numpy()

    def __getitem__(self, idx):
        idx = idx[0]
        return self.cats[idx:idx+self.bs], self.conts[idx:idx+self.bs], self.ys[idx:idx+self.bs]

    def __len__(self): return len(self.cats)

And now we can make some `Datasets`!

In [None]:
train_ds = TabDataset(to.train)
valid_ds = TabDataset(to.valid)

We can look at some data real quick if we want to as well! First we need to assign a batch size:

In [None]:
train_ds.bs = 3

And now let's look at some data:

In [None]:
train_ds[[3]]

(array([[ 5, 10,  5,  5,  2,  5,  1],
        [ 2, 16,  3,  5,  1,  3,  1],
        [ 5, 16,  3,  5,  1,  5,  1]]),
 array([[-0.9979143 ,  0.07715245,  1.1427965 ],
        [ 0.8376807 ,  1.4486277 , -0.02766372],
        [ 1.4984949 , -1.4280752 , -0.02766372]], dtype=float32),
 array([[0],
        [0],
        [1]], dtype=int8))

We can see that we output what could be considered a batch of data! The only thing missing is to make it into a tensor! Fantastic! Now let's build the `DataLoader`, as there's some pieces in it that we need, so simply having this `Dataset` won't be enough

### The `DataLoader`

Now to build our `DataLoader`, we're going to want to modify 4 particular functions:

1. `create_item`
2. `create_batch`
3. `get_idxs`
4. `shuffle_ds`

Each of these play a particular role. First let's look at our template:

In [None]:
class TabDataLoader(DataLoader):
    def __init__(self, dataset, bs=1, num_workers=0, device='cuda', shuffle=False, **kwargs):
        "A `DataLoader` based on a `TabDataset`"
        super().__init__(dataset, bs=bs, num_workers=num_workers, shuffle=shuffle, 
                         device=device, drop_last=shuffle, **kwargs)
        self.dataset.bs=bs

As you can see, our `__init__` will build a `DataLoader`, and we keep track of our `Dataset` and set the `Datasets` batch size here as well

In [None]:
dl = TabDataLoader(train_ds, bs=3)

In [None]:
dl.dataset.bs

3

In [None]:
dl.dataset[[0]]

(array([[ 8, 13,  5, 11,  2,  2,  1],
        [ 8, 12,  3, 12,  1,  5,  1],
        [ 5, 10,  1,  5,  2,  5,  1]]),
 array([[-1.071338  , -0.39674038,  1.5329499 ],
        [ 0.03001888,  0.46965045, -0.41781712],
        [-0.5573715 ,  0.39267784,  1.1427965 ]], dtype=float32),
 array([[0],
        [0],
        [0]], dtype=int8))

And we can see that we grab everything as normal in the `Dataset`! Great! Now let's work on `create_item` and `create_batch`. `create_item` is very simple as we already do so when we make our call to the dataset, so we just pass it on. `create_batch` is also very simplistic. We'll take some index's from our `Dataset` and convert them all to `Tensors`!

In [None]:
class TabDataLoader(DataLoader):
    def __init__(self, dataset, bs=1, num_workers=0, device='cuda', shuffle=False, **kwargs):
        "A `DataLoader` based on a `TabDataset`"
        super().__init__(dataset, bs=bs, num_workers=num_workers, shuffle=shuffle, 
                         device=device, drop_last=shuffle, **kwargs)
        self.dataset.bs=bs
    
    def create_item(self, s): return s

    def create_batch(self, b):
        cat, cont, y = self.dataset[b]
        return tensor(cat).to(self.device), tensor(cont).to(self.device), tensor(y).to(self.device)

Now we're almost done. The last two pieces missing is `get_idxs` and `shuffle_fn`. These are needed as after each epoch we actually shuffle the dataset and we need to get a list of index's for our `DataLoader` to use! To save on time (as we’re using array indexing), we can shuffle the interior dataset instead! A major benefit is slicing (consecutive idxs) instead of indexing (non-consecutive idxs). Let's look at what that looks like:

In [None]:
class TabDataLoader(DataLoader):
    def __init__(self, dataset, bs=1, num_workers=0, device='cuda', shuffle=False, **kwargs):
        "A `DataLoader` based on a `TabDataset`"
        super().__init__(dataset, bs=bs, num_workers=num_workers, shuffle=shuffle, 
                         device=device, drop_last=shuffle, **kwargs)
        self.dataset.bs=bs
    
    def create_item(self, s): return s

    def create_batch(self, b):
        "Create a batch of data"
        cat, cont, y = self.dataset[b]
        return tensor(cat).to(self.device), tensor(cont).to(self.device), tensor(y).to(self.device)

    def get_idxs(self):
        "Get index's to select"
        idxs = Inf.count if self.indexed else Inf.nones
        if self.n is not None: idxs = list(range(len(self.dataset)))
        return idxs

    def shuffle_fn(self):
        "Shuffle the interior dataset"
        rng = np.random.permutation(len(self.dataset))
        self.dataset.cats = self.dataset.cats[rng]
        self.dataset.conts = self.dataset.conts[rng]
        self.dataset.ys = self.dataset.ys[rng]

And now we have all the pieces we need to build a `DataLoader` with `NumPy`! We'll examine it's speed now and then we'll build some convience functions later. First let's build the `Datasets`:

In [None]:
train_ds = TabDataset(to.train)
valid_ds = TabDataset(to.valid)

And then the `DataLoader`:

In [None]:
train_dl = TabDataLoader(train_ds, device='cpu', shuffle=True, bs=128)
valid_dl = TabDataLoader(valid_ds, device='cpu', bs=128)

And now let's grab some CPU timings similar to what we did before:

In [None]:
%%timeit
_ = next(iter(train_dl))

1000 loops, best of 3: 669 µs per loop


In [None]:
%%timeit
_ = next(iter(valid_dl))

1000 loops, best of 3: 300 µs per loop


**Right** away we can see that we are *leagues* faster than the previous version. Shuffling only added ~370 *microseconds*, which means we used 4% of the time! Now let's iterate over the entire `DataLoader`:

In [None]:
%%timeit
for _ in train_dl:
    _

10 loops, best of 3: 31.8 ms per loop


In [None]:
print(31.8/len(train_dl))

0.1566502463054187


In [None]:
%%timeit
for _ in valid_dl:
    _

100 loops, best of 3: 8.07 ms per loop


In [None]:
print(8.07/len(valid_dl))

0.15823529411764706


And as we can see, each individual batch of data is about 0.158 milliseconds! Yet again, about 6% of time time, quite a decrease! So we have **sucessfully** decreased the time! Let's look at the GPU now:

In [None]:
train_dl = TabDataLoader(train_ds, device='cuda', shuffle=True, bs=128)
valid_dl = TabDataLoader(valid_ds, device='cuda', bs=128)

In [None]:
%%timeit
_ = next(iter(train_dl))

1000 loops, best of 3: 835 µs per loop


In [None]:
%%timeit
_ = next(iter(valid_dl))

1000 loops, best of 3: 451 µs per loop


In [None]:
%%timeit
for _ in train_dl:
    _

10 loops, best of 3: 51.5 ms per loop


In [None]:
print(51.5/len(train_dl))

0.2536945812807882


In [None]:
%%timeit
for _ in valid_dl:
    _

100 loops, best of 3: 12.8 ms per loop


In [None]:
print(12.8/len(valid_dl))

0.25098039215686274


Which as we can see, it adds a little bit of time from converting the tensors over to `cuda`. You could save a *little* bit more by converting first, but as this should be seperate from the dataset I decided to just keep it here. Now that we have all the steps, finally we can take a look at training! First let's build a quick helper function to make `DataLoaders` similar to what `fastai`'s `tabular_learner` would be expecting:

In [None]:
class TabDataLoaders(DataLoaders):
    def __init__(self, to, bs=64, val_bs=None, shuffle_train=True, device='cpu', **kwargs):
        train_ds = TabDataset(to.train)
        valid_ds = TabDataset(to.valid)
        val_bs = bs if val_bs is None else val_bs
        train = TabDataLoader(train_ds, bs=bs, shuffle=shuffle_train, device=device, **kwargs)
        valid = TabDataLoader(valid_ds, bs=val_bs, shuffle=False, device=device, **kwargs)
        super().__init__(train, valid, device=device, **kwargs)

In [None]:
dls = TabDataLoaders(to, bs=128, device='cuda')

And now we can build our model and train! We need to build our own `TabularModel` here, so we'll need to grab the size of our embeddings and build a `Learner`. For simplicity we'll still use `TabularPandas` to get those sizes:

In [None]:
emb_szs = get_emb_sz(to)
net = TabularModel(emb_szs, n_cont=3, out_sz=2, layers=[200,100]).cuda()
learn = Learner(dls, net, metrics=accuracy, loss_func=CrossEntropyLossFlat())

And now let's train!

In [None]:
%%time
learn.fit(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.369785,0.358381,0.837531,00:01
1,0.359938,0.354405,0.840602,00:01
2,0.353965,0.35438,0.837838,00:01
3,0.350551,0.355998,0.837684,00:01
4,0.349042,0.357085,0.838606,00:01
5,0.347858,0.354116,0.839988,00:01
6,0.344613,0.352649,0.840448,00:01
7,0.343187,0.351604,0.840909,00:01
8,0.342587,0.353344,0.841523,00:01
9,0.342127,0.355749,0.841216,00:01


CPU times: user 13.4 s, sys: 203 ms, total: 13.6 s
Wall time: 13.8 s


As you can see, we cut the speed down 60%! So we saw a *tremendous* speed up! Let's quickly revisit all of the times and results in a pretty table.

## Results

|  | CPU? | First Batch | Per Batch | Per Epoch | Ten Epochs |
|:-------:|:----:|:-------------------------------:|:-----------------------------:|-----------|------------|
| fastai2 | Yes | 18.3ms (train) 3.37ms (valid) | 3.25ms (train) 3.11ms (valid) |  |  |
|  | No | 18.8ms (train) 3.49ms (valid) | 3.41ms (train) 3.19ms (valid) | 2.29s | 22.9s |
| NumPy | Yes | 0.669ms (train) 0.3ms (valid | 0.15ms (train) 0.15ms (valid) |  |  |
|  | No | 0.835ms (train) 0.451ms (valid) | 0.25ms (train) 0.25ms (valid) | 1.38s | 13.8s |

So in summary, we first sped up the time to grab a single batch of data by converting everything from `Pandas` to `NumPy`. Afterwards we made a custom `DataLoader` that could handle these `NumPy` arrays and induce the speedup we saw! I hope this article helps you better understand how the interior `DataLoader` can be integrated in with `NumPy`, and that it helps you speed up your tabular training!

* Small note: `show_batch()` etc will *not* work with this particular code base, this is simply a proof of concept