# A walk with fastinference - Part 1
> A guided walkthrough of the `fastinference` module

- toc: true
- badges: true
- comments: true
- image: images/chart-preview.png

---
This blog is also a Jupyter notebook available to run from the top down. There will be code snippets that you can then run in any environment. In this section I will be posting what version of fastai2 and fastcore I am currently running at the time of writing this:

* `fastai2`: 0.0.17
* `fastcore`: 0.1.18
* `fastinference`: 0.0.4
---

## `fastinference`, what is it and why do I need it?

Over the last few months I've been trying to speed up inference for `fastai`, and more and more I was noticing that I was using the same "realm" of functions! So I decided to try to fit them into a cohesive library, with some special perks along with it. In this library we have (along with the speed-up modules), some interpretability modules as well as ONNX support. Each will be getting their own article to walk you through those, today we'll be covering the `fastai` speed ups!

## Starting with `Vision`

Let's begin with a vision problem. We'll use our *very* familiar `PETs` dataset and quickly train a model. What we'll be focusing on is two very specific functions, `.predict` and `.get_preds`, so first let's prepare:

  * I will be skipping the explaination for this part, if you'd like an in-depth walkthrough of the API, see my article [here](https://muellerzr.github.io/fastblog/datablock/2020/03/21/DataBlockAPI.html).

In [0]:
from fastai2.vision.all import *

In [5]:
path = untar_data(URLs.PETS)
pat = r'([^/]+)_\d+.*$'
splitter = RandomSplitter(valid_pct=0.2, seed=42)
item_tfms = [Resize(224, method='crop')]
batch_tfms=[*aug_transforms(size=256), Normalize.from_stats(*imagenet_stats)]
fnames = get_image_files(path/'images')

In [0]:
dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   splitter=RandomSplitter(),
                   get_y=RegexLabeller(pat=pat),
                   item_tfms=item_tfms,
                   batch_tfms=batch_tfms)

In [0]:
pets_dls = dblock.dataloaders(path/'images')

Now that we have our `DataLoader`, we'll use a regular `resnet34` so we can see the speed:

In [0]:
learn = cnn_learner(pets_dls, resnet34, metrics=accuracy)

In [0]:
learn.fine_tune(1)

epoch,train_loss,valid_loss,accuracy,time
0,1.499775,0.304153,0.901218,00:53


epoch,train_loss,valid_loss,accuracy,time
0,0.453589,0.239638,0.924222,00:55


In [0]:
learn.save('pets')

Now we'll time a `get_preds` and a `predict` on both `CPU` and `CUDA`:

In [0]:
%%time
# CUDA:
preds = learn.get_preds()

CPU times: user 990 ms, sys: 326 ms, total: 1.32 s
Wall time: 10 s


In [0]:
%%time
# CUDA:
pred = learn.predict(fnames[0])

CPU times: user 72.3 ms, sys: 140 ms, total: 212 ms
Wall time: 253 ms


In [0]:
learn.dls.device = 'cpu'

In [0]:
%%time
# CPU:
preds = learn.get_preds()

* Note I skipped this, as it takes ~4:54 seconds

In [0]:
%%time
# CPU:
preds = learn.predict(fnames[0])

CPU times: user 243 ms, sys: 9.13 ms, total: 252 ms
Wall time: 263 ms


Okay, not too bad right? But can we make it *better*. Let's bring in `fastinference`:

In [0]:
from fastinference.inference import *

Now for those of you Python-savvy, you'll notice we don't actually import anything. Why? We override `.predict` and `.get_preds`. As such, no adjustments are needed except for rebuilding `Learner`!

In [0]:
pets_dls.device= 'cuda'

In [0]:
learn = cnn_learner(pets_dls, resnet34, metrics=accuracy).load('pets')

Let's try our timings again:

In [0]:
%%time
# CUDA:
preds = learn.get_preds()

CPU times: user 649 ms, sys: 206 ms, total: 855 ms
Wall time: 9.83 s


In [0]:
%%time
# CUDA:
pred = learn.predict(fnames[0])

CPU times: user 32.1 ms, sys: 2.01 ms, total: 34.1 ms
Wall time: 34.7 ms


So as you saw, we shaved down only ~0.2 seconds from `get_preds`, but `predict` we could reduce it by almost 200 milliseconds! This is only 13% of the time!

In [0]:
learn.dls.device = 'cpu'
learn.model.to('cpu');

In [0]:
%%time
# CPU:
preds = learn.predict(fnames[0])

CPU times: user 229 ms, sys: 6.8 ms, total: 236 ms
Wall time: 239 ms


*And* a 20 millsecond reduction on the CPU! Not bad at all! So what changed? `get_preds` now looks very reminiscent of a `PyTorch` loop, with no `fastai` parts. This is intentional, as now we can blend the two together. 

Okay... you've sped it up, what about these "Quality of Life" improvements you mentioned? Let's take a look:

## Quality of Life Improvements

The `get_preds` and `predict` adjustments don't end there. We'll start with the major changes to `get_preds`:


### `get_preds`

`get_preds` allows for any of the usual `fastai` paramters, such as `ds_idx`, `dl`, and any other `**kwargs` you may want. But what has changed is now we have 3 other parameters:
  * `raw_outs`
  * `decoded_loss`
  * `fully_decoded`



`raw_outs` will let you choose to apply your loss functions `activation` or not (default is `False`, so it always will). Let's see what that means from a code perspective!

Here is our `CrossEntropyLossFlat`, the loss function for our model:

In [0]:
class CrossEntropyLossFlat(BaseLoss):
    "Same as `nn.CrossEntropyLoss`, but flattens input and target."
    y_int = True
    def __init__(self, *args, axis=-1, **kwargs): super().__init__(nn.CrossEntropyLoss, *args, axis=axis, **kwargs)
    def decodes(self, x):    return x.argmax(dim=self.axis)
    def activation(self, x): return F.softmax(x, dim=self.axis)

As you can see, we have an `activation` that applies a `softmax`. Let's disable it:

In [0]:
learn.dls.device = 'cuda'
learn.model.to('cuda')
preds = learn.get_preds(raw_outs=True)

So now, if we look at the second item returned:

In [0]:
preds[1][0]

array([ 1.0002726 , -0.24976277, -1.058503  , -3.0604253 , -1.0200855 ,
       -2.0843894 , -0.5362588 , -3.2754717 , -1.7710001 , -2.2774923 ,
       -1.7333597 , -4.1619153 ,  0.9354493 ,  3.4019513 , 12.7702875 ,
        7.132491  ,  2.0602374 , -3.1376252 ,  1.4149432 ,  5.9531183 ,
        4.3040857 ,  1.0570182 , -2.5935738 , -1.620131  , -0.4416199 ,
        2.527862  , -0.9089688 ,  0.51930547, -3.564454  , -3.3547723 ,
        6.0538034 , -1.4624698 , -4.14351   , -1.7813267 , -0.72445834,
        2.1540437 ,  0.07214043], dtype=float32)

We have the non-scaled results, versus:

In [0]:
preds = learn.get_preds(raw_outs=False)

In [0]:
preds[1][0]

array([7.68456448e-06, 2.20158699e-06, 9.80627874e-07, 1.32458638e-07,
       1.01903447e-06, 3.51534027e-07, 1.65314952e-06, 1.06828480e-07,
       4.80917777e-07, 2.89803666e-07, 4.99364774e-07, 4.40260024e-08,
       7.20222897e-06, 8.48506534e-05, 9.93737578e-01, 3.53840739e-03,
       2.21797090e-05, 1.22617607e-07, 1.16334395e-05, 1.08795962e-03,
       2.09144753e-04, 8.13323913e-06, 2.11266979e-07, 5.59232149e-07,
       1.81724408e-06, 3.54032127e-05, 1.13879651e-06, 4.75048228e-06,
       8.00172444e-08, 9.86841755e-08, 1.20320532e-03, 6.54732105e-07,
       4.48438442e-08, 4.75977203e-07, 1.36954998e-06, 2.43610393e-05,
       3.03764227e-06], dtype=float32)

So here, if you wanted to apply some sigmoid threshold, etc for labelling on inference you may not have done while during training (such as tell your model "I do not know"), you can do this directly with your outputs!

Now what is in slot `[0]` I hear you ask?

In [0]:
preds[0][:5]

It's our decoded classes! Right away! This is extremely helpful for people who want to align their outputs. This relies on your `dls.vocab`, so let's see what happens if we don't have a `vocab`:

In [0]:
vocab = learn.dls.vocab
learn.dls.vocab = None

In [0]:
preds = learn.get_preds(raw_outs=False)

In [0]:
preds[0][:5]

tensor([14, 12, 19, 31,  4], device='cuda:0')

You can see instead it's a `argmax`'d tensor of our previous probabilties, run through the `decodes` of the loss function!

Alright, anything else new with `get_preds`? We'll move on to a tabular model for our `fully_decoded`:

In [0]:
from fastai2.tabular.all import *

In [9]:
path = untar_data(URLs.ADULT_SAMPLE)

In [0]:
df = pd.read_csv(path/'adult.csv')
splits = RandomSplitter()(range_of(df))
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'

In [0]:
dls = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, splits=splits).dataloaders(bs=512)

In [14]:
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.380894,0.452286,0.820639,00:00


Now we can make a `test_dl` to work with our `get_preds` too, let's look at one for `fully_decoded`:

In [0]:
test_dl = learn.dls.test_dl(df.iloc[:3])

In [0]:
preds = learn.get_preds(dl=test_dl, fully_decoded=True)

So first we have our classes and probabilities again

In [0]:
preds[0], preds[1]

(['>=50k', '>=50k', '<50k'], array([[0.48675266, 0.51324725],
        [0.41627738, 0.58372265],
        [0.72597593, 0.27402407]], dtype=float32))

But now we have a third item:

In [0]:
preds[2].show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101319.997692,12.0,>=50k
1,Private,Masters,Divorced,Exec-managerial,Not-in-family,White,False,44.0,236746.000628,14.0,>=50k
2,Private,HS-grad,Divorced,#na#,Unmarried,Black,True,38.0,96185.000326,10.0,<50k


Which is our `DataFrame` with the inputs and our outputs! This also works for vision as well, however it's easier to show with tabular. Now let's move onto `.predict`:

### `.predict`

`predict` has some new bits too. Specifically, we can pass in `with_input` to possibly return our inputs, similar to what I showed above:

In [0]:
name, probs, row = learn.predict(df.iloc[0], with_input=True)

In [0]:
name, probs

(['>=50k'], array([[0.48675266, 0.51324725]], dtype=float32))

In [0]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101319.997692,12.0,>=50k


And this of course will also work for vision! Lastly, I want to cover the `ONNX` support, as it can be **directly integrated into `fastai`**

### ONNX

ONNX is a special module, where you can potentially speed up inference by relying on C++ rather than Python, but it's not easy to export from `fastai`, or at least it was! Let's import `fastinference.onnx` and look at our tabular model one more time:

In [0]:
from fastinference.onnx import *

All we need to do is call `learn.to_onnx` and pass in a `fname` to export to **both** the ONNX format and export our model. We'll see why in a moment:

In [0]:
learn.to_onnx('tabular')

That's it! Note however that some models may not be `ONNX` compatable (such as the UNET), and currently it supports only one output, multiple outputs will be supported soon. Now, what can we do from there?

Let's load our model into a `fastONNX` model:

In [0]:
tab_inf = fastONNX('tabular')

Next, we'll try to time how the two different `predict` methods stack up. First, our improved version:

In [17]:
%%time
_ = learn.predict(df.iloc[0])

CPU times: user 37.9 ms, sys: 2.61 ms, total: 40.5 ms
Wall time: 39 ms


And now for our ONNX version. We'll want to pass in a *raw* batch of data to our model, so let's grab the first batch:

In [0]:
test_dl.bs = 1

In [0]:
batch = next(iter(test_dl))

In [22]:
batch

(tensor([[5, 8, 3, 0, 6, 5, 1]], device='cuda:0'),
 tensor([[ 0.7635, -0.8441,  0.7524]], device='cuda:0'),
 tensor([[1]], device='cuda:0', dtype=torch.int8))

And now let's `.predict`:

In [24]:
%%timeit
_ = tab_inf.predict(batch[:2])

The slowest run took 31.68 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 241 µs per loop


As you can see, *lighting* fast! And that's all that's needed! <s>Eventually I may convert this into a more `fastai`-like scenario, but for now this is the framework as it lays. </s>

I also couldn't leave enough as enough. You have the full inference capability inside of this module. Let's build a `test_dl` again and run `get_preds` as an example:

In [0]:
dl = tab_inf.test_dl(df.iloc[:100])

In [27]:
%%time
preds = tab_inf.get_preds(dl=dl)

CPU times: user 8.34 ms, sys: 0 ns, total: 8.34 ms
Wall time: 9.71 ms


Just to compare with our original:

In [0]:
dl.device = 'cuda'

In [32]:
%%time
preds = learn.get_preds(dl=dl)

CPU times: user 10.6 ms, sys: 0 ns, total: 10.6 ms
Wall time: 11.7 ms


We shaved off a few milliseconds too!



In the next article I'll be showing how to utilize `SHAP`, and then the last will include `ClassConfusion`. Thanks for reading!