# 01 - Introduction to `fastinference`

What is `fastinference`?

  * Inference module for `fastai`
  * Allows for faster inference, more verbose prediction methods, and some model interpretability modules
  
What will this tutorial cover? (NB 01 - n):
  * We'll be looking at *why* these speedups work
  * What they really are
  * Other modules that are also packed away in `fastinference` such as `fastshap` and `ClassConfusion` among a few others

In [1]:
!pip install fastai2 fastinference --quiet

## Tabular Example

First let's look at a tabular example (via ADULT_SAMPLE)

In [2]:
from fastai2.tabular.all import *

In [3]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
splits = RandomSplitter()(range_of(df))
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'

In [4]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, splits=splits)

In [5]:
dls = to.dataloaders()

We'll build a quick learner then check the speed of 2 items:

* `learn.predict()`
* `learn.get_preds()` on 100 rows

In [6]:
learn = tabular_learner(dls, layers=[200,100])

Now for all the tests we will be having `CUDA` enabled:

In [6]:
%%time
out = learn.predict(df.iloc[0])

CPU times: user 196 ms, sys: 83.9 ms, total: 280 ms
Wall time: 280 ms


In [9]:
test_dl = learn.dls.test_dl(df.iloc[:100])

In [8]:
%%time
out = learn.get_preds(dl=test_dl)

CPU times: user 16.1 ms, sys: 5.26 ms, total: 21.4 ms
Wall time: 21.9 ms


So as you can see, a bit of time. Now next we'll look at a *raw* `PyTorch` inference loop:

In [9]:
cat, cont, y = next(iter(test_dl))

In [10]:
%%timeit
learn.model.eval()
with torch.no_grad():
    out = learn.model(cat, cont)

520 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Woah! Wait hold on. That's a *half* a *millisecond*! Why? 

Well, we haven't quite done everything the same yet. `predict` gives us probabilities and the class label and whatnot. So, let's recreate this here:

In [11]:
learn.model.eval()
with torch.no_grad():
    out = learn.model(cat, cont)

First, we need to turn our probabilities into `argmax`'s, first applying a `softmax`. This comes from our `loss_func`'s `activation` and `decodes` methods:

In [12]:
out[:5]

tensor([[-0.0088,  0.0212],
        [ 0.0103,  0.0407],
        [-0.0072,  0.0120],
        [-0.0084,  0.0354],
        [-0.0128,  0.0131]], device='cuda:0')

In [13]:
learn.loss_func.activation(out[:5])

tensor([[0.4925, 0.5075],
        [0.4924, 0.5076],
        [0.4952, 0.5048],
        [0.4890, 0.5110],
        [0.4935, 0.5065]], device='cuda:0')

In [14]:
learn.loss_func.decodes(learn.loss_func.activation(out[:5]))

tensor([1, 1, 1, 1, 1], device='cuda:0')

Can we skip the `activation`? Since it's a softmax, yes we can if we just want the class:

In [15]:
learn.loss_func.decodes(out[:5])

tensor([1, 1, 1, 1, 1], device='cuda:0')

In [16]:
learn.predict(df.iloc[0])

(   workclass  education  marital-status  occupation  relationship  race  \
 0        5.0        8.0             3.0         0.0           6.0   5.0   
 
    education-num_na       age    fnlwgt  education-num  salary  
 0               1.0  0.774021 -0.833759       0.753626     1.0  ,
 tensor(1),
 tensor([0.4925, 0.5075]))

Now let's time our new `predict` pipeline:

In [17]:
%%timeit
learn.model.eval()
with torch.no_grad():
    out = learn.model(cat, cont)
raw_probs = learn.loss_func.activation(out)
classes = learn.loss_func.decodes(raw_probs)

553 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We're still in `microseconds`. Why? How is it so much longer? The `Callback` system is the right answer. This is why I made `fastinference`. Let's import `fastinference` and see what it gives me:

In [7]:
from fastinference.inference import *

In [19]:
%%timeit
out = learn.predict(df.iloc[0])

25.4 ms ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [20]:
%%timeit
dl = learn.dls.test_dl(df.iloc[:1])

14.6 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


So 10 milliseconds comes from building our `test_dl`, and we were able to cut our inference by 1/2 with `learn.predict`. What about `get_preds`

In [21]:
%%timeit
preds = learn.get_preds(dl=test_dl)

8.78 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Also by almost half! But what else is different about these methods? Let's compare source code:

In [22]:
# fastai
def get_preds(self, ds_idx=1, dl=None, with_input=False, with_decoded=False, with_loss=False, act=None,
                  inner=False, reorder=True, **kwargs):
    if dl is None: dl = self.dls[ds_idx].new(shuffled=False, drop_last=False)
    if reorder and hasattr(dl, 'get_idxs'):
        idxs = dl.get_idxs()
        dl = dl.new(get_idxs = _ConstantFunc(idxs))
    cb = GatherPredsCallback(with_input=with_input, with_loss=with_loss, **kwargs)
    ctx_mgrs = [self.no_logging(), self.added_cbs(cb), self.no_mbar()]
    if with_loss: ctx_mgrs.append(self.loss_not_reduced())
    with ExitStack() as stack:
        for mgr in ctx_mgrs: stack.enter_context(mgr)
        self(event.begin_epoch if inner else _before_epoch)
        self._do_epoch_validate(dl=dl)
        self(event.after_epoch if inner else _after_epoch)
        if act is None: act = getattr(self.loss_func, 'activation', noop)
        res = cb.all_tensors()
        pred_i = 1 if with_input else 0
        if res[pred_i] is not None:
            res[pred_i] = act(res[pred_i])
            if with_decoded: res.insert(pred_i+2, getattr(self.loss_func, 'decodes', noop)(res[pred_i]))
        if reorder and hasattr(dl, 'get_idxs'): res = nested_reorder(res, tensor(idxs).argsort())
        return tuple(res)
    self._end_cleanup()

Looks pretty sophistcated. Why? It relies on the `Callback` system and `Schedulers`. What does mine do?

In [25]:
# mine
@patch
def get_preds(x:Learner, ds_idx=1, dl=None, raw_outs=False, decoded_loss=True, fully_decoded=False,
             **kwargs):
    "Get predictions with possible decoding"
    inps, outs, dec_out, raw = [], [], [], []
    if dl is None: dl = x.dls[ds_idx].new(shuffle=False, drop_last=False)
    is_multi=False
    if x.dls.n_inp > 1:
        is_multi=True
        [inps.append([]) for _ in range(x.dls.n_inp)]
    x.model.eval()
    for batch in dl:
        with torch.no_grad():
            if is_multi:
                for i in range(x.dls.n_inp):
                    inps[i].append(batch[i])
            else:
                inps.append(batch[:x.dls.n_inp])
            if decoded_loss or fully_decoded:
                out = x.model(*batch[:x.dls.n_inp])
                raw.append(out)
                dec_out.append(x.loss_func.decodes(out))
            else:
                raw.append(x.model(*batch[:x.dls.n_inp]))
    raw = torch.cat(raw, dim=0).cpu().numpy()
    if fully_decoded or decoded_loss:
        dec_out = torch.cat(dec_out, dim=0)
    if not raw_outs:
        try: outs.insert(0, x.loss_func.activation(tensor(raw)).numpy())
        except: outs.insert(0, dec_out)
    else:
        outs.insert(0, raw)
    if fully_decoded: outs = _fully_decode(x.dls, inps, outs, dec_out, is_multi)
    if decoded_loss: outs = _decode_loss(x.dls.vocab, dec_out, outs)
    return outs

Not so complicated now, is it? Let's do one more time, setting everything to `False` except `raw`:

In [10]:
%%timeit
_ = learn.get_preds(dl=test_dl, raw_outs=True, decoded_loss=False)

7.52 ms ± 76.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Great! So now let's take a look inside `get_preds` and `predict`, specifically what they output:

In [16]:
name, preds, inp = learn.predict(df.iloc[0], with_input=True)

In [17]:
name, preds

(['>=50k'], array([[0.46804622, 0.5319538 ]], dtype=float32))

In [19]:
inp.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.00122,12.0,>=50k


So as you can see we get the class name (no longer just the class index), the probabilities, and the actual input (if `with_input=True`). What about `get_preds`?

In [21]:
name, preds = learn.get_preds(dl=test_dl)

In [22]:
name[:3], preds[:3]

(['>=50k', '>=50k', '>=50k'],
 array([[0.46804622, 0.5319538 ],
        [0.4700674 , 0.52993256],
        [0.4769747 , 0.5230253 ]], dtype=float32))

Same thing, however we have a bit more parameters:

In [26]:
def get_preds(x:Learner, ds_idx=1, dl=None, raw_outs=False, decoded_loss=True, fully_decoded=False,
             **kwargs):

So what all does this mean? `raw_outs` applies our `softmax` or not, `decoded_loss` does our `argmax`, and `fully_decoded` returns our inputs:

In [27]:
name, preds = learn.get_preds(dl=test_dl, raw_outs=True)

In [28]:
name[:3], preds[:3]

(['>=50k', '>=50k', '>=50k'],
 array([[-0.08579029,  0.04219937],
        [-0.05968028,  0.06019334],
        [-0.0669051 ,  0.02526121]], dtype=float32))

In [30]:
preds = learn.get_preds(dl=test_dl, decoded_loss=False)

In [32]:
preds[0][:3]

array([[0.46804622, 0.5319538 ],
       [0.4700674 , 0.52993256],
       [0.4769747 , 0.5230253 ]], dtype=float32)

In [33]:
name, preds, inp = learn.get_preds(dl=test_dl, decoded_loss=True, fully_decoded=True)

In [34]:
name[:3], preds[:3]

(['>=50k', '>=50k', '>=50k'],
 array([[0.46804622, 0.5319538 ],
        [0.4700674 , 0.52993256],
        [0.4769747 , 0.5230253 ]], dtype=float32))

In [35]:
inp[:3]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,5.0,8.0,3.0,0.0,6.0,5.0,1.0,0.758773,-0.84007,0.755614,1.0
1,5.0,13.0,1.0,5.0,2.0,5.0,1.0,0.392702,0.444001,1.540485,1.0
2,5.0,12.0,1.0,0.0,5.0,3.0,2.0,-0.046583,-0.888759,-0.029257,1.0


This will also work for any other application in the `fastai` library, int eh same