# Fastai's Mid-Level API's

In [0]:
from fastai2.text.all import *

---

## Transforms

In [0]:
# Writing your own transforms

def f(x:int): return x+1
tfm = Transform(f)
tfm(2),tfm(2.0)

(3, 2.0)

In [0]:
@Transform
def f(x:int): return x+1

tfm(2),tfm(2.0)

(3, 2.0)

In [0]:
Transform??

 "Delegates (`__call__`,`decode`,`setup`) to (`encodes`,`decodes`,`setups`) if `split_idx` matches"

If you need either `setup` or `decode`, you will need to subclass `Transform`. When writing this subclass, you need to implement the actual function in `encodes`, then (optionally), the setup behavior in `setups` and the decoding behavior in `decodes`:

In [0]:
class NormalizeMean(Transform):
    def setups(self, items): self.mean = sum(items)/len(items)
    def encodes(self, x): return x-self.mean
    def decodes(self, x): return x+self.mean

Here `NormalizeMean` will initialize some state during the setup (the mean of all elements passed), then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of `NormalizeMean` in action:

In [0]:
tfm = NormalizeMean()
tfm.setup([1,2,3,4,5]) 
#This is a method in the Transform class that basically calls the setups method in our NormalizeMean subclass
start = 2
y = tfm(start)
z = tfm.decode(y)
tfm.mean,y,z

(3.0, -1.0, 2.0)

---

## Pipeline

To compose several transforms together, fastai provides `Pipeline`. We define a `Pipeline` by passing it a list of `Transform`s; it will then compose the transforms inside it. When you call a `Pipeline` on an object, it will automatically call the transforms inside, in order:

In [0]:
path = untar_data(URLs.IMDB)
dls = DataBlock(
    blocks=(TextBlock.from_folder(path),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path)

In [0]:
files = get_text_files(path, folders = ['train', 'test'])
txts = L(o.open().read() for o in files[:2000])

tok = Tokenizer.from_folder(path)
tok.setup(txts)
toks = txts.map(tok)


num = Numericalize()
num.setup(toks)
nums = toks.map(num)

In [0]:
tfms = Pipeline([tok, num])
t = tfms(txts[0]); t[:20]

tensor([   2,   18,  287,   20,   27,   13,    9,  180,    8,    0,    8,  116,
         168,   14,  168,   10,    8,    9, 1196,  257])

In [0]:
tfms.decode(t)[:100]

'xxbos i watched this movie and the original xxmaj xxunk xxmaj way back to back . xxmaj the differenc'

The only part that doesn't work the same way as in `Transform` is the setup. To properly setup a `Pipeline` of `Transform`s on some data, you need to use a `TfmdLists`.

---

## TfmdLists 

Your data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succession of transformations. We just saw that the succession of transformations was represented by a Pipeline in fastai. The class that groups together this pipeline with your raw items is called TfmdLists.

In [0]:
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])

At initialization, the TfmdLists will automatically call the setup method of each transform in order, providing them not with the raw items but the items transformed by all the previous Transforms in order. We can get the result of our pipeline on any raw element just by indexing into the TfmdLists:

In [0]:
t = tls[0]; t[:20]

tensor([    2,    19,   323,    20,    30,    12,     9,   231,     8, 17081,
            8,   117,   164,    15,   164,    10,     8,     9,  1553,   223])

In [0]:
tls.decode(t)[:100]

'xxbos i watched this movie and the original xxmaj carlitos xxmaj way back to back . xxmaj the differ'

In [0]:
#TfmdLists has a show method
tls.show(t)

xxbos i watched this movie and the original xxmaj carlitos xxmaj way back to back . xxmaj the difference between the two is disgusting . xxmaj now i know that people are going to say that the prequel was made on a small budget but that never had anything to do with a bad script . xxmaj now maybe it 's just me , but i always thought that a prequel was made to go set up the other movie , starring key characters and maybe filling in a bit about life that we did n't know . xxmaj rise to xxmaj power is just a movie that has xxmaj carlito 's name . xxmaj there should have been at least a few characters from the original movie , the ending makes no sense in relation to the original . xxmaj in the end of this movie he retires with his sweet heart but how the hell do we get him coming out of prison in the next movie ? xxmaj and his woman is n't even the same woman that he talks about as his only love in the original . i would say the movie is mildly entertaining in its self , with a few decent 

The `TfmdLists` is named with an "s" because it can handle a training and validation set with a splits argument. You just need to pass the indices of which elemets are in the training set, and which are in the validation set:

In [0]:
cut = int(len(files)*0.8)
splits = [list(range(cut)), list(range(cut,len(files)))]
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize], 
                splits=splits)

In [0]:
Path.BASE_PATH = path

You can then access them through the train and valid attribute:

In [0]:
tls.valid[0]

TensorText([    2,     7,  1261,    37,   347,     5,   155,   733,    34,    16,
           13,   842,   447,    14,     8,   976,  1511,    10,     8,    17,
           23,    60,   297,     0,    50,    99,    18,   117,   144,    81,
         1691,    69,    81,  1345,    11,  3708,     0,  5346,     5,   155,
          238,   142,   297,     7,   102,   460,    44,   330,    18,    13,
         1523,    10,     8,    69,   152,   297,  1304,  4006,    11,    50,
           39,    73,    15,    95,    70, 14332,    75,    24,  4101,    94,
           78,     9,    32,  2236,   201,    15,   141,    11,  1458,     9,
          594, 23508,    62,    14,    44,    14,   278,    22,  1518,    22,
           11,     9,  3868,  1108,    15,  1019,     9,  5308,    18,   266,
           24,    83,    47,  2566,     7,   516,    44,    14,     9,   102,
          460,    54,     8,     9,  3468,   128,  6741,   116,   897,     9,
        22665,  3156,    14,   298,   152,   361,     0,  10

In [0]:
tls_y = TfmdLists(files, [parent_label, Categorize()])
tls_y[0]

TensorCategory(0)

But then we end up with two separate objects for our inputs and targets, which is not what we want. This is where `Datasets` comes to the rescue.

---

## Datasets 

`Datasets` will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. Like `TfmdLists`, it will automatically do the setup for us, and when we index into a `Datasets`, it will return us a tuple with the results of each pipeline:

In [0]:
x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms])
x,y = dsets[0]
x[:20],y

(tensor([    2,    19,   323,    20,    30,    12,     9,   231,     8, 17081,
             8,   117,   164,    15,   164,    10,     8,     9,  1553,   223]),
 TensorCategory(0))

In [0]:
# with splits

x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)
x,y = dsets.valid[0]
x[:20],y

(tensor([   2,    7, 1261,   37,  347,    5,  155,  733,   34,   16,   13,  842,
          447,   14,    8,  976, 1511,   10,    8,   17]),
 TensorCategory(1))

In [0]:
t = dsets.valid[0]
dsets.decode(t)

('xxbos xxup nobody ( 1 xxrep 3 9 ) is a fantastic piece of xxmaj japanese noir . xxmaj it \'s about three xxunk who get in way over their heads when their innocent , drunken xxunk p xxrep 3 * off three xxup other guys one night in a bar . xxmaj when these three mysterious strangers , who are up to much more deviant no - goodness than even the film allows us to know , beat the living daylights out of one of our " heroes " , the trio decides to return the favour in kind - only they accidentally xxup kill one of the other guys ! xxmaj the remaining two baddies then begin the systematic destruction of everything these poor xxunk hold dear , including their fast - dwindling sanity . xxmaj xxunk xxmaj video \'s xxup dvd sleeve features a critic quote calling the film " a paranoid street crime freakout ! " or some such , and the term more than applies here . xxmaj brooding , tense , very violent and low - key ( but still pretty slick ) , shot largely at night with many deliberately vague mom

The last step is to convert your `Datasets` object to a `DataLoaders`, which can be done with the `dataloaders` method. Here we need to pass along special arguments to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to `before_batch`: 

In [0]:
dls = dsets.dataloaders(bs=64, before_batch=pad_input)

`dataloaders` directly calls `DataLoader` on each subset of our `Datasets`. fastai's `DataLoader` expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization but the most important you should know are:

- `after_item`: applied on each item after grabbing it inside the dataset. This is the equivalent of the `item_tfms` in `DataBlock`.
- `before_batch`: applied on the list of items before they are collated. This is the ideal place to pad items to the same size.
- `after_batch`: applied on the batch as a whole after its construction. This is the equivalent of the `batch_tfms` in `DataBlock`.

As a conclusion, here is the full code necessary to prepare the data for text classification:

In [0]:
tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]
files = get_text_files(path, folders = ['train', 'test'])
splits = GrandparentSplitter(valid_name='test')(files)
dsets = Datasets(files, tfms, splits=splits)
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)

The two differences with what we had above is the use of `GrandParentSplitter` to split our training and validation data, and the `dl_type` argument. This is to tell `dataloaders` to use the `SortedDL` class of `DataLoader`, and not the usual one. This is the class that will handle the construction of batches by putting samples of roughly the same lengths into batches.

This does the exact same thing as our `DataBlock` from above:

In [0]:
path = untar_data(URLs.IMDB)
dls = DataBlock(
    blocks=(TextBlock.from_folder(path),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path)

---