# Data

A **Transform** is an object that 
- behaves like a function
- has an optional setup method that will initialize some inner state 
- has an optional decode that will reverse the function (this reversal may not be perfect)

These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them.

In general, our data is always a tuple (input,target) (sometimes with more than one input or more than one target).
A special behavior of Transforms is that they always get applied over tuples. 

When applying a transform on an item, we don't want to resize the tuple as a whole; instead, we want to resize the input (if applicable) and the target (if applicable) separately.

It's the same for batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.

In [None]:
class Transform:
    def setups(self, items): 
    def encodes(self, x):
    def decodes(self, x):
        
tfm = Transform()
tfm.setup([...])
x2 = tfm(x1)
x1 = tfm.decode(x2)

To compose several transforms together, fastai provides the **Pipeline** class.

We define a Pipeline by passing it a list of Transforms; it will then compose the transforms inside it. When you call Pipeline on an object, it will automatically call the transforms inside, in order.

The only part that doesn't work the same way as in Transform is the setup. To properly set up a Pipeline of Transforms on some data, you need to use a TfmdLists.

In [None]:
tfms = Pipeline([tfm1, tfm2])

Your data is usually a set of raw items (like filenames, or rows in a DataFrame) to which you want to apply a succession of transformations. We just saw that a succession of transformations is represented by a Pipeline in fastai. 

The class that groups together this Pipeline with your raw items is called **TfmdLists**.

At initialization, the TfmdLists will automatically call the setup method of each Transform in order, providing them not with the raw items but the items transformed by all the previous Transforms in order. 

We can get the result of our Pipeline on any raw element just by indexing into the TfmdLists.

The TfmdLists is named with an "s" because it can handle a training and a validation set with a splits argument. You just need to pass the indices of which elements are in the training set, and which are in the validation set. You can then access them through the train and valid attributes.

In [None]:
cut = int(len(items)*0.8)
splits = [list(range(cut)), list(range(cut,len(items)))]

tls = TfmdLists(items, [tfm1, tfm2], splits=splits)
x2 = tls.train[0]
x1 = tls.decode(x2)
tls.show(x1)

But then we end up with two separate objects for our inputs and targets, which is not what we want.

**Datasets** will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result.

Like TfmdLists, it will automatically do the setup.

Like TfmdLists, we can pass along splits to split our data between training and validation sets.

When we index into a Datasets, it will return us a tuple with the results of each pipeline.

It can also decode any processed tuple or show it directly.

In [None]:
dsets = Datasets(items, [x_tfms, y_tfms], splits=splits)
x,y = dsets.valid[0]
dsets.decode((x,y))

The last step is to convert our Datasets object to a **DataLoaders**, which can be done with the dataloaders method. 

dataloaders directly calls DataLoader on each subset of our Datasets. fastai's DataLoader expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. 

It has a lot of points of customization, but the most important ones are:
- after_item : Applied on each item after grabbing it inside the dataset.
- before_batch : Applied on the list of items before they are collated. This is the ideal place to pad items to the same size.
- after_batch : Applied on the batch as a whole after its construction.

The dl_type argument tells dataloaders to use the SortedDL class of DataLoader, and not the usual one. SortedDL constructs batches by putting samples of roughly the same lengths into batches.

In [None]:
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)

In [2]:
class TransformBlock():
    "A basic wrapper that links defaults transforms for the data block API"
    
    def __init__(self, type_tfms=None, item_tfms=None, batch_tfms=None, dl_type=None, dls_kwargs={}):
        self.type_tfms  =            L(type_tfms)
        self.item_tfms  = ToTensor + L(item_tfms)
        self.batch_tfms =            L(batch_tfms)
        self.dl_type    =              dl_type
        self.dls_kwargs =              dls_kwargs          

In [3]:
def CategoryBlock(vocab=None, sort=True, add_na=False):
    "`TransformBlock` for single-label categorical targets"
    
    type_tfms=Categorize(vocab=vocab, sort=sort, add_na=add_na)

In [4]:
def MultiCategoryBlock(encoded=False, vocab=None, add_na=False):
    "`TransformBlock` for multi-label categorical targets"
    
    if encoded:
        type_tfms=EncodedMultiCategorize(vocab=vocab) 
    else:
        type_tfms=[MultiCategorize(vocab=vocab, add_na=add_na), OneHotEncode]

In [5]:
def RegressionBlock(n_out=None):
    "`TransformBlock` for float targets"
    
    type_tfms=RegressionSetup(c=n_out)

In [None]:
class DataBlock():
    "Generic container to quickly build `Datasets` and `DataLoaders`"
       
    source =>  Datasets.items
    get_items => Datasets.items
    splitter =>  Datasets.splits    
    
    blocks = (TransformBlock,TransformBlock)*
    n_inp =>  Datasets.n_inp
    
    getters =>  Datasets.tfms
    type_tfms* =>  Datasets.tfms    
    
    default_item_tfms* => Dataloaders.after_item
    item_tfms => Dataloaders.after_item
    
    default_batch_tfms* => Dataloaders.after_batch
    batch_tfms => Dataloaders.after_batch
    
    dl_type = TfmdDL* =>  Datasets.dl_type    
    -> dataloaders
    dls_kwargs* => Dataloaders.kwargs
    
    def __init__(self, blocks=None, dl_type=None, getters=None, n_inp=None, item_tfms=None, batch_tfms=None):
        # Properties initialized by blocks
        blocks = L(b() if callable(b) else b for b in blocks)
        self.type_tfms = blocks.attrgot('type_tfms', L())
        
        self.default_item_tfms  = _merge_tfms(*blocks.attrgot('item_tfms',  L()))
        self.default_batch_tfms = _merge_tfms(*blocks.attrgot('batch_tfms', L()))
        
        for b in blocks: 
            if getattr(b, 'dl_type', None) is not None: self.dl_type = b.dl_type
        if dl_type is not None: self.dl_type = dl_type
        self.dataloaders = delegates(self.dl_type.__init__)(self.dataloaders)
        self.dls_kwargs = merge(*blocks.attrgot('dls_kwargs', {}))
        
        # Pipeline
        self.n_inp = ifnone(n_inp, max(1, len(blocks)-1))
        self.getters = ifnone(getters, [noop]*len(self.type_tfms))
        if self.get_x:
            self.getters[:self.n_inp] = L(self.get_x)
        if self.get_y:
            self.getters[self.n_inp:] = L(self.get_y)
        self.new(item_tfms, batch_tfms)
        
    def new(self, item_tfms=None, batch_tfms=None):
        "Create a new `DataBlock` with other `item_tfms` and `batch_tfms`"
        self.item_tfms  = _merge_tfms(self.default_item_tfms,  item_tfms)
        self.batch_tfms = _merge_tfms(self.default_batch_tfms, batch_tfms)
        
    def datasets(self, source, verbose=False):
        "Create a `Datasets` object from `source`"
        self.source = source                     
        items = (self.get_items or noop)(source)
        splits = (self.splitter or RandomSplitter())(items)  
        return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
    
    def dataloaders(self, source, path='.', verbose=False, **kwargs):
        dsets = self.datasets(source)
        kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
        return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)