# Fastai version 2

In this version of fastai, we will talk about transforms and pipeline. Everything is just the function and we apply those function on set of inputs. How these functions are managed is the main objective of this version of fastai library.

In [6]:
#export
from fastai2.torch_basics import *
from fastai2.data.core import *
from fastai2.data.load import *
from fastai2.data.external import *
from fastai2.data.transforms import *

In [2]:
from nbdev.showdoc import *

# Helper functions for processing data and basic transforms
> Functions for getting, splitting and labelling data

# Get, split and label

For most of the data source, we need functions to get a list of items, split them into train / valid sets , label them. Fastai provides functions to make each of these steps easy ( easily when combined with `fastai.data.blocks`

In [4]:
path = untar_data(URLs.MNIST_TINY,dest='/media/puneet/Data')
path.ls()

(#5) [/media/puneet/Data/mnist_tiny/labels.csv,/media/puneet/Data/mnist_tiny/models,/media/puneet/Data/mnist_tiny/test,/media/puneet/Data/mnist_tiny/train,/media/puneet/Data/mnist_tiny/valid]

In [8]:
get_files(path, extensions='.png',recurse=True)

(#1428) [/media/puneet/Data/mnist_tiny/test/1503.png,/media/puneet/Data/mnist_tiny/test/1605.png,/media/puneet/Data/mnist_tiny/test/1883.png,/media/puneet/Data/mnist_tiny/test/2032.png,/media/puneet/Data/mnist_tiny/test/205.png,/media/puneet/Data/mnist_tiny/test/2642.png,/media/puneet/Data/mnist_tiny/test/3515.png,/media/puneet/Data/mnist_tiny/test/3848.png,/media/puneet/Data/mnist_tiny/test/3878.png,/media/puneet/Data/mnist_tiny/test/4605.png...]

In [11]:
#hide
test_eq(len(get_files(path, extensions='.png', recurse=True, folders='train')), 709)
test_eq(len(get_files(path, extensions='.png', recurse=True)), 1428)


In fastai, every function starting with Capital letter is returning another function. It generates new function to be used in library

# Split

The next set of functions are used to *split* data into training and validation sets. The function return two lists - a list of indices or masks for training dand validation sets

### Random Splitter

Function that splits the dataset randomly

In [12]:
src = list(range(30))
f = RandomSplitter(seed=42)
trn, val = f(src)
assert 0< len(trn) < len(src)
assert all(o not in val for o in trn)
test_eq(len(trn), len(src)- len(val))
test_eq(f(src)[0],trn)

In [15]:
len(trn), len(val)

(24, 6)

### Index Splitter

It needs a set of valid indice.

In [18]:
items = list(range(10))
splitter = IndexSplitter([3,7,9])
test_eq(splitter(items), [[0,1,2,4,5,6,8],[3,7,9]])

### Mask Splitter
mask2idx converts mask array  to set of indices and call index splitter on it

In [27]:
items = list(range(6))
splitter = MaskSplitter([True, False, True,False,True,False])
test_eq(splitter(items),[[1,3,5],[0,2,4]])

# Label

In [29]:
df = pd.DataFrame({'a': 'a b c d'.split(), 'b': ['1 2', '0', '', '1 2 3']})


In [30]:
df.head()

Unnamed: 0,a,b
0,a,1 2
1,b,0
2,c,
3,d,1 2 3


In [32]:
list(df.itertuples())

[Pandas(Index=0, a='a', b='1 2'),
 Pandas(Index=1, a='b', b='0'),
 Pandas(Index=2, a='c', b=''),
 Pandas(Index=3, a='d', b='1 2 3')]

### ColReader
As I understand, ColReader is a function to read data from pandas dataframe or list. How do you call a function on dataframe that returns values

In [36]:
f = ColReader('a')
test_eq([f(o) for o in df.itertuples()],['a','b','c','d'])

# We can add prefix and suffix
f = ColReader('a',pref='0',suff='1')
test_eq([f(o) for o in df.itertuples()],['0a1','0b1','0c1','0d1'])


### Categorize

Use CategoryMap to create category from labels

In [37]:
t = CategoryMap([4,2,3,4])


In [38]:
t

(#3) [2,3,4]

In [39]:
t.o2i

{2: 0, 3: 1, 4: 2}

In [41]:
t = CategoryMap(pd.Series([4,2,3,4]), sort=False)
test_eq(t, [4,2,3])
test_eq(t.o2i, {4:0,2:1,3:2})

In [42]:
col = pd.Series(pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True))
t = CategoryMap(col)
test_eq(t, ['H','M','L'])
test_eq(t.o2i, {'H':0,'M':1,'L':2})

# if there is a panda series ordered then it has special method to return categories
# col.cat.categories

### Multicategorize

Useful when categories are a list rather than single label. It converts each and every variable to integer and assign unique integers to the category labels.

Multicategorize is a very intersting method which works with almost every data type and hence it doesn't accept the incoming data type

In [44]:
cat = MultiCategorize()
tds = DataSource([['b', 'c'],['a'],['a', 'c'],[]],  tfms=[cat])


In [57]:
cat = MultiCategorize()

tds = DataSource([[1.0,2.0],[]], tfms=[cat])

In [63]:
tds = DataSource([['b', 'c'], ['a'], ['a', 'c'], []], [[MultiCategorize(), OneHotEncode()]])
tds[0],tds[1],tds[2]

((tensor([0., 1., 1.]),), (tensor([1., 0., 0.]),), (tensor([1., 0., 1.]),))

If you call it with simple tensor, it still works

Please note here that while decoding even if we are passing a tensor since its a super class of TensorMultiCategory, it gets type casted to that first

In [83]:
tds.decode([tensor([False, True, True])]), tds.decode([tensor([1,0,1])])

(((#2) [b,c],), ((#2) [a,c],))

Internally tensor gets converted to tensorMultiCategory first

In [80]:
tds.decode([TensorMultiCategory(tensor([False, True, True]))])

((#2) [b,c],)

In [82]:
# Here is how to get type and actual type of encodes method

type(tds[0][0]), tds[0][0].type()

(fastai2.torch_core.TensorMultiCategory, 'torch.FloatTensor')