# Notebook 3a: Tabular Data

This notebook will go over how the new API operates on Tabular data with the standard API, and 3b will go over utilizing RAPIDs

First let's install the library again, we won't need Pillow for this one

In [1]:
!pip3 install torch===1.3.0 torchvision===0.4.1 -f https://download.pytorch.org/whl/torch_stable.html
!pip install git+https://github.com/fastai/fastai_dev > /dev/null

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/50a05de5337f7a924bb8bd70c6936230642233e424d6a9747ef1cfbde353/torch-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (773.1MB)
[K     |████████████████████████████████| 773.1MB 24kB/s 
[?25hCollecting torchvision===0.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/fc/23/d418c9102d4054d19d57ccf0aca18b7c1c1f34cc0a136760b493f78ddb06/torchvision-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)
[K     |████████████████████████████████| 10.1MB 29.4MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.3.0+cu100
    Uninstalling torch-1.3.0+cu100:
      Successfully uninstalled torch-1.3.0+cu100
  Found existing installation: torchvision 0.4.1+cu100
    Uninstalling torchvision-0.4.1+cu100:
      Successfully uninstalled torchvision-0.4.1+cu100
Successfully installed torch-1.3.0 torchvisio

To use the tabular libraries, we need to import the `core` module. Along with this we will need some code borrowed from [notebook 41](github/fastai/fastai_dev/blob/master/dev/41_tabular_model.ipynb) to build our Learner

In [0]:
from fastai2.torch_basics import *
from fastai2.basics import *
from fastai2.tabular.core import *

We'll be using the ADULT's `datafram` as per usual, with our old variable setup

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

Now let's get into the new stuff. So before, we had something like the following to create a `TabularList`

In [0]:
### DO NOT RUN! JUST FOR SHOW OF HOW THE 1.0 API LOOKED ###
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .databunch())

Where essentially we build our `TabularList`, then choose how to split, then label, then databunch it. Quite a convoluted setup there. Let's see how the new API looks and handles it!

We can still use our old procs, but now let's introduce you to the `RandomSplitter`. This function will split our dataframe's indexes randomly into 80/20. We just make a function call to it and then pass in a range we'd like to use. 

We'll use the `range_of` function that was made to grab the range our `dataframe` has

In [0]:
splits = RandomSplitter()(range_of(df))

But what is `range_of` doing?

In [6]:
rang = range_of(df); 
print(rang[:10], rang[-10:])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559, 32560]


And we can see that split then randomly split our index's into two lists! (the first value here is the length of the list)

In [7]:
splits

((#26049) [9386,14109,10181,7992,22718,22572,25177,6876,11364,18308...],
 (#6512) [23673,24687,1608,28424,4770,6716,19149,31006,1982,14778...])

Well, it's a list of indexes our dataframe has in it!

Great! So what's next? 

Now we can create a `TabularPandas` object! Think of it like our `TabularList` with a bit more parameters. We pass in the `dataframe`, our preprocessor steps (`procs`), our categorical and continuous variables, our `y` variable, and how we want to split our data!

In [0]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names="salary",
                   splits=splits)

Along with this there is an optional `is_y_cat`, which will determine if you want a regression problem or not.

So what is this `TabularPandas` object? Think of it like a Pandas Dataframe enhanced! We can use it a bit like a regular one, but yet it's already split and prepared to databunch!

In [10]:
to.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary,age_na,fnlwgt_na,education-num_na
9386,0.104522,6,1.469168,15,1.928576,3,11,1,5,Male,0,0,50,United-States,1,1,1,1
14109,0.691071,5,-0.323401,16,-0.029405,5,4,2,3,Female,0,0,40,United-States,0,1,1,1
10181,1.204301,5,-0.129143,11,2.320172,3,11,1,5,Male,0,1977,55,?,1,1,1,1
7992,1.27762,5,0.727911,10,1.145383,5,11,2,5,Female,0,0,45,Mexico,0,1,1,1
22718,1.204301,3,-1.503648,10,1.145383,7,11,2,5,Female,0,0,40,United-States,1,1,1,1


## DataBunch

We can create our `DataBunch` object a few different ways. The first I'll show you is very high-level and helps using defaults. Our `tp` object has a list of train and validation in it, so the last step is to simply `.databunch()` it!

### Method 1: Straight

In [0]:
dbunch = to.databunch()

In [12]:
dbunch.show_batch()

Unnamed: 0,age,fnlwgt,education-num,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,salary
0,30.0,29521.994539,10.0,Private,Some-college,Married-civ-spouse,Sales,Wife,White,False,False,False,<50k
1,30.0,467107.990769,14.0,Private,Masters,Married-civ-spouse,Prof-specialty,Husband,White,False,False,False,>=50k
2,21.0,119703.999997,10.0,Private,Some-college,Never-married,Sales,Unmarried,White,False,False,False,<50k
3,21.999999,54824.997941,9.0,Private,HS-grad,Never-married,Sales,Not-in-family,White,False,False,False,<50k
4,68.000001,270338.996935,6.0,?,10th,Married-civ-spouse,?,Husband,White,False,False,False,<50k
5,46.0,459189.009275,10.0,Private,Some-college,Married-civ-spouse,Craft-repair,Husband,White,False,False,False,>=50k
6,21.0,56582.000487,7.0,Private,11th,Never-married,Other-service,Own-child,White,False,False,False,<50k
7,37.0,173779.999996,10.0,State-gov,Some-college,Divorced,Prof-specialty,Unmarried,White,False,False,False,<50k
8,48.0,266336.998972,9.0,?,HS-grad,Married-civ-spouse,?,Husband,White,False,False,False,<50k
9,25.0,263772.99881,10.0,Private,Some-college,Married-civ-spouse,Adm-clerical,Wife,White,False,False,False,>=50k


### Method 2: With Two DataLoaders

We can create our `DataLoaders` (a train and a valid). One great reason to do this *this* way is we can pass in different batch sizes into each `TabDataLoader`, along with changing options like `shuffle` and `drop_last` (at the bottom I'll show why that's **super** cool)

So how do we use it? Our train and validation data live in `tp.train` and `tp.valid` right now, so we specify that along with our options. When you make a training `DataLoader`, you want `shuffle` to be `True` and `drop_last` to be `True`

In [0]:
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Since our validation dataset is much smaller, we can have a larger batch size here. Now let's create a `DataBunch`

In [0]:
dbunch = DataBunch(trn_dl, val_dl)

In [15]:
dbunch.show_batch()

Unnamed: 0,age,fnlwgt,education-num,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,salary
0,33.0,117982.996926,9.0,Private,HS-grad,Never-married,Adm-clerical,Not-in-family,White,False,False,False,<50k
1,18.0,270543.997075,8.0,Private,12th,Never-married,Adm-clerical,Own-child,White,False,False,False,<50k
2,32.0,285131.000281,12.0,?,Assoc-acdm,Never-married,?,Unmarried,White,False,False,False,<50k
3,48.0,349151.005482,9.0,Private,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,False,False,False,<50k
4,37.0,76766.99715,15.0,State-gov,Prof-school,Married-civ-spouse,Prof-specialty,Wife,White,False,False,False,>=50k
5,31.0,304212.000533,10.0,Self-emp-inc,Some-college,Never-married,Exec-managerial,Own-child,White,False,False,False,<50k
6,20.0,211293.000575,10.0,Private,Some-college,Never-married,Sales,Own-child,Black,False,False,False,<50k
7,51.0,87205.001406,13.0,Private,Bachelors,Divorced,Prof-specialty,Unmarried,White,False,False,False,<50k
8,27.0,214384.999293,9.0,Federal-gov,HS-grad,Never-married,Adm-clerical,Not-in-family,Black,False,False,False,<50k
9,26.0,89389.000849,10.0,Private,Some-college,Divorced,Prof-specialty,Not-in-family,White,False,False,False,<50k


# Training

Great! Let's train a model. I'm going to put in the code we need in a seperate bit here but nothing has changed since 1.0 in terms of the embedding rule and model generation

### Source Code

In [0]:
def emb_sz_rule(n_cat): 
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

def _one_emb_sz(classes, n, sz_dict=None):
    "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
    sz_dict = ifnone(sz_dict, {})
    n_cat = len(classes[n])
    sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
    return n_cat,sz

def get_emb_sz(to, sz_dict=None):
    "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
    return [_one_emb_sz(to.procs.classes, n, sz_dict) for n in to.cat_names]

class TabularModel(Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None, embed_p=0., y_range=None, use_bn=True, bn_final=False):
        ps = ifnone(ps, [0]*len(layers))
        if not is_listy(ps): ps = [ps]*len(layers)
        self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(embed_p)
        self.bn_cont = nn.BatchNorm1d(n_cont)
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range
        sizes = [n_emb + n_cont] + layers + [out_sz]
        actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None]
        _layers = [BnDropLin(sizes[i], sizes[i+1], bn=use_bn and i!=0, p=p, act=a)
                       for i,(p,a) in enumerate(zip([0.]+ps,actns))]
        if bn_final: _layers.append(nn.BatchNorm1d(sizes[-1]))
        self.layers = nn.Sequential(*_layers)
    
    def forward(self, x_cat, x_cont):
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        if self.n_cont != 0:
            x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
        x = self.layers(x)
        if self.y_range is not None:
            x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0]
        return x

## Building the model

Eventually something similar to `tabular_learner` will appear, but for the time being we need to build the model ourselves. We do this by calling `TabularModel` and passing in an embedding matrix size, how many continuous variables we have, our number of outputs, and our layer sizes.

We can gather our embedding matrix by doing `get_emb_sz` and passing in a `TabularPandas`

In [0]:
emb_szs = get_emb_sz(to)

In [18]:
emb_szs

[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (2, 2), (2, 2), (3, 3)]

We can grab our number of continous variables by calling a `cont_names` to our tabular pandas object as well

In [19]:
cont_len = len(to.cont_names); cont_len

3

Now that we have these, let's create our model! We'll use a simple `[200, 100]` layer setup like Jeremy has in his lectures. We'll also want to have our output be `2`, as this is binary classification (Above or below $50k)

In [0]:
net = TabularModel(emb_szs, cont_len, 2, [200,100])

In [21]:
net

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(2, 2)
    (7): Embedding(2, 2)
    (8): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): BnDropLin(
      (0): Linear(in_features=46, out_features=200, bias=True)
      (1): ReLU(inplace=True)
    )
    (1): BnDropLin(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=True)
      (2): ReLU(inplace=True)
    )
    (2): BnDropLin(
      (0): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Now let's create an optimizer instance, our `Learner` object, and start training!

In [0]:
opt_func = partial(Adam, wd=0.01, eps=1e-5)
learn = Learner(dbunch, net, CrossEntropyLossFlat(), opt_func=opt_func, metrics=accuracy)

In [23]:
learn.fit(1)

(#5) [0,0.37234917283058167,0.35412687063217163,0.8353808522224426,00:10]


Awesome! We get ~82.5% accuracy! We can call `learn.show_results` to take a look at a dataframe that shows our predictions (something new!)

In [0]:
#export
@typedispatch
def show_results(x:Tabular, y:Tabular, samples, outs, ctxs=None, max_n=10, **kwargs):
    df = x.all_cols[:max_n]
    df[to.y_names+'_pred'] = y[to.y_names][:max_n].values
    display_df(df)

And if you notice *how* they did it, it looks just like adding a column to a `DataFrame`!

In [26]:
learn.show_results()

Unnamed: 0,age,fnlwgt,education-num,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,salary,salary_pred
0,46.0,207300.999331,9.0,Private,HS-grad,Never-married,Adm-clerical,Not-in-family,White,False,False,False,<50k,<50k
1,37.0,419052.993722,9.0,Federal-gov,HS-grad,Divorced,Craft-repair,Not-in-family,White,False,False,False,<50k,<50k
2,61.0,195518.999819,14.0,Local-gov,Masters,Never-married,Prof-specialty,Unmarried,White,False,False,False,<50k,<50k
3,52.0,334273.003392,13.0,Self-emp-not-inc,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,False,False,>=50k,>=50k
4,55.0,227855.999994,13.0,Self-emp-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,False,False,False,>=50k,>=50k
5,17.0,168202.999963,4.0,Private,7th-8th,Never-married,Farming-fishing,Other-relative,Other,False,False,False,<50k,<50k
6,45.0,213140.000239,6.0,Private,10th,Married-civ-spouse,Craft-repair,Husband,White,False,False,False,<50k,<50k
7,49.0,203039.000464,7.0,State-gov,11th,Married-civ-spouse,Adm-clerical,Husband,White,False,False,False,<50k,<50k
8,54.0,123010.998709,9.0,Self-emp-not-inc,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,False,False,False,>=50k,<50k
9,33.0,171214.999849,14.0,Private,Masters,Never-married,Adm-clerical,Own-child,White,False,False,False,<50k,<50k


# That Cool Bit I Mentioned Earlier

One neat thing we can do now is have labeled test sets, and its easy to do! Let's create a labeled test set with our validation dataset from earlier (in practice you'd want a second labeled test set you'd want to use!)

We're going to create a `TabularPandas` object like before: (using the whole `DataFrame`) and then we can create a `DataLoader` like before too, specifying `shuffle` to `False` and `drop_last` to `False`

In [0]:
to_test = TabularPandas(df, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(to_test, shuffle=False, drop_last=False)

And now we can pass in any `DataLoader` right into `learn.get_preds()` **or** `learn.validate()`!

In [29]:
learn.validate(dl=test_dl)

(#2) [0.35304903984069824,0.836675763130188]

If you're worried about if it's actually working or not, let's get our predictions and check them ourselves with `get_preds`

In [0]:
preds = learn.get_preds(dl=test_dl)

In [31]:
accuracy(preds[0], preds[1])

tensor(0.8367)

You can see that they line up perfectly!

Thanks for reading, and I hope you enjoy the v2 library as much as I am :)