# Notebook 3a: Tabular Data

This notebook will go over how the new API operates on Tabular data with the standard API, and 3b will go over utilizing RAPIDs

First let's install the library again (if you have not already)

In [0]:
#import os
#!pip install -q torch torchvision feather-format kornia pyarrow Pillow wandb nbdev fastprogress --upgrade 
#!pip install -q git+https://github.com/fastai/fastcore  --upgrade
#!pip install -q git+https://github.com/fastai/fastai2 --upgrade
#os._exit(00)

[K     |████████████████████████████████| 122kB 2.8MB/s 
[K     |████████████████████████████████| 59.2MB 131kB/s 
[K     |████████████████████████████████| 2.1MB 32.0MB/s 
[K     |████████████████████████████████| 1.3MB 57.7MB/s 
[K     |████████████████████████████████| 460kB 73.8MB/s 
[K     |████████████████████████████████| 92kB 14.4MB/s 
[K     |████████████████████████████████| 92kB 13.1MB/s 
[K     |████████████████████████████████| 102kB 16.3MB/s 
[K     |████████████████████████████████| 256kB 41.6MB/s 
[K     |████████████████████████████████| 71kB 11.0MB/s 
[K     |████████████████████████████████| 184kB 67.1MB/s 
[?25h  Building wheel for watchdog (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for shortuuid (setup.py) ... [?25l[?25hdone
  Building wheel for gql (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
[31mERROR: albumentations 0.1.12 has 

To use the tabular libraries, we need to import the `tabular` modules, and the callbacks (to use `one_cycle`)

In [0]:
from fastai2.basics import *
from fastai2.tabular.all import *
from fastai2.callback.all import *

We'll be using the ADULT's `datafram` as per usual, with our old variable setup

In [3]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

We'll want to declare our categorical variables, continuous variables, as well as our pre-processors for our data (Categorify, Normalize, and FillMissing)

In [0]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

Now let's get into the new stuff. So before, we had something like the following to create a `TabularList`

In [0]:
### DO NOT RUN! JUST FOR SHOW OF HOW THE 1.0 API LOOKED ###
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .dataloaders())

Where essentially we build our `TabularList`, then choose how to split, then label, then dataloaders it. Quite a convoluted setup there. Let's see how the new API looks and handles it!

We can still use our old procs, but now let's introduce you to the `RandomSplitter`. This function will split our dataframe's indexes randomly into 80/20. We just make a function call to it and then pass in a range we'd like to use. 

We'll use the `range_of` function that was made to grab the range our `dataframe` has

In [0]:
splits = RandomSplitter()(range_of(df))

But what is `range_of` doing?

In [7]:
rang = range_of(df); 
print(rang[:10], rang[-10:])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559, 32560]


And we can see that split then randomly split our index's into two lists! (the first value here is the length of the list)

In [8]:
splits

((#26049) [26310,10505,30679,32520,9169,30975,15283,28455,1688,29727...],
 (#6512) [1500,8126,3549,18958,4257,26991,3518,8588,26729,23538...])

Well, it's a list of indexes our dataframe has in it!

Great! So what's next? 

Now we can create a `TabularPandas` object! Think of it like our `TabularList` with a bit more parameters. We pass in the `dataframe`, our preprocessor steps (`procs`), our categorical and continuous variables, our `y` variable, and how we want to split our data!

In [0]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names="salary",
                   splits=splits)

Along with this there is an optional `is_y_cat`, which will determine if you want a regression problem or not.

So what is this `TabularPandas` object? Think of it like a Pandas Dataframe enhanced! We can use it a bit like a regular one, but yet it's already split and prepared to dataloaders!

In [12]:
to.iloc[0:10]

            age  workclass    fnlwgt  ...  age_na  fnlwgt_na  education-num_na
26310  0.032767          5  1.498836  ...       1          1                 1
10505 -0.699518          5 -0.579685  ...       1          1                 1
30679 -1.285347          5  0.179223  ...       1          1                 1
32520  2.595767          7 -0.764931  ...       1          1                 1
9169  -0.113690          5 -1.305196  ...       1          1                 1
30975 -1.578261          6 -1.213851  ...       1          1                 1
15283 -0.699518          5 -0.205346  ...       1          1                 1
28455  0.325681          5 -1.074966  ...       1          1                 1
1688  -1.138890          5 -0.265252  ...       1          1                 1
29727  0.325681          5 -1.254105  ...       1          1                 1

[10 rows x 18 columns]

## DataLoaders

We can create our `DataLoaders` object a few different ways. The first I'll show you is very high-level and helps using defaults. Our `tp` object has a list of train and validation in it, so the last step is to simply `.dataloaders()` it!

### Method 1: Straight

In [0]:
dls = to.dataloaders()

In [14]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,age,fnlwgt,education-num,salary
0,State-gov,Masters,Married-civ-spouse,Prof-specialty,Husband,White,False,False,False,41.0,36998.995377,14.0,>=50k
1,Private,1st-4th,Married-spouse-absent,Farming-fishing,Not-in-family,White,False,False,False,47.0,343578.993892,2.0,<50k
2,Private,10th,Never-married,Farming-fishing,Own-child,White,False,False,False,19.000001,304469.005575,6.0,<50k
3,Self-emp-inc,HS-grad,Separated,Adm-clerical,Not-in-family,White,False,False,False,27.0,233724.000337,9.0,<50k
4,Private,Bachelors,Never-married,Prof-specialty,Not-in-family,Black,False,False,False,51.0,182186.999699,13.0,<50k
5,Self-emp-inc,Masters,Married-civ-spouse,Exec-managerial,Husband,White,False,False,False,51.0,304955.000236,14.0,>=50k
6,Private,11th,Never-married,Other-service,Other-relative,White,False,False,False,26.0,206599.999843,7.0,<50k
7,Self-emp-inc,Some-college,Married-civ-spouse,Sales,Husband,White,False,False,False,49.0,431244.998361,10.0,>=50k
8,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,False,False,False,60.0,282065.998362,13.0,>=50k
9,Private,11th,Never-married,Transport-moving,Not-in-family,White,False,False,False,22.0,72309.997199,7.0,<50k


### Method 2: With Two DataLoaders

We can create our `DataLoaders` (a train and a valid). One great reason to do this *this* way is we can pass in different batch sizes into each `TabDataLoader`, along with changing options like `shuffle` and `drop_last` (at the bottom I'll show why that's **super** cool)

So how do we use it? Our train and validation data live in `tp.train` and `tp.valid` right now, so we specify that along with our options. When you make a training `DataLoader`, you want `shuffle` to be `True` and `drop_last` to be `True`

In [0]:
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Since our validation dataset is much smaller, we can have a larger batch size here. Now let's create a `DataLoaders`

In [0]:
dls = DataLoaders(trn_dl, val_dl)

In [17]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,age,fnlwgt,education-num,salary
0,Private,Bachelors,Never-married,Sales,Own-child,White,False,False,False,34.0,134736.998545,13.0,>=50k
1,Private,Assoc-acdm,Divorced,Prof-specialty,Unmarried,White,False,False,False,32.0,203673.999285,12.0,<50k
2,Private,Bachelors,Never-married,Exec-managerial,Own-child,Black,False,False,False,24.0,205844.000652,13.0,<50k
3,Private,9th,Never-married,Other-service,Own-child,White,False,False,False,22.0,251072.998172,5.0,<50k
4,Private,Some-college,Never-married,Adm-clerical,Own-child,Black,False,False,False,22.0,229456.000066,10.0,<50k
5,?,Some-college,Widowed,?,Not-in-family,White,False,False,False,68.999999,628796.98995,10.0,<50k
6,Private,HS-grad,Divorced,Adm-clerical,Not-in-family,White,False,False,False,26.0,152435.999637,9.0,<50k
7,Local-gov,Assoc-acdm,Never-married,#na#,Not-in-family,White,False,False,True,29.0,419721.99378,10.0,<50k
8,Private,Some-college,Divorced,Exec-managerial,Unmarried,White,False,False,False,33.0,268451.000311,10.0,<50k
9,Private,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,False,False,False,45.0,153141.001267,9.0,<50k


# Training

Great! Let's train a model!

We can gather our embedding matrix by doing `get_emb_sz` and passing in a `TabularPandas`

In [0]:
emb_szs = get_emb_sz(to)

In [19]:
emb_szs

[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (2, 2), (2, 2), (3, 3)]

We can grab our number of continous variables by calling a `cont_names` to our tabular pandas object as well

In [20]:
cont_len = len(to.cont_names); cont_len

3

Now that we have these, let's create our model! We'll use a simple `[200, 100]` layer setup like Jeremy has in his lectures. We'll also want to have our output be `2`, as this is binary classification (Above or below $50k)

In [0]:
net = TabularModel(emb_szs, cont_len, 2, [200,100])

In [22]:
net

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(2, 2)
    (7): Embedding(2, 2)
    (8): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(46, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=46, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Now that we know what's going on in the background, don't we have like a `tabular_learner` or something? Yes, yes we do!

In [0]:
learn = tabular_learner(dls, [200,100], metrics=accuracy)

In [24]:
learn.fit(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.362509,0.369712,0.828471,00:06


Awesome! We get ~82% accuracy!

# That Cool Bit I Mentioned Earlier

One neat thing we can do now is have labeled test sets, and its easy to do! Let's create a labeled test set with our validation dataset from earlier (in practice you'd want a second labeled test set you'd want to use!)

We're going to create a `TabularPandas` object like before: (using the whole `DataFrame`) and then we can create a `DataLoader` like before too, specifying `shuffle` to `False` and `drop_last` to `False`

In [0]:
to_test = TabularPandas(df, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(to_test, shuffle=False, drop_last=False)

And now we can pass in any `DataLoader` right into `learn.get_preds()` **or** `learn.validate()`!

In [29]:
learn.validate(dl=test_dl)

(#2) [0.3529963791370392,0.8352323174476624]

If you're worried about if it's actually working or not, let's get our predictions and check them ourselves with `get_preds`

In [30]:
preds = learn.get_preds(dl=test_dl)

In [31]:
accuracy(preds[0], preds[1])

tensor(0.8352)

You can see that they line up perfectly!

Thanks for reading, and I hope you enjoy the v2 library as much as I am :)