# Notebook 3a: Tabular Data

This notebook will go over how the new API operates on Tabular data with the standard API, and 3b will go over utilizing RAPIDs

First let's install the library again, we won't need Pillow for this one

In [1]:
!pip install torch torchvision feather-format kornia pyarrow Pillow wandb --upgrade 
!pip install git+https://github.com/fastai/fastprogress  --upgrade
!pip install git+https://github.com/fastai/fastai_dev    

Requirement already up-to-date: torch in /usr/local/lib/python3.6/dist-packages (1.3.1)
Requirement already up-to-date: torchvision in /usr/local/lib/python3.6/dist-packages (0.4.2)
Requirement already up-to-date: feather-format in /usr/local/lib/python3.6/dist-packages (0.4.0)
Collecting kornia
[?25l  Downloading https://files.pythonhosted.org/packages/e7/05/4adc0140932d37ab1ff02f3a88c3362d5d31b999936bb3af651f641e1295/kornia-0.1.4.post2-py2.py3-none-any.whl (114kB)
[K     |████████████████████████████████| 122kB 41.4MB/s 
[?25hCollecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/6c/32/ce1926f05679ea5448fd3b98fbd9419d8c7a65f87d1a12ee5fb9577e3a8e/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl (59.2MB)
[K     |████████████████████████████████| 59.2MB 152kB/s 
[?25hCollecting Pillow
[?25l  Downloading https://files.pythonhosted.org/packages/10/5c/0e94e689de2476c4c5e644a3bd223a1c1b9e2bdb7c510191750be74fa786/Pillow-6.2.1-cp36-cp36m-manylinux1_x86_64.whl (

Collecting git+https://github.com/fastai/fastprogress
  Cloning https://github.com/fastai/fastprogress to /tmp/pip-req-build-szes89vf
  Running command git clone -q https://github.com/fastai/fastprogress /tmp/pip-req-build-szes89vf
Building wheels for collected packages: fastprogress
  Building wheel for fastprogress (setup.py) ... [?25l[?25hdone
  Created wheel for fastprogress: filename=fastprogress-0.1.22-cp36-none-any.whl size=10409 sha256=c8833f475f9b355637be54a6ef59fdd1ff3a115889c93a11bd5eec8df2511c89
  Stored in directory: /tmp/pip-ephem-wheel-cache-2df4omkq/wheels/7a/7b/0d/5fc197867d2d699227020d922bd8ce4b1faa75d188328f6c1c
Successfully built fastprogress
Installing collected packages: fastprogress
  Found existing installation: fastprogress 0.1.21
    Uninstalling fastprogress-0.1.21:
      Successfully uninstalled fastprogress-0.1.21
Successfully installed fastprogress-0.1.22
Collecting git+https://github.com/fastai/fastai_dev
  Cloning https://github.com/fastai/fastai_dev t

To use the tabular libraries, we need to import the `core` module. Along with this we will need some code borrowed from [notebook 41](github/fastai/fastai_dev/blob/master/dev/41_tabular_model.ipynb) to build our Learner

In [0]:
from fastai2.basics import *
from fastai2.tabular.core import *
from fastai2.tabular.model import *
from fastai2.callback.all import *

We'll be using the ADULT's `datafram` as per usual, with our old variable setup

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

Now let's get into the new stuff. So before, we had something like the following to create a `TabularList`

In [0]:
### DO NOT RUN! JUST FOR SHOW OF HOW THE 1.0 API LOOKED ###
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .databunch())

Where essentially we build our `TabularList`, then choose how to split, then label, then databunch it. Quite a convoluted setup there. Let's see how the new API looks and handles it!

We can still use our old procs, but now let's introduce you to the `RandomSplitter`. This function will split our dataframe's indexes randomly into 80/20. We just make a function call to it and then pass in a range we'd like to use. 

We'll use the `range_of` function that was made to grab the range our `dataframe` has

In [0]:
splits = RandomSplitter()(range_of(df))

But what is `range_of` doing?

In [5]:
rang = range_of(df); 
print(rang[:10], rang[-10:])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559, 32560]


And we can see that split then randomly split our index's into two lists! (the first value here is the length of the list)

In [6]:
splits

((#26049) [3666,10658,4062,19410,13138,31455,9928,15554,29856,3015...],
 (#6512) [24965,9066,5774,32419,7612,10884,4617,26938,5591,28001...])

Well, it's a list of indexes our dataframe has in it!

Great! So what's next? 

Now we can create a `TabularPandas` object! Think of it like our `TabularList` with a bit more parameters. We pass in the `dataframe`, our preprocessor steps (`procs`), our categorical and continuous variables, our `y` variable, and how we want to split our data!

In [0]:
TabularPandas()

In [0]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names="salary",
                   splits=splits, block_y=CategoryBlock)

Along with this there is an optional `block_y`, which will determine if you want a regression problem or not. Since we are doing Categories, we do a `CategoryBlock`. If we wanted to do regression we do `type_y=Float`

So what is this `TabularPandas` object? Think of it like a Pandas Dataframe enhanced! We can use it a bit like a regular one, but yet it's already split and prepared to databunch!

In [0]:
to.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary,age_na,fnlwgt_na,education-num_na
9386,0.104522,6,1.469168,15,1.928576,3,11,1,5,Male,0,0,50,United-States,1,1,1,1
14109,0.691071,5,-0.323401,16,-0.029405,5,4,2,3,Female,0,0,40,United-States,0,1,1,1
10181,1.204301,5,-0.129143,11,2.320172,3,11,1,5,Male,0,1977,55,?,1,1,1,1
7992,1.27762,5,0.727911,10,1.145383,5,11,2,5,Female,0,0,45,Mexico,0,1,1,1
22718,1.204301,3,-1.503648,10,1.145383,7,11,2,5,Female,0,0,40,United-States,1,1,1,1


## DataBunch

We can create our `DataBunch` object a few different ways. The first I'll show you is very high-level and helps using defaults. Our `tp` object has a list of train and validation in it, so the last step is to simply `.databunch()` it!

### Method 1: Straight

In [0]:
dbunch = to.databunch()

In [9]:
dbunch.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,age,fnlwgt,education-num,salary
0,Private,9th,Married-civ-spouse,Transport-moving,Husband,White,False,False,False,37.0,278632.001158,5.0,<50k
1,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,False,False,30.0,159187.001329,13.0,>=50k
2,Local-gov,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,False,False,44.0,203760.999509,13.0,>=50k
3,Private,Some-college,Never-married,Sales,Own-child,White,False,False,False,19.0,343199.996435,10.0,<50k
4,Private,Bachelors,Never-married,#na#,Own-child,White,False,False,False,22.0,103761.99866,13.0,<50k
5,Private,9th,Married-civ-spouse,Transport-moving,Husband,White,False,False,False,32.0,217459.999192,5.0,<50k
6,Private,HS-grad,Never-married,Craft-repair,Not-in-family,White,False,False,False,42.0,397345.998114,9.0,<50k
7,State-gov,Bachelors,Divorced,Protective-serv,Not-in-family,White,False,False,False,45.0,271962.003024,13.0,<50k
8,Private,Some-college,Married-civ-spouse,Exec-managerial,Husband,White,False,False,False,43.0,273230.002524,10.0,>=50k
9,Federal-gov,Some-college,Married-civ-spouse,Craft-repair,Husband,White,False,False,False,47.0,168190.999857,10.0,<50k


### Method 2: With Two DataLoaders

We can create our `DataLoaders` (a train and a valid). One great reason to do this *this* way is we can pass in different batch sizes into each `TabDataLoader`, along with changing options like `shuffle` and `drop_last` (at the bottom I'll show why that's **super** cool)

So how do we use it? Our train and validation data live in `tp.train` and `tp.valid` right now, so we specify that along with our options. When you make a training `DataLoader`, you want `shuffle` to be `True` and `drop_last` to be `True`

In [0]:
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Since our validation dataset is much smaller, we can have a larger batch size here. Now let's create a `DataBunch`

In [0]:
dbunch = DataBunch(trn_dl, val_dl)

In [12]:
dbunch.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,age_na,fnlwgt_na,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Never-married,Craft-repair,Not-in-family,White,False,False,False,39.0,114078.998347,12.0,<50k
1,Private,HS-grad,Married-civ-spouse,Other-service,Husband,White,False,False,False,34.0,261418.002062,9.0,<50k
2,Private,HS-grad,Never-married,Machine-op-inspct,Not-in-family,White,False,False,False,22.0,324921.995714,9.0,<50k
3,Self-emp-not-inc,Some-college,Married-civ-spouse,Prof-specialty,Wife,White,False,False,False,51.0,268638.999576,10.0,<50k
4,Private,11th,Married-civ-spouse,Machine-op-inspct,Husband,White,False,False,False,32.0,195576.000038,7.0,<50k
5,Private,HS-grad,Never-married,Handlers-cleaners,Own-child,White,False,False,False,20.0,451995.998245,9.0,<50k
6,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,Black,False,False,False,41.0,118618.998681,13.0,<50k
7,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,Black,False,False,False,37.0,178136.000088,9.0,<50k
8,Private,HS-grad,Married-spouse-absent,Craft-repair,Not-in-family,White,False,False,False,29.0,308943.995217,9.0,<50k
9,Self-emp-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,False,False,False,45.0,188329.999952,13.0,<50k


# Training

Great! Let's train a model. 

## Building the model

Eventually something similar to `tabular_learner` will appear, but for the time being we need to build the model ourselves. We do this by calling `TabularModel` and passing in an embedding matrix size, how many continuous variables we have, our number of outputs, and our layer sizes.

We can gather our embedding matrix by doing `get_emb_sz` and passing in a `TabularPandas`

In [0]:
emb_szs = get_emb_sz(to)

In [14]:
emb_szs

[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (2, 2), (2, 2), (3, 3)]

We can grab our number of continous variables by calling a `cont_names` to our tabular pandas object as well

In [15]:
cont_len = len(to.cont_names); cont_len

3

Now that we have these, let's create our model! We'll use a simple `[200, 100]` layer setup like Jeremy has in his lectures. We'll also want to have our output be `2`, as this is binary classification (Above or below $50k)

In [0]:
net = TabularModel(emb_szs, cont_len, 2, [200,100])

In [17]:
net

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(2, 2)
    (7): Embedding(2, 2)
    (8): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(46, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=46, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Now let's create an optimizer instance, our `Learner` object, and start training!

In [0]:
opt_func = partial(Adam, wd=0.01, eps=1e-5)
learn = Learner(dbunch, net, CrossEntropyLossFlat(), opt_func=opt_func, metrics=accuracy)

In [22]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.35218,0.352489,0.841523,00:08


Awesome! We get ~82.5% accuracy! We can call `learn.show_results` to take a look at a dataframe that shows our predictions (something new!)

And if you notice *how* they did it, it looks just like adding a column to a `DataFrame`!

# That Cool Bit I Mentioned Earlier

One neat thing we can do now is have labeled test sets, and its easy to do! Let's create a labeled test set with our validation dataset from earlier (in practice you'd want a second labeled test set you'd want to use!)

We're going to create a `TabularPandas` object like before: (using the whole `DataFrame`) and then we can create a `DataLoader` like before too, specifying `shuffle` to `False` and `drop_last` to `False`

In [0]:
to_test = TabularPandas(df, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(to_test, shuffle=False, drop_last=False)

And now we can pass in any `DataLoader` right into `learn.get_preds()` **or** `learn.validate()`!

In [27]:
learn.validate(dl=test_dl)

(#2) [0.34851083159446716,0.837136447429657]

If you're worried about if it's actually working or not, let's get our predictions and check them ourselves with `get_preds`

In [28]:
preds = learn.get_preds(dl=test_dl)

In [29]:
accuracy(preds[0], preds[1])

tensor(0.8371)

You can see that they line up perfectly!

Thanks for reading, and I hope you enjoy the v2 library as much as I am :)