# 02 Tabular Classification

In this notebook we will quickly show how to perform binary classfication on tabular data

## Installing the library

First we'll install the library as we did in the previous notebook:

In [1]:
!pip install light-the-torch >> /.tmp
!ltt install torch torchvision >> /.tmp
!pip install fastai nbdev --upgrade >> /.tmp

## Importing the Library and Getting the Data

Then we'll import the `tabular` library:

In [2]:
from fastai.tabular.all import *

We will download the `ADULT_SAMPLE` dataset and load it into `Pandas`:

In [3]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


The goal with this dataset is to calculate a persons salary given external information. To work with it in the `fastai` library we need to define the categorical and continous variables, the preprocessers we want to use, as well as our y variables.

For tabular problems these preprocessors (or `Transforms`) are `Categorify`, `FillMissing`, and `Normalize`:

In [6]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'

Next we'll declare how to split the data. Similar to the PETs example we'll define a splitter then send it a range of indicies to use:

In [7]:
splits = RandomSplitter(valid_pct=0.2, seed=42)(range_of(df))

In [8]:
splits

((#26049) [14860,30337,20177,23289,27833,28222,10731,775,9068,15458...],
 (#6512) [26459,14308,30129,7933,23325,4624,24726,19755,24006,30826...])

And now we will load everything into the `TabDataLoaders` API:

In [10]:
dls = TabularDataLoaders.from_df(df, procs=procs, 
                                 cat_names=cat_names, 
                                 cont_names=cont_names, 
                                 splits=splits, y_names=y_names)

We can also load this into the mid-level `TabularPandas` API:
> This has the benefit of being usable across other libraries, read more [here](https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0/blob/master/Tabular%20Notebooks/02_Ensembling.ipynb)

In [12]:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, splits=splits)

And then we can load it into `DataLoaders`:

In [13]:
dls = to.dataloaders(bs=512)

## Making a `Learner`

Similar to our `cnn_learner` we have a `tabular_learner` which will generate a fully-connected neural network usable for our data. The size of the hidden layers is defined by `layers`:

In [14]:
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)

To show how easy it is to use new optimizers we'll go ahead and show an example with the `ranger` optimizer, which is a mix of LookAhead and AdamW:

In [15]:
learn = tabular_learner(dls, layers=[200,100], opt_func=ranger, metrics=accuracy)

And finally we'll train:

In [16]:
learn.fit_flat_cos(5, 4e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.430128,0.461389,0.814496,00:00
1,0.378331,0.354911,0.835842,00:00
2,0.358553,0.348953,0.837531,00:00
3,0.349968,0.351114,0.838913,00:00
4,0.343157,0.34673,0.839834,00:00


## Inference

Just as with the vision example, we can pass in a row and perform inference:

In [18]:
row, idx, probs = learn.predict(df.iloc[0])

The input row:

In [19]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.001323,12.0,>=50k


The probabilities:

In [20]:
probs

tensor([0.4805, 0.5195])

The class index:

In [21]:
idx

tensor(1)

Which we can convert back to a name easily:

In [23]:
dls.vocab[idx]

'>=50k'