# Tabular models

In [2]:
from fastai.tabular import *

Tabular data should be in a Pandas `DataFrame`.

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

Downloading http://files.fast.ai/data/examples/adult_sample.tgz


In [3]:
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

`procs` means preprocessing.

We have a number of processes in the fastai library. And the ones we're going to use this time are:

- `FillMissing`: Look for missing values and deal with them some way.

- `Categorify`: Find categorical variables and turn them into Pandas categories

- `Normalize`: Do a normalization ahead of time which is to take continuous variables and subtract their mean and divide by their standard deviation so they are zero-one variables.

In [4]:
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

In [5]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

We are loading the databunch, which includes the train, validation, and test set in one. So all the stuff about preprocessing is handled seamlessly. You know, things like normalisation which is based on the training set's mean and sd, and care has to be taken to not include the *entire* data set's metric, so yeah things like that are handled.

In [6]:
data.show_batch(rows=10)

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Local-gov,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,False,0.9098,-0.7707,-0.4224,<50k
Local-gov,7th-8th,Divorced,Handlers-cleaners,Not-in-family,White,False,0.3968,-1.5024,-2.3781,<50k
Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,False,2.0826,-0.9516,1.1422,>=50k
Private,Some-college,Divorced,Exec-managerial,Not-in-family,White,False,-0.4828,0.5833,-0.0312,>=50k
Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,Black,False,-0.8493,1.4244,-0.4224,<50k
Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,False,0.2502,1.4686,-0.4224,>=50k
Self-emp-not-inc,Some-college,Never-married,Prof-specialty,Own-child,White,False,-1.3624,-0.0137,-0.0312,<50k
Private,Some-college,Never-married,Adm-clerical,Own-child,White,False,-1.1425,-0.4452,-0.0312,<50k
Self-emp-not-inc,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,False,0.3968,-0.1635,-0.4224,>=50k
Private,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,False,0.8365,-1.4648,-0.4224,<50k


In [7]:
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

In [3]:
doc(tabular_learner)

In [8]:
learn.fit(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.363733,0.401158,0.78,02:56


## Inference

In [9]:
row = df.iloc[0]

In [15]:
learn.predict(row)

(Category tensor(1), tensor(1), tensor([0.2739, 0.7261]))

In [20]:
data.classes

['<50k', '>=50k']