# Tabular models
## Our goal is to build a model to predict between salary categories
## $[$Salary $\lt\$50K$, Salary $\ge \$50K]$ 

## from a dataset of 1000 adults, given several categorical and continuous variables, which we define below.

In [2]:
from fastai.tabular import *

Tabular data should be in a Pandas `DataFrame`.

### Download the data set. 

In [3]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

### We shall predict `salary`, the dependent variable, using nine predictors (or independent variables) consisting of the six categories in `cat_names`, and the three continuous variables in `cont_names`.
#### The preprocessing steps are specified in `procs`

In [4]:
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

### The `test` data is the last 200 of the 1000 examples in the dataset.

In [5]:
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

### Create a databunch from the processed and labeled training and test data

In [6]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

### Inspect the data
#### Note that an extra predictor `eduction-num_na`, a boolean flag indicating whether the `education-num` field was missing (and hence imputed from statistics), was automatically created and added to the data. 

In [13]:
data.show_batch()

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Private,Some-college,Never-married,Adm-clerical,Own-child,White,True,-1.4357,-0.7734,-0.0312,<50k
Private,HS-grad,Never-married,Machine-op-inspct,Own-child,White,False,-0.9226,0.1039,-0.4224,<50k
Private,Some-college,Never-married,Other-service,Not-in-family,White,False,-1.2891,0.9961,-0.0312,<50k
State-gov,Doctorate,Married-civ-spouse,Prof-specialty,Husband,White,False,2.1559,0.4377,2.3157,>=50k
Self-emp-not-inc,Bachelors,Married-civ-spouse,Farming-fishing,Husband,White,False,0.4701,-0.384,1.1422,<50k


### Create and fit a tabular learner model

In [14]:
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

#### Fit one epoch

In [15]:
learn.fit(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.366718,0.383264,0.82,00:58


#### Fit 5 more epochs

In [18]:
learn.fit(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.359574,0.393915,0.815,00:58
1,0.363905,0.375155,0.83,00:57
2,0.367971,0.38925,0.82,00:57
3,0.364362,0.378294,0.845,00:56
4,0.356256,0.352762,0.845,00:57


## Inference
### Our model predicts probabilities for the categories [`Salary < 50,000`, `Salary >= 50,000`]

### Let's use our model to make predictions for the first 10 data points
#### First look at the data:

In [25]:
rows = df.iloc[0:10]
rows

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k
5,20,Private,63210,HS-grad,9.0,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,15,United-States,<50k
6,49,Private,44434,Some-college,10.0,Divorced,,Other-relative,White,Male,0,0,35,United-States,<50k
7,37,Private,138940,11th,7.0,Married-civ-spouse,,Husband,White,Male,0,0,40,United-States,<50k
8,46,Private,328216,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,>=50k
9,36,Self-emp-inc,216711,HS-grad,,Married-civ-spouse,,Husband,White,Male,99999,0,50,?,>=50k


### Now make the predictions
#### Note that each prediction specifies three objects: a Category, a numerical value (0 or 1) associated with the Category, and a tuple of two elements giving the probabilities of each Category. The sum of the probabilities in each tuple should, of course, sum to one.

In [42]:
predictions  = [learn.predict(df.iloc[index]) for index in range(10)]
predictions

[(Category <50k, tensor(0), tensor([0.5177, 0.4823])),
 (Category >=50k, tensor(1), tensor([0.4129, 0.5871])),
 (Category <50k, tensor(0), tensor([0.9748, 0.0252])),
 (Category >=50k, tensor(1), tensor([0.1384, 0.8616])),
 (Category <50k, tensor(0), tensor([0.8651, 0.1349])),
 (Category <50k, tensor(0), tensor([0.9974, 0.0026])),
 (Category <50k, tensor(0), tensor([0.9437, 0.0563])),
 (Category <50k, tensor(0), tensor([0.8724, 0.1276])),
 (Category <50k, tensor(0), tensor([0.5859, 0.4141])),
 (Category <50k, tensor(0), tensor([0.6552, 0.3448]))]

### Note that samples 1 and 3 are in the upper salary class and all the others are in the lower class.
#### Note also that the probabilities of the class predictions are mostly pretty `confident`, i.e., well away from 0.5

In [44]:
salary_classes = [predictions[index][0] for index in range(10)]
salary_classes

[Category <50k,
 Category >=50k,
 Category <50k,
 Category >=50k,
 Category <50k,
 Category <50k,
 Category <50k,
 Category <50k,
 Category <50k,
 Category <50k]