This post is a tutorial on working with tabular data using FastAI. One of FastAI biggest contributions in working with tabular data is the ease with which embeddings can be used for categorical variables. I have found that using embeddings for categorical variables results in significantly better models than the alternatives (e.g. one-hot encoding). I have found that the combination of embeddings and neural networks reach very high performance with tabular data.

In [1]:
from fastai.tabular.all import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We'll the [UCI Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult) where the task is to predict whether a person makes over 50k a year. FastAI makes downloading the dataset easy.

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)

Once it's downloaded we can load it into a DataFrame.

In [3]:
df = pd.read_csv(path/'adult.csv')

Many times machine learning practitioners are dealing with datasets that have already been split into train and test sets. In this case we have all of the data, but I am going to split the data into a train and test split to simulate a pre-defined split.

## Part I

In [4]:
train_df, test_df = train_test_split(df, random_state=42)

Let's take a look at the data.

In [5]:
train_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
29,42,Private,70055,11th,7.0,Married-civ-spouse,,Husband,White,Male,0,0,45,United-States,<50k
12181,25,Private,253267,Some-college,10.0,Married-civ-spouse,Adm-clerical,Husband,Black,Male,0,1902,36,United-States,>=50k
18114,53,Self-emp-not-inc,145419,1st-4th,2.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,7688,0,67,Italy,>=50k
4278,37,State-gov,354929,Assoc-acdm,12.0,Divorced,Protective-serv,Not-in-family,Black,Male,0,0,38,United-States,<50k
12050,25,Private,404616,Masters,14.0,Married-civ-spouse,Farming-fishing,Not-in-family,White,Male,0,0,99,United-States,>=50k
14371,20,Private,303565,Some-college,10.0,Never-married,Handlers-cleaners,Own-child,Black,Male,0,0,40,Germany,<50k
32541,24,Private,241857,Some-college,10.0,Never-married,Adm-clerical,Not-in-family,Black,Female,0,0,35,United-States,<50k
3362,48,Private,398843,Some-college,10.0,Separated,Sales,Unmarried,Black,Female,0,0,35,United-States,<50k
19009,46,Private,109227,Some-college,10.0,Divorced,Exec-managerial,Unmarried,White,Female,0,0,70,United-States,<50k
16041,26,Private,171114,Bachelors,13.0,Never-married,Exec-managerial,Own-child,White,Female,0,0,40,United-States,<50k


In [55]:
train_df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24420.0,24420.0,24057.0,24420.0,24420.0,24420.0
mean,38.578911,189536.7,10.058361,1066.490254,86.502457,40.393366
std,13.69662,104313.5,2.580948,7243.366967,400.848415,12.380526
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,118305.2,9.0,0.0,0.0,40.0
50%,37.0,178482.5,10.0,0.0,0.0,40.0
75%,48.0,236642.0,12.0,0.0,0.0,45.0
max,90.0,1455435.0,16.0,99999.0,4356.0,99.0


In [56]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24420 entries, 29 to 23654
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             24420 non-null  int64  
 1   workclass       24420 non-null  object 
 2   fnlwgt          24420 non-null  int64  
 3   education       24420 non-null  object 
 4   education-num   24057 non-null  float64
 5   marital-status  24420 non-null  object 
 6   occupation      24031 non-null  object 
 7   relationship    24420 non-null  object 
 8   race            24420 non-null  object 
 9   sex             24420 non-null  object 
 10  capital-gain    24420 non-null  int64  
 11  capital-loss    24420 non-null  int64  
 12  hours-per-week  24420 non-null  int64  
 13  native-country  24420 non-null  object 
 14  salary          24420 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.0+ MB


In [57]:
train_df['salary'].value_counts()

<50k     18537
>=50k     5883
Name: salary, dtype: int64

The first thing to note is that there is missing data. We'll have to deal with that - fortunately FastAI has tools that make this easy. Also, it looks like we have both continuous and categorical data. We'll split those apart so we can put the categorical data through embeddings. Also, the data is highly imbalanced, but not so much that we need to directly compensate. The network should be able to deal with this.

In [6]:
continuous_vars, categorical_vars = cont_cat_split(train_df)

The `cont_cat_split` function usually works well, but I always double check the results to see that they make sense.

In [7]:
train_df[continuous_vars].nunique()

age                  72
fnlwgt            17545
education-num        16
capital-gain        116
capital-loss         90
hours-per-week       93
dtype: int64

In [8]:
train_df[categorical_vars].nunique()

workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    41
salary             2
dtype: int64

Note that the variable we're trying to predict, salary, is in the variables. That's fine, we'll just need to tell our learner that that's the `y_names` variable.

In [9]:
y_names = 'salary'

Let's think about the data. One thing that sticks out to me is that `native-country` has 41 different unique values in the train set. This means there's a good chance there will be a new `native-country` in the test set (or after we deploy it!). This will be a problem if we use embeddings. There are ways to deal with unknown categories and embeddings but it's easiest to simply remove it.

In [10]:
categorical_vars.remove('native-country')

In [11]:
categorical_vars

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'salary']

Now we need to decide what preprocessing we need to do. We noted there is missing data, so we'll need to use `FillMissing` to clean that up. Also, we should always `Normalize` the data. Finally, we'll use `Categorify` to transform the categorical variables to be similar to `pd.Categorical`.

In [12]:
preprocessing = [Categorify, FillMissing, Normalize]

We've already split our data because we're simulating that it's already been split for us. But we will still need to pass a splitter to `TabularPandas`, so we'll make one that puts everything in the train set and nothing in the validation set.

In [13]:
def no_split(obj):
    """
    Put everything in the train set
    """
    return list(range(len(obj))), []

In [14]:
splits = no_split(range_of(train_df))

Now we need to create a `TabularPandas` for our data. A `TabularPandas` is wrapper for a pandas DataFrame where the continuous, categorical, and dependent variables are known. FastAI uses lots of inheritance, and the inheritances aren't always intuitive to me, so it's good to look at the method resolution order to get a sense of what the class is supposed to do. You can do so like this:

In [15]:
TabularPandas.__mro__

(fastai.tabular.core.TabularPandas,
 fastai.tabular.core.Tabular,
 fastcore.foundation.CollBase,
 fastcore.basics.GetAttr,
 fastai.data.core.FilteredBase,
 object)

In [16]:
df_wrapper = TabularPandas(train_df, procs=preprocessing, cat_names=categorical_vars, cont_names=continuous_vars,
                   y_names=y_names, splits=splits)

Let's look at some examples to make sure they look right. All the data should be ready for deep learning.

In [17]:
df_wrapper.train.xs.iloc[:5]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,salary,education-num_na,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
29,5,2,3,0,1,5,2,0,1,0.249781,-1.145433,-1.193564,-0.14724,-0.215803,0.372095
12181,5,16,3,2,1,3,2,1,1,-0.991426,0.610962,-0.022444,-0.14724,4.52923,-0.354868
18114,7,4,3,5,1,5,2,1,1,1.052916,-0.422942,-3.145431,0.914167,-0.215803,2.149115
4278,8,8,1,12,2,3,2,0,1,-0.11528,1.585564,0.758303,-0.14724,-0.215803,-0.193321
12050,5,13,3,6,2,5,2,1,1,-0.991426,2.061897,1.539049,-0.14724,-0.215803,4.733873


We can see that the continuous variables are all normalized. This looks good! Now let's create the `DataLoaders`.

In [18]:
dls = df_wrapper.dataloaders(bs=128)

Let's look at our data to make sure it looks right.

In [34]:
batch = next(iter(dls.train))

We are expecting three objects in each batch: the categorical variables, the continuous variables, and the labels. Let's take a look.

In [36]:
len(batch)

3

In [37]:
cat_vars, cont_vars, labels = batch

In [38]:
cat_vars[:5]

tensor([[ 5, 12,  1,  9,  2,  5,  2,  1,  1],
        [ 5,  5,  5, 13,  4,  2,  1,  0,  1],
        [ 5,  9,  5, 14,  4,  5,  2,  0,  1],
        [ 1, 15,  3,  1,  1,  5,  2,  1,  1],
        [ 3, 15,  1, 11,  5,  3,  1,  0,  1]])

In [39]:
cont_vars[:5]

tensor([[-0.6994, -0.9743, -0.4128,  1.3052, -0.2158, -0.0318],
        [-0.8454, -0.7565, -2.7551, -0.1472, -0.2158, -1.6472],
        [-0.6994, -0.3367,  0.3679, -0.1472, -0.2158, -0.0318],
        [ 2.0751, -0.3081,  1.9294,  0.7388, -0.2158, -2.4550],
        [ 0.4688, -0.5462,  1.9294, -0.1472,  4.0902, -0.0318]])

In [40]:
labels[:5]

tensor([[1],
        [0],
        [0],
        [1],
        [0]], dtype=torch.int8)

Looks good!

Now we make a learner. This data isn't very complex so we'll use a relatively small model for it.

In [19]:
learn = tabular_learner(dls, layers=[20,10], metrics=accuracy)

Let's fit the model.

In [21]:
learn.fit(4, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.006316,,,00:01
1,0.00022,,,00:01
2,2.8e-05,,,00:01
3,9e-06,,,00:01


  warn("Your generator is empty.")


We didn't get any `valid_loss` or `accuracy` because we didn't pass a validation set. Normally we would use a validation set, but in this case we wanted to use it like a held-out test set.

Now we can save the model.

In [None]:
learn.save('my_tabular_model')

## Part II

To fully simulate this being a separate test, I'm going to reload the model from the weights. Note that we would have to create a `learn` object before we load the weights. In this case we'll use the same `learn` as before.

In [22]:
learn.load('my_tabular_model')

<fastai.tabular.learner.TabularLearner at 0x7f9828f20820>

Let's look at the model and make sure it loaded correctly.

In [23]:
learn.summary()

TabularModel (Input shape: 128 x 9)
Layer (type)         Output Shape         Param #    Trainable 
                     128 x 6             
Embedding                                 60         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 136        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 40         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 128        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 35         True      
______________________________________________________________

Looks good. Let's look at the test data.

In [25]:
test_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
14160,30,Private,81282,HS-grad,9.0,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<50k
27048,38,Federal-gov,172571,Some-college,10.0,Divorced,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,>=50k
28868,40,Private,223548,HS-grad,9.0,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,Mexico,<50k
5667,28,Local-gov,191177,Masters,14.0,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,20,United-States,>=50k
7827,31,Private,210562,HS-grad,9.0,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,65,United-States,<50k


Because the data is imbalanced we'll have to adjust our baseline. A completely "dumb" classifier that only guesses the most common class will be right more than 50% of the time. Let's see what that percentage is.

In [62]:
test_df['salary'].value_counts()

<50k     6183
>=50k    1958
Name: salary, dtype: int64

In [61]:
test_df['salary'].value_counts()[0] / np.sum(test_df['salary'].value_counts())

0.7594890062645867

OK, so 75% is our baseline that we have to beat.

The data looks like we expected. Now we follow a similar process as what we did before.

In [29]:
test_splits = no_split(range_of(test_df))

In [30]:
test_df_wrapper = TabularPandas(test_df, preprocessing, categorical_vars, continuous_vars, splits=test_splits, y_names=y_names)

Now we can turn that into a `DataLoaders` object.

> Note: If your test set size isn't divisible by your batch size you'll need to `drop_last`. If I don't I get an error, although I've only noticed this happening with the test set.

In [31]:
test_dls = test_df_wrapper.dataloaders(128, drop_last=False)

Now we've got everything in place to make predictions.

In [41]:
preds, ground_truth = learn.get_preds(dl=test_dls.train)

Let's see what they look like.

In [43]:
preds[:5]

tensor([[9.9986e-01, 1.3628e-04],
        [1.0685e-03, 9.9893e-01],
        [9.9973e-01, 2.6673e-04],
        [1.8451e-04, 9.9982e-01],
        [9.9965e-01, 3.4817e-04]])

In [44]:
ground_truth[:5]

tensor([[0],
        [1],
        [0],
        [1],
        [0]], dtype=torch.int8)

Depending on your last layer, converting the prediction into an actual prediction will be different. In this case have a probability associated with each value, so to get the final prediction we need to take an argmax. Had you just had one value in the last layer, you could extract the label prediction with `np.rint(preds)`.

You can test this by seeing that each prediction sums to 1.

In [45]:
preds.sum(1)

tensor([1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000])

In [46]:
torch.argmax(preds, dim=1)

tensor([0, 1, 0,  ..., 0, 0, 1])

Let's see what our final accuracy is on the test set.

In [47]:
accuracy_score(ground_truth, torch.argmax(preds, dim=1))

1.0

Wow! 100% accuracy. This actually makes me suspicious that there was a problem somewhere, such as data leakage. I don't think there was though. But I also don't think that the variable we're predicting - whether the salary is over 50k or not - would be possible to predict with this accuracy. Is there a bug in the code? If you see it, let me know.