# Census Model Deployment to Heroku Using FastAPI

In this project, a simple census dataset is used to create a model pipeline, train it and deploy it to Heroku using FastAPI. The dataset consists of 32,561 entries of different people, each with 14 features (age, education, etc.) and the model infers the salary range of an entry. See the colocated [`README.md`](README.md) for more information.

This notebook is a playground where different data processing and modeling techniques are tested.

You can open this notebook on Google Colab, However, note that you need to upload the dataset to run the notebook there.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mxagar/census_model_deployment_fastapi/blob/master/census_notebook.ipynb)

Table of contents:

- [1. Load and Explore Dataset](#1.-Load-and-Explore-Dataset)
- [2. Data Processing Pipeline](#2.-Data-Processing-Pipeline)
- [3. Model Definition and Training](#3.-Model-Definition-and-Training)
- [4. Model Evaluation](#4.-Model-Evaluation)
- [5. Extra Tests](#5.-Extra-Tests)

## 1. Load and Explore Dataset

In [1]:
import itertools
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./data/census.csv')

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
# No missing values
# We just need to
# - encode categoricals
# - binarize label/target
# - scale numericals
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlgt           32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
df.describe()

Unnamed: 0,age,fnlgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [6]:
# Column names have spaces?
df.columns

Index(['age', ' workclass', ' fnlgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' salary'],
      dtype='object')

In [7]:
# IMPORTANT: We need to do that in production, too!
df = df.rename(columns={col_name: col_name.replace(' ', '') for col_name in df.columns})

In [8]:
# Column names don't have spaces now
df.columns

Index(['age', 'workclass', 'fnlgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [9]:
# Drop duplicates
# IMPORTANT: We need to do that in production, too!
df = df.drop_duplicates().reset_index(drop=True)

In [10]:
# Some duplicates are removed!
# 32561 -> 32537
df.shape

(32537, 15)

In [11]:
# The targets are not balanced
df.salary.value_counts()

 <=50K    24698
 >50K      7839
Name: salary, dtype: int64

## 2. Data Processing Pipeline

In [12]:
target = "salary"
categorical_features = list(df.drop(target, axis=1).select_dtypes(include = ['object']))
numerical_features = list(df.select_dtypes(include = ['float', 'int']))

In [13]:
categorical_features

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [14]:
numerical_features

['age',
 'fnlgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [15]:
len(categorical_features)+len(numerical_features)

14

In [16]:
# Import all necessary tools
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer

In [17]:
# Define processing for categorical columns
# handle_unknown: label encoders need to be able to deal with unknown labesl!
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0),
    OneHotEncoder(sparse_output=False, handle_unknown="ignore")
)

In [18]:
# Define processing for numerical columns
numerical_transformer = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

In [19]:
# Put the 2 tracks together into one pipeline using the ColumnTransformer
# This also drops the columns that we are not explicitly transforming
feature_processor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop",  # This drops the columns that we do not transform
)

In [20]:
# Get a list of the columns we used
features = list(itertools.chain.from_iterable([x[2] for x in feature_processor.transformers]))

In [21]:
len(features)

14

In [22]:
X = df[features]
y = df[target]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X, # predictive variables
    y, # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=42, # we are setting the seed here, ALWAYS DO IT!
    stratify=y # if we want to keep class ratios in splits
)

In [24]:
X_train_transformed = feature_processor.fit_transform(X_train)

In [25]:
target_processor = LabelBinarizer()
y_train_transformed = target_processor.fit_transform(y_train).ravel()

In [26]:
# Save processors and additional data
processing_parameters = dict()
processing_parameters['features'] = features
processing_parameters['target'] = target
processing_parameters['categorical_features'] = categorical_features
processing_parameters['numerical_features'] = numerical_features
processing_parameters['feature_processor'] = feature_processor
processing_parameters['target_processor'] = target_processor

pickle.dump(processing_parameters, open('artifacts/processing_parameters.pickle','wb')) # wb: write bytes
processing_parameters = pickle.load(open('artifacts/processing_parameters.pickle','rb')) # rb: read bytes

# Load again (test)
features = processing_parameters['features']
target = processing_parameters['target']
categorical_features = processing_parameters['categorical_features']
numerical_features = processing_parameters['numerical_features']
feature_processor = processing_parameters['feature_processor']
target_processor = processing_parameters['target_processor']

## 3. Model Definition and Training

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [28]:
# Random forest classifier
estimator = RandomForestClassifier(random_state=42)

# Define Grid Search: parameters to try, cross-validation size
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_features': ['sqrt', 'log2'],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None]+[n for n in range(5,20,5)]
}

# Grid search
search = GridSearchCV(estimator=estimator,
                      param_grid=param_grid,
                      cv=3,
                      scoring='roc_auc')

# Find best hyperparameters and best estimator pipeline
search.fit(X_train_transformed, y_train_transformed)
rfc = search.best_estimator_

In [29]:
print(search.best_score_)
print(search.best_params_)

0.9139884699718301
{'criterion': 'gini', 'max_depth': 15, 'max_features': 'sqrt', 'n_estimators': 200}


In [30]:
# Save model
pickle.dump(rfc, open('artifacts/model.pickle','wb')) # wb: write bytes
# Load again (test)
rfc = pickle.load(open('artifacts/model.pickle','rb')) # rb: read bytes

## 4. Model Evaluation

In [31]:
from sklearn.metrics import fbeta_score, precision_score, recall_score, roc_auc_score

In [32]:
X_test_transformed = feature_processor.transform(X_test)
y_test_transformed = target_processor.transform(y_test).ravel()

In [33]:
preds = rfc.predict(X_test_transformed)
probs = rfc.predict_proba(X_test_transformed)[:, 1]

In [34]:
fbeta = fbeta_score(y_test_transformed, preds, beta=1, zero_division=1)
precision = precision_score(y_test_transformed, preds, zero_division=1)
recall = recall_score(y_test_transformed, preds, zero_division=1)
roc_auc = roc_auc_score(y_test_transformed, probs)

In [35]:
print(f"fbeta = {fbeta}")
print(f"precision = {precision}")
print(f"recall = {recall}")
print(f"roc_auc = {roc_auc}")

fbeta = 0.6847110460863205
precision = 0.8027444253859348
recall = 0.5969387755102041
roc_auc = 0.920798949640585


## 5. Extra Tests

In [36]:
import yaml

In [37]:
config = dict()
with open('config.yaml') as f:
    config = yaml.safe_load(f)
print(config)

{'data_path': './data/census.csv', 'test_size': 0.2, 'random_seed': 42, 'target': 'salary', 'features': {'numerical': ['age', 'fnlgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week'], 'categorical': ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']}, 'random_forest_parameters': {'n_estimators': 100, 'criterion': 'gini', 'max_depth': 13, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'bootstrap': True, 'oob_score': False, 'n_jobs': None, 'random_state': 42, 'verbose': 0, 'warm_start': False, 'class_weight': 'balanced', 'ccp_alpha': 0.0, 'max_samples': None}, 'random_forest_grid_search': {'hyperparameters': {'n_estimators': [100, 150, 200], 'max_features': ['sqrt', 'log2'], 'criterion': ['gini', 'entropy'], 'max_depth': [5, 10, 15]}, 'cv': 3, 'scoring': 'roc_auc'}, 'model_artifact': './exported_artifact

In [40]:
# Install created library/package
!pip install --upgrade .

Processing /Users/mxagar/nexo/git_repositories/census_model_deployment_fastapi
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: census-salary
  Building wheel for census-salary (setup.py) ... [?25ldone
[?25h  Created wheel for census-salary: filename=census_salary-0.1.0-py3-none-any.whl size=6144 sha256=ee81f050bb3ee7d0544dbd919aa21d9bfc72b8003e9d425afae054416a833fe5
  Stored in directory: /private/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/pip-ephem-wheel-cache-e20dcnvd/wheels/8a/3a/7d/c496210767a1dc8b82ed069ba03ced1af4ea2f3cfc458ea059
Successfully built census-salary
Installing collected packages: census-salary
  Attempting uninstall: census-salary
    Found existing installation: census-salary 0.1.0
    Uninstalling census-salary-0.1.0:
      Successfully uninstalled census-salary-0.1.0
Successfully installed census-salary-0.1.0


In [41]:
# Run usage example
import pandas as pd
import census_salary as cs

# Train, is not trained yet
model, processing_parameters, config, test_scores = cs.train_pipeline(config_filename='config.yaml')
print("Test scores: ")
print(test_scores)

# Load pipeline, if training performed in another execution/session
model, processing_parameters, config = cs.load_pipeline(config_filename='config.yaml')

# Get and check the data
df = pd.read_csv('./data/census.csv') # original training dataset: features & target
df, _ = cs.validate_data(df=df) # columns renamed, duplicates dropped, etc.
X = df.drop("salary", axis=1) # optional
X = X.iloc[:100, :] # we take a sample

# Predict salary (values already decoded)
pred = cs.predict(X, model, processing_parameters)
print("Prediction: ")
print(pred)


TRAINING
Running setup...
Running data processing...
Running model fit...
Persisting pipeline: model + processing...
Running evaluation with test split...
Training successfully finished! Check exported artifacts.

Test scores: 
{'precision': 0.5711906744379683, 'recall': 0.875, 'fbeta': 0.691183879093199, 'roc_auc': 0.918021681091465}
Loading pipeline: model + processing parameters + config...
Prediction: 
[' <=50K' ' >50K' ' <=50K' ' <=50K' ' >50K' ' >50K' ' <=50K' ' >50K'
 ' >50K' ' >50K' ' >50K' ' >50K' ' <=50K' ' <=50K' ' >50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' >50K' ' >50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' >50K' ' <=50K' ' >50K' ' <=50K' ' >50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' >50K' ' >50K'
 ' <=50K' ' >50K' ' >50K' ' <=50K' ' <=50K' ' >50K' ' >50K' ' <=50K'
 ' >50K' ' <=50K' ' <=50K' ' <=50K' ' >50K' ' >50K' ' <=50K' ' >50K'
 ' <=50K' ' <=50K' ' >50K' ' <=50K' ' >50K' ' <=50K' ' >50K' ' >50K'
 ' <=50K' ' >50K' ' <=50K' ' >50K'