# From Notebook to ModelOp Center:
## Training, Evaluating, and Conforming a Model for Deployment


In this notebook, we demonstrate the process of 
1. training a model, 
2. evaluating its performance, 
3. saving it for later use,
4. and conforming it to MOC standards.

More specifically, we will train a logistic regression classifier on the German Credit Data dataset.

**I - Model Training**

Let's begin by loading relevant libraries. We will need `sklearn` for model training, and `aequitas` for bias detection.

In [1]:
import csv
import json
import pickle
import numpy as np
import pandas as pd

from aequitas.bias import Bias
from aequitas.group import Group
from aequitas.preprocessing import preprocess_input_df

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, \
                            f1_score, fbeta_score, confusion_matrix

# set_config(display='diagram')

The **German Credit Data** dataset can be found here: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). Download it and load it from a *CSV* file. For our purposes, the dataset has been modified slightly to include an `id` column, and a `gender` column (engineered from `status_sex`, used to demonstarte bias). The target variable is under `label`. We have mapped the labels `[1,2]` to `[0,1]`, where `1` indicates the positive class (loan default).

In [2]:
data = pd.read_csv("german_credit_data.csv")

The aequitas library requires the true label to be encoded as 'label_value', so let us rename that column.

In [4]:
data = data.rename(columns={"label":"label_value"})

In [5]:
data.columns.values

array(['id', 'duration_months', 'credit_amount', 'installment_rate',
       'present_residence_since', 'age_years', 'number_existing_credits',
       'checking_status', 'credit_history', 'purpose', 'savings_account',
       'present_employment_since', 'debtors_guarantors', 'property',
       'installment_plans', 'housing', 'job', 'number_people_liable',
       'telephone', 'foreign_worker', 'gender', 'label_value'],
      dtype=object)

Let's look at some data:

In [6]:
data.head()

Unnamed: 0,id,duration_months,credit_amount,installment_rate,present_residence_since,age_years,number_existing_credits,checking_status,credit_history,purpose,...,debtors_guarantors,property,installment_plans,housing,job,number_people_liable,telephone,foreign_worker,gender,label_value
0,0,6,1169,4,4,67,2,A11,A34,A43,...,A101,A121,A143,A152,A173,1,A192,A201,male,0
1,1,48,5951,2,2,22,1,A12,A32,A43,...,A101,A121,A143,A152,A173,1,A191,A201,female,1
2,2,12,2096,2,3,49,1,A14,A34,A46,...,A101,A121,A143,A152,A172,2,A191,A201,male,0
3,3,42,7882,2,4,45,1,A11,A32,A42,...,A103,A122,A143,A153,A173,2,A191,A201,male,0
4,4,24,4870,3,4,53,2,A11,A33,A40,...,A101,A124,A143,A153,A173,2,A191,A201,male,1


Not all numeric columns need to be considered as numeric features. For example, `number_people_liable` only has two unique discrete values:

In [7]:
data.number_people_liable.value_counts()

1    845
2    155
Name: number_people_liable, dtype: int64

We may therefore treat it as a categorical feature. Note, however, that we may need to reconsider this option if more values appear in testing phases.

In [8]:
data.number_people_liable = data.number_people_liable.astype('object')

Before proceeding any further with model development, let us split the original dataset into two sets: a **baseline** set that will be used as a reference set, and a **sample** set which will mimic input data to the model once the model is in use.

In [9]:
df_baseline, df_sample = train_test_split(data, train_size=0.8, random_state=0)

df_baseline.to_json('df_baseline.json', orient='records', lines=True)
df_sample.to_json('df_sample.json', orient='records', lines=True)

We will train a **Logistic Regression** classifier. Since our data contains categorical features, we will need to start our pipeline with an encoder.

In [10]:
pipeline = make_pipeline(
    OneHotEncoder(
        handle_unknown='ignore', 
        sparse=True
    ),
    LogisticRegression(
        max_iter=1000,
        random_state=0
    )
)

**Logistic Regression** has multiple parameters which can be tuned. Among these are `C`, `solver`, and `class_weight`, which will be optimized by **GridSearchCV**. We provide GridSearchCV a list of values for each of these parameters.

In [11]:
parameters = dict(
    logisticregression__C=np.logspace(-4, 4, 50), # Inverse of regularization strength
    logisticregression__solver=['liblinear', 'lbfgs', 'newton-cg'],
    logisticregression__class_weight=['balanced', None]
)

Our data still contains non-predictive features, such as `id`, `label` and `gender` (excluded to remove explicit bias). We remove these below.

In [12]:
predictive_features = [
    f for f in list(data.columns.values) 
    if f not in ['id', 'label_value', 'gender']
]

As a sanity check, let us see which features are automatically encoded as **numeric**, and which are encoded as **categorical**.

In [13]:
categorical_features = [
    f for f in list(data.select_dtypes(include=['category', 'object'])) 
    if f in predictive_features
]

numerical_features = [
    f for f in predictive_features 
    if f not in categorical_features
]

**Categorical features**:

In [14]:
print(categorical_features)

['checking_status', 'credit_history', 'purpose', 'savings_account', 'present_employment_since', 'debtors_guarantors', 'property', 'installment_plans', 'housing', 'job', 'number_people_liable', 'telephone', 'foreign_worker']


**Numerical features**:

In [15]:
print(numerical_features)

['duration_months', 'credit_amount', 'installment_rate', 'present_residence_since', 'age_years', 'number_existing_credits']


Everything looks good; let us proceed with training. We need to specify **predictive** and **response** variables for each of the training and test sets. We set these by filtering the baseline and sample sets.

In [16]:
X_train = df_baseline[predictive_features]
X_test = df_sample[predictive_features]

y_train = df_baseline['label_value']
y_test = df_sample['label_value']

X_train.to_json('X_train.json', orient='records', lines=True)
X_test.to_json('X_test.json', orient='records', lines=True)

We may now fit the classifier to the training data. Since "it is worse to classify a customer as good when they are bad, than it is to classify a customer as bad when they are good", we will use an **F_beta metric**, with `beta=2`, to judge the performance of our model.

In [17]:
clf_GS = GridSearchCV(
    estimator=pipeline, 
    param_grid=parameters,
    n_jobs=-1,
    scoring=make_scorer(fbeta_score, beta=2),
    cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0)
)
clf_GS.fit(X_train, y_train)

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=0),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('onehotencoder',
                                        OneHotEncoder(categories='auto',
                                                      drop=None,
                                                      dtype=<class 'numpy.float64'>,
                                                      handle_unknown='ignore',
                                                      sparse=True)),
                                       ('logisticregression',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=...
       3.39322177e+02, 4.94171336e+02, 7.19685673e+02, 1.048

In [19]:
clf_GS.best_estimator_

Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(categories='auto', drop=None,
                               dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', sparse=True)),
                ('logisticregression',
                 LogisticRegression(C=0.0020235896477251557,
                                    class_weight='balanced', dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=1000,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=0,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

Let's examine the parameters of the best estimator more carefully.

In [20]:
clf_GS.best_params_

{'logisticregression__C': 0.0020235896477251557,
 'logisticregression__class_weight': 'balanced',
 'logisticregression__solver': 'lbfgs'}

It appears that the best logistic regression classifier is one with a `solver='lbfgs'` and `class_weight='balanced'`. This classifier achived the best score:

In [21]:
clf_GS.best_score_

0.6720094431830962

**II - Model Evaluation**

Before saving our trained model for further use, let's look at some performance metrics. We will evaluate the model on both the training and test sets; we would like to see a stable performance.

For repeatability, let's define a function which computes multiple metrics at-a-time:

In [22]:
def compute_metrics(y, y_preds):
    """
    A function to evaluate a classification model
    
    param: y: true (actual) labels
    param: y_preds: predicted labels (as scored by model)
    
    return: mutiple classification performance metrics
    """
    
    return [
        accuracy_score(y, y_preds),
        precision_score(y, y_preds),
        recall_score(y, y_preds),
        f1_score(y, y_preds),
        fbeta_score(y, y_preds, beta=2),
    ]

Let us now compute predictions on both training and test sets:

In [23]:
y_test_preds = clf_GS.best_estimator_.predict(X_test)
y_train_preds = clf_GS.best_estimator_.predict(X_train)

We will display performance metrics in a DataFrame:

In [24]:
preformance_df = pd.DataFrame(
    data=[{}],
    columns=['Accuracy', 'Precision', 'Recall', 'F1 score', 'F2 Score'],
    index=['Training Set', 'Test Set']
)

In [25]:
preformance_df.loc['Training Set',:] = compute_metrics(y=y_train, y_preds=y_train_preds)
preformance_df.loc['Test Set',:] = compute_metrics(y=y_test, y_preds=y_test_preds)

Here's how our model performed:

In [26]:
preformance_df

Unnamed: 0,Accuracy,Precision,Recall,F1 score,F2 Score
Training Set,0.725,0.53125,0.772727,0.62963,0.708333
Test Set,0.665,0.451613,0.724138,0.556291,0.646154


While it's good to see that the performance on the training set is not too far off from the performance on the test set, further model improvements are needed to achieve better F2 scores. For now, we will contend with this model and use it to produce new predictions.

**III - Saving and Loading the Trained Model**

Now that the model is **trained** and **evaluated**, we save it in a binary format. It will then be loaded and used to make new predictions.

In [27]:
pickle.dump(clf_GS.best_estimator_, open("logreg_classifier.pickle", 'wb'))

The model is reloaded on-demand as follows:

In [58]:
logreg_classifier = pickle.load(open("logreg_classifier.pickle", 'rb'))

Predictions are produced on-demand by calling the `predict()` function:

In [59]:
new_preds = logreg_classifier.predict(X_test)

**IV - Evaluating Bias on Protected Classes**

Since `gender` is a protected class, we have excluded it from the list of predictive features. However, this does not guarantee that the model is not implicitly biased, as `gender` could potentially be inferred from other features. It is therefore imperative that we evaluate our model for Bias.

To that end, let us produce some predictions and append them to our labeled baseline and sample sets.

In [60]:
df_baseline_scored = df_baseline.copy(deep=True)
df_baseline_scored["score"] = logreg_classifier.predict(
    df_baseline[predictive_features])

df_sample_scored = df_sample.copy(deep=True)
df_sample_scored["score"] = logreg_classifier.predict(
    df_sample[predictive_features])

Let's save these two DataFrames before proceeding further:

In [61]:
df_baseline_scored.to_json('df_baseline_scored.json', orient='records', lines=True)
df_sample_scored.to_json('df_sample_scored.json', orient='records', lines=True)

Now, we call the aequitas preprocessing function on our datasets, filtered to the features we care about: `score` (prediction), `label_value` (true label), and `gender` (protected class):

In [62]:
df_baseline_scored_processed, _ = preprocess_input_df(
    df_baseline_scored.loc[:,['score', 'label_value', 'gender']]
)
df_sample_scored_processed, _ = preprocess_input_df(
    df_sample_scored.loc[:,['score', 'label_value', 'gender']]
)

Let's start by computing some `Group` Metrics:

In [63]:
g_baseline, g_sample = Group(), Group()
xtab_baseline, _ = g_baseline.get_crosstabs(df_baseline_scored_processed)
xtab_sample, _ = g_sample.get_crosstabs(df_sample_scored_processed)

In [64]:
absolute_metrics_baseline = g_baseline.list_absolute_metrics(xtab_baseline)
absolute_metrics_sample = g_sample.list_absolute_metrics(xtab_sample)

Here are the absolute metrics, computed on baseline and sample sets, respectively:

In [65]:
xtab_baseline[['attribute_name', 'attribute_value'] + absolute_metrics_baseline].round(2)

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,gender,female,0.8,0.67,0.14,0.43,0.33,0.2,0.86,0.57,0.34,0.5,0.35
1,gender,male,0.76,0.72,0.12,0.49,0.28,0.24,0.88,0.51,0.66,0.42,0.28


In [66]:
xtab_sample[['attribute_name', 'attribute_value'] + absolute_metrics_sample].round(2)

Unnamed: 0,attribute_name,attribute_value,tpr,tnr,for,fdr,fpr,fnr,npv,precision,ppr,pprev,prev
0,gender,female,0.68,0.7,0.2,0.45,0.3,0.32,0.8,0.55,0.33,0.43,0.35
1,gender,male,0.76,0.61,0.12,0.6,0.39,0.24,0.88,0.4,0.67,0.48,0.26


We can also add some raw counts (group sizes) as follows:

In [67]:
xtab_baseline[[col for col in xtab_baseline.columns if col not in absolute_metrics_baseline]]

Unnamed: 0,model_id,score_threshold,k,attribute_name,attribute_value,pp,pn,fp,fn,tn,tp,group_label_pos,group_label_neg,group_size,total_entities
0,0,binary 0/1,352,gender,female,118,120,51,17,103,67,84,154,238,800
1,0,binary 0/1,352,gender,male,234,328,114,38,290,120,158,404,562,800


In [68]:
xtab_sample[[col for col in xtab_sample.columns if col not in absolute_metrics_sample]]

Unnamed: 0,model_id,score_threshold,k,attribute_name,attribute_value,pp,pn,fp,fn,tn,tp,group_label_pos,group_label_neg,group_size,total_entities
0,0,binary 0/1,93,gender,female,31,41,14,8,33,17,25,47,72,200
1,0,binary 0/1,93,gender,male,62,66,37,8,58,25,33,95,128,200


That's it for `Group` metrics. Let's move on to `Bias` metrics.

In [69]:
b_baseline, b_sample = Bias(), Bias()

bdf_baseline = b_baseline.get_disparity_predefined_groups(
    xtab_baseline, 
    original_df=df_baseline_scored_processed, 
    ref_groups_dict={'gender':'male'}, alpha=0.05, mask_significance=True
)

bdf_sample = b_sample.get_disparity_predefined_groups(
    xtab_sample, 
    original_df=df_sample_scored_processed, 
    ref_groups_dict={'gender':'male'}, alpha=0.05, mask_significance=True
)

get_disparity_predefined_group()
get_disparity_predefined_group()


We can now compute **disparity** metrics as follows

In [70]:
calculated_disparities_baseline = b_baseline.list_disparities(bdf_baseline)
calculated_disparities_sample = b_sample.list_disparities(bdf_sample)

disparity_metrics_df_baseline = bdf_baseline[
    ['attribute_name', 'attribute_value'] + \
        calculated_disparities_baseline
    ]
disparity_metrics_df_sample = bdf_sample[
    ['attribute_name', 'attribute_value'] + \
        calculated_disparities_sample
    ]

Here are the computed disparity metrics on baseline and sample sets, respectively:

In [71]:
disparity_metrics_df_baseline

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.504274,1.190763,1.107203,0.887154,1.222807,1.173616,0.841479,1.050198,0.931751,0.970805
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [72]:
disparity_metrics_df_sample

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.5,0.888889,1.36,0.756757,1.609756,0.764807,1.32,0.8976,1.150037,0.915896
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Some of the disparity metrics above are worrisome! We might need to retrain the model, possibly with better feature engineering. That's an exercise for a later time.

**V - Conforming Model Code to MOC Requirements**

Conformance is best-demonstrated through and example. Let's look at the code below:

In [73]:
import pandas as pd
import numpy as np
import pickle
import copy
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.bias import Bias


# modelop.init
def begin():
    
    global logreg_classifier
    
    # load pickled logistic regression model
    logreg_classifier = pickle.load(open("logreg_classifier.pickle", "rb"))

    
# modelop.score
def action(data):
    
    # Turn data into DataFrame
    data = pd.DataFrame([data])
    
    # There are only two unique values in data.number_people_liable.
    # Treat it as a categorical feature
    data.number_people_liable = data.number_people_liable.astype('object')

    predictive_features = [
        'duration_months', 'credit_amount', 'installment_rate',
        'present_residence_since', 'age_years', 'number_existing_credits',
        'checking_status', 'credit_history', 'purpose', 'savings_account',
        'present_employment_since', 'debtors_guarantors', 'property',
        'installment_plans', 'housing', 'job', 'number_people_liable',
        'telephone', 'foreign_worker'
    ]
    
    data["predicted_score"] = logreg_classifier.predict(data[predictive_features])
    
    # MOC expects the action function to be a *yield* function
    yield data.to_dict(orient="records")


# modelop.metrics
def metrics(data):
    
    data = pd.DataFrame(data)

    # To measure Bias towards gender, filter DataFrame
    # to "score", "label_value" (ground truth), and
    # "gender" (protected attribute)
    data_scored = data[["score", "label_value", "gender"]]

    # Process DataFrame
    data_scored_processed, _ = preprocess_input_df(data_scored)

    # Group Metrics
    g = Group()
    xtab, _ = g.get_crosstabs(data_scored_processed)

    # Absolute metrics, such as 'tpr', 'tnr','precision', etc.
    absolute_metrics = g.list_absolute_metrics(xtab)

    # DataFrame of calculated absolute metrics for each sample population group
    absolute_metrics_df = xtab[
        ['attribute_name', 'attribute_value'] + absolute_metrics].round(2)

    # For example:
    """
        attribute_name  attribute_value     tpr     tnr  ... precision
    0   gender          female              0.60    0.88 ... 0.75
    1   gender          male                0.49    0.90 ... 0.64
    """

    # Bias Metrics
    b = Bias()

    # Disparities calculated in relation gender for "male" and "female"
    bias_df = b.get_disparity_predefined_groups(
        xtab,
        original_df=data_scored_processed,
        ref_groups_dict={'gender': 'male'},
        alpha=0.05, mask_significance=True
    )

    # Disparity metrics added to bias DataFrame
    calculated_disparities = b.list_disparities(bias_df)

    disparity_metrics_df = bias_df[
        ['attribute_name', 'attribute_value'] + calculated_disparities]

    # For example:
    """
        attribute_name	attribute_value    ppr_disparity   precision_disparity
    0   gender          female             0.714286        1.41791
    1   gender          male               1.000000        1.000000
    """

    output_metrics_df = disparity_metrics_df # or absolute_metrics_df

    # Output a JSON object of calculated metrics
    
    # MOC expects the action function to be a *yield* function
    yield output_metrics_df.to_dict(orient="records")

There are four main sections that are standard to almost any model in MOC:
1. Library imports
2. `init` function
3. `score` function
4. `metrics` function

**Library** imports are always at the top. We don't need to include all libraries that we used for training and model evaluation. We just need the libraries for processing and scoring.

The **`init`** function runs once per deployment, and is used to load and persist into memory any variable that needs to be accessed at scoring time. For example, the init function is where we load the saved model binary. We make the variable global so it can be accessed from the scoring function.

The **`score`** function is the function that runs anytime we make a scoring (prediction) request. This is where we put our prediction code. We have to remember to include any steps that were not captured by the pipeline, such as feature engineering or re-encoding.

The **`metrics`** functions is where model evaluation is carried out. In our example, this is the place where we replicate the calculations of Group and/or Bias metrics.

Let us test our source code to see if we missed anything. We will load input data and scored input data to test both the scoring and metrics functions:

In [74]:
test_sample = pd.read_json('df_baseline.json', lines=True, orient='records')
metrics_sample = pd.read_json('df_baseline_scored.json', lines=True, orient='records')

Let's check that the **`init`** function can load the trained model binary:

In [75]:
begin()

No errors from the **`init`** function. Let us now call the **`score`** function on input data:

In [76]:
scores = next(action(test_sample.iloc[0]))

In [77]:
pd.DataFrame(scores).head()

Unnamed: 0,id,duration_months,credit_amount,installment_rate,present_residence_since,age_years,number_existing_credits,checking_status,credit_history,purpose,...,property,installment_plans,housing,job,number_people_liable,telephone,foreign_worker,gender,label_value,predicted_score
0,687,36,2862,4,3,30,1,A12,A33,A40,...,A124,A143,A153,A173,1,A191,A201,male,0,1


We have scores! Last but not least, let's call the **`metrics`** function on scored data:

In [None]:
bias = next(metrics(metrics_sample))

In [79]:
pd.DataFrame(bias).head()

Unnamed: 0,attribute_name,attribute_value,ppr_disparity,pprev_disparity,precision_disparity,fdr_disparity,for_disparity,fpr_disparity,fnr_disparity,tpr_disparity,tnr_disparity,npv_disparity
0,gender,female,0.504274,1.190763,1.107203,0.887154,1.222807,1.173616,0.841479,1.050198,0.931751,0.970805
1,gender,male,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Prefect!**