# Modeling

In this notebook, I used a 5-fold cross-validation to examine the performance four different models. I tried to keep the model types as diverse as possible:

- k-nearest neighbors (non-parametric)
- logistic regression (linear)
- random forest (tree + bagging)
- gradient boosting (tree + boosting)

Further hyperparameter tuning was performed for the most promising model.

## Setup

Load dependencies.

In [1]:
import pickle

import numpy as np
import pandas as pd
import scipy
from sklearn.compose import make_column_transformer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

Load datasets.

In [2]:
train = pd.read_csv('../data/processed/train.csv')
test = pd.read_csv('../data/processed/test.csv')

In [3]:
X_train, y_train = train.drop(columns=['claim_number', 'fraud']), train['fraud']
X_test = test.drop(columns=['claim_number'])

In [4]:
categorical_features = X_train.columns[X_train.dtypes == object].tolist()
column_transformer = make_column_transformer(
    (OneHotEncoder(), categorical_features),
    remainder="passthrough",
)
standard_scaler = StandardScaler()

## Model Selection

In [5]:
model_dict = {
    "k-nearest-neighbors": KNeighborsClassifier(),
    "logistic-regression": LogisticRegression(),
    "random-forest": RandomForestClassifier(),
    "gradient-boost": GradientBoostingClassifier(),
}

In [6]:
np.random.seed(9394)

records = []
for name, model in model_dict.items():
    if name in ('k-nearest-neighbors', 'logistic-regression'):
        steps = [column_transformer, standard_scaler, model]
    else:
        steps = [column_transformer, model]
    pipeline = make_pipeline(*steps)
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    mean_cv_score = np.mean(cv_scores)
    std_cv_score = np.std(cv_scores)
    records.append({'name': name, 'mean_cv_score': np.mean(cv_scores), 'std_cv_score': np.std(cv_scores)})
pd.DataFrame.from_records(records)

Unnamed: 0,name,mean_cv_score,std_cv_score
0,k-nearest-neighbors,0.560769,0.013607
1,logistic-regression,0.704148,0.012404
2,random-forest,0.686333,0.005503
3,gradient-boost,0.717776,0.011023


AdaBoost had the best performance. I tried to see if I could get an even better performance from it using some hyperparameter tuning.

## Hyperparameter Tuning

Here are the hyperparameters we can tune for `GradientBoostingClassifier`:

- `learning_rate`
- `n_estimators`
- `subsample`
- `min_samples_split`
- `min_samples_leaf`
- `max_depth`
- `max_features`

In [7]:
np.random.seed(123)

pipeline = make_pipeline(column_transformer, GradientBoostingClassifier())
distributions = {
    'gradientboostingclassifier__learning_rate': scipy.stats.uniform(),
    'gradientboostingclassifier__n_estimators': scipy.stats.randint(10, 100),
    'gradientboostingclassifier__subsample': scipy.stats.uniform(),
    'gradientboostingclassifier__min_samples_split': scipy.stats.uniform(),
    'gradientboostingclassifier__min_samples_leaf': scipy.stats.randint(1, 10),
    'gradientboostingclassifier__max_depth': scipy.stats.randint(1, 5),
    'gradientboostingclassifier__max_features': scipy.stats.randint(1, 20),
}

hparam_tuner = RandomizedSearchCV(pipeline, distributions, n_iter=50, cv=5, scoring='roc_auc')
hparam_tuner = hparam_tuner.fit(X_train, y_train)

In [8]:
pd.DataFrame(
    hparam_tuner.cv_results_,
    columns=[
        'param_gradientboostingclassifier__learning_rate',
        'param_gradientboostingclassifier__n_estimators',
        'param_gradientboostingclassifier__subsample',
        'param_gradientboostingclassifier__min_samples_split',
        'param_gradientboostingclassifier__min_samples_leaf',
        'param_gradientboostingclassifier__max_depth',
        'param_gradientboostingclassifier__max_features',
        'mean_test_score',
        'std_test_score',
        'rank_test_score',
    ],
).rename(
    columns={
        'param_gradientboostingclassifier__learning_rate': 'learning_rate',
        'param_gradientboostingclassifier__n_estimators': 'n_estimators',
        'param_gradientboostingclassifier__subsample': 'subsample',
        'param_gradientboostingclassifier__min_samples_split': 'min_samples_split',
        'param_gradientboostingclassifier__min_samples_leaf': 'min_samples_leaf',
        'param_gradientboostingclassifier__max_depth': 'max_depth',
        'param_gradientboostingclassifier__max_features': 'max_features',
    }
).sort_values(by=['rank_test_score'])

Unnamed: 0,learning_rate,n_estimators,subsample,min_samples_split,min_samples_leaf,max_depth,max_features,mean_test_score,std_test_score,rank_test_score
3,0.440257,71,0.849432,0.531828,3,1,8,0.721221,0.011237,1
7,0.317285,53,0.495492,0.250455,7,1,8,0.718594,0.010972,2
26,0.664872,35,0.438214,0.403355,9,2,11,0.71846,0.01274,3
47,0.293152,49,0.792208,0.651643,9,1,9,0.7166,0.009147,4
36,0.305229,68,0.806604,0.411569,6,2,2,0.715922,0.010207,5
45,0.322568,29,0.634442,0.629728,2,2,17,0.71376,0.013558,6
28,0.337066,31,0.440643,0.157035,5,1,10,0.71244,0.012263,7
2,0.627317,78,0.631792,0.398044,5,2,15,0.711341,0.012896,8
32,0.565011,92,0.832837,0.263981,8,1,1,0.708463,0.01282,9
34,0.965698,93,0.769397,0.588017,1,2,11,0.707308,0.011313,10


Use the best model to make predictions on the test set.

In [9]:
best_model = hparam_tuner.best_estimator_
probs = best_model.predict_proba(X_test)
df = pd.DataFrame({'claim_number': test['claim_number'], 'fraud': probs[:, 1]})
df.to_csv("../submission.csv", index=False)

Save the best model for deployment.

In [10]:
with open('../models/best_model.pickle', 'wb') as f:
    pickle.dump(best_model, f)

## Feature importance

In [11]:
def add_dummies(df, categorical_features):
    dummies = pd.get_dummies(df[categorical_features])
    res = pd.concat([dummies, df], axis=1)
    res = res.drop(categorical_features, axis=1)
    return res.columns

feature_names = add_dummies(X_train, categorical_features)
importances = best_model.steps[-1][1].feature_importances_
pd.DataFrame(
    {'feature_name': feature_names, 'importance': importances}
).sort_values(by=['importance', 'feature_name'], ascending=False)

Unnamed: 0,feature_name,importance
26,past_num_of_claims,0.147125
23,annual_income,0.139702
6,accident_site_Parking Lot,0.127698
24,high_education_ind,0.107904
27,witness_present_ind,0.085897
30,claim_est_payout,0.081348
21,marital_status,0.062094
22,safty_rating,0.05514
25,address_change_ind,0.047272
31,age_of_vehicle,0.037471
