# Modeling

In this notebook, I used a 5-fold cross-validation to examine the performance four different models. I tried to keep the model types as diverse as possible:

- k-nearest neighbors (non-parametric)
- logistic regression (linear)
- random forest (tree + bagging)
- gradient boosting (tree + boosting)

## Setup

Import libraries.

In [19]:
import numpy as np
import pandas as pd
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.compose import make_column_transformer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from xgboost import XGBClassifier

Load datasets.

In [20]:
df_train = pd.read_csv("../data/processed/train.csv")
df_val = pd.read_csv("../data/processed/val.csv")
df_test = pd.read_csv("../data/processed/test.csv")

In [21]:
X_train = df_train.drop(columns=["claim_number", "fraud"])
y_train = df_train["fraud"]
X_val = df_val.drop(columns=["claim_number", "fraud"])
y_val = df_val["fraud"]
X_test = df_test.drop(columns=["claim_number"])

## Model Selection

In [22]:
categorical_features = X_train.columns[X_train.dtypes == object].tolist()
column_transformer = make_column_transformer(
    (OneHotEncoder(), categorical_features),
    remainder="passthrough",
)
scaler = StandardScaler()

Some numerical features are measured in very different scales, so they should be re-scaled.

In [26]:
def modeling(X_train, y_train, X_val, y_val, steps):
    pipeline = make_pipeline(*steps)
    pipeline.fit(X_train, y_train)
    y_val_pred = pipeline.predict_proba(X_val)[:, 1]
    metric = roc_auc_score(y_val, y_val_pred)
    if isinstance(pipeline._final_estimator, RandomizedSearchCV) or isinstance(pipeline._final_estimator, GridSearchCV):
        print(f"Best params: {pipeline._final_estimator.best_params_}")
    print(f"AUC score: {metric}")
    return pipeline

### K-Nearest Neighbor

KNN has two hyperparameters: the number of neighbors, and whether all points in each neighborhood are weighted equally or weighted by the inverse of their distance. Since the number of hyperparameters is small. A grid search is used to find the optimal hyperparameter values.

In [27]:
param_grid = {
    "n_neighbors": [5, 10, 25, 50],
    "weights": ["uniform", "distance"],
}

knn_clf = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    scoring="roc_auc",
)

knn_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, knn_clf])

Best params: {'n_neighbors': 10, 'weights': 'distance'}
AUC score: 0.628334760961194


### Logistic Regression

For logistic regression, there is no hyperparameter to tune. I choose to impose an L1 penalty because that will enable variable selection.

In [29]:
lr_clf = LogisticRegression(penalty="l1", solver="liblinear")
lr_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, lr_clf])

AUC score: 0.7155579407942609


In [260]:
def add_dummies(df, categorical_features):
    dummies = pd.get_dummies(df[categorical_features])
    res = pd.concat([dummies, df], axis=1)
    res = res.drop(categorical_features, axis=1)
    return res.columns

feature_names = add_dummies(X_train, categorical_features)

pd.DataFrame({
    "feature_name": feature_names,
    "coefficient": lr_pipeline._final_estimator.coef_[0]
}).sort_values(by="coefficient", ascending=False)

Unnamed: 0,feature_name,importance
16,annual_income,0.730238
19,past_num_of_claims,0.284873
18,address_change_ind,0.196455
24,age_of_vehicle,0.141075
28,longitude,0.136628
4,accident_site_Highway,0.094844
0,gender_F,0.051312
3,living_status_Rent,0.040397
22,policy_report_filed_ind,0.039673
9,channel_Phone,0.023839


### XGBoost

Since there are many hyperparameters in XGBoost, I decide to use a randomized search for hyperparameter tuning.

In [31]:
param_grid = {
    "max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "learning_rate": [0.001, 0.01, 0.1, 0.2, 0.3],
    "subsample": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    "colsample_bytree": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    "colsample_bylevel": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    "min_child_weight": [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
    "gamma": [0, 0.25, 0.5, 1.0],
    "n_estimators": [10, 20, 40, 60, 80, 100, 150, 200]
}

xgb_clf = RandomizedSearchCV(
    XGBClassifier(),
    param_distributions=param_grid,
    n_iter=50,
    n_jobs=-1,
    cv=5,
    random_state=23,
    scoring="roc_auc",
)

xgb_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, xgb_clf])

Best params: {'subsample': 0.7, 'n_estimators': 100, 'min_child_weight': 7.0, 'max_depth': 1, 'learning_rate': 0.3, 'gamma': 0.25, 'colsample_bytree': 1.0, 'colsample_bylevel': 0.8}
AUC score: 0.730342970504935


Although the class imbalance is not very serious in this dataset, I want to see if using SMOTE to synthesize new examples for the minority class can improve the predictive performance. However, it seems that using SMOTE only worsens the performance.

In [32]:
sampler = SMOTE(random_state=42)
xgb_pipeline_smote = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, sampler, xgb_clf])

Best params: {'subsample': 1.0, 'n_estimators': 200, 'min_child_weight': 0.5, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0.25, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.6}
AUC score: 0.696632985585841


Save the XGBoost model (without SMOTE), since it has the best performance.

In [262]:
best_model = xgb_pipeline._final_estimator.best_estimator_
steps = [column_transformer, scaler, best_model]
pipeline = make_pipeline(*steps)
y_test_pred = pipeline.predict_proba(X_test)[:, 1]

df = pd.DataFrame({
    "claim_number": df_test["claim_number"],
    "fraud": y_test_pred
})
df.to_csv("../data/submission/submission.csv", index=False)

Look at the feature importance.

In [263]:
pd.DataFrame({
    "feature_name": feature_names,
    "importance": best_model.feature_importances_
}).sort_values(by="importance", ascending=False)

Unnamed: 0,feature_name,importance
6,accident_site_Parking Lot,0.104387
17,high_education_ind,0.093526
19,past_num_of_claims,0.089551
20,witness_present_ind,0.0796
13,age_of_driver,0.077286
18,address_change_ind,0.067784
14,marital_status,0.063995
16,annual_income,0.05885
15,safty_rating,0.049129
24,age_of_vehicle,0.048446
