# Car Insurance Claims Classifier Project
This is a car insurance classifier which uses data from  to predict whether or not a person will file a claim for the purposes of identifying drivers who are likely to file an insurance claim.  

> ## Table of Contents
> 1. Data Acquisition
> 2. Data Preparation  
    a. Train and Test Sets  
    b. Transformation Pipelines   
>3. Data Transformation  
>4. Models  
>5. Evaluation
>6. Further Research

In [None]:
import sys
!{sys.executable} -m pip install pandas numpy scikit-learn matplotlib

In [None]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import numpy as np

%matplotlib inline
RANDOM_STATE=2021

### 1. Data Acquisition
The data used for this project is available at [this kaggle page](https://www.kaggle.com/sagnik1511/car-insurance-data). We'll load the data from our filesystem into a data frame after saving the dataset locally.

In [None]:
def get_data():
    #load data
    file_path = Path('./Car_Insurance_Claim.csv')
    df = pd.read_csv(file_path)

    df = df.drop(['ID'], axis=1) #drop ID

    # convert binary cols to bools
    df['VEHICLE_OWNERSHIP'] = (df['VEHICLE_OWNERSHIP']==1.0)
    df['MARRIED'] = (df['MARRIED']==1.0)
    df['CHILDREN'] = (df['CHILDREN']==1.0)
    df['OUTCOME'] = (df['OUTCOME']==1.0)

    #convert zipcode to string
    df['POSTAL_CODE'] = df['POSTAL_CODE'].astype(str)
    return df

In [None]:
df = get_data()

### 2. Data Preparation

In [1]:
def split_data_set(df):
    #split into train and test sets
    train, test = train_test_split(df, test_size=.2, random_state=2021)

    X_train = train.drop(['OUTCOME'], axis=1) # drop labels
    y_train = train[['OUTCOME']] # only label vector
    y_train = (y_train==1.0) # to bool
    y_train = y_train.values.ravel() #to ndarray and transpose vector

    X_test = test.drop(['OUTCOME'], axis=1)
    y_test = test[['OUTCOME']]
    y_test = (y_test==1.0)
    y_test = y_test.values.ravel()
    
    return X_train, y_train, X_test, y_test

In [None]:
#get split train and test data
X_train, y_train, X_test, y_test = split_data_set(df)

### 3. Data Transformation
Next, we'll write a custom transformer to select columns for a pd.DataFrame.  Then, we'll build numerical and categorical pipelines for the respective data types.

Below are the transforms that will be applied:
    
**Categorical Variables**
- One hot encoding

**Numerical Variables**
- Impute missing values with the median of that column
- Normalize the data with respect to the standard deviation and the mean of the column 

In [None]:
class DataFrameSelector(BaseEstimator, TransformerMixin): 
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names 
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
    

In [None]:
def build_pipeline(df):
    #numerical pipeline
    numerics = ['int64', 'float64']
    non_numerics = ['object', 'bool']
    
    num_cols = df.select_dtypes(include=numerics).columns
    cat_cols = df.select_dtypes(include=non_numerics).columns
    
    num_pipeline = Pipeline([
         ('selector', DataFrameSelector(num_cols)), #select num cols
         ('imputer', SimpleImputer(strategy="median")), #impute the missing values and fill with median
         ('std_scaler', StandardScaler()), # z = (x-u)/std(x)   
    ])
    
    cat_pipeline = Pipeline([
         ('selector', DataFrameSelector(cat_cols)),
         ('one_hot_encoder', OneHotEncoder()),
    ])
        
    full_pipeline = FeatureUnion(transformer_list=[
         ("num_pipeline", num_pipeline),
         ("cat_pipeline", cat_pipeline),
    ])
    return full_pipeline

In [None]:
#instantiate the transformer pipeline
pipeline = build_pipeline(X_train)

In [None]:
X_train_prepared = pipeline.fit_transform(X_train) # get transformed data to feed to the classifier
X_test_prepared = pipeline.fit_transform(X_test)

### 4. Models

We're going to train each of the models on the training set first, and then train each model with cross validation to evaluate the models' generalizability.

First, we instantiate each of the types of models that we want to compare

In [None]:
sgd_clf = SGDClassifier(random_state=RANDOM_STATE) #random state for reproducibility since SGD is stochastic
knn_clf = KNeighborsRegressor(n_neighbors=1) # n=1 for binary classifier
logit_clf = LogisticRegression()
linear_svm_clf = LinearSVC(C=1, loss="hinge", random_state=RANDOM_STATE, max_iter=10000) #random state for reproducibility

Next, we'll train each model on the training sets

In [None]:
sgd_clf.fit(X_train_prepared, y_train)
knn_clf.fit(X_train_prepared, y_train)
logit_clf.fit(X_train_prepared, y_train)
linear_svm_clf.fit(X_train_prepared, y_train)

Now, we can get predictions for the training set from each model

In [None]:
y_train_pred_sgd = sgd_clf.predict(X_train_prepared)
y_train_pred_knn = knn_clf.predict(X_train_prepared)
y_train_pred_logit = logit_clf.predict(X_train_prepared)
y_train_pred_lin_svm = linear_svm_clf.predict(X_train_prepared)

Next, we get the predictions for each model using 3-fold cross validation

In [None]:
y_train_pred_sgd_cv = cross_val_predict(sgd_clf, X_train_prepared, y_train, cv=3)
y_train_pred_knn_cv = cross_val_predict(knn_clf, X_train_prepared, y_train, cv=3, method="predict")
y_train_pred_logit_cv = cross_val_predict(logit_clf, X_train_prepared, y_train, cv=3, method="predict")
y_train_pred_lin_svm_cv = cross_val_predict(linear_svm_clf, X_train_prepared, y_train, cv=3, method="predict")

### 5. Evaluation

Now that all of the models are trained, we can get the evaluation metrics for both sets of predictions

In [None]:
def get_model_metrics(y_train, y_train_pred):
    confusion = confusion_matrix(y_train, y_train_pred)
    accuracy = accuracy_score(y_train, y_train_pred)
    precision = precision_score(y_train, y_train_pred)
    recall = recall_score(y_train, y_train_pred)
    f1 = f1_score(y_train, y_train_pred)
    
    return [confusion, accuracy, precision, recall, f1]

In [None]:
#metrics for models trained with the training set
sgd_metrics = get_model_metrics(y_train, y_train_pred_sgd)
knn_metrics = get_model_metrics(y_train, y_train_pred_knn)
logit_metrics = get_model_metrics(y_train, y_train_pred_logit)
linear_svm_metrics = get_model_metrics(y_train, y_train_pred_lin_svm)

#metrics for models trained with CV on the training set
sgd_cv_metrics = get_model_metrics(y_train, y_train_pred_sgd_cv)
knn_cv_metrics = get_model_metrics(y_train, y_train_pred_knn_cv)
logit_cv_metrics = get_model_metrics(y_train, y_train_pred_logit_cv)
linear_svm_cv_metrics = get_model_metrics(y_train, y_train_pred_lin_svm_cv)

And now, we can print the results and compare the models' performances.

In [None]:
def print_model_metrics(title, metrics):
    conf, acc, prec, rec, f1 = metrics
    print(f"Performance metrics for {title}:\n")
    print(conf, "\n")
    print("Accuracy: ", acc)
    print("Precision: ", prec)
    print("Recall: ", rec)
    print("F1 Score: ", f1, "\n\n")

In [None]:
print_model_metrics("SGD Classifier-No CV", sgd_metrics)
print_model_metrics("SGD Classifier-With CV", sgd_cv_metrics)

As we can see here, the model is generalizing well, since we aren't seeing much difference in the performance on the cross-validated sets versus the training set.  Let's see if either of the other models have better predictive power.  Next up is the KNN Classifier

In [None]:
print_model_metrics("KNN Classifier-No CV", knn_metrics)
print_model_metrics("KNN Classifier-With CV", knn_cv_metrics)

Here, we can see that the KNN model is badly overfitting and won't generalize to unseen data.  We can consider simplifying the model by removing some features, or opt to eliminate this model entirely.

Next, let's check out how the logistic regression classifier fared.

In [None]:
print_model_metrics("Logit Classifier-No CV", logit_metrics)
print_model_metrics("Logit Classifier-With CV", logit_cv_metrics)

Good news!  The logit classifier is generalizing well, ***and***  performs better on all evaluation metrics than the SGD classifier!  This looks like a winner thus far, but we still aren't achieving particularly high accuracy.  Perhaps a more powerful model can make better predictions.  Let's explore that with some support vector machines.

In [None]:
print_model_metrics("Linear SVM Classifier-No CV", linear_svm_metrics)
print_model_metrics("Linear SVM Classifier-With CV", linear_svm_cv_metrics)

This is performing similarly to (but slightly better than) the logistic regression classifier.  Let's take a closer look at the most promising models and see if we can't find better hyperparameters.

### 6. Fine Tuning


Now that we've identified some promising models, we can begin fine tuning our model using GridSearchCV to train models with different combinations of hyperparameters (here we're just adjusting the regularization coefficient, C, for the linear support vector machine model from above.

In [None]:
params = {"C":[1e-3, 1e-2, 1e-1, 1, 10, 100, 1e3, 1e4]}
grid_search = GridSearchCV(linear_svm_clf, params, cv=3, scoring="f1", refit=True) # refit=True retrains best_estimator on all other folds
grid_search.fit(X_train_prepared, y_train)
grid_search.best_params_ #show the best hyperparams from the grid

At first, we started with order of magnitude differences in the hyper parameter grid.  Now, let's get a little more precise with our search grid by choosing closer values of C to test.

In [None]:
params = {"C":[75, 90, 100, 110, 125, 150]}
grid_search = GridSearchCV(linear_svm_clf, params, cv=3, scoring="f1", refit=True) # refit=True retrains best_estimator on all other folds
grid_search.fit(X_train_prepared, y_train)
grid_search.best_params_ #show the best hyperparams from the grid

We can see that our original estimate of 100 is still the optimal value of C.  Now, we can get the best estimator from the grid search, and fit it to the training data

In [None]:
best_svm_clf = grid_search.best_estimator_
best_svm_clf.fit(X_train_prepared, y_train)

Get the predictions on the training set and using CV

In [None]:
y_train_pred_best_svm = best_svm_clf.predict(X_train_prepared)
y_train_pred_best_svm_cv = cross_val_predict(best_svm_clf, X_train_prepared, y_train, cv=3, method="predict")

Get evaluation metrics for the best SVM classifier from the grid search

In [None]:
best_svm_metrics = get_model_metrics(y_train, y_train_pred_lin_svm)
best_svm_cv_metrics = get_model_metrics(y_train, y_train_pred_lin_svm_cv)

In [None]:
print_model_metrics("Best Linear SVM Classifier-No CV", best_svm_metrics)
print_model_metrics("Best Linear SVM Classifier-With CV", best_svm_cv_metrics)

As we can see, there's very similar performance on the training set and cross-validated models, suggesting a high degree of generalizability.  Now, all that remains is to test the performance on the test set.

In [None]:
y_test_pred = best_svm_clf.predict(X_test_prepared)
test_metrics = get_model_metrics(y_test, y_test_pred)
print_model_metrics("Linear SVM--Test Set", test_metrics)

In [None]:
These are encouraging results!  We're successfully predicting 85.5% of all outcomes! Of the customers we classified as likely to file a claim, 77.6% of those did file claims.  Of all of the cusomters who filed claims, we correctly identified 72.4% of them.