# IART - 2ND GROUP PROJECT (CAR INSURANCE CLAIM PREDICTION)

## Introduction

- We were proposed to create a supervised machine learning model in order to predict whether the policyholder files a claim in the next 6 months or not.

- To do that, we had a database which contains various informations such as policy tenure, age of the car, age of the car owner, the population density of the city, make and model of the car, power, engine type, etc, and some of them might be relevant to our model use to make the prediction.

- Firstly, we had to sanitize our database so that it could be free from outliers and irrelevant attributes.

## Database

- At first we made a describe of the database, so that we could know what attributes and its values it had.

In [None]:
import pandas as pd

car_data = pd.read_csv("data/train.csv")

# Learn more about the training data attributes
print(car_data.head())
print(car_data.describe())

- Next, we have also removed the attributes that were not relvant for the prediction, such as ID's, cars' torques, etc, and checked if the database was correct.

In [None]:
# Drop unuseful columns
car_data = car_data.drop(columns=['policy_id', 'is_adjustable_steering', 'is_tpms',
                                  'is_parking_camera', 'rear_brakes_type', 'cylinder',
                                  'transmission_type', 'turning_radius', 'is_rear_window_washer',
                                  'is_power_door_locks', 'is_central_locking',
                                  'is_day_night_rear_view_mirror'])

# Check transformed data
car_data.to_csv('./data/cleansed_car_data.csv', index=False)
print(car_data.head())

- After that, we filtered for null values. In our case, fortunately, we dind't had in our database.

In [None]:
n_of_null_values = car_data.isnull().sum()
print(n_of_null_values)

- Besides that, we removed the duplicated rows, so that we prevented from using repeated information.

In [None]:
# Remove duplicate entries
car_data = car_data.drop_duplicates()

- Then, we target encoded the training data, so that it would be better suited for the training models.

In [None]:
from sklearn import preprocessing

# Target encode the training data, to better suit the training models
for column in car_data.columns:
    if dict(car_data.dtypes)[column] == 'object':  # If it is a value of type string or categorical variable     
        label_encoder = preprocessing.LabelEncoder()
        car_data[column] = label_encoder.fit_transform(car_data[column])

- Afterwards, we took all age related columns and verified if the values were between 0 and 1, since the dataset description indicated that all age columns were normalized.

In [None]:
# Remove all age values not between 0 and 1 
subset = car_data[~car_data['age_of_policyholder'].between(0, 1)]
print("\nNumber of entries where policy holder age not between 0 and 1: "+str(len(subset)))
# car_data = car_data[car_data['age_of_policyholder'].between(0, 1)]

subset = car_data[~car_data['age_of_car'].between(0, 1)]
print("\nNumber of entries where age of car not between 0 and 1: "+str(len(subset)))
# car_data = car_data[car_data['age_of_car'].between(0, 1)]

subset = car_data[~car_data['policy_tenure'].between(0, 1)]
print("\nNumber of entries where policy tenure not between 0 and 1: "+str(len(subset)))
car_data = car_data[car_data['policy_tenure'].between(0, 1)]

- Next, we removed outliers using the Z-Score and Inter Quartile Range methods.

In [None]:
import numpy as np

def detect_outliers_zscore(data):
    outliers = []
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    for idx, value in enumerate(data):
        z_score = (value - mean) / std
        if np.abs(z_score) > threshold:
            outliers.append(idx)
    return outliers


def detect_outliers_iqr(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    IQR = q3 - q1
    lower_bound = q1 - (1.5 * IQR)
    upper_bound = q3 + (1.5 * IQR)
    for idx, value in enumerate(data):
        if value < lower_bound or value > upper_bound:
            outliers.append(idx)
    return outliers

In [None]:
# Detect outliers using Z-Score and Inter Quartile Range
car_data = car_data.reset_index(drop=True)

for column in ['population_density', 'fuel_type', 'displacement', 'length','width','height','gross_weight']:
    outliers = detect_outliers_zscore(car_data[column])
    print("\nNumber of Z-score outliers in "+column+" column: "+str(len(outliers)))
    car_data.drop(outliers)

    outliers = detect_outliers_iqr(car_data[column])
    print("Number of IQR outliers in "+column+" column: "+str(len(outliers)))
    car_data.drop(outliers)

- To finish up, we scaled the data so that all columns would be between values 0 and 1

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scale data
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(car_data)
car_data = pd.DataFrame(scaled_data, columns=car_data.columns)

- After it all, we have cleansed the database. Here is its plot (only some columns included):

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt

sb.pairplot(car_data[['policy_tenure','age_of_car','age_of_policyholder','area_cluster', 'is_claim']], hue='is_claim')
plt.show()

- We realised that some classes were imbalanced and some features should be separated from the target and feature variables. We did that in this way:

In [None]:
# Separate the data into target and feature variables
car_data_target = car_data['is_claim']
car_data_features = car_data.drop('is_claim', axis=1)

- In order to check the attibutes' imbalance, we count the number of 1's and compared with the number's of 0's. 
- With this, we concluded that there was an imbalance in the target variable, due to the overwhelming amount of 0's compared to 1's. 


In [None]:
# Check for class imbalance - the results indicate that there is a clear imbalance in the target variable, an overwhelming amount of 0's compared to 1's
print(car_data_target.value_counts(1))

- We also separated data from training to testing, in favor of seed's consistency.


In [None]:
from sklearn import model_selection

# Separate data into training and testing, provide seed for consistency
car_data_features_train, car_data_features_test, car_data_target_train, car_data_target_test = model_selection.train_test_split(car_data_features, car_data_target, test_size = 0.3, random_state = 1)

- In order to attenuate attributes' imbalance, we decided to use weights. In this case, 'balanced' == n_samples / (n_classes * np.bincount(is_claim)).

In [None]:
import numpy as np
from sklearn.utils import class_weight

# Correct class imbalance using class weights. 'balanced' == n_samples / (n_classes * np.bincount(is_claim))
weights = class_weight.compute_class_weight(class_weight = 'balanced', classes = np.unique(car_data_target_train), y = car_data_target_train)
print(weights[0], weights[1])

## Algorithms for the model

- After the database clean-up, we had to decide which algorithm our model should use. For that, we trained 3 separate algorithms: Decision Trees, K-NN and Logistic Regression, and chose which had the better results. We chose these algorithms, since they are used in classification problems.

- The choice of the algorithm was based on the results for different performance measures, such as accuracy, recall and F1-measure. We also applied a confusion matrix and ROC.

### Decision Trees

- For this part we used a Decision Tree classifier to predict the target attribute in the training data. After that, we calculated the performance measures for different characteristics of the classifier and chose the best ones. In the snippet code below is shown what was done.

In [None]:
# Here we created a Classifier with all possible attributes combinations.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, learning_curve

dtc = DecisionTreeClassifier()

grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [1, 2, 3, 4, 5],
    'max_features': [1, 2, 3, 4]
}

# The Grid Search, with the help of the stratikied K-fold, will show the best parameters 
# to use for the Decision Tree

cross_validation = StratifiedKFold(n_splits=10)

gs_dt_cv = GridSearchCV(dtc, param_grid=grid, cv=cross_validation)
gs_dt_cv.fit(car_data_features_train, car_data_target_train)

print('For Decision Tree:')
print('Best score: {}'.format(gs_dt_cv.best_score_))
print('Best parameters: {}'.format(gs_dt_cv.best_params_))


- After that, we used the classifier with the best characteristics and meausred its performance. Here are the results:

In [None]:
from sklearn import metrics
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
import numpy as np
import matplotlib.pyplot as plt

# This part will show all the performance measures (either graphics, or analytics) of the Decision
# Tree Classifier with the best parameters.

print('For decision tree:')

dtc = DecisionTreeClassifier(
    criterion='gini', max_depth=5, max_features=3, splitter='best', class_weight='balanced', 
    random_state=1)

dtc.fit(car_data_features_train, car_data_target_train)

dtc.score(car_data_features_train, car_data_target_train)

# Graphical performance measures for Decision Tree (ROC and Confusion Matrix)

dtc_pred = dtc.predict(car_data_features_train)

cmat = confusion_matrix(car_data_target_train, dtc_pred, labels=dtc.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cmat, display_labels=dtc.classes_)

disp.plot()

plt.show()

plt.show()

target_pred_proba = dtc.predict_proba(car_data_features_train)[::,1]

fpr, tpr, _ = metrics.roc_curve(car_data_target_train ,target_pred_proba)

plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Analytical performance measures for Decision Tree

accuracy = accuracy_score(car_data_target_train,dtc_pred)

precision = precision_score(car_data_target_train,dtc_pred, average="macro")

recall = recall_score(car_data_target_train,dtc_pred, average="macro")

f1 = f1_score(car_data_target_train,dtc_pred, average="macro")

dt_performance = [f1, fpr, tpr, precision, recall, accuracy]

print(f"Accuracy: {accuracy:.5f}")

print(f"Precision: {precision:.5f}")

print(f"Recall: {recall:.5f}")

print(f"F1 Accuracy: {f1:.5f}")

- In fact, the precision measures are low, specially the F1-accuracy and it can be confirmed due to the incorrect predictions that the policyholdes will not claim the insurance.

### K-NN

- We did the similar for the K-NN algorithm:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Similarly to what was done in the Decision Tree, we generate all the possible 
# KNeighbors Classifiers, varying their parameters.

knc = KNeighborsClassifier()

knn_grid = {
'weights': ['uniform', 'distance'],
'n_neighbors': [10,25,50],
'n_jobs': [10,25,50],
}

# After that, we the GridSearch will select the K-NN classifier with the best parameter,
# with the help of the cross validation created before.

gs_kn_cv = GridSearchCV(knc, param_grid=knn_grid, cv=cross_validation)
gs_kn_cv.fit(car_data_features_train, car_data_target_train)

print('For K-NN')
print('Best score: {}'.format(gs_kn_cv.best_score_))
print('Best parameters: {}'.format(gs_kn_cv.best_params_))

- Here is its performance:

In [None]:
# This part will show all the performance measures (either graphics, or analytics) of the
# K-NN Classifier with the best parameters.

print('For K-NN:')

knc = KNeighborsClassifier(n_jobs=10, n_neighbors=25, weights='uniform')

knc.fit(car_data_features_train, car_data_target_train)

knc.score(car_data_features_train, car_data_target_train)

knn_pred = knc.predict(car_data_features_train)

# Graphical performance measures for Decision Tree (ROC and Confusion Matrix)

cmat = confusion_matrix(car_data_target_train, knn_pred, labels=knc.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cmat, display_labels=knc.classes_)

disp.plot()

plt.show()

plt.show()

target_pred_proba = knc.predict_proba(car_data_features_train)[::,1]

fpr, tpr, _ = metrics.roc_curve(car_data_target_train ,target_pred_proba)

plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Analytical performance measures for Decision Tree

accuracy = accuracy_score(car_data_target_train,knn_pred)

precision = precision_score(car_data_target_train,knn_pred, average="macro")

recall = recall_score(car_data_target_train, knn_pred, average="macro")

f1 = f1_score(car_data_target_train,knn_pred, average="macro")

knn_performance = [f1, fpr, tpr, precision, recall, accuracy]

print(f"Accuracy: {accuracy:.5f}")

print(f"Precision: {precision:.5f}")

print(f"Recall: {recall:.5f}")

print(f"F1 Accuracy: {f1:.5f}")

- When we looked the precisions measures are slighly better in the K-NN, it could give us a (wrong) idea that the K-NN model had a better performance. However, and thanks to the confusion matrix plots, we concluded that the model was always predicting that a policeholder will not claim an insurance. This will lead to the besta values for true and false positives value (maximum and 0, respectively), but the false negatives had the maximum value possible and, as result, it does not complete the objective, which is to predict when they will claim it.

### Logistic Regression

- And the same for Logistic Regression. In this part, it appeared some errors due to the combination of values of penalty that were not accepted using l2 and elasticnet. Here are the results

In [None]:
from sklearn.linear_model import LogisticRegression

# Similarly to what was done in the Decision Tree and in the K-NN, we generate all the possible 
# Logistic Regression Classifiers, varying their parameters of the grid.

lrc = LogisticRegression()

lrc_grid = {
'penalty': ['l1','l2','elasticnet'],
'l1_ratio': [1, 0, 0.2, 0.5],
'tol': [0.2, 0.5]
}

#The Grid Search with cross-validation will select the Logistic Regression Classifiers with 
# the best parameters to predict.

gs_lrc_cv = GridSearchCV(lrc, param_grid=lrc_grid,cv=cross_validation)
gs_lrc_cv.fit(car_data_features_train, car_data_target_train)

print('For Logistic Regression:')
print('Best score: {}'.format(gs_lrc_cv.best_score_))
print('Best parameters: {}'.format(gs_lrc_cv.best_params_))

Here the Logistic Regression's performance:

In [None]:
print('For Logistic Regression:')

# This part will show all the performance measures (either graphics, or analytics) of the
# Logistic Regression Classifier with the best parameters.

lrc = LogisticRegression(penalty='l2', tol=0.2, class_weight='balanced', random_state=1)

lrc.fit(car_data_features_train, car_data_target_train)

lrc.score(car_data_features_train, car_data_target_train)

log_pred = lrc.predict(car_data_features_train)

# Graphical performance measures for Decision Tree (ROC and Confusion Matrix)

cmat = confusion_matrix(car_data_target_train, log_pred, labels=lrc.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cmat, display_labels=lrc.classes_)

disp.plot()

plt.show()

target_pred_proba = lrc.predict_proba(car_data_features_train)[::,1]

fpr, tpr, _ = metrics.roc_curve(car_data_target_train ,target_pred_proba)

plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Analytical performance measures for Decision Tree

accuracy = accuracy_score(car_data_target_train,log_pred)

precision = precision_score(car_data_target_train,log_pred, average="macro")

recall = recall_score(car_data_target_train, log_pred, average="macro")

f1 = f1_score(car_data_target_train, log_pred, average="macro")

lr_performance = [f1, fpr, tpr, precision, recall, accuracy]

print(f"Accuracy: {accuracy:.5f}")

print(f"Precision: {precision:.5f}")

print(f"Recall: {recall:.5f}")

print(f"F1 Accuracy: {f1:.5f}")

As we can see, the ratings are low, and is predicting a lot that a policyholder will not claim the insurance when they will (false-negatives). However, this model had the best performance of the 3 used, since it had the best perfomance measures than the Decision Tree (DT) and K-NN, and a similar confusion matrix compared to the DT one.

- After that, we used the weights in order to compare the various algorithms and choose one:

In [None]:
# The voting classifier will ensure the use of the weights created in the classifiers 
# and choose the one which has the best accuracy. In this case, all of them have the same
# average accuracy.

ensemble = VotingClassifier([('dt', dtc), ('knn',knc)], voting='soft', weights=[weights[0], weights[1]])

ensemble.fit(car_data_features_train, car_data_target_train)

pred = ensemble.predict(car_data_features_train)

score = accuracy_score(car_data_target_train, pred)

print('Weighted Avg Accuracy: %.3f' % (score*100))

- As we could see and analysing the results of the measures and plots, we decided, to use the Logistic Regression Model for our prediction.

### Learning Curve

- After that, we made the learning curve algorithm for the Logistice Regression, in order to know in which point it has over-fit.

In [None]:
#Learning Curve
car_train_sizes, car_train_scores, car_test_scores = learning_curve(lrc,
                car_data_features_train, car_data_target_train, scoring = 'accuracy', n_jobs = 1, random_state=1)

train_mean = np.mean(car_train_scores , axis = 1)
train_std = np.std(car_train_scores, axis=1)

test_mean = np.mean(car_test_scores, axis=1)
test_std = np.std(car_test_scores, axis=1)

plt.subplots(1, figsize=(10,10))
plt.plot(car_train_sizes, train_mean, '--', color="#111111",  label="Training score")
plt.plot(car_train_sizes, test_mean, color="#111111", label="Cross-validation score")

plt.fill_between(car_train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(car_train_sizes , test_mean - test_std, test_mean + test_std, color="#DDDDDD")

plt.title("Learning Curve")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()




## Final analysis and conclusions

- With the results of the previous analysis, we diecided to use the Logistic Regression, since it has the best performance (it has a better accurancy than the Decision Tree's and also have a constant learning curve), even though K-NN has a better accuracy (K-NN always predicting 0, so it has no FP, but we know it is not a good prediction).

- So let's make the similar, but using the test data.

In [None]:
# We did a similar Logistic Regression Classifier for the test data, with the same parameters
# as the one used in the training data. We also printed the performance measures.

lrc = LogisticRegression(l1_ratio=1, penalty='l2', tol=0.2, class_weight='balanced', random_state=1)

lrc.fit(car_data_features_test, car_data_target_test)

lrc.score(car_data_features_test, car_data_target_test)

log_pred = lrc.predict(car_data_features_test)

# Created the Confusion Matrix and the ROC.

cmat = confusion_matrix(car_data_target_test, log_pred, labels=lrc.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cmat, display_labels=lrc.classes_)

disp.plot()

plt.show()

target_pred_proba = lrc.predict_proba(car_data_features_test)[::,1]

fpr, tpr, _ = metrics.roc_curve(car_data_target_test ,target_pred_proba)

plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Created the analytical performance measures.

accuracy = accuracy_score(car_data_target_test,log_pred)

precision = precision_score(car_data_target_test,log_pred, average="macro")

recall = recall_score(car_data_target_test, log_pred, average="macro")

f1 = f1_score(car_data_target_test, log_pred, average="macro")

print(f"Accuracy: {accuracy:.5f}")

print(f"Precision: {precision:.5f}")

print(f"Recall: {recall:.5f}")

print(f"F1 Accuracy: {f1:.5f}")

- As expected, we concluded that the model has a similar behaviour as the training. However, having a reasonable accuracy does not mean having a good prediction model (as we can see for the value of F1-accuracy).

- Due to the huge number of false-negatives, we can say that our model is good to predict when a policeholder will not claim the prediction, but we can't say if is going to predict when he/she will.