# **Binary Classification of Insurance Cross Selling**

In this analysis, we are developing a model for binary classification in the field of insurance cross-selling. The goal is to determine the likelihood that a customer is interested in an additional insurance product. To achieve this, we will employ various classification algorithms to attain the best possible prediction accuracy.

We plan to evaluate multiple models, including Light Gradient Boosting Machine, Gradient Boosting Classifier, Ada Boost Classifier, and Random Forest. These models will be assessed using metrics such as accuracy, AUC, recall, precision, F1-score, Kappa, and MCC. Through this comparison, we aim to gain a comprehensive understanding of the strengths and weaknesses of each algorithm and identify which model is most suitable for identifying potential cross-sales.

**Data Dictionary**

| Column Name             | Description                                       | Turkish Translation                          |
|-------------------------|---------------------------------------------------|----------------------------------------------|
| ID                      | Unique identifier for the individual               | Birey için benzersiz kimlik                 |
| Gender                  | Gender of the individual                           | Bireyin cinsiyeti                            |
| Age                     | Age of the individual                              | Bireyin yaşı                                 |
| Driving License         | Indicates if the individual has a driving license  | Bireyin sürücü belgesinin olup olmadığını belirtir |
| Region Code             | Code representing the region                       | Bölgeyi temsil eden kod                      |
| Previously Insured      | Indicates if the individual was previously insured | Bireyin daha önce sigortalı olup olmadığını belirtir |
| Vehicle Age             | Age of the vehicle                                 | Araç yaşı                                    |
| Vehicle Damage          | Indicates if the vehicle has sustained damage      | Araçta hasar olup olmadığını belirtir        |
| Annual Premium          | Annual insurance premium amount                    | Yıllık sigorta prim tutarı                   |
| Policy Sales Channel    | Channel through which the policy was sold         | Poliçenin satıldığı kanal                    |
| Vintage                 | Time since the policy was issued                   | Poliçenin düzenlendiği süre                  |
| Response                | Indicates if the individual responded              | Bireyin yanıt verip vermediğini belirtir    |

<img src='https://pfst.cf2.poecdn.net/base/image/33d6b5db3db39301f6ad6e6a95a607aa282e4977960cf405f1cae653c48414de?w=1024&h=768&pmaid=259373161' width='800'>
<a href='https://www.kaggle.com/datasets/bhavikjikadara/car-price-prediction-dataset' target=_blank>
Click here for the dataset </a>

**Import library**

In [19]:
#!pip install pycaret

In [4]:
import pandas as pd
from pycaret.classification import*

import warnings
warnings.filterwarnings('ignore')

**Load the data**

In [5]:
test=pd.read_csv('test.csv')
train=pd.read_csv('train.csv')

In [6]:
df = pd.concat([test, train], ignore_index=True)

**EDA - Exploratory Data Analysis**

In [7]:
df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,11504798,Female,20,1,47.0,0.0,< 1 Year,No,2630.0,160.0,228.0,
1,11504799,Male,47,1,28.0,0.0,1-2 Year,Yes,37483.0,124.0,123.0,
2,11504800,Male,47,1,43.0,0.0,1-2 Year,Yes,2630.0,26.0,271.0,
3,11504801,Female,22,1,47.0,1.0,< 1 Year,No,24502.0,152.0,115.0,
4,11504802,Male,51,1,19.0,0.0,1-2 Year,No,34115.0,124.0,148.0,


In [8]:
df.shape

(706274, 12)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 706274 entries, 0 to 706273
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    706274 non-null  int64  
 1   Gender                706274 non-null  object 
 2   Age                   706274 non-null  int64  
 3   Driving_License       706274 non-null  int64  
 4   Region_Code           706274 non-null  float64
 5   Previously_Insured    706273 non-null  float64
 6   Vehicle_Age           706273 non-null  object 
 7   Vehicle_Damage        706273 non-null  object 
 8   Annual_Premium        706273 non-null  float64
 9   Policy_Sales_Channel  706273 non-null  float64
 10  Vintage               706273 non-null  float64
 11  Response              298515 non-null  float64
dtypes: float64(6), int64(3), object(3)
memory usage: 64.7+ MB


In [10]:
df.isnull().sum()

Unnamed: 0,0
id,0
Gender,0
Age,0
Driving_License,0
Region_Code,0
Previously_Insured,1
Vehicle_Age,1
Vehicle_Damage,1
Annual_Premium,1
Policy_Sales_Channel,1


**Data Preparation**

In [11]:
# Check for missing values in the target variable
missing_values = df['Response'].isna().sum()
print(f"Missing values in the target variable: {missing_values}")

# Remove rows with missing values in the target variable
df = df.dropna(subset=['Response'])

Missing values in the target variable: 407759


In [13]:
# Set up the PyCaret environment
clf = setup(data=df, target='Response', session_id=123,
            categorical_features=['Gender', 'Driving_License', 'Region_Code', 'Vehicle_Age', 'Vehicle_Damage', 'Policy_Sales_Channel'],
            numeric_features=['Age', 'Annual_Premium', 'Vintage'],
            ignore_features=['id'])

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Response
2,Target type,Binary
3,Original data shape,"(298515, 12)"
4,Transformed data shape,"(298515, 13)"
5,Transformed train set shape,"(208960, 13)"
6,Transformed test set shape,"(89555, 13)"
7,Ignore features,1
8,Numeric features,3
9,Categorical features,6


In [14]:
# Compare models
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.878,0.8605,0.0087,0.505,0.0172,0.0131,0.0536,1.315
gbc,Gradient Boosting Classifier,0.878,0.8648,0.0027,0.5156,0.0055,0.0042,0.0306,3.937
ridge,Ridge Classifier,0.8779,0.8457,0.0,0.0,0.0,0.0,0.0,0.256
lr,Logistic Regression,0.8776,0.8428,0.0005,0.1674,0.001,0.0002,0.0023,2.476
lda,Linear Discriminant Analysis,0.8765,0.8457,0.0133,0.3438,0.0257,0.0167,0.0467,0.475
rf,Random Forest Classifier,0.8686,0.8443,0.1381,0.3916,0.2041,0.15,0.1746,4.077
et,Extra Trees Classifier,0.8636,0.8353,0.1563,0.3634,0.2186,0.1566,0.1735,3.495
svm,SVM - Linear Kernel,0.8631,0.4999,0.024,0.1296,0.0381,0.0052,0.0061,2.246
knn,K Neighbors Classifier,0.8613,0.6249,0.074,0.2602,0.1152,0.0646,0.08,0.811
dt,Decision Tree Classifier,0.827,0.6086,0.3197,0.3025,0.3108,0.212,0.2121,0.389


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8796,0.8703,0.0541,0.5737,0.0988,0.0794,0.1487,122.307
gbc,Gradient Boosting Classifier,0.878,0.8648,0.0027,0.5156,0.0055,0.0042,0.0306,3.937
ada,Ada Boost Classifier,0.878,0.8605,0.0087,0.505,0.0172,0.0131,0.0536,1.315
ridge,Ridge Classifier,0.8779,0.8457,0.0,0.0,0.0,0.0,0.0,0.256
dummy,Dummy Classifier,0.8779,0.5,0.0,0.0,0.0,0.0,0.0,0.242
lr,Logistic Regression,0.8776,0.8428,0.0005,0.1674,0.001,0.0002,0.0023,2.476
lda,Linear Discriminant Analysis,0.8765,0.8457,0.0133,0.3438,0.0257,0.0167,0.0467,0.475
rf,Random Forest Classifier,0.8686,0.8443,0.1381,0.3916,0.2041,0.15,0.1746,4.077
et,Extra Trees Classifier,0.8636,0.8353,0.1563,0.3634,0.2186,0.1566,0.1735,3.495
svm,SVM - Linear Kernel,0.8631,0.4999,0.024,0.1296,0.0381,0.0052,0.0061,2.246


In [15]:
# Create and train a model (e.g., Random Forest)
model = create_model('rf')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8687,0.8428,0.1408,0.3936,0.2074,0.1529,0.1773
1,0.8682,0.8403,0.1403,0.3896,0.2063,0.1515,0.1752
2,0.8713,0.8469,0.1427,0.4203,0.2131,0.1611,0.1894
3,0.8687,0.843,0.1337,0.3902,0.1991,0.1459,0.1711
4,0.8705,0.8482,0.1478,0.4147,0.2179,0.1643,0.1906
5,0.8681,0.847,0.1333,0.3842,0.1979,0.1441,0.1684
6,0.8681,0.8497,0.1282,0.3807,0.1918,0.1388,0.1635
7,0.8689,0.8468,0.1392,0.3953,0.2059,0.1519,0.1769
8,0.8662,0.8384,0.1301,0.3652,0.1919,0.1365,0.1584
9,0.867,0.8402,0.1446,0.382,0.2098,0.153,0.1748


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [16]:
# Evaluate the model
evaluate_model(model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [17]:
# Make predictions on the test data
predictions = predict_model(model, data=df)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.8123,0.4996,0.0856,0.1208,0.1002,-0.0012,-0.0012


**Results**

The results indicate that the Light Gradient Boosting Machine is the best-performing model, with an accuracy of 0.8796 and an AUC of 0.8703. These values suggest that this model is highly effective in reliably identifying positive cases (customers interested in additional insurance). In comparison, the Gradient Boosting Classifier achieved a similar accuracy of 0.8780, while the Ada Boost Classifier and Ridge Classifier showed slightly lower values.

The average performance across all folds highlights the stability of the models. The Random Forest Classifier performed the weakest, with an accuracy of 0.8123 and an AUC of 0.4996, indicating that it is not well-suited for this particular application.


**Conclusion**

In summary, the analysis demonstrates that the Light Gradient Boosting Machine is the most effective model for binary classification in the context of insurance cross-selling. The results suggest that companies looking to target their offers to interested customers could benefit from implementing this model. Future work could focus on integrating additional features and improving data quality to further enhance predictive accuracy.