# 2. Data
The dataset was generated from a deep learning model trained on the Health Insurance Cross Sell Prediction Data dataset.

There is a version of it available on Kaggle: https://www.kaggle.com/competitions/playground-series-s4e7/data

# 3. Evaluation 
If we can reach +80% accuracy on predicting we'll persue the project.

# 4. Features
Not much specified about the features of the dataset but along this notebook we'll explore it.

# Preparing basic tools and datasets


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Loading the data
test_data = pd.read_csv('playground-series-s4e7/test.csv')
train_data = pd.read_csv('playground-series-s4e7/train.csv')

# Data Exploration

In [5]:
train_data.shape

(11504798, 12)

In [6]:
train_data.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187,0
1,1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288,1
2,2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254,0
3,3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76,0
4,4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294,0


In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11504798 entries, 0 to 11504797
Data columns (total 12 columns):
 #   Column                Dtype  
---  ------                -----  
 0   id                    int64  
 1   Gender                object 
 2   Age                   int64  
 3   Driving_License       int64  
 4   Region_Code           float64
 5   Previously_Insured    int64  
 6   Vehicle_Age           object 
 7   Vehicle_Damage        object 
 8   Annual_Premium        float64
 9   Policy_Sales_Channel  float64
 10  Vintage               int64  
 11  Response              int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 1.0+ GB


In [8]:
# Are there any missing values in the train dataset?
train_data.isna().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

In [9]:
# Are there any missing values in the test dataset?
test_data.isna().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
dtype: int64

In [10]:
train_data['Vehicle_Age'].head()

0     1-2 Year
1    > 2 Years
2     < 1 Year
3     1-2 Year
4     1-2 Year
Name: Vehicle_Age, dtype: object

We need to remove the symbols  > & < from the `[Vehicle_Age]` column in both train and test  data sets.

# Preparing Data for Modelling

In [14]:
# Removing symbols on both data sets
vehicle_age_mapping = {'< 1 Year': 1, '1-2 Year': 2, '> 2 Years': 3}

train_data['Vehicle_Age'] = train_data['Vehicle_Age'].replace(vehicle_age_mapping)
test_data['Vehicle_Age'] = test_data['Vehicle_Age'].replace(vehicle_age_mapping)

  train_data['Vehicle_Age'] = train_data['Vehicle_Age'].replace(vehicle_age_mapping)
  test_data['Vehicle_Age'] = test_data['Vehicle_Age'].replace(vehicle_age_mapping)


In [16]:
train_data.tail()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
11504793,11504793,1,48,1,6.0,0,2,Yes,27412.0,26.0,218,0
11504794,11504794,0,26,1,36.0,0,1,Yes,29509.0,152.0,115,1
11504795,11504795,0,29,1,32.0,1,1,No,2630.0,152.0,189,0
11504796,11504796,0,51,1,28.0,0,2,Yes,48443.0,26.0,274,1
11504797,11504797,1,25,1,28.0,1,1,No,32855.0,152.0,189,0


In [17]:
# Converting all the values into numbers
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for column in train_data.columns:
    if train_data[column].dtype == 'object':
        train_data[column] = label_encoder.fit_transform(train_data[column])

for column in test_data.columns:
    if test_data[column].dtype == 'object':
        test_data[column] = label_encoder.fit_transform(test_data[column])

In [18]:
train_data.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,0,1,21,1,35.0,0,2,1,65101.0,124.0,187,0
1,1,1,43,1,28.0,0,3,1,58911.0,26.0,288,1
2,2,0,25,1,14.0,1,1,0,38043.0,152.0,254,0
3,3,0,35,1,1.0,0,2,1,2630.0,156.0,76,0
4,4,0,36,1,15.0,1,2,0,31951.0,152.0,294,0


In [19]:
test_data.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
0,11504798,0,20,1,47.0,0,1,0,2630.0,160.0,228
1,11504799,1,47,1,28.0,0,2,1,37483.0,124.0,123
2,11504800,1,47,1,43.0,0,2,1,2630.0,26.0,271
3,11504801,0,22,1,47.0,1,1,0,24502.0,152.0,115
4,11504802,1,51,1,19.0,0,2,0,34115.0,124.0,148


# Modelling

In [23]:
# Importing required tools
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

In [24]:
# Splitting training data into X and y
X = train_data.drop('Response', axis=1)
y = train_data['Response']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2)

Now we've got our data split into training and test sets, it's time to build a machine learning model.

We'll train it (find the patterns) on the training set.

And we'll test it (use the patterns) on the test set.

#### After making some research I decided to use CatBoostClassifier as our model to train.

In [28]:
# Istalling model 
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp312-cp312-macosx_11_0_universal2.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Downloading plotly-5.24.0-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly->catboost)
  Downloading tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp312-cp312-macosx_11_0_universal2.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hDownloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Downloading plotly-5.24.0-py3-none-any.whl (19.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.0/19.0 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tenacity-9.0.0-py3-none-any.whl (28 kB)
Installing collected packages: tenacity, graphviz, plotly, cat

In [29]:
# Importing model to the notebook
from catboost import CatBoostClassifier

In [30]:
# Initializing and training model
model = CatBoostClassifier()
model.fit(X_train, y_train)

Learning rate set to 0.5
0:	learn: 0.2977030	total: 791ms	remaining: 13m 10s
1:	learn: 0.2751333	total: 1.35s	remaining: 11m 13s
2:	learn: 0.2695888	total: 1.84s	remaining: 10m 11s
3:	learn: 0.2665882	total: 2.14s	remaining: 8m 52s
4:	learn: 0.2654297	total: 2.47s	remaining: 8m 10s
5:	learn: 0.2642373	total: 3.01s	remaining: 8m 18s
6:	learn: 0.2635975	total: 3.44s	remaining: 8m 8s
7:	learn: 0.2630496	total: 3.85s	remaining: 7m 57s
8:	learn: 0.2626122	total: 4.17s	remaining: 7m 39s
9:	learn: 0.2623616	total: 4.51s	remaining: 7m 26s
10:	learn: 0.2618936	total: 4.84s	remaining: 7m 15s
11:	learn: 0.2616260	total: 5.13s	remaining: 7m 2s
12:	learn: 0.2611806	total: 5.49s	remaining: 6m 56s
13:	learn: 0.2610024	total: 5.91s	remaining: 6m 56s
14:	learn: 0.2608959	total: 6.26s	remaining: 6m 51s
15:	learn: 0.2607365	total: 6.63s	remaining: 6m 47s
16:	learn: 0.2605628	total: 7.03s	remaining: 6m 46s
17:	learn: 0.2603867	total: 7.43s	remaining: 6m 45s
18:	learn: 0.2602363	total: 7.76s	remaining: 6m 

<catboost.core.CatBoostClassifier at 0x154fd56d0>

In [31]:
# Evaluating the model
model.score(X_test, y_test)

0.8808797197691398

In [37]:
# Classification report
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.99      0.94   2017823
           1       0.57      0.13      0.21    283137

    accuracy                           0.88   2300960
   macro avg       0.73      0.56      0.57   2300960
weighted avg       0.85      0.88      0.85   2300960

Confusion Matrix:
[[1989650   28173]
 [ 245918   37219]]
