# **Problem Statement**
---

The insurance industry is undergoing a transformative phase, with M.L becoming integral for enhancing customer service and fraud detection. In this context, this thesis aims to leverage imbalance learning and sampling techniques in supervised learning to develop a predictive model for an insurance company specializing in Health Insurance. The objective is to predict whether policyholders from the previous year will also exhibit interest in Vehicle Insurance, thus enabling more targeted marketing efforts and improved customer satisfaction.

RESEARCH OBJECTIVES :

- Data Preparation and Exploration :  
- Collect and preprocess insurance data, emphasizing the handling of class imbalance between customers interested and not interested in Vehicle Insurance.
- Explore the dataset to gain insights into customer behaviors and identify potential features that influence insurance product adoption.
- Imbalance Learning Techniques :
- Investigate various imbalance learning techniques.
- Supervised Learning Models
- Implement and compare supervised learning algorithms for predicting customer interest in Vehicle Insurance.
- Sampling Strategies :
- Analyze the impact of different sampling strategies on model performance.
- Model Evaluation and Interpretability :
- Evaluate model performance using appropriate metrics and assess the interpretability of the models to gain insights into customer preferences.
- Recommendation System Integration :
- Explore the feasibility of integrating the model into the insurance company's recommendation system to offer tailored insurance products to customers.
- Ethical Considerations :
- Address ethical considerations related to data privacy and fairness in insurance product recommendations, ensuring responsible AI practices.

EXPECTED OUTCOMES:
- Develop a predictive model capable of identifying policyholders likely to be interested in Vehicle Insurance, improving customer targeting and retention.
- Determine which sampling strategies and imbalance learning techniques are most effective for handling imbalanced insurance data.
- Assess the potential business impact of the predictive model on the insurance company, including increased cross-selling and customer satisfaction.
- Establish an ethical framework for deploying the model, addressing fairness and privacy concerns in the insurance industry.


### Import Libraries

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from faker import Faker


### Load the Dataset

In [2]:
# Load Dataset
df = pd.read_csv('C:\\Users\\USER-PC\\Desktop\\ThesisCode\\Data\\Raw\\aug_train.csv')


### Exploring the Dataset: First View

In [3]:
# Dataset First Look
df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,167647,Male,22,1,7.0,1,< 1 Year,No,2630.0,152.0,16,0
1,17163,Male,42,1,28.0,0,1-2 Year,Yes,43327.0,26.0,135,0
2,32023,Female,66,1,33.0,0,1-2 Year,Yes,35841.0,124.0,253,0
3,87447,Female,22,1,33.0,0,< 1 Year,No,27645.0,152.0,69,0
4,501933,Male,28,1,46.0,1,< 1 Year,No,29023.0,152.0,211,0


### Understanding the variables

In [4]:
# Dataset Columns
df.columns

Index(['id', 'Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response'],
      dtype='object')

In [5]:
# Dataset Describe
df.describe()

Unnamed: 0,id,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,Response
count,382154.0,382154.0,382154.0,382154.0,382154.0,382154.0,382154.0,382154.0,382154.0
mean,234392.953477,38.545691,0.998108,26.406867,0.489182,30711.271362,111.939812,154.189429,0.163811
std,139527.487326,15.226897,0.043455,13.181241,0.499884,17061.595532,54.286511,83.735107,0.370104
min,1.0,20.0,0.0,0.0,0.0,2630.0,1.0,10.0,0.0
25%,115006.25,25.0,1.0,15.0,0.0,24546.0,26.0,81.0,0.0
50%,230461.5,36.0,1.0,28.0,0.0,31692.0,145.0,154.0,0.0
75%,345434.75,49.0,1.0,35.0,1.0,39447.75,152.0,227.0,0.0
max,508145.0,85.0,1.0,52.0,1.0,540165.0,163.0,299.0,1.0


Notes :

- Gender: No descriptive statistics are provided, but this is a categorical variable typically indicating male or female. There are no numerical summaries since it's not a quantitative variable.
- Age: The age of individuals ranges from 20 to 85 years old. The average age is approximately 38.5
- Driving_License: Almost all individuals (99.81%) have a driving license since the minimum value is 0 and the mean is very close to 1.
- Previously_Insured: Indicates whether the individual was previously insured (1) or not (0). About 48.9% of the individuals were previously insured, which is close to an even split.
- Vehicle_Age: This is a categorical variable indicating the age of the vehicle. Categories like "< 1 Year" and "1-2 Year" suggest the data classifies vehicles based on their age range.
- Vehicle_Damage: Another categorical variable indicating whether the vehicle has been damaged (Yes) or not (No).
- Annual_Premium: The average annual premium paid by individuals is about 30,711, with a large standard deviation of 17,061, indicating a wide variation in premium amounts. The minimum premium is 2,630, and the maximum is a quite high 540,165.
- Policy_Sales_Channel: Likely categorical or discrete numeric representing different sales channels. The mean sales channel code is around 111.94 with a range from 1 to 163, which may correspond to various sales channels through which the insurance policies are sold.
- Vintage: This seems to represent how long (in days) the individual has been with the insurance company. The mean is 154 days, with a range from 10 to 299 days.
- Response: A binary variable indicating whether the individual was interested (1) or not interested (0) in vehicle insurance offered. The mean of 0.163 suggests that about 16.4% of individuals were interested, which could be the conversion rate

In [6]:
# Dataset Describe
df.describe(include='O')

Unnamed: 0,Gender,Vehicle_Age,Vehicle_Damage
count,382154,382154,382154
unique,2,3,2
top,Male,1-2 Year,No
freq,205603,200176,198501


Notes : 
- This dataset contains a greater number of males compared to females.
- Most customers possess vehicles that are older than one year but younger than two years.
- Majority of the customer in this dataset have had their vehicle damaged before.

### Anonymization process

To ensure the protection of personal data while maintaining the utility of the dataset for analysis, a focused approach to anonymization has been adopted, concentrating on the most sensitive elements: personal identifiers and age. The chosen methods provide a robust balance between data privacy and analytical value, satisfying both ethical considerations and practical needs in data processing.


In [7]:
# Initialize Faker
fake = Faker()

# Anonymizing 'id' by replacing it with a pseudonymized version, if exists
if 'id' in df.columns:
    id_map = {original_id: fake.uuid4() for original_id in df['id']}
    df['id'] = df['id'].map(id_map)
    
# Generalizing 'Age' into Age Groups, if 'Age' exists
df['Age_Group'] = df['Age'].apply(lambda x:'YoungAge' if x >= 18 and x<=30 else 'MiddleAge' if x>30 and x<=60 else 'OldAge')
# Remove the 'Age' column from the DataFrame
df.drop('Age', axis=1, inplace=True)
# Save the anonymized DataFrame to a new CSV file
anonymized_file_path = 'C:\\Users\\USER-PC\\Desktop\\ThesisCode\\Data\\Anonymized\\aug_train_anonymized.csv'
df.to_csv(anonymized_file_path, index=False)

print(f"Anonymized dataset saved to {anonymized_file_path}")


Anonymized dataset saved to C:\Users\USER-PC\Desktop\ThesisCode\Data\Anonymized\aug_train_anonymized.csv


Data Anonymized


In [8]:
# Load the anonymized dataset and see the first rows
da = pd.read_csv('C:\\Users\\USER-PC\\Desktop\\ThesisCode\\Data\\Anonymized\\aug_train_anonymized.csv')
da.head(10)

Unnamed: 0,id,Gender,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response,Age_Group
0,ca21a51d-28e2-47bd-8dae-6e249b5d458d,Male,1,7.0,1,< 1 Year,No,2630.0,152.0,16,0,YoungAge
1,327b5be5-8ff2-4f50-98d5-91b1817cf447,Male,1,28.0,0,1-2 Year,Yes,43327.0,26.0,135,0,MiddleAge
2,5a43fb76-7db1-435a-8954-754cc9da0fb2,Female,1,33.0,0,1-2 Year,Yes,35841.0,124.0,253,0,OldAge
3,c5aa742b-9b6b-495a-9dcb-1a41b85f8812,Female,1,33.0,0,< 1 Year,No,27645.0,152.0,69,0,YoungAge
4,61e8d449-5da0-4f6e-bba1-77bf812e9bf7,Male,1,46.0,1,< 1 Year,No,29023.0,152.0,211,0,YoungAge
5,b1c47b5d-41e0-493d-87ab-f16afa5de8c7,Female,1,25.0,1,< 1 Year,No,27954.0,152.0,23,0,YoungAge
6,18e3aa78-72fd-4888-9f22-71cb7540fb46,Male,1,8.0,0,1-2 Year,Yes,2630.0,26.0,209,0,MiddleAge
7,95a17bb8-46e4-49d3-a69a-6120429fbbd2,Male,1,28.0,1,1-2 Year,No,2630.0,26.0,51,0,MiddleAge
8,3c8c15aa-6638-499c-93db-bcebdcfbc4cb,Female,1,28.0,0,1-2 Year,Yes,55873.0,124.0,262,0,MiddleAge
9,76c7d89d-c8a8-4725-82b6-4a9f021d7930,Male,1,28.0,0,1-2 Year,Yes,27801.0,122.0,217,1,MiddleAge


In [9]:
# Verify that all 'id' values are UUIDs, which should all have a length of 36 (typical for UUIDs).
uuid_check = da['id'].apply(lambda x: len(x) == 36 and '-' in x).all()
print("UUID Verification:", "Passed" if uuid_check else "Failed")

UUID Verification: Passed


In [10]:
print("\nAnonymized Age Group Distribution:")
print(da['Age_Group'].value_counts())


Anonymized Age Group Distribution:
Age_Group
MiddleAge    179389
YoungAge     162433
OldAge        40332
Name: count, dtype: int64
