<h1 align="center">Car Insurance Analysis</h1>


### Description
This analysis focuses on predicting whether a policyholder is likely to file a claim within the next six months, leveraging a dataset with diverse features related to policyholders and their vehicles.

### Objective
The goal is to develop a predictive model to assist in identifying high-risk policyholders, enabling more informed decisions in insurance risk management.


| **Feature**                          | **Description**                                                                                   |
|--------------------------------------|---------------------------------------------------------------------------------------------------|
| **policy_id**                        | Unique identifier of the policyholder                                                             |
| **policy_tenure**                    | Time period of the policy                                                                         |
| **age_of_car**                       | Normalized age of the car in years                                                                |
| **age_of_policyholder**              | Normalized age of policyholder in years                                                           |
| **area_cluster**                     | Area cluster of the policyholder                                                                  |
| **population_density**               | Population density of the city (Policyholder City)                                                |
| **make**                             | Encoded Manufacturer/company of the car                                                           |
| **segment**                          | Segment of the car (A/ B1/ B2/ C1/ C2)                                                            |
| **model**                            | Encoded name of the car                                                                           |
| **fuel_type**                        | Type of fuel used by the car                                                                      |
| **max_torque**                       | Maximum Torque generated by the car (Nm@rpm)                                                      |
| **max_power**                        | Maximum Power generated by the car (bhp@rpm)                                                      |
| **engine_type**                      | Type of engine used in the car                                                                    |
| **airbags**                          | Number of airbags installed in the car                                                            |
| **is_esc**                           | Boolean flag indicating whether Electronic Stability Control (ESC) is present in the car or not   |
| **is_adjustable_steering**           | Boolean flag indicating whether the steering wheel of the car is adjustable or not                |
| **is_tpms**                          | Boolean flag indicating whether Tyre Pressure Monitoring System (TPMS) is present in the car or not|
| **is_parking_sensors**               | Boolean flag indicating whether parking sensors are present in the car or not                     |
| **is_parking_camera**                | Boolean flag indicating whether the parking camera is present in the car or not                   |
| **rear_brakes_type**                 | Type of brakes used in the rear of the car                                                        |
| **displacement**                     | Engine displacement of the car (cc)                                                               |
| **cylinder**                         | Number of cylinders present in the engine of the car                                              |
| **transmission_type**                | Transmission type of the car                                                                      |
| **gear_box**                         | Number of gears in the car                                                                        |
| **steering_type**                    | Type of the power steering present in the car                                                     |
| **turning_radius**                   | The space a vehicle needs to make a certain turn (Meters)                                         |
| **length**                           | Length of the car (Millimetres)                                                                   |
| **width**                            | Width of the car (Millimetres)                                                                    |
| **height**                           | Height of the car (Millimetres)                                                                   |
| **gross_weight**                     | The maximum allowable weight of the fully-loaded car, including passengers, cargo, and equipment (Kg)|
| **is_front_fog_lights**              | Boolean flag indicating whether front fog lights are available in the car or not                  |
| **is_rear_window_wiper**             | Boolean flag indicating whether the rear window wiper is available in the car or not              |
| **is_rear_window_washer**            | Boolean flag indicating whether the rear window washer is available in the car or not             |
| **is_rear_window_defogger**          | Boolean flag indicating whether the rear window defogger is available in the car or not           |
| **is_brake_assist**                  | Boolean flag indicating whether the brake assistance feature is available in the car or not       |
| **is_power_door_lock**               | Boolean flag indicating whether a power door lock is available in the car or not                  |
| **is_central_locking**               | Boolean flag indicating whether the central locking feature is available in the car or not        |
| **is_power_steering**                | Boolean flag indicating whether power steering is available in the car or not                     |
| **is_driver_seat_height_adjustable** | Boolean flag indicating whether the height of the driver seat is adjustable or not                |
| **is_day_night_rear_view_mirror**    | Boolean flag indicating whether day & night rearview mirror is present in the car or not          |
| **is_ecw**                           | Boolean flag indicating whether Engine Check Warning (ECW) is available in the car or not         |
| **is_speed_alert**                   | Boolean flag indicating whether the speed alert system is available in the car or not             |
| **ncap_rating**                      | Safety rating given by NCAP (out of 5)                                                            |
| **is_claim**                         | Outcome: Boolean flag indicating whether the policyholder filed a claim in the next 6 months or not|


# Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
from datasist.structdata import detect_outliers
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Data Reading & Understanding

In [None]:
# Load the dataset
df = pd.read_csv(r"C:\Users\Lenovo\Desktop\final_project\car_insurance\Sourse\modified_train_subset.csv")

In [None]:
# Display the first 10 rows of the dataset and its shape
print(df.shape)
df.head(10)

In [None]:
# Get information about the dataset 
df.info()

In [None]:
# Describe the dataset 
df.describe()

In [None]:
# Get number of unique values in each column and their unique values 
for col in df.columns:
    print('Column Name: ',col)
    print(df[col].nunique())
    print('-'*30)
    print(df[col].unique())
    print('='*30)

# Data Cleaning & Preprocessing 

In [None]:
# Replace '?' with NaN in the entire DataFrame
df.replace('?', np.nan, inplace=True)

In [None]:
# Check for missing values in each column
df.isna().sum()

In [None]:
# Plot missing values 
na_counts = df.isna().sum()
fig = px.bar(x=na_counts.values, y=na_counts.index, orientation='h', title='Missing Values', labels={'x':'Count', 'y':'Column'})
fig.update_layout(template='plotly_white')
fig.show()

In [None]:
# Check for duplicates
df.duplicated().sum()

In [None]:
# Drop duplicates and reset the index
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.duplicated().sum()

In [None]:
# Drop missing value in target column
df.dropna(subset=['is_claim'], inplace=True)
df.reset_index(drop=True, inplace=True)

#Dropping the unnecessary policy id column
df=df.drop('policy_id', axis=1)

In [None]:
# Handle columns data types with safe casting to integers

df['policy_tenure'] = df['policy_tenure'].astype('float')                                                                      # errors='coerce' converts non-numeric values to NaN
df['age_of_policyholder'] = df['age_of_policyholder'].astype('float')                                                          # round() rounds to the nearest integer value 
df['make'] = pd.to_numeric(df['make'], errors='coerce').round().astype('Int64')                                                # astype() converts the column to the specified data type
df['population_density'] = pd.to_numeric(df['population_density'], errors='coerce').round().astype('Int64')
df['cylinder'] = pd.to_numeric(df['cylinder'], errors='coerce').round().astype('Int64')
df['length'] = df['length'].astype('float')
df['height'] = df['height'].astype('float')
df['gross_weight'] = df['gross_weight'].astype('float')
df['is_claim'] = pd.to_numeric(df['is_claim'], errors='coerce').round().astype('Int64')


In [None]:
# Extract numeric values (including decimals) from 'max_torque' and 'max_power' columns
df['max_torque'] = df['max_torque'].str.extract('(\d+\.?\d*)').astype(float)                 # (\d+\.?\d*) extracts numeric values (including decimals) from the string and converts them to float type
df['max_power'] = df['max_power'].str.extract('(\d+\.?\d*)').astype(float)


In [None]:
# Scale the 'age_of_car' and 'age_of_policyholder' columns by 100
# This adjustment is made to revert the normalization for easier analysis and interpretation.
df['age_of_car'] = df['age_of_car']*100
df['age_of_policyholder'] = df['age_of_policyholder']*100
df['policy_tenure'] = df['policy_tenure']*100

In [None]:
# columns that contain missing values
missing_cols = df.columns[df.isna().any()].tolist()
missing_cols

In [None]:
# Deal with missing values in numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns           # selecting only numerical columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())       # filling missing values with median to be on the safer side in case of outliers

In [None]:
# Deal with missing values in categorical columns 
categorical_cols = df.select_dtypes(include=['object']).columns                         # selecting only categorical columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])  # filling missing values with mode 

In [None]:
# check if there are outliers in the numerical columns
cols = ['policy_tenure', 'age_of_car', 'age_of_policyholder','max_torque', 'max_power']
for col in cols:
    fig = px.box(df, x=col)
    fig.show()

In [None]:
# Detecting and removing outliers using module detect_outliers from datasist.structdata library 
outliers_index = detect_outliers(df, 0, cols)
print(len(outliers_index))
df.drop(outliers_index, inplace=True)
df.reset_index(drop=True, inplace=True)


In [None]:
# Show shape after removing outliers and cleaning the data
df.shape

In [None]:
# Show descriptive statistics after cleaning the data 
df.describe().T

In [None]:
# # save data in csv file
# cleaned_data = df.to_csv('cleaned_data.csv', index=False)

# Exploratory Data Analysis

In [None]:
# Count the number of occurrences for each value in the 'is_claim' column
is_claim_count = df['is_claim'].value_counts()
is_claim_count

In [None]:
# Create a pie chart to visualize the distribution of claims in the dataset
px.pie(df, names='is_claim', color_discrete_sequence=px.colors.sequential.Cividis, template='plotly_dark')       # The chart uses a color sequence from Plotly's 'Cividis' palette and a dark theme

The pie chart visualizes the distribution of claims within the dataset:

- **No Claims (0):** 53,701 records (approximately 93%)
- **Claims (1):** 3,670 records (approximately 7%)

**Key Insights:**

- **Imbalance:** The dataset is highly imbalanced, with a significantly higher proportion of records indicating no claims compared to those indicating claims.
- **Impact on Analysis:** This imbalance may affect the performance of predictive models, making it important to consider techniques for handling class imbalance, such as resampling or using algorithms designed to address imbalanced data.


In [None]:
# Define the columns
cols = ['policy_tenure', 'age_of_car', 'age_of_policyholder', 'population_density', 'make',
        'airbags', 'displacement', 'cylinder', 'gear_box', 'length', 'width', 'height',
        'gross_weight','ncap_rating','turning_radius']

# Number of columns to plot
num_columns = len(cols) 

# Determine the number of rows and columns needed for the subplots grid
num_cols = 3  # Number of columns in the grid
num_rows = (num_columns + num_cols - 1) // num_cols  # Calculate number of rows required

# Create a subplot figure
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=cols)

# Loop through each column and add a histogram to the subplot grid
for i, col in enumerate(cols):
    row = i // num_cols + 1  # Calculate the row index
    col_pos = i % num_cols + 1  # Calculate the column index
    
    # Add histogram for each column
    fig.add_trace(go.Histogram(x=df[col], name=col), row=row, col=col_pos)

# Update layout for the entire figure
fig.update_layout(height=1500, width=1000, showlegend=False, title_text='Histograms of DataFrame Columns')

# Adjust spacing between subplots
fig.update_layout(margin=dict(l=10, r=10, t=50, b=10), template='plotly_dark')

# Display the plot
fig.show()


1. **Policy Activation Duration**:
   - .
   - The majority of people opt for insurance immediately when they purchase a car.

2. **Age of Policyholders**:
   - Most policyholders are between 0.3 and 0.4 years of age.

3. **Car Make Preferences**:
   - Most people prefer cars from 'Make 1', followed by 'Make 3'.
   - The least preferred makes are 'Make 2', 'Make 4', and 'Make 5'.

4. **Airbags in Cars**:
   - The number of airbags in most cars is 5.
   - Most cars have "2" airbags (40,000+), followed by "6" airbags (approximately 17,000).
   - The "1" airbag option is present in almost 1,000 cars, while no cars have "3", "4", or "5" airbags.

5. **NCAP Rating of Cars**:
   - Most cars have an NCAP rating of "2" (20,000+), followed by a rating of "0" (approximately 19,000).
   - Cars with ratings of "4" and "5" are considered the safest (2,500 each).


In [None]:
# Define the columns that contain 'is_'
cols_contain_is = ['is_esc', 'is_adjustable_steering', 'is_tpms', 'is_parking_sensors',
                   'is_parking_camera', 'is_front_fog_lights', 'is_rear_window_wiper',
                   'is_rear_window_washer', 'is_rear_window_defogger', 'is_brake_assist',
                   'is_power_door_locks', 'is_central_locking', 'is_power_steering',
                   'is_driver_seat_height_adjustable', 'is_day_night_rear_view_mirror',
                   'is_ecw', 'is_speed_alert']

# Number of columns for subplots
cols = 3
rows = (len(cols_contain_is) + cols - 1) // cols  # Calculate the number of rows needed

# Create a subplot grid
fig = make_subplots(rows=rows, cols=cols, subplot_titles=cols_contain_is)

# Loop through each column in cols_contain_is and add a histogram to the subplot
for i, col in enumerate(cols_contain_is):
    row = (i // cols) + 1
    col_num = (i % cols) + 1
    histogram = px.histogram(df, x=col, color='is_claim', template='plotly_dark', nbins=2, barmode='group')
    
    # Add trace to the corresponding subplot
    for trace in histogram['data']:
        fig.add_trace(trace, row=row, col=col_num)
    
    # Update x-axis to display only 0 and 1
    fig.update_xaxes(
        tickvals=[0, 1], 
        ticktext=['No', 'Yes'], 
        row=row, 
        col=col_num
    )

# Apply dark mode to the layout
fig.update_layout(
    template='plotly_dark',  # Apply dark template
    height=rows * 200, 
    width=cols * 300, 
    title_text="Histograms of Features with 'is_' Prefix Colored by 'is_claim'",
    showlegend=False,
    paper_bgcolor='black',  # Background color of the entire figure
    plot_bgcolor='black',   # Background color of the plotting area
    font=dict(color='white')  # Set text color to white for visibility
)

# Show the figure
fig.show()


In [None]:
# Define the columns
cols = ['area_cluster', 'population_density', 'model', 'fuel_type', 'airbags', 'displacement', 'cylinder', 'transmission_type', 'gear_box', 'steering_type', 'turning_radius']

# Determine the number of rows and columns needed for the subplots grid
num_cols = 2  # Number of columns in the grid
num_rows = (len(cols) + num_cols - 1) // num_cols  # Calculate number of rows required

# Create a subplot figure
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=[f'Histogram for {col}' for col in cols])

# Loop through each column and add a histogram to the subplot grid
for i, col in enumerate(cols):
    row = i // num_cols + 1  # Calculate the row index
    col_pos = i % num_cols + 1  # Calculate the column index
    
    # Create the histogram using Plotly Express
    fig_px = px.histogram(df, x=col, template='plotly_dark', barmode='group', color='is_claim')
    
    # Convert the Plotly Express figure to a graph object trace and add to subplots
    for trace in fig_px['data']:
        fig.add_trace(trace, row=row, col=col_pos)
    
    # Update layout for each subplot
    fig.update_xaxes(title_text=col, row=row, col=col_pos, tickangle=90)
    fig.update_yaxes(title_text='Count', row=row, col=col_pos)

# Update layout for the entire figure
fig.update_layout(height=1500, width=1000, showlegend=False, title_text='Histograms for Selected Columns', 
                  template='plotly_dark')

# Display the plot
fig.show()

1. **High Claim Density in Area C8**:
   - The highest number of claims, approximately 1000, comes from area C8.

2. **High Claims Among Specific Car Models (M1, M4, M6)**:
   - Owners of models M1, M4, and M6 have the highest claims, with about 1000 claims each.

3. **Zero Claims Without Speed Alert System**:
   - There are zero claims where the speed alert system isn't present in the car.

In [None]:
# Number of bottom features to display
num_bottom_features = 10

# Calculate the absolute correlation values with respect to 'is_claim'
correlation_values = df.corr(numeric_only=True)['is_claim'].abs().sort_values(ascending=True)[:num_bottom_features]

# Get the bottom feature names
bottom_features = correlation_values.index

# Subset the dataframe to include only these features
subset_df = df[bottom_features]

# Calculate the correlation matrix for the subset
correlation_matrix = subset_df.corr()

# Create the heatmap using Plotly
fig = px.imshow(correlation_matrix, 
                text_auto=True,  # Automatically show values in the heatmap cells
                color_continuous_scale='RdBu_r',  # Color scale (red to blue reversed)
                labels={'color': 'Correlation'},  # Label for the color bar
                title=f'Bottom {num_bottom_features} Correlation Heatmap')

# Update layout to improve appearance
fig.update_layout(height=600, width=800, template='plotly_dark')

# Display the heatmap
fig.show()
