# Data Preparation & Cleaning

The first thing we did for the project was cleaning and preparation of the dataset to help us gain meaningful insights from the dataset and help us answer the question we posed.

#### Questions: Create a customer segmentation campaign to label customers based on their risk profiles (e.g. riskier customers, less risky customers, normal drivers, etc.).

#### Dataset: [Vehicle Insurance Policy 2020](https://www.kaggle.com/datasets/lakshmanraj/vehicle-insurance-policy?select=Vehicle_policies_2020.csv)

## Table of Contents:

1. Dropping NaNs & Unused Columns 
2. Split Dataset in Two
3. Removing Outliers

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
# import CSV file
data = pd.read_csv('Vehicle_policies_2020.csv')
data

Unnamed: 0,pol_number,pol_eff_dt,gender,agecat,date_of_birth,credit_score,area,traffic_index,veh_age,veh_body,veh_value,claim_office,numclaims,claimcst0,annual_premium
0,43124327,12/30/2020,F,4.0,7/12/1968,381.0,D,133.6,2,HBACK,1.331,,0,0.0,716.53
1,21919609,12/30/2020,F,2.0,11/5/1982,549.0,D,163.6,1,UTE,3.740,,0,0.0,716.53
2,72577057,12/30/2020,M,2.0,11/26/1983,649.0,B,117.5,4,COUPE,0.880,,0,0.0,716.53
3,92175225,12/30/2020,M,4.0,11/2/1960,743.0,B,100.7,3,SEDAN,1.045,,0,0.0,716.53
4,66223239,12/30/2020,F,4.0,1/4/1968,817.0,C,115.5,4,HBACK,0.473,,0,0.0,716.53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60387,73248694,1/2/2020,F,5.0,5/10/1956,809.0,C,145.5,4,HBACK,0.891,,0,0.0,716.53
60388,71411764,1/2/2020,M,4.0,3/22/1961,681.0,D,86.3,1,HBACK,1.881,,0,0.0,716.53
60389,89353155,1/2/2020,M,4.0,9/29/1965,773.0,F,110.0,1,STNWG,5.170,,0,0.0,716.53
60390,40916605,1/2/2020,M,3.0,8/1/1978,714.0,B,,1,HBACK,1.903,,0,0.0,716.53


In [3]:
data.shape

(60392, 15)

In [4]:
data.info

<bound method DataFrame.info of        pol_number  pol_eff_dt gender  agecat date_of_birth  credit_score area  \
0        43124327  12/30/2020      F     4.0     7/12/1968         381.0    D   
1        21919609  12/30/2020      F     2.0     11/5/1982         549.0    D   
2        72577057  12/30/2020      M     2.0    11/26/1983         649.0    B   
3        92175225  12/30/2020      M     4.0     11/2/1960         743.0    B   
4        66223239  12/30/2020      F     4.0      1/4/1968         817.0    C   
...           ...         ...    ...     ...           ...           ...  ...   
60387    73248694    1/2/2020      F     5.0     5/10/1956         809.0    C   
60388    71411764    1/2/2020      M     4.0     3/22/1961         681.0    D   
60389    89353155    1/2/2020      M     4.0     9/29/1965         773.0    F   
60390    40916605    1/2/2020      M     3.0      8/1/1978         714.0    B   
60391    33623054    1/2/2020      F     3.0    12/12/1973           NaN    D

## 1. Dropping NaNs & Unused Columns
Firstly,  we noticed that our dataset contains some NaNs. These NaNs mean that the survey respondent chose not to answer that question. Since we are only concerned with respondents who have answered all the questions in the survey, we have simply dropped these NaNs from our dataset.

In [5]:
# drop all the NaN values
data = data.dropna()

# reset the index of the rows of the DataFrame
data = data.reset_index(drop=True)

print(f"The shape of the new dataset: {data.shape}")

The shape of the new dataset: (8123, 15)


In [6]:
# checking if NaNs exist in our dataset after dropping
data.isnull().values.any()

False

In [7]:
# drop unused columns
data = data.drop(columns = ["pol_number", "pol_eff_dt", "claim_office", "annual_premium"])
data

Unnamed: 0,gender,agecat,date_of_birth,credit_score,area,traffic_index,veh_age,veh_body,veh_value,numclaims,claimcst0
0,M,5.0,8/7/1957,584.0,C,105.0,3,SEDAN,1.5290,1,1120.833360
1,M,3.0,4/25/1977,396.0,E,25.5,4,UTE,1.6500,2,4548.075015
2,F,4.0,6/14/1965,347.0,B,136.3,3,SEDAN,1.6170,1,2265.262185
3,F,2.0,4/11/1980,431.0,C,111.0,1,HBACK,2.4178,3,6616.971570
4,F,4.0,11/16/1968,798.0,B,129.6,4,SEDAN,1.1880,2,225.489518
...,...,...,...,...,...,...,...,...,...,...,...
8118,M,6.0,1/26/1933,637.0,B,109.0,2,HBACK,1.2760,1,467.640398
8119,M,4.0,9/20/1968,443.0,C,106.5,4,HBACK,0.9350,1,804.425046
8120,M,4.0,3/13/1968,582.0,C,132.0,4,SEDAN,0.3850,1,490.897473
8121,M,1.0,4/19/1994,610.0,A,76.4,3,HBACK,1.1220,1,402.791766


## 2. Splitting Dataset into Two
For the purpose of our analysis, it is best suited if we split our dataset into two depending on the Machine Learning Techniques that the variables are involved in:

DataFrame containing variables used in Linear Regression (reg_df)
DataFrame containing variables used in Clustering (clust_df)

From this point, the further data cleaning and preparation is done separately for these two DataFrames

Variables used in Linear Regression:

gender, agecat, traffic_index (3 in total)

Variables used in Clustering:

gender, agecat, credit_score, area, traffic_index, veh_age, veh_body, veh_value, numclaims, claimcst0 (10 in total)


In [11]:
reg_df = data[[
    'gender', 
    'agecat',
    'traffic_index',
]]

clust_df = data[[
    'gender',
    'agecat',
    'credit_score',
    'area',
    'traffic_index',
    'veh_age',
    'veh_body',
    'veh_value',
    'numclaims',
    'claimcst0'
]]

In [10]:
reg_df.info

<bound method DataFrame.info of      gender  agecat  traffic_index
0         M     5.0          105.0
1         M     3.0           25.5
2         F     4.0          136.3
3         F     2.0          111.0
4         F     4.0          129.6
...     ...     ...            ...
8118      M     6.0          109.0
8119      M     4.0          106.5
8120      M     4.0          132.0
8121      M     1.0           76.4
8122      M     4.0           63.0

[8123 rows x 3 columns]>

In [12]:
clust_df.info

<bound method DataFrame.info of      gender  agecat  credit_score area  traffic_index  veh_age veh_body  \
0         M     5.0         584.0    C          105.0        3    SEDAN   
1         M     3.0         396.0    E           25.5        4      UTE   
2         F     4.0         347.0    B          136.3        3    SEDAN   
3         F     2.0         431.0    C          111.0        1    HBACK   
4         F     4.0         798.0    B          129.6        4    SEDAN   
...     ...     ...           ...  ...            ...      ...      ...   
8118      M     6.0         637.0    B          109.0        2    HBACK   
8119      M     4.0         443.0    C          106.5        4    HBACK   
8120      M     4.0         582.0    C          132.0        4    SEDAN   
8121      M     1.0         610.0    A           76.4        3    HBACK   
8122      M     4.0         602.0    A           63.0        2      UTE   

      veh_value  numclaims    claimcst0  
0        1.5290          

## 3. Removing Outliers
Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.

In [14]:
def removeOutliers(df, var):
    q1 = df[var].quantile(0.25)
    q3 = df[var].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    
    df = df[df[var] < upper]
    df = df[df[var] > lower]
    return df

In [15]:
data = removeOutliers(data, "credit_score")
data = removeOutliers(data, "traffic_index")
data = removeOutliers(data, "veh_value")
data

Unnamed: 0,gender,agecat,date_of_birth,credit_score,area,traffic_index,veh_age,veh_body,veh_value,numclaims,claimcst0
0,M,5.0,8/7/1957,584.0,C,105.0,3,SEDAN,1.5290,1,1120.833360
1,M,3.0,4/25/1977,396.0,E,25.5,4,UTE,1.6500,2,4548.075015
2,F,4.0,6/14/1965,347.0,B,136.3,3,SEDAN,1.6170,1,2265.262185
3,F,2.0,4/11/1980,431.0,C,111.0,1,HBACK,2.4178,3,6616.971570
4,F,4.0,11/16/1968,798.0,B,129.6,4,SEDAN,1.1880,2,225.489518
...,...,...,...,...,...,...,...,...,...,...,...
8118,M,6.0,1/26/1933,637.0,B,109.0,2,HBACK,1.2760,1,467.640398
8119,M,4.0,9/20/1968,443.0,C,106.5,4,HBACK,0.9350,1,804.425046
8120,M,4.0,3/13/1968,582.0,C,132.0,4,SEDAN,0.3850,1,490.897473
8121,M,1.0,4/19/1994,610.0,A,76.4,3,HBACK,1.1220,1,402.791766


In [17]:
# Exporting DF to CSV file named "Vehicle_policies_2020.csv"
data.to_csv("/Users/nataliecje/Desktop/SC1015 Lab/Mini Project/Vehicle_policies_2020.csv", index = False)