<center><img src="car.jpg" alt="Parked car" width="400" height="300"></center>

Insurance companies invest a lot of [time and money](https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf) into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries insurance it is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!

Knowing all of this, On the Road car insurance have requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to use simple Logistic Regression, identifying the single feature that results in the best performing model, as measured by accuracy.

They have supplied you with their customer data as a csv file called `car_insurance.csv`, along with a table detailing the column names and descriptions below.



## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-15</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code | 
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client | 
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

In [295]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit

# Start coding!
df = pd.read_csv('car_insurance.csv')
df.head()

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
0,569520,3,0,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,0,1,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,0,0,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,0,1,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,1,1,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


In [296]:
df['driving_experience'].unique()

array(['0-9y', '10-19y', '20-29y', '30y+'], dtype=object)

In [297]:
df.replace({'0-9y': 0, '10-19y': 1, '20-29y': 2, '30y+': 3},inplace=True);

In [298]:
df['driving_experience'].unique()

array([0, 1, 2, 3])

In [299]:
df.replace(['none', 'high school', 'university'], [0, 1, 2], inplace=True);

In [300]:
df.replace(['poverty', 'working class', 'middle class', 'upper class'], [0, 1, 2, 3], inplace = True);

In [301]:
df.replace(['before 2015', 'after 2015'], [0, 1], inplace=True);

In [302]:
df.replace(['sedan', 'sports car'], [0, 1], inplace=True);

In [303]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  int64  
 4   education            10000 non-null  int64  
 5   income               10000 non-null  int64  
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  int64  
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  int64  
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

Looking at null values on credit_score and annual_mileage

In [304]:
df[df['credit_score'].isna()]

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
17,24851,0,1,0,0,0,,0.0,0,1.0,0.0,32765,12000.0,0,0,0,0,1.0
23,217,0,1,0,0,0,,0.0,0,0.0,0.0,10238,17000.0,0,0,0,0,0.0
37,511757,2,0,1,0,2,,1.0,0,1.0,1.0,10238,11000.0,0,2,0,1,0.0
38,429947,3,1,3,2,3,,0.0,1,0.0,1.0,10238,12000.0,1,6,0,5,0.0
47,921097,2,0,2,2,3,,1.0,1,1.0,1.0,92101,11000.0,0,3,0,2,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9952,870405,2,0,1,2,3,,1.0,1,1.0,1.0,32765,5000.0,0,1,0,0,0.0
9967,27406,1,0,1,1,2,,0.0,0,0.0,0.0,92101,13000.0,0,1,0,0,0.0
9981,366048,1,1,0,1,1,,1.0,0,0.0,1.0,10238,11000.0,0,0,0,0,0.0
9985,595418,0,1,0,1,1,,1.0,0,0.0,1.0,10238,11000.0,0,0,0,0,0.0


In [305]:
df[df['annual_mileage'].isna()]

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
13,569640,0,0,0,2,3,0.591260,1.0,0,0.0,1.0,10238,,0,0,0,0,0.0
15,906223,1,0,0,1,3,0.762798,0.0,1,1.0,0.0,10238,,0,0,0,0,0.0
16,517747,3,1,3,2,3,0.796175,1.0,0,1.0,1.0,32765,,0,10,2,1,0.0
18,104086,1,0,0,2,3,0.680594,1.0,0,0.0,1.0,32765,,0,0,0,0,1.0
58,123941,2,0,2,2,2,0.570157,1.0,1,1.0,1.0,10238,,0,0,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9959,163840,2,1,2,0,0,0.203950,0.0,0,0.0,1.0,10238,,0,2,1,2,1.0
9969,355304,2,1,2,2,3,0.545855,1.0,0,1.0,1.0,32765,,0,12,2,0,0.0
9977,794068,3,1,0,0,3,0.710640,1.0,1,0.0,1.0,32765,,0,0,0,0,0.0
9988,479789,1,1,1,1,0,,0.0,0,0.0,0.0,10238,,0,1,0,2,1.0


At first try we'll considere credit_score and annual_mileage null as 0. 

In [306]:
df.fillna(0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  int64  
 4   education            10000 non-null  int64  
 5   income               10000 non-null  int64  
 6   credit_score         10000 non-null  float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  int64  
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       10000 non-null  float64
 13  vehicle_type         10000 non-null  int64  
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

Looking for some correlation between columns to simplify the model

In [307]:
corr_metrics = df.corr()
corr_metrics.style.background_gradient()

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
id,1.0,0.013512,-0.007343,0.004298,-0.005429,0.000423,0.008453,0.009197,-0.003281,0.014826,0.001233,0.006038,-0.001871,0.006243,0.008156,0.009268,0.001831,-0.010506
age,0.013512,1.0,0.005929,0.707393,0.260793,0.654771,0.318332,0.27214,0.23539,0.384759,0.383708,0.008553,-0.153108,-0.008463,0.458413,0.281937,0.431061,-0.448463
gender,-0.007343,0.005929,1.0,0.007511,0.079606,0.026456,-0.04409,0.007385,0.010674,0.008393,-0.00264,-0.001996,-0.014867,-2.2e-05,0.202095,0.094202,0.223202,0.107208
driving_experience,0.004298,0.707393,0.007511,1.0,0.180533,0.459883,0.216103,0.202788,0.164915,0.269942,0.277546,0.006443,-0.094342,-0.008554,0.637306,0.399398,0.604699,-0.497431
education,-0.005429,0.260793,0.079606,0.180533,1.0,0.563786,0.260339,0.236347,0.203394,0.195583,0.123735,0.020813,-0.06603,-0.003194,0.140876,0.08931,0.124718,-0.189357
income,0.000423,0.654771,0.026456,0.459883,0.563786,1.0,0.470163,0.424349,0.359052,0.394274,0.29165,0.021101,-0.141532,-0.010253,0.310474,0.193068,0.287915,-0.422996
credit_score,0.008453,0.318332,-0.04409,0.216103,0.260339,0.470163,1.0,0.20312,0.175559,0.183019,0.136631,-0.002716,-0.062591,-0.011112,0.124184,0.076998,0.112708,-0.198353
vehicle_ownership,0.009197,0.27214,0.007385,0.202788,0.236347,0.424349,0.20312,1.0,0.158579,0.175626,0.12599,-0.004866,-0.059863,0.005647,0.133868,0.086567,0.119521,-0.378921
vehicle_year,-0.003281,0.23539,0.010674,0.164915,0.203394,0.359052,0.175559,0.158579,1.0,0.129638,0.105189,0.006958,-0.034059,-0.025185,0.1027,0.049981,0.097587,-0.294178
married,0.014826,0.384759,0.008393,0.269942,0.195583,0.394274,0.183019,0.175626,0.129638,1.0,0.287009,0.012045,-0.271263,0.006905,0.218855,0.12084,0.215269,-0.262104


### Splitting our data

In [308]:
from sklearn.model_selection import train_test_split

X = df.drop('outcome', axis=1)
y = df.outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10)

In [309]:
X.columns

Index(['id', 'age', 'gender', 'driving_experience', 'education', 'income',
       'credit_score', 'vehicle_ownership', 'vehicle_year', 'married',
       'children', 'postal_code', 'annual_mileage', 'vehicle_type',
       'speeding_violations', 'duis', 'past_accidents'],
      dtype='object')

In [310]:
X_train[[X.columns[0]]].values

array([[ 71781],
       [ 15689],
       [274700],
       ...,
       [205125],
       [533665],
       [390289]])

In [311]:
from sklearn.linear_model import LogisticRegression
feature = []
accuracy = []
for column in X.columns:
    clf = LogisticRegression(random_state = 10)
    clf.fit(X_train[[column]].values, y_train.values)
    clf.predict(X_test[[column]].values)
    feature.append(column)
    accuracy.append(clf.score(X_test[[column]].values, y_test))

print(best_feature, best_accuracy)

driving_experience 0.7804


In [312]:
features_df = pd.DataFrame(list(zip(feature, accuracy)), columns = ['feature', 'accuracy'])
features_df

Unnamed: 0,feature,accuracy
0,id,0.6868
1,age,0.7784
2,gender,0.6868
3,driving_experience,0.7804
4,education,0.6868
5,income,0.738
6,credit_score,0.6456
7,vehicle_ownership,0.7352
8,vehicle_year,0.6868
9,married,0.6868


In [313]:
#best_feature_df[['best_accuracy']].values.max()
best_index = features_df['accuracy'].idxmax()
best_index

3

In [314]:
best_feature = features_df.iloc[best_index,0]
best_accuracy = features_df.iloc[best_index,1]

best_feature_df = pd.DataFrame([[best_feature, best_accuracy]], columns = ['best_feature', 'best_accuracy'])
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.7804
