In this notebook, we create the clean_data.csv dataset
- Also contains the methodology of the cleaning, engineering, wrangling, preprocessing
- RUN ORDER = 1/5

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [3]:
df = pd.read_csv("data/insurance_train.csv")

- Derive LC, HALC, and made_claim (CS) from base features
- LC was created from dividing X.15 and X.16
- HALC was created from LC times X.18
- made_claim is a binary feature, where 1 if X.17 does not equal 0, else 0

In [5]:
df['LC'] = df['X.15'] / df['X.16']
df['HALC'] = df['LC'] * df['X.18']
df['made_claim'] = np.where(df['X.17'] != 0, 1, 0)

- If LC or HALC has an NAN value, we convert to 0
- This is because dividing by zero gives NaN

In [7]:
df['LC'] = df['LC'].fillna(0)
df['HALC'] = df['HALC'].fillna(0)

- drop x.1, 15, 16, 17, 18
- These columns are base features used to engineer new features
- Features have been dropped to reduce potential multicollinearity

In [9]:
df = df.drop(columns=['X.1', 'X.15', 'X.16', 'X.17', 'X.18'])

- We will rename X.27 to is_petrol
- This will make it more interpretable for humans
- We also convert P's to 1, and D's to 0
- Creating a binary feature would make it easier for the  model to understand

In [11]:
df.rename(columns={'X.27': 'is_petrol'}, inplace=True)

In [12]:
df['is_petrol'] = df['is_petrol'].map({'P': 1, 'D': 0})

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37451 entries, 0 to 37450
Data columns (total 26 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   X.2         37451 non-null  object 
 1   X.3         37451 non-null  object 
 2   X.4         37451 non-null  object 
 3   X.5         37451 non-null  object 
 4   X.6         37451 non-null  object 
 5   X.7         37451 non-null  int64  
 6   X.8         37451 non-null  int64  
 7   X.9         37451 non-null  int64  
 8   X.10        37451 non-null  int64  
 9   X.11        37451 non-null  int64  
 10  X.12        37451 non-null  int64  
 11  X.13        37451 non-null  int64  
 12  X.14        37451 non-null  float64
 13  X.19        37451 non-null  int64  
 14  X.20        37451 non-null  int64  
 15  X.21        37451 non-null  int64  
 16  X.22        37451 non-null  int64  
 17  X.23        37451 non-null  int64  
 18  X.24        37451 non-null  int64  
 19  X.25        37451 non-nul

We renamed all columns to make it interpreable for humans

In [15]:
df = df.rename(columns={
    'X.2': 'pol_start', 
    'X.3': 'last_renewal',
    'X.4': 'next_renewal',
    'X.5': 'DOB',
    'X.6': 'license_issue_date',
    'X.7': 'is_channel_broker',
    'X.8': 'total_pol_year',
    'X.9': 'total_pol_held',
    'X.10': 'max_policies',
    'X.11': 'max_products',
    'X.12': 'canceled_policies',
    'X.13': 'is_halfyearly',
    'X.14': 'net_premium',
    'X.19': 'risk_type',
    'X.20': 'is_urban',
    'X.21': 'is_multidriver',
    'X.22': 'regis_year',
    'X.23': 'horsepower',
    'X.24': 'cylinder_cap',
    'X.25': 'market_value',
    'X.26': 'door_count',
    'X.28': 'vehicle_weight',
})

- Convert datetime features into integers of day, month, year
- This granularity is needed as patterns could be picked up from model building
- Also, integers are needed for the models, as datetime is difficult for models to deal with

In [17]:
dates_to_convert = ['pol_start', 'last_renewal', 'next_renewal', 'license_issue_date']

In [18]:
for i in dates_to_convert:
    df[i] = pd.to_datetime(df[i], format='%d/%m/%Y')
    df[f'{i}_day'] = df[i].dt.day
    df[f'{i}_month'] = df[i].dt.month
    df[f'{i}_year'] = df[i].dt.year

In [19]:
df = df.drop(columns=dates_to_convert)

- We found the Age by subtracting 31/12/2019 and and their DOB
- Because this data was last collected on 31/12/2019, we assumed their DOB could be approximated from this
- We then drop DOB to reduce multicollinearity
- To reduce noise and variance, we created age buckets
  - below 24 is a youngin
  - between 25 and 39 is an adult
  - 40 and 64 is a middle aged person
  - and later then 65 is an oldie
- this also increases interpretability if we care about that
  

In [21]:
df['DOB'] = pd.to_datetime(df['DOB'], format='%d/%m/%Y')
reference_date = pd.to_datetime('31/12/2019', format='%d/%m/%Y')
df['Age'] = (reference_date - df['DOB']).dt.days // 365

In [22]:
df = df.drop(columns='DOB')

In [23]:
df['is_youngin'] = (df['Age'] <= 24).astype(int)
df['is_adult'] = ((df['Age'] >= 25) & (df['Age'] <= 39)).astype(int)
df['is_middleaged'] = ((df['Age'] >= 40) & (df['Age'] <= 64)).astype(int)
df['is_old'] = (df['Age'] >= 65).astype(int)

In [24]:
df = df.drop(columns='Age')

- is_petrol still has NaN values. so we use logistic regression imputation to fill. We are assuming that the NaN values must be P or D. (It could be something else, but that something else will just harm the data)

In [26]:
# dealing with NaN in is_petrol
df_known = df[df['is_petrol'].notna()]
df_missing = df[df['is_petrol'].isna()]

In [27]:
df_known['is_petrol'].value_counts()

is_petrol
0.0    23074
1.0    13784
Name: count, dtype: int64

In [28]:
df_missing['is_petrol'].value_counts(dropna=False)

is_petrol
NaN    593
Name: count, dtype: int64

In [29]:
X_train = df_known.drop(columns='is_petrol')
y_train = df_known['is_petrol']

In [30]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

In [31]:
pred_probs = logreg.predict_proba(df_missing[X_train.columns])[:, 1]
pred_classes = (pred_probs >= 0.5).astype(int)

In [32]:
df.loc[df['is_petrol'].isna(), 'is_petrol'] = pred_classes

In [33]:
df['is_petrol'].value_counts()

is_petrol
0.0    23104
1.0    14347
Name: count, dtype: int64

- IMPORTANT FOR MODEL:
  - Move LC, HALC, and made_claim to the end of the data set
  - When making predictions, the columns must be consistent 

In [35]:
temp = df.pop('LC')
df['LC'] = temp

temp = df.pop('HALC')
df['HALC'] = temp

temp = df.pop('made_claim')
df['made_claim'] = temp

- Now, we want to make a prediction of LC is 0 or non 0
- First, we create is_LC, where 1 if LC is not 0, else 0
- We drop HALC, LC, and made_claim
- We then split the data
- After splitting, we SMOTE is_LC because there are so many 0's and not 1.
- SMOTE will prevent over bias in the majority class. 
- We apply SMOTE just on training data (not testing) to prevent data leakage
  - data leakage, improves training accuracy, but lowers testing acc.
  - In other words, its not good

In [37]:
df_is_LC = df.copy()

In [38]:
df_is_LC['is_LC'] = (df_is_LC['LC'] != 0).astype(int)

In [39]:
df_is_LC = df_is_LC.drop(columns=['HALC', 'LC', 'made_claim'])

In [40]:
X = df_is_LC.drop(columns=['is_LC'])
y = df_is_LC['is_LC']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [42]:
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

- After applying SMOTE and balancing data, we train a Random Forest Classifier to predict if an instance would have LC to be a 0 or a non 0 value
- We get an accuracy of 89%, but auc of 81%.
- Tuning may be required
- We then export the model as temp.pkl
  - temp.pkl will be used as a preliminary model to predict 0 and non 0 for LC.
  - If 1 gets predicted, we take those instances and shove it in another model that will be built in building_RF_for_LC
- Finally, we export the clean data to be used to build other models
  - These other models will predict a continuous value for LC, a continuous value for HALC, and a binary classifier if a claim has ever been made

In [44]:
rf = RandomForestClassifier(
    n_estimators=100, random_state=42
)

In [45]:
rf.fit(X_train_smote, y_train_smote)

- Accuracy: 89%
- ROC AUC: 81%
- My thoughts: Accuracy could be misleading, as is it not a good representation of the model especially if the data is heavily skewed (such as this case)
- the ROC AUC is actually pretty good in my opinion. with an AUC of 81%, the model learned meaningul patterns despite 11% of the minority class.  

In [47]:
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]

In [48]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[6452  212]
 [ 624  203]]
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      6664
           1       0.49      0.25      0.33       827

    accuracy                           0.89      7491
   macro avg       0.70      0.61      0.63      7491
weighted avg       0.87      0.89      0.87      7491



In [50]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba); auc_score

0.80863064693834

In [51]:
import joblib
joblib.dump(rf, 'temp.pkl')

['temp.pkl']

In [52]:
# df_is_LC.to_csv("df_is_LC.csv", index=False)

### EXPORTING clean_data.csv

In [54]:
df.to_csv("clean_data.csv", index=False)