<a href="https://www.kaggle.com/code/datascientistsohail/onehotencoding-ics-classification?scriptVersionId=188509708" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Machine Learning Classification for Cross Insurance

In this notebook, We tackle a classification problem in the domain of cross insurance using machine learning techniques. To effectively handle categorical data, we apply the OneHotEncoding technique, ensuring that our model can interpret and utilize these categorical features efficiently. Our model of choice is the `LGBMClassifier`, a powerful gradient boosting framework known for its high performance and speed. To validate the robustness and generalizability of our model, we employ a 5-fold cross-validation approach. This method splits our dataset into five parts, training the model on four parts and validating it on the fifth, rotating through all parts to ensure a comprehensive evaluation.

### Import necessary packages and libraries

In [1]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold

### Read Datasets

In [2]:
df = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv', index_col = 'id')
df_test = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv', index_col = 'id')
submission = pd.read_csv('/kaggle/input/playground-series-s4e7/sample_submission.csv')

In [3]:
df.head()

Unnamed: 0_level_0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187,0
1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288,1
2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254,0
3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76,0
4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294,0


In [4]:
df_test.head()

Unnamed: 0_level_0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
11504798,Female,20,1,47.0,0,< 1 Year,No,2630.0,160.0,228
11504799,Male,47,1,28.0,0,1-2 Year,Yes,37483.0,124.0,123
11504800,Male,47,1,43.0,0,1-2 Year,Yes,2630.0,26.0,271
11504801,Female,22,1,47.0,1,< 1 Year,No,24502.0,152.0,115
11504802,Male,51,1,19.0,0,1-2 Year,No,34115.0,124.0,148


In [5]:
df.shape, df_test.shape

((11504798, 11), (7669866, 10))

In [6]:
tolal_columns = len([c for c in df.columns])
print(tolal_columns)

11


### Observe Categorical Columns

In [7]:
obj_cols = [c for c in df.columns if df[c].dtype == "object"]
obj_cols

['Gender', 'Vehicle_Age', 'Vehicle_Damage']

In [8]:
df[obj_cols].head()

Unnamed: 0_level_0,Gender,Vehicle_Age,Vehicle_Damage
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Male,1-2 Year,Yes
1,Male,> 2 Years,Yes
2,Female,< 1 Year,No
3,Female,1-2 Year,Yes
4,Female,1-2 Year,No


In [9]:
Vehicle_Age_set = set(df["Vehicle_Age"])
Vehicle_Age_set

{'1-2 Year', '< 1 Year', '> 2 Years'}

In [10]:
Gender_set = set(df["Gender"])
Gender_set

{'Female', 'Male'}

In [11]:
Vehicle_Damage_set = set(df["Vehicle_Damage"])
Vehicle_Damage_set

{'No', 'Yes'}

In [12]:
target = df.Response.values
df = df.drop(['Response'], axis = "columns")
df.head()

Unnamed: 0_level_0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187
1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288
2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254
3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76
4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294


In [13]:
num_cols = [c for c in df.columns if c not in obj_cols]
len(num_cols)

7

In [14]:
print(obj_cols)
print('*'*60)
print(num_cols)

['Gender', 'Vehicle_Age', 'Vehicle_Damage']
************************************************************
['Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']


In [15]:
df[obj_cols].head()

Unnamed: 0_level_0,Gender,Vehicle_Age,Vehicle_Damage
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Male,1-2 Year,Yes
1,Male,> 2 Years,Yes
2,Female,< 1 Year,No
3,Female,1-2 Year,Yes
4,Female,1-2 Year,No


### Apply OneHotEncoding 

In [16]:
encoder = OneHotEncoder(sparse_output = False)
encoded_df = pd.DataFrame(encoder.fit_transform(df[obj_cols]))
encoded_test = pd.DataFrame(encoder.transform(df_test[obj_cols]))

encoded_df.index = df.index
encoded_test.index = df_test.index

X = pd.concat([df[num_cols], encoded_df], axis =1 )
X_test = pd.concat([df_test[num_cols], encoded_test], axis =1)

In [17]:
X.head()

Unnamed: 0_level_0,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,0,1,2,3,4,5,6
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,21,1,35.0,0,65101.0,124.0,187,0.0,1.0,1.0,0.0,0.0,0.0,1.0
1,43,1,28.0,0,58911.0,26.0,288,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,25,1,14.0,1,38043.0,152.0,254,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,35,1,1.0,0,2630.0,156.0,76,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,36,1,15.0,1,31951.0,152.0,294,1.0,0.0,1.0,0.0,0.0,1.0,0.0


In [18]:
X_test.head()

Unnamed: 0_level_0,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,0,1,2,3,4,5,6
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
11504798,20,1,47.0,0,2630.0,160.0,228,1.0,0.0,0.0,1.0,0.0,1.0,0.0
11504799,47,1,28.0,0,37483.0,124.0,123,0.0,1.0,1.0,0.0,0.0,0.0,1.0
11504800,47,1,43.0,0,2630.0,26.0,271,0.0,1.0,1.0,0.0,0.0,0.0,1.0
11504801,22,1,47.0,1,24502.0,152.0,115,1.0,0.0,0.0,1.0,0.0,1.0,0.0
11504802,51,1,19.0,0,34115.0,124.0,148,0.0,1.0,1.0,0.0,0.0,1.0,0.0


### Model LGBMClassifier in Cross-Validation

In [19]:
splits = 5
test_preds = np.zeros((X_test.shape[0], 2))
scores = []
folds = StratifiedKFold(n_splits = splits, shuffle = True, random_state = 42)


params = {
    'force_row_wise': True,
    # add other necessary parameters
}

for fold, (trn_idx, val_idx) in enumerate(folds.split(X, target)):
    X_train, X_valid = X.iloc[trn_idx], X.iloc[val_idx]
    y_train, y_valid = target[trn_idx], target[val_idx]
    
    lgbm_model = LGBMClassifier(**params)
    lgbm_model.fit(X_train, y_train)
    
    y_pred = lgbm_model.predict_proba(X_valid)[:,1]
    
    score = roc_auc_score(y_valid, y_pred)
    print('Fold Score: ', score)
    
    scores.append(score)
    
    test_preds += lgbm_model.predict_proba(X_test) / splits
    
print(np.mean(scores))

[LightGBM] [Info] Number of positive: 1132047, number of negative: 8071791
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 9203838, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.122997 -> initscore=-1.964348
[LightGBM] [Info] Start training from score -1.964348
Fold Score:  0.8754937440329258
[LightGBM] [Info] Number of positive: 1132047, number of negative: 8071791
[LightGBM] [Info] Total Bins 746
[LightGBM] [Info] Number of data points in the train set: 9203838, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.122997 -> initscore=-1.964348
[LightGBM] [Info] Start training from score -1.964348
Fold Score:  0.8754672800994766
[LightGBM] [Info] Number of positive: 1132047, number of negative: 8071791
[LightGBM] [Info] Total Bins 741
[LightGBM] [Info] Number of data points in the train set: 9203838, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.122997 ->

### Submission

In [20]:
test_preds.shape

(7669866, 2)

In [21]:
submission.head()

Unnamed: 0,id,Response
0,11504798,0.5
1,11504799,0.5
2,11504800,0.5
3,11504801,0.5
4,11504802,0.5


In [22]:
predicted_responses = test_preds[:, 1]

In [23]:
predicted_responses.shape

(7669866,)

In [24]:
submission['Response'] = predicted_responses

In [25]:
submission.head()

Unnamed: 0,id,Response
0,11504798,0.014618
1,11504799,0.411443
2,11504800,0.253371
3,11504801,0.000225
4,11504802,0.035656


In [26]:
submission.to_csv('submission.csv', index = False)