# <p style="font-family: Garamond; font-size: 33px; word-spacing: 10px; padding: 25px; text-align: center; color: #ffffff; border-radius: 25px;  font-weight: bold; background-color: #06066F">BINARY CLASSIFICATION OF INSURANCE CROSS SELLING WITH CATBOOST</p>

**Context:** An insurance company that has provided health insurance to its customers, now needs to build a model to predict whether last year's policyholders would also be interested in purchasing a vehicle insurance.  

The **objective of this competition is to predict which customers respond positively to an automobile insurance offer** based on 10 factors:   
- Gender. Gender of the customer
- Age. Age of the customer
- Driving_License. 0: Customer does not have DL, 1: Customer already has DL
- Region_Code. Unique code for the region of the customer
- Previously_Insured. 1: Customer already has Vehicle Insurance, 0: Customer doesn't have Vehicle Insurance
- Vehicle_Age. Age of the Vehicle
- Vehicle_Damage. 1: Customer got his/her vehicle damaged in the past. 0: Customer didn't get his/her vehicle damaged in the past.
- Annual_Premium. The amount customer needs to pay as premium in the year
- Policy_Sales_Channel. Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- Vintage. Number of days the customer has been associated with the company.


**Response.** 1: Customer is interested, 0: Customer is not interested

The datasets for this competition (both train and test) were generated from a deep learning model trained on the [Health Insurance Cross Sell Prediction Data](https://www.kaggle.com/datasets/annantkumarsingh/health-insurance-cross-sell-prediction-data/data).

**Submissions are evaluated using area under the ROC curve.**  


**In this notebook we'll use a Catboost Classifier.**

Full description available at [Health Insurance Cross Sell Prediction](https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction/data)

EDA available [here](https://www.kaggle.com/code/marcelamanzosagez/classification-of-insurance-cross-selling-eda).

# <p style="font-family: Garamond; font-size: 25px; word-spacing: 10px; padding: 15px; text-align: center; color: #ffffff; border-radius: 15px;  font-weight: bold; background-color: #06066F;">LIBRARIES</p>

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.base import OneToOneFeatureMixin, BaseEstimator, TransformerMixin

from catboost import CatBoostClassifier

# Remove the max column restriction for displaying on the screen
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import os
import gc
import warnings
warnings.filterwarnings("ignore")
    
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from IPython.display import Markdown as md

/kaggle/input/playground-series-s4e7/sample_submission.csv
/kaggle/input/playground-series-s4e7/train.csv
/kaggle/input/playground-series-s4e7/test.csv


In [2]:
# Helper functions

import random as py_random

def reset_random_seeds(seed=42):
    ''' Set all seeds for random numbers to get reproducibility.'''
    
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    py_random.seed(seed)


reset_random_seeds()

# <p style="font-family: Garamond; font-size: 25px; word-spacing: 10px; padding: 15px; text-align: center; color: #ffffff; border-radius: 15px;  font-weight: bold; background-color: #06066F;">DATA LOADING</p>

In [3]:
train = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv', index_col='id')
test  = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv', index_col='id')

initial_features = train.columns[:-1].to_list()

nominal_features     = ['Region_Code', 'Policy_Sales_Channel']
ordinal_features     = ['Vehicle_Age']
binary_features      = ['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Damage']
categorical_features = nominal_features + ordinal_features + binary_features
discrete_features    = ['Age', 'Vintage']
continuous_features  = ['Annual_Premium']
numerical_features   = discrete_features + continuous_features

display(md(f'''
**Train shape = {train.shape} (including target)**           
Number of missing values in train set = {train.isna().sum().sum()}   

**Test shape = {test.shape}**     
Number of missing values in test set = {test.isna().sum().sum()}   

**Number of features = {len(initial_features)}**   
- **Categorical = {len(categorical_features)} ( nominal = {len(nominal_features)} -- ordinal = {len(ordinal_features)} -- binary = {len(binary_features)} )**   
- **Numerical   = {len(numerical_features)} ( discrete = {len(discrete_features)} -- continous = {len(continuous_features)} )**   

**Features:**   
- **Nominal:** {nominal_features}    
- **Ordinal:** {ordinal_features}   
- **Binary:** {binary_features}   
- **Discrete:** {discrete_features}   
- **Continuous:** {continuous_features}   

**train dataset:**   
'''))
train.head()


**Train shape = (11504798, 11) (including target)**           
Number of missing values in train set = 0   

**Test shape = (7669866, 10)**     
Number of missing values in test set = 0   

**Number of features = 10**   
- **Categorical = 7 ( nominal = 2 -- ordinal = 1 -- binary = 4 )**   
- **Numerical   = 3 ( discrete = 2 -- continous = 1 )**   

**Features:**   
- **Nominal:** ['Region_Code', 'Policy_Sales_Channel']    
- **Ordinal:** ['Vehicle_Age']   
- **Binary:** ['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Damage']   
- **Discrete:** ['Age', 'Vintage']   
- **Continuous:** ['Annual_Premium']   

**train dataset:**   


Unnamed: 0_level_0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Male,21,1,35.0,0,1-2 Year,Yes,65101.0,124.0,187,0
1,Male,43,1,28.0,0,> 2 Years,Yes,58911.0,26.0,288,1
2,Female,25,1,14.0,1,< 1 Year,No,38043.0,152.0,254,0
3,Female,35,1,1.0,0,1-2 Year,Yes,2630.0,156.0,76,0
4,Female,36,1,15.0,1,1-2 Year,No,31951.0,152.0,294,0


In [4]:
# Re-encode nominal features 

ordinal_encoder = {}
for col in ['Region_Code', 'Policy_Sales_Channel', 'Gender', 'Vehicle_Damage']:
    ordinal_encoder[col] = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=train[col].nunique(), dtype=int)
    train[col] = ordinal_encoder[col].fit_transform(train[col].array.reshape(-1, 1))
    test[col] = ordinal_encoder[col].transform(test[col].array.reshape(-1, 1))  

col = 'Vehicle_Age'
categories = [['< 1 Year', '1-2 Year', '> 2 Years']]
ordinal_encoder[col] = OrdinalEncoder(categories=categories, handle_unknown='use_encoded_value', unknown_value=train[col].nunique(), dtype=int) 
train[col] = ordinal_encoder[col].fit_transform(train[col].array.reshape(-1, 1))
test[col] = ordinal_encoder[col].transform(test[col].array.reshape(-1, 1))  

# Downcast 'Annual_Premium'
train['Annual_Premium'] = train['Annual_Premium'].astype(int)
test['Annual_Premium']  = test['Annual_Premium'].astype(int)

gc.collect()

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
 #   Column                Dtype
---  ------                -----
 0   Gender                int64
 1   Age                   int64
 2   Driving_License       int64
 3   Region_Code           int64
 4   Previously_Insured    int64
 5   Vehicle_Age           int64
 6   Vehicle_Damage        int64
 7   Annual_Premium        int64
 8   Policy_Sales_Channel  int64
 9   Vintage               int64
 10  Response              int64
dtypes: int64(11)
memory usage: 1.0 GB


# <p style="font-family: Garamond; font-size: 25px; word-spacing: 10px; padding: 15px; text-align: center; color: #ffffff; border-radius: 15px;  font-weight: bold; background-color: #06066F;">FEATURE ENGINEERING</p>

The idea is to transform the categorical features "Region_Code" and "Policy_Sales_Channel" into continuous features using the **Weight of Evidence** transformation.

**Weight of Evidence (WoE)** [1] is a metric for evaluating the predictive power of categorical variables in distinguishing between binary outcomes, ensuring a monotonic relationship with the target variable and enhancing the effectiveness of predictive models. It is particularly used in credit scoring and risk modeling.

References:
[1] WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED, https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

In [5]:
class TargetEncodingTransformer(TransformerMixin, BaseEstimator):
    '''Add target encoding for categorical features.
    Transform categorical features into continuous ones using the "weight of evidence" transformation.
    '''
    
    def __init__(self, with_sum_woes=False, suffix='_WoE'):
        self.with_sum_woes = with_sum_woes
        self.suffix = suffix
        
    def calculate_woe(self, X, y):
        '''Calculate Weight of Evidence (WoE).'''
        eps = 1e-10  # small value to avoid division by zero
        #tot_ev = train['Response'].sum()
        #tot_non_ev = len(train) - tot_ev
        grouped = y.groupby(X).agg(['count', 'sum'])
        grouped['non_event'] = grouped['count'] - grouped['sum'] 
        grouped['woe'] = np.log( (grouped['sum'] + eps) / (grouped['non_event'] + eps) ) # + np.log(tot_non_ev/tot_ev)
        return dict(grouped['woe'])
        
    def fit(self, X, y=None):
        self.target_enc_dict = {}
        for col in X.columns:
            self.target_enc_dict[col] = self.calculate_woe(X[col], y) 
        return self
    
    def transform(self, X):  
        df = pd.DataFrame(index=X.index)
        if self.with_sum_woes: df['WoEs'] = 0.0
        for col in X.columns:
            df[col+self.suffix] = X[col].map(self.target_enc_dict[col])
            if self.with_sum_woes: df['WoEs'] = df['WoEs'] + df[col+self.suffix]
        self.columns = df.columns.to_list()
        return df
    
    def get_feature_names_out(self):
        return self.columns

# <p style="font-family: Garamond; font-size: 25px; word-spacing: 10px; padding: 15px; text-align: center; color: #ffffff; border-radius: 15px;  font-weight: bold; background-color: #06066F;">MODELING</p>

In [6]:
def train_model(features, n_folds, cv_seed, model_params, target='Response'):
    
    TransformerID = FunctionTransformer(lambda x:x)
    
    valid_scores     = [] # oof scores 
    test_pred        = np.zeros(len(test), dtype=float) # record predictions for test dataset

    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=cv_seed)

    for fold, (train_index, valid_index) in enumerate(skf.split(train, train['Response'])):
        
        X_train = train.loc[train_index,features]
        y_train = train[target][train_index]
        X_valid = train.loc[valid_index,features]
        y_valid = train[target][valid_index]
        
        # Add encoding features
        preprocessor = ColumnTransformer([
            ('tenc', TargetEncodingTransformer(suffix='_WoE'), ['Region_Code', 'Policy_Sales_Channel']),
            ('id_fun', TransformerID, ['Region_Code', 'Policy_Sales_Channel'])
        ],
        remainder='passthrough').set_output(transform='pandas')
        
        X_train = preprocessor.fit_transform(X_train, y_train)
        X_valid = preprocessor.transform(X_valid)
        X_test  = preprocessor.transform(test)
        
        new_categorical_features = [col for col in X_train.columns if col not in ['tenc__Region_Code_WoE', 'tenc__Policy_Sales_Channel_WoE']] 
        X_train[new_categorical_features] = X_train[new_categorical_features].astype('category')
        X_valid[new_categorical_features] = X_valid[new_categorical_features].astype('category')
        X_test[new_categorical_features]  = X_test[new_categorical_features].astype('category')

        model = CatBoostClassifier(**model_params)
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], cat_features=new_categorical_features)
        valid_score = model.best_score_['validation'][model_params['eval_metric']]
        valid_scores.append(valid_score)
        
        # predict test set
        test_pred = test_pred + model.predict_proba(X_test)[:, 1]
        
        del X_train, y_train, X_valid, y_valid
        gc.collect()
        
        print(f'fold {fold+1}/{n_folds} - AUC validation set = {valid_score}')

    print(f'\nmean AUC validation set = {np.mean(valid_scores)}')
    
    test_pred = test_pred / n_folds

    return test_pred

In [7]:
%%time

# Parameters copied from https://www.kaggle.com/code/darkdevil18/0-89698-ps4e7-are-you-insured?scriptVersionId=189390714&cellId=47
model_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'class_names': [0, 1],
    'learning_rate': 0.075,
    'iterations': 10000,
    'depth': 9,
    'random_strength': 0,
    'l2_leaf_reg': 0.5,
    'max_leaves': 512,
    'fold_permutation_block': 64,
    'task_type': 'GPU',
    'random_seed': 42,
    'verbose': 500,
    'early_stopping_rounds':500,
    'allow_writing_files': False
}

test_pred = train_model(features=initial_features, n_folds=5, cv_seed=42, model_params=model_params)

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8749228	best: 0.8749228 (0)	total: 14s	remaining: 1d 14h 54m 46s
500:	test: 0.8942842	best: 0.8942842 (500)	total: 7m 48s	remaining: 2h 27m 58s
1000:	test: 0.8946763	best: 0.8946763 (1000)	total: 15m 3s	remaining: 2h 15m 18s
1500:	test: 0.8948261	best: 0.8948261 (1500)	total: 22m 16s	remaining: 2h 6m 5s
2000:	test: 0.8949012	best: 0.8949027 (1988)	total: 29m 33s	remaining: 1h 58m 11s
2500:	test: 0.8949186	best: 0.8949236 (2375)	total: 36m 49s	remaining: 1h 50m 25s
bestTest = 0.8949236274
bestIteration = 2375
Shrink model to first 2376 iterations.
fold 1/5 - AUC validation set = 0.8949236273765564


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8741687	best: 0.8741687 (0)	total: 1.45s	remaining: 4h 2m 18s
500:	test: 0.8938780	best: 0.8938780 (500)	total: 7m 29s	remaining: 2h 21m 58s
1000:	test: 0.8942742	best: 0.8942742 (1000)	total: 14m 43s	remaining: 2h 12m 25s
1500:	test: 0.8943881	best: 0.8943881 (1500)	total: 21m 53s	remaining: 2h 4m
2000:	test: 0.8944353	best: 0.8944362 (1993)	total: 29m 9s	remaining: 1h 56m 32s
2500:	test: 0.8944582	best: 0.8944618 (2444)	total: 36m 21s	remaining: 1h 49m 1s
3000:	test: 0.8944542	best: 0.8944649 (2639)	total: 43m 28s	remaining: 1h 41m 23s
bestTest = 0.8944648504
bestIteration = 2639
Shrink model to first 2640 iterations.
fold 2/5 - AUC validation set = 0.8944648504257202


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8741781	best: 0.8741781 (0)	total: 1.46s	remaining: 4h 2m 38s
500:	test: 0.8940715	best: 0.8940715 (500)	total: 7m 37s	remaining: 2h 24m 42s
1000:	test: 0.8945269	best: 0.8945269 (1000)	total: 14m 46s	remaining: 2h 12m 48s
1500:	test: 0.8946704	best: 0.8946704 (1500)	total: 22m	remaining: 2h 4m 34s
2000:	test: 0.8947283	best: 0.8947285 (1998)	total: 29m 15s	remaining: 1h 56m 58s
2500:	test: 0.8947335	best: 0.8947414 (2190)	total: 36m 25s	remaining: 1h 49m 12s
bestTest = 0.8947413564
bestIteration = 2190
Shrink model to first 2191 iterations.
fold 3/5 - AUC validation set = 0.8947413563728333


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8742979	best: 0.8742979 (0)	total: 1.33s	remaining: 3h 41m 4s
500:	test: 0.8938781	best: 0.8938781 (500)	total: 7m 31s	remaining: 2h 22m 32s
1000:	test: 0.8943288	best: 0.8943288 (1000)	total: 14m 44s	remaining: 2h 12m 35s
1500:	test: 0.8944678	best: 0.8944683 (1480)	total: 21m 56s	remaining: 2h 4m 14s
2000:	test: 0.8945459	best: 0.8945475 (1994)	total: 29m 10s	remaining: 1h 56m 38s
2500:	test: 0.8945721	best: 0.8945730 (2456)	total: 36m 22s	remaining: 1h 49m 4s
3000:	test: 0.8945709	best: 0.8945777 (2792)	total: 43m 39s	remaining: 1h 41m 50s
3500:	test: 0.8945718	best: 0.8945795 (3156)	total: 50m 51s	remaining: 1h 34m 24s
bestTest = 0.8945795298
bestIteration = 3156
Shrink model to first 3157 iterations.
fold 4/5 - AUC validation set = 0.8945795297622681


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.8748752	best: 0.8748752 (0)	total: 1.33s	remaining: 3h 41m 20s
500:	test: 0.8946410	best: 0.8946410 (500)	total: 7m 38s	remaining: 2h 24m 59s
1000:	test: 0.8950649	best: 0.8950649 (1000)	total: 14m 51s	remaining: 2h 13m 37s
1500:	test: 0.8951876	best: 0.8951879 (1499)	total: 22m 1s	remaining: 2h 4m 44s
2000:	test: 0.8952496	best: 0.8952509 (1997)	total: 29m 16s	remaining: 1h 57m 3s
2500:	test: 0.8952574	best: 0.8952574 (2500)	total: 36m 30s	remaining: 1h 49m 28s
3000:	test: 0.8952511	best: 0.8952631 (2567)	total: 43m 47s	remaining: 1h 42m 7s
bestTest = 0.8952631354
bestIteration = 2567
Shrink model to first 2568 iterations.
fold 5/5 - AUC validation set = 0.895263135433197

mean AUC validation set = 0.894794499874115
CPU times: user 6h 20min 47s, sys: 20min 7s, total: 6h 40min 55s
Wall time: 4h 27min 36s


# <p style="font-family: Garamond; font-size: 25px; word-spacing: 10px; padding: 15px; text-align: center; color: #ffffff; border-radius: 15px;  font-weight: bold; background-color: #06066F;">SUBMISSION</p>

In [8]:
submission = pd.read_csv('/kaggle/input/playground-series-s4e7/sample_submission.csv')
submission['Response'] = test_pred
display(submission.head(10))
submission.to_csv("submission.csv", index=False)

Unnamed: 0,id,Response
0,11504798,0.005925
1,11504799,0.658852
2,11504800,0.23986
3,11504801,6.6e-05
4,11504802,0.198479
5,11504803,6.5e-05
6,11504804,0.101406
7,11504805,0.003469
8,11504806,1.2e-05
9,11504807,0.000183
