# Kaggle Tabular Playground Series - Sep 2021

The aim of this project was to take part in and complete the Kaggle "Tabular Playground Series - Sep 2021" Prediction Competition. The competition consisted of predicting the probability of a customer making a claim on an insurance policy. I chose this as my capstone project since it allowed me to put my knowledge and skills into practice but also explore new topics and learn new modelling techniques. 

The best solution I was able to achieve consisted of a voting classifier model containing the three highest-scoring available models I was able to train on my dataset. This solution was able to achieve good predictive performance both in-sample and out-of-sample, with AUC scores in excess of 83% when using the entire dataset to predict labels for the validation dataset. Other metrics scored highly as well, indicating the good performance of the chosen model. 

The project was concluded with the submission of the test predictions to the relevant competition submission page to finally  be scored. My best predictions allowed me to score in the top 37% of participants.

## Importing packages and functions

In [1]:
%%capture 
#avoids printing long console output
!pip install pycaret[full] #installing the pycaret module and its sub-modules (only run in GPU mode)

In [2]:
## The magic four
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 

#Scaler 
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

#garbage collection (clear up some RAM)
import gc

#imputer
from sklearn.impute import SimpleImputer

#pycaret (only run in GPU mode)
from pycaret.classification import *

#cuDF
import cudf

%matplotlib inline

In [3]:
#this is an aesthetic choice and just removes the many warnings that some functions and comands produce
#it helps significantly declutter the workbook
import warnings
warnings.filterwarnings('ignore')

# Importing data

In [4]:
#importing data and setting index column
#using cuDF to dramatically speed up data import
train = cudf.read_csv('../input/tabular-playground-series-sep-2021/train.csv', index_col='id')
test = cudf.read_csv('../input/tabular-playground-series-sep-2021/test.csv', index_col='id')

# Exploratory Data Analysis

In [5]:
train.head()

In [6]:
test.head()

In [7]:
train.shape

In [8]:
train.dtypes

In [9]:
train.info()

In [10]:
train.describe()

In [11]:
test.shape

In [12]:
train.isnull().sum()

In [13]:
test.isnull().sum()

In [14]:
'''
#correlation heatmap
plt.figure(figsize = (30,30))
corrplot = sns.heatmap(train.corr(), square = True)

corrplot.figure.savefig('corrplot.png')
'''

# Data Cleaning

In [15]:
#defining the train and test datasets
#transforming cuDF to pandas DataFrames for compatibility
X_train = train.to_pandas()
y_train = X_train.pop('claim')
X_test = test.to_pandas()

#saving the index of the test dataset for later use
idx = X_test.index

## Filling in null values by column mean

In [16]:
#saving a copy of column headings
train_cols = X_train.columns
test_cols = X_test.columns

In [17]:
#fills null value in each column with column mean
SI = SimpleImputer(strategy = 'mean')
X_train_fill = SI.fit_transform(X_train)
X_train_fill = pd.DataFrame(X_train_fill, columns = train_cols)

In [18]:
#fill null values in test and set index from orginal dataset
X_test_fill = SI.fit_transform(X_test)
X_test_fill = pd.DataFrame(X_test_fill, columns = test_cols)
X_test_fill.set_index(idx, inplace = True)

In [19]:
gc.collect()

In [20]:
#adding additional features to both train and test
X_train_fill['n_missing'] = X_train.isnull().sum(axis=1).astype(int)
X_train_fill['std'] = X_train_fill[train_cols].std(axis=1)
X_train_fill['avg'] = X_train_fill[train_cols].mean(axis=1)
X_train_fill['max'] = X_train_fill[train_cols].max(axis=1)
X_train_fill['min'] = X_train_fill[train_cols].min(axis=1)

X_test_fill['n_missing'] = X_test.isnull().sum(axis=1).astype(int) 
X_test_fill['std'] = X_test_fill[test_cols].std(axis=1)
X_test_fill['avg'] = X_test_fill[test_cols].mean(axis=1)
X_test_fill['max'] = X_test_fill[test_cols].max(axis=1)
X_test_fill['min'] = X_test_fill[test_cols].min(axis=1)

In [21]:
#updated list of column headings
train_cols = X_train_fill.columns
test_cols = X_test_fill.columns

# Scaler

In [22]:
#scaling train
scaler = RobustScaler()

scaled_X_train = scaler.fit_transform(X_train_fill)
scaled_X_train = pd.DataFrame(scaled_X_train, columns = train_cols)

In [23]:
#scaling test
scaled_X_test = scaler.transform(X_test_fill)
scaled_X_test = pd.DataFrame(scaled_X_test, columns = test_cols)
scaled_X_test.set_index(idx, inplace = True)

In [24]:
#to reduce datset size in RAM
scaled_X_train = scaled_X_train.astype(np.float32)
scaled_X_test = scaled_X_test.astype(np.float32)

In [25]:
#adding a column to say if the row contains nulls

scaled_X_train['any_missing'] = X_train_fill['n_missing'] > 0
scaled_X_test['any_missing'] = X_test_fill['n_missing'] > 0

scaled_X_train['any_missing'] = scaled_X_train['any_missing'].astype(np.int8)
scaled_X_test['any_missing'] = scaled_X_test['any_missing'].astype(np.int8)

gc.collect()

In [26]:
#eliminate unnecessary objects to reduce RAM usage
try:
    del test, train, scaler, SI, X_train, X_test, X_train_fill, X_test_fill
except:
    print('already dropped!')
finally:
    gc.collect()

# PyCaret Classification

In [27]:
# pycaret wants a single dataframe which includes the target column
# try-except structure to avoid unhandled errors
try:
    clf_data = scaled_X_train.copy()
    clf_data = clf_data.join(y_train)
    del scaled_X_train
except:
    pass
finally:
    gc.collect()

## Setup

In [28]:
#setting up the pipeline
clf = setup(data = clf_data, #DataFrame
            target = 'claim', #specify the target column
            data_split_stratify = True, #stratify by target value
            fold = 5, #use 5-fold cross-validation
            use_gpu = True, #use GPU acceleration
            n_jobs = -1, #use maximum number of threads
            silent = True #execute without need for confirmation
           )

In [29]:
#this lists all models that can be run
models()

## Model comparison

In [30]:
#comparing all models
#excluded some models that did not benefit from GPU acceleration and took too long to run (ada, gbc)
#excluded SVM since it does not support AUC score
#excluded some models that were run independently (dt, et)
compare_models(exclude = ['dt','ada','gbc','et','svm'], sort = 'AUC')

## Voting (Blending) Classifier

In [31]:
#trains all available models and return top 3 by AUC score
top3 = compare_models(n_select = 3, exclude = ['dt','ada','gbc','et','svm'], sort = 'AUC')

In [32]:
#lists top 3 models and their details
top3

In [33]:
#hyperparameter tuning
tuned_top3 = [tune_model(i, choose_better = True, optimize = 'AUC') for i in top3]

In [34]:
#voting classifier on baseline models
basic_blend = blend_models(top3)

In [35]:
#voting classifier on tuned models
tuned_blend = blend_models(tuned_top3)

In [36]:
#predict baseline model on validation set
predict_model(basic_blend);

In [37]:
#predict tuned model on validation set
predict_model(tuned_blend);

In [38]:
#finalise model
#fits the model onto the complete dataset including the test/hold-out sample
final_model = finalize_model(basic_blend)

In [39]:
final_tuned_model = finalize_model(basic_blend)

In [40]:
#predict using entire dataset (train and validation)
predict_model(final_model);

In [41]:
#tuned model does not perform better so simpler model will be used
predict_model(final_tuned_model);

In [42]:
#performance on unseen data
unseen_predictions_best = predict_model(final_model, data = scaled_X_test, raw_score = True, round = 6)
unseen_predictions_best.head()

In [43]:
#model plotting
#Area under the ROC curve
plot_model(final_model, plot = 'auc')

In [44]:
#precision-recall plot
plot_model(final_model, plot = 'pr')

In [45]:
#confusion matrix
plot_model(final_model, plot = 'confusion_matrix')

In [46]:
#save model and parameters
save_model(final_model, 'Final model')

# Submission

In [47]:
#check size of prediction vector
assert(len(idx)==len(unseen_predictions_best))

#create DataFrame with index of test dataset and predictions from the model
sub = pd.DataFrame(list(zip(idx, unseen_predictions_best.Score_1)),columns = ['id', 'claim'])

#DataFrame to csv to be submitted
sub.to_csv('submission.csv', index = False)

#print DataFrame contents for final inspection
print(sub)