### Tabular playground series - September 2021

https://www.kaggle.com/competitions/tabular-playground-series-sep-2021/overview

For this competition, you will predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

### Setup

In [None]:
from catboost import CatBoostClassifier, Pool, cv
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(rc = {'figure.figsize':(10, 6)})
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler

from ml_utils.preprocess.missing import check_missingness
from ml_utils.preprocess.pipeline import preprocessing_pipeline

### Load data

In [None]:
pipeline = {}

train = pd.read_csv("Data/train.csv")
train.drop('id', axis=1, inplace=True)
test = pd.read_csv("Data/test.csv")

print(f"Train shape {train.shape}")
print(f"Test shape {test.shape}")

In [None]:
train.head()

### Missingness

In [None]:
train_missingness = check_missingness(train)
train_missingness

118 of the columns have some missing data. All of these have low levels of missingness (c1.5%).

In [None]:
sns.histplot(train_missingness)

In [None]:
# For now import with mean
def impute_missing(df: pd.DataFrame) -> pd.DataFrame:
    col_names = df.columns
    df = pd.DataFrame(
        SimpleImputer(strategy="mean").fit_transform(df),
        columns=col_names
    )
    return df 
    
pipeline["impute"] = (impute_missing, None)

train = impute_missing(train)


### Basic EDA


The training dataset is quite large with almost 1 million rows. Take a random sample of 10% of the data for analysis, before training on all of the data.

In [None]:
train_sample = train.sample(100000)

In [None]:
def get_col_summary(df: pd.DataFrame) -> pd.DataFrame:

    # Min max scale so variances are comparable
    df = pd.DataFrame(MinMaxScaler().fit_transform(df), columns=df.columns)

    summary_fns = {
    "variance": np.var,
    "mean": np.mean,
    "median": np.median
    }

    _ = []
    for name, fn in summary_fns.items():
        _.append(df.apply(fn).to_frame(name=name))

    summary_df = pd.concat(_, axis=1)

    # Ignore var3 as this has already been analysed/dealt with 
    summary_df = summary_df[~summary_df.index.isin(["var3"])]

    return summary_df

col_summary = get_col_summary(train_sample)

In [None]:
# Distribution of column variances
sns.histplot(col_summary["variance"], bins=20)

In [None]:
# Distribution of column means
sns.histplot(col_summary["mean"], bins=20)

Looks like the data has already been normalised, continue to use all features for now.

**Class imbalance**

In [None]:
train_sample['claim'].value_counts() / len(train_sample)

### Outliers

A lot of features aren't gaussian so can't reliably use z-scores for outliers.

~~Instead use interquartile range.~~

Some of the distributions are veryextremely skewed and a lot of the data are getting flagged as outliers.

**Come back to this later, potentially look at whether removing the outliers improves that features predictability?**

In [None]:
# x = train_sample['f112']
# tol = 1.5

# uq = np.percentile(x, 75)
# lq = np.percentile(x, 25)
# iqr = uq - lq

# lower_lim = lq - (iqr * tol)
# upper_lim = uq + (iqr * tol)

### Preprocess data

In [None]:
test = preprocessing_pipeline(test, pipeline)

### Model

The data set is very large. To avoid long train times with comparing a bunch of algorithms I'll be exclusively using catboost.

Main reasons:
 - Avoiding excessive training times using slow sklearn models which gradient boosting will likely outperform
 - Performance increases are likely to be far greater by tuning a single algorithm and improving the quality of features

In [None]:
X = train.drop("claim", axis=1)
y = train["claim"]

# Initialise baseline catboost model with default parameters
cb = CatBoostClassifier()


In [None]:
N_CV = 5

kfold = StratifiedKFold(n_splits=N_CV)

train_score = []
valid_score = []
test_set_preds = []

for fold, (train_idx, valid_idx) in enumerate(kfold.split(X, y)):
    print(f"Running fold: {fold}...")
    X_train, y_train = X.iloc[train_idx], y[train_idx]
    X_valid, y_valid = X.iloc[valid_idx], y[valid_idx]
    
    cb.fit(X_train, y_train)
    
    y_hat_train = cb.predict_proba(X_train)[:, 1]
    y_hat_test = cb.predict_proba(X_valid)[:, 1]
    
    train_score.append(roc_auc_score(y_train, y_hat_train))
    valid_score.append(roc_auc_score(y_valid, y_hat_test))
    
    # Fold prediction on test set
    y_hat = cb.predict_proba(test)[:,1]
    test_set_preds.append(y_hat)

In [None]:
results_df = pd.DataFrame({
    "fold": list(range(N_CV)),
    "train": train_score,
    "valid": valid_score
})
results_df = pd.melt(results_df, id_vars="fold", var_name="set",value_name="accuracy")
print(f"Mean train score {np.mean(train_score)}")
print(f"Mean test score {np.mean(valid_score)}")
sns.barplot(data=results_df, x="fold", y="accuracy", hue="set")

### Optuna parameter tuning

### Final predictions and submission