# Tabular Playground Series March 2021

The aim is to predict a binary target based on a number of categorical (cat0-cat18) and continuous (cont0-cont10) features given in a tabular dataset (train.csv). 

This notebook will use a neural network. 

### Import all the necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from category_encoders import MEstimateEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from sklearn.model_selection import GridSearchCV
from tensorflow.keras import layers
from tensorflow.keras.layers import BatchNormalization
from numpy.random import seed
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

### Load the data

Do some basic checks for the data shape and potentially missing values. We will assume that there is no risk of target leakage (i.e., dataset variables created/updated after the target value is realized).

In [3]:
train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
# Check for missing data
train.isnull().sum()

We have no missing data. Next, separate target column from predictors and create the different training and validation sets before performing any data preprocessing steps, to avoid train-test contamination.

In [None]:
#Separate target from predictors
X=train.copy()
ID=X.pop('id')
y=X.pop('target')

#Break off validation set from training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

### Preprocess categorical features

Machine learning models can only interpret numerical data and not text data, so we will need to turn the categorical features in our dataset into numerical data by feature encoding. This dataset contains 19 categorical features (cat0-cat18). 

Check the cardinality of the features. For high-cardinality features, one-hot encoding would generate many features. We will instead use target encoding for categorical features with more than 50 unique values. For categorical features with only 2 unique values, we will use label encoding. For the rest, we will use one-hot encoding.

In [None]:
cat_columns = ['cat' + str(i) for i in range(19)]
train_X[cat_columns].nunique().sort_values()

**Label encode:**

In [None]:
le_cols = ['cat0', 'cat11', 'cat12', 'cat13', 'cat14']

label_encoder = LabelEncoder()
for col in le_cols:
    train_X[col] = label_encoder.fit_transform(train_X[col])
    val_X[col] = label_encoder.transform(val_X[col])

**One-hot encode:**

In [None]:
oh_cols = ['cat1', 'cat2', 'cat3', 'cat4', 'cat6', 'cat9', 'cat15', 'cat16',
           'cat17', 'cat18']

onehot = OneHotEncoder(handle_unknown = 'ignore', sparse=False)
oh_train_X = pd.DataFrame(onehot.fit_transform(train_X[oh_cols]))
oh_val_X = pd.DataFrame(onehot.transform(val_X[oh_cols]))

oh_train_X.index = train_X.index
oh_val_X.index = val_X.index

oh_train_X.columns = onehot.get_feature_names()
oh_val_X.columns = onehot.get_feature_names()

num_train_X = train_X.drop(oh_cols, axis=1)
num_val_X = val_X.drop(oh_cols, axis=1)

train_X = pd.concat([oh_train_X, num_train_X], axis=1)
val_X = pd.concat([oh_val_X, num_val_X], axis=1)

**Target encode:**

In [None]:
# Break off 'encoding' set from training data
X_encode, pretrain_X_trimmed, y_encode, train_y_trimmed = train_test_split(train_X, train_y, train_size=0.2, random_state=0)

encoder = MEstimateEncoder(cols=['cat5', 'cat7', 'cat8', 'cat10'], m=7.0) 

# Fit the encoder on the encoding split
encoder.fit(X_encode, y_encode)

# Encode to create the final training data
train_X_trimmed = encoder.transform(pretrain_X_trimmed)

val_X = encoder.transform(val_X)

### Preprocess numerical features
Neural networks generally perform better with scaled features, with values that are not too far from 0. Unscaled variables can produce slow or unstable learning processes. Standardize the continuous features.

In [None]:
cont_columns = ['cont' + str(i) for i in range(11)]

scaler=StandardScaler()
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X_trimmed[cont_columns]), 
                              index=train_X_trimmed.index, columns=cont_columns)
val_X_scaled = pd.DataFrame(scaler.transform(val_X[cont_columns]), 
                            index=val_X.index, columns=cont_columns)

train_X_trimmed.drop(cont_columns, axis=1, inplace=True)
val_X.drop(cont_columns, axis=1, inplace=True)

train_X_scaled = pd.concat([train_X_scaled, train_X_trimmed], axis=1)
val_X_scaled = pd.concat([val_X_scaled, val_X], axis=1)

### Select subset of features using mutual information

Use mutual information (MI) to select a subset of features for training our neural network. If the MI score is high for a given feature then it is a strong indicator of the target. If MI=0, the feature is independent of the target. The feature may still contribute an interaction effect to other features even though its MI score is low. Here, we will simply remove features with low MI scores. 

In [None]:
def make_mi_scores(X, y):
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(train_X_scaled, train_y_trimmed)
mi_scores[::3]  # Show a few features with their MI scores

In [None]:
# Drop columns with MI scores less than 0.0002
mi_scores=pd.DataFrame(mi_scores)
cols_new=mi_scores[mi_scores['MI Scores'].gt(0.0002)].index

new_train_X=train_X_scaled[cols_new]
new_val_X=val_X_scaled[cols_new]

### Build the neural network

The model will be built with 2 hidden layers and tuned using GridSearch to find the optimum number of hidden nodes and dropout rate. In addition to the dropout layer, a batch normalization layer is added. For the hidden layers, the activation function is set to *ReLU* and the optimization algorithm to *Adam*. The output layer instead uses a *Sigmoid* activation function and *binary cross-entropy* loss function, since this is a binary classification problem.


In [None]:
def create_model(nodes=100, act='relu', opt='Adam', dr=0.0):

    # Set random seed
    seed(0)
    tf.random.set_seed(0)
    
    model = Sequential()
    
    model.add(Dense(nodes, activation=act, input_dim=new_train_X.shape[1]))
    model.add(Dropout(dr))       # Add dropout, default none (dr=0.0)
    model.add(BatchNormalization())         # Add batch normalization
    model.add(Dense(nodes, activation=act))
    model.add(BatchNormalization())         # Add batch normalization
    model.add(Dense(1, activation='sigmoid'))        # Output layer
    
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

    return model

early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
) 

In [None]:
model = create_model()
print(model.summary())

We will test the neural network here, using the default values given in the model along with early stopping to avoid overfitting. 

In [None]:
training = model.fit(new_train_X, train_y_trimmed, 
                     validation_data=(new_val_X, val_y),
                     epochs=1000, batch_size=32, 
                     callbacks=[early_stopping], verbose=0)

In [None]:
val_acc = np.mean(training.history['val_binary_accuracy'])
print("\n%s: %.2f%%" % ('val_acc', val_acc*100))

In [None]:
plt.plot(training.history['binary_accuracy'])
plt.plot(training.history['val_binary_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

Define the grid search parameters. Ideally, more parameters would be tested, but this is omitted to save some time.

In [None]:
model = KerasClassifier(build_fn=create_model, verbose=0)

nodes=[25, 50]
drops = [0.0, 0.1]
param_grid = dict(batch_size=[32], epochs=[100], nodes=nodes,dr=drops)

# Search the grid
grid = GridSearchCV(estimator=model, param_grid=param_grid, verbose=2)
grid_result = grid.fit(new_train_X, train_y_trimmed, 
                       validation_data=(new_val_X, val_y), callbacks=[early_stopping])

Print the best results of GridSearchCV.

In [None]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Now, fit the model using the optimal GridSearchCV values and evaluate the area under the ROC curve between the predicted probability and the observed target.

In [None]:
model = create_model(nodes=50, dr=0.1)

training = model.fit(new_train_X, train_y_trimmed, epochs=100, batch_size=32, validation_data=(new_val_X, val_y), callbacks=[early_stopping])

In [None]:
preds= model.predict(new_val_X)
false_positive_rate, recall, thresholds = roc_curve(val_y, preds)
roc_auc = auc(false_positive_rate, recall)
print(roc_auc)