# XGBoost TPS May 2022

This notebook has simple feature engineering and trains a gradient boosting model.

Kaggle Kernels for inspiration:

- <a href="https://www.kaggle.com/code/cv13j0/tps-may22-eda-gbdt">TPS-MAY22, EDA + GBDT</a>

- <a href="https://www.kaggle.com/code/ambrosm/tpsmay22-gradient-boosting-quickstart/#Three-diagrams-for-model-evaluation">TPSMAY22 Gradient-Boosting Quickstart</a>

- <a href="https://www.kaggle.com/code/ambrosm/tpsmay22-eda-which-makes-sense">TPSMAY22 EDA which makes sense</a>

- <a href="https://www.kaggle.com/code/kellibelcher/tps-may-2022-eda-lgbm-neural-networks">TPS May 2022 | EDA, LGBM & Neural Networks</a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import string
import math
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from xgboost  import XGBClassifier
from sklearn.metrics import roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Read data**

In [None]:
print('Train data:')
train = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv')
print('Shape of train data: ' + str(train.shape))
print()
train.head(5)

In [None]:
print('Test data:')
test = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv')
print('Shape of test data: ' + str(test.shape))
print()
test.head(5)

# **Explore data**

In [None]:
print(train['target'].value_counts())

The train data points seem to be almost evenly distributed between the two target classes.

In [None]:
train.info()

In [None]:
train.isnull().sum()

No NULL values in the train dataset.

In [None]:
train.describe()

In [None]:
train.nunique().sort_values(ascending = True)

# Categorical features

In [None]:
categorical_feats = []

for col in train.columns:
    if train[col].dtype == 'int64':
        categorical_feats.append(col)

train[categorical_feats].sample(10)

In [None]:
for feature in categorical_feats:
    print('Possible values for ' + feature)
    print(train[feature].unique())
    print()

In [None]:
fig, ax = plt.subplots(figsize = (50, 50))
sns.heatmap(train[categorical_feats].corr(), annot = True, fmt = '.2f', ax = ax)
plt.show()

# Float features

In [None]:
float_feats = []
for col in train.columns:
    if train[col].dtype == 'float64':
        float_feats.append(col)

print(float_feats)

In [None]:
# https://stackoverflow.com/questions/20174468/how-to-create-subplots-of-pictures-made-with-the-hist-function-in-matplotlib-p
fig, axs = plt.subplots(4, 4, figsize=(18, 18))
axs = axs.ravel()

for ind, ax in enumerate(axs):
    ax.hist(train[float_feats[ind]], density = True, bins = 100)
    ax.set_title(f'Train: {float_feats[ind]}, Std. dev: {train[float_feats[ind]].std():.1f}')
plt.show()    

Features seem to be normally distributed. The standard deviation for features f_00 through f_06 is 1. The standard deviation for features f_19 through f_26 lies between 2.3 to 2.5. Standard deviation for f_28 is 238.8

In [None]:
fig, axs = plt.subplots(4, 4, figsize=(14, 24))
axs = axs.ravel()

for ind, ax in enumerate(axs):
    ax.boxplot(train[train.target == 0][float_feats[ind]], positions = [0], widths = 0.7)
    ax.boxplot(train[train.target == 1][float_feats[ind]], positions = [1], widths = 0.7)
    ax.set_title(f'{float_feats[ind]}')
plt.show()    

- It can be observed that the features cannot be used to clearly distinguish between the 2 target classes.

In [None]:
fig, ax = plt.subplots(figsize = (50, 50))
sns.heatmap(train[float_feats + ['target']].corr(), annot = True, fmt = '.2f', ax = ax)
plt.show()

- f_28 seems to be correlated with features f_00, f_01, f_02, f_03, f_04, f_05, f_06.
- Features f_19 through f_26 are slightly correlated with each other.
- Target is not strongly correlated with any feature.

In [None]:
fig, ax = plt.subplots(figsize = (50, 50))
sns.heatmap(train[float_feats].corr(), annot = True, fmt = '.2f', ax = ax)
plt.show()

In [None]:
sns.scatterplot(x = train['f_21'], y = train['f_02'], hue = train['target'])
plt.title("Scatter plot for f_02 and f_21")
plt.show()

In [None]:
# print(train[['f_02'] + ['f_21'] + ['target']].head(20))
total = train['f_02'] + train['f_21']
sum_02_21['sum'] = total 
sum_02_21['target'] = train['target']

# print(sum_02_21[(sum_02_21['target'] == 1) & (sum_02_21['sum'] >= 5.1)].sort_values(by = ['sum']))

print(sum_02_21[(sum_02_21['target'] == 0) & (sum_02_21['sum'] >= 5.1)].sort_values(by = ['sum']))
# df[(df[Gender]=='Male') & (df[Year]==2014)]

In [None]:
print(sum_02_21.columns)

# String feature

In [None]:
train['f_27'].str.len().value_counts()

- Length of all the strings in f_27 is 10.

In [None]:
print('No. of unique strings')
print(train['f_27'].nunique())
print()

print('Difference between train and test')
print(len(set(test['f_27']).difference(set(train['f_27']))))

- Most of the strings in f_27 are unique. The most frequently occurring string BBBBBBCJBC has just occurred 12 times.
- There 440526 strings which are present in the test data but not in train. Thus, we cannot use f_27 as a categorical feature.

In [None]:
print('Top 20 frequent strings')

train.f_27.value_counts()[:20].sort_values().plot(kind = 'barh', figsize = (15, 15), colormap = 'Paired')

In [None]:
for charind in range(10):
    print(f'Position {charind + 1}:')
    char_group = train.groupby(train['f_27'].str.get(charind))
    char_info = pd.DataFrame({'Length': char_group.size(), 'Prob': char_group.target.mean().round(2)})
    print(char_info)
    print()

- Positions 1, 3, 6 are binary. Only have either A or B.
- A and B are the two most frequent characters in f_27
- Each of the strings has characters from A to T for position 8.
- For positions other than 1, 3, 6, 8 we have characters only from A to O.

# Features on the basis of f_27

In [None]:
for ite in range(10):
    train['char_' + str(ite)] = train['f_27'].str.get(ite).apply(ord) - ord('A')
    test['char_' + str(ite)] = test['f_27'].str.get(ite).apply(ord) - ord('A')


In [None]:
train['unique_letters'] = train['f_27'].apply(lambda s: len(set(s)))
test['unique_letters'] = test['f_27'].apply(lambda s: len(set(s)))

In [None]:
exclude_feats = ['id', 'f_27', 'target']
features = [feature for feature in train.columns if feature not in exclude_feats]


# XGBoost model

In [None]:
xgb_params = {
              'n_estimators'     : 8192,
              'min_child_weight' : 96,
              'max_depth'        : 6,
              'learning_rate'    : 0.15,
              'subsample'        : 0.95,
              'colsample_bytree' : 0.95,
              'reg_lambda'       : 1.50,
              'reg_alpha'        : 1.50,
              'gamma'            : 1.50,
              'max_bin'          : 512,
              'random_state'     : 46,
              'objective'        : 'binary:logistic',
              'tree_method'      : 'gpu_hist',
             }

In [None]:
scores, predictions = [], []

kf = KFold(n_splits = 5)

for fold, (train_ind, cv_ind) in enumerate(kf.split(train)):
    print('Train fold ' + str(fold))
    
    X_train, y_train = train.iloc[train_ind][features], train.iloc[train_ind]['target']
    X_cv, y_cv = train.iloc[cv_ind][features], train.iloc[cv_ind]['target']

    mdl = XGBClassifier(**xgb_params)
    mdl.fit(X_train, y_train, eval_set = [(X_cv, y_cv)], eval_metric = ['auc'], early_stopping_rounds = 256, verbose = 0)

    y_cv_pred = mdl.predict_proba(X_cv.values)[:, 1]
    score = roc_auc_score(y_cv, y_cv_pred)

    scores.append(score)
    print(f"Fold {fold}, AUC = {score:.3f}")
    print((''))
    
    test_pred = mdl.predict_proba(test[features])[:, 1]
    predictions.append(test_pred)

print('AUC ' + str(np.mean(scores)))

# Submission

In [None]:
submission = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/sample_submission.csv')
submission.head(5)

In [None]:
submission['target'] = np.array(predictions).mean(axis = 0)
submission.to_csv('submission.csv', index = False)

In [None]:
submission.head(5)