In [None]:
# -*- coding: utf-8 -*-
#  File: HW1_atom.ipynb
#  Project: 'OTUS Homework #1'
#  Created by Gennady Matveev (gm@og.ly) on 16-12-2021.

# **$Homework 1$**  


Goals:   
- Compare four gradient boosting algorithms: sklearn GBT, XGBoost, CatBoost, LightGBM
- Implement EDA, preprocessing, and feature engineering
- Tune hyperparameters

Means:  
- All meaningful programming will be done in ATOM  
    https://tvdboom.github.io/ATOM/about/

Dataset:
- Student Performance on an entrance examination Data Set  https://archive.ics.uci.edu/ml/datasets/Student+Performance+on+an+entrance+examination

Abbreviations:
- EDA: exploratory data analysis
- BO: bayesian optimization
- FE: feature engineering
- DFS: deep feature synthesis

### Import libraries and setup notebook

In [None]:
import pandas as pd
import numpy as np
from atom import ATOMClassifier
from scipy.io import arff
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import display, HTML
import warnings
# import sys
# sys.path.append('../src/')
# from utilities import *

display(HTML("<style>.container { width:80% !important; }</style>"))
%config InlineBackend.figure_format = 'retina'
# plt.rcParams['figure.figsize']=(10,5)
sns.set(rc = {'figure.figsize':(8,5)})
warnings.filterwarnings("ignore")

random_state = 17

### Load data

In [None]:
data, meta = arff.loadarff('../data/CEE_DATA.arff')
df = pd.DataFrame(data).applymap(lambda x: x.decode('utf-8'))

#### Small bit of preprocessing - XGBoost needs special target encoding

In [None]:
df["Performance"] = df["Performance"].map(
    {'Average': 0, 'Good': 1, 'Vg': 2, 'Excellent': 3})

####  Split target from features

In [None]:
X = df[df.columns[1:]]
y = df[df.columns[0]]

### EDA

#### Basic dataset information

In [None]:
print(meta)
df.head(3)

In [None]:
df.info()

In [None]:
X.describe().T

#### Target values distribution

In [None]:
sns.countplot(df['Performance'], data=df)
plt.suptitle('Target distribution by classes')

print(f'Target distribution by classes:')
df["Performance"].value_counts(
    normalize=True).apply(lambda x: f'{x*100:.1f} %')

#### twelve_education and Caste features seem to be important

In [None]:
_ = sns.countplot(x=df['twelve_education'], hue=df['Performance'], data=df)
plt.suptitle('Target distribution:\n"twelve_education" feature')

In [None]:
_ = sns.countplot(x=df['Caste'], hue=df['Performance'], data=df)
plt.suptitle('Target distribution:\n"Caste" feature')

#### Run profile report on the dataset

In [None]:
X.profile_report()

Observations:  
- Dataset has no missiong values
- Roughly 10% of rows have duplicates - will ignore this, may be a coincidence
- All features are categorical with smallish number of unique values
- Many features have very uneven distributions with predominance of a single class
- Target values are distributed fairly equally between classes
- A number of feature pairs exhibit sizable correlation that may lead to their collinearity
- Only a few features have significant correlation with the target

### Pipeline

#### Initialise classifier

In [None]:
atom = ATOMClassifier(X, y, test_size=0.25, verbose=2,
                      warnings=False, random_state=random_state)

#### Preprocessing - encode features

In [None]:
atom.encode()

#### Setup Decision Tree classifier as a baseline model and check its performance

In [None]:
atom.run(
    models='tree',
    metric = ["roc_auc_ovr", "f1_weighted"]
)  

In [None]:
# Remove Decision Tree from the pipeline
atom.delete('tree')

#### Run the pipeline with default hyperparameters

##### Choose models and metrics

In [None]:
models = ['GBM', 'XGB', 'CATB', 'LGB']
metric = ["roc_auc_ovr", "f1_weighted"]

In [None]:
atom.run(
    models=models,
    metric=metric
)

#### Check train and test metrics

Todo: transform atom.results

In [None]:
atom.plot_results(figsize=(8,5))
atom.results[[ "metric_train", "metric_test"]].applymap(lambda x: (round(x[0],4), round(x[1],4)))

Observation:  
- All estimators perform better than baseline model - sanity check passed
- So far GBM and CatBoost seem more promising than XGBoost and LightGBM
- XGBoost and LightGBM are most likely overfitting

#### Check feature importance for one of estimators

In [None]:
# atom.bar_plot('LGB', show=20, figsize=(8,10)) # <-- doesn't work in mybnder JupyterLab, check why

#### Bayesian optimization of hyperparameters

In [None]:
atom.branch = "hp_bo"
atom.run(
    models=["GBM", "XGB", "CATB", "LGB"],
    metric=metric,
    n_calls=7,
    n_initial_points=3,
    bo_params={
        "base_estimator": "RF", "max_time": 10000,
    },
    n_bootstrap=5, verbose=1
)

Todo: transform atom.results to a better readeable format!!!

In [None]:
atom.plot_results(figsize=(8,5))
atom.results

Observation:  
- BO leads to models performing more or less on par, with XGBoost leading in ROC_AUC and LightGBM in F1_weighted 
- Overall performance of models has not increased after H/P BO with n_calls=25, n_initial_points=10,  
    it takes several hundred of BO calls to reach ROC_AUC 0.8 and F1_weighted 0,51, probably due  
    to high dimensionality of H/P space

### Feature engineering 

#### DFS in a separate pipeline branch

In [None]:
atom.verbose = 1
atom.branch = "fe"
atom.feature_generation("dfs", n_features=100, operators=["add", "sub", "mul"])

#### Make feature selection: check for multicollinearity and use RFECV to reduce their number

In [None]:
atom.feature_selection(
    strategy="RFECV",
    solver="RF",
    n_features=50,
    scoring="logloss",
    max_correlation=0.98,
)

In [None]:
# After applying RFECV, plot the score per number of features
atom.plot_rfecv()

#### Run models with new set of features

In [None]:
# Check models' performance now
# Add a tag to the model's acronym to not overwrite previous one

atom.run("GBM_fe")
atom.run("XGB_fe")
atom.run("CATB_fe")
atom.run("LGB_fe")

#### Compare intermediate results

In [None]:
atom.plot_results()
atom.results

Observation:
- Feature engineering doesn't show much promise as models run with default parameters perform better in both metrics

#### Run FE-modified models and tune hyperparameters with bayesian optimization

Todo: eleminate line wrap in atom.run output

In [None]:
atom.branch = "fe_bo"

atom.run(
    models=["GBM_fe", "XGB_fe", "CATB_fe", "LGB_fe"],
    metric=metric,
    n_calls=7,
    n_initial_points=3,
    bo_params={
        "base_estimator": "RF", "max_time": 10000,
    },
    n_bootstrap=5, verbose=1
)

#### Compare results - final table

In [None]:
atom.plot_results(figsize=(8,5))
atom.results

Observation:  
- And the winner is ...
- ...on BO and FE

#### Learning curves

In [None]:
atom.plot_bo(figsize=(16,5))

In [None]:
# with atom.canvas(2, 2, title="Models evaluation"):
#     model_list = ["XGB","LGB"]
#     for m in model_list:
#         atom.plot_evals(m, title=f"{m}", figsize=(8,5)) # <-- doesn't work in mybnder JupyterLab, check why

#### One model exploration - show-off

In [None]:
atom.plot_confusion_matrix("CATB", figsize=(8,5))

In [None]:
atom.plot_feature_importance("CATB", 12, figsize=(8,5))

In [None]:
atom.plot_probabilities('CATB', figsize=(8,5))

### Final thoughts

Despite apparent failures of hyperparameters' optimization and feature engineering on a  
particular dataset, all gradient boosted models in question showed robust performance.  

Future steps:  
- reduce the number of hyperparameters being tuned (impossible now for technical reasons)
- drastically increase the number of calls for BO (200-500)
- give the models a try on a different dataset
- implement similar pipeline logic in sklearn (maybe)
- CatBoost crashes ATOM in many function calls, could not explore all useful features of the  
    library, will try to run in a different environment (conda may be a culprit)