<a href="https://colab.research.google.com/github/kankkw/229352-StatisticalLearning/blob/main/Lab06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #6

## Boosted tree models on a simulated dataset

- [AdaBoostClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn-ensemble-adaboostclassifier)
- [XGBClassifier documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
- [LGBMClassifier documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm-lgbmclassifier)
- [GridSeachCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


- [Data](https://github.com/donlapark/ds352-labs/raw/main/Lab06-data.zip)


Perform GridSearchCV of the following three models on the provided training set (`X_train.csv` and `y_train.csv`)

1. Evaluate these models on the test set (`X_test.csv` and `y_test.csv`). **Keep searching (using cross-validation) until you find the model that achieves > 0.83 out-of-fold accuracy (use `GridSeachCV.best_score_` to obtain the out-of-fold accuracy)**

2. Report the test accuracy of your best model.

3. For each model, plot the feature importances

For `AdaBoostClassifier`, feature importances can be obtained by calling the `feature_importances_` attribute after fitting the model.

For `XGBClassifier` and `LGBMClassifier`, feature importances can be obtained using the libraryâ€™s `plot_importance` function. Here is a minimal example in XGBoost:

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier, plot_importance
from lightgbm import LGBMClassifier

import matplotlib.pyplot as plt

In [None]:
from sklearn import datasets


iris = datasets.load_iris()
X = iris.data
y = iris.target

In [None]:
from sklearn.ensemble import AdaBoostClassifier


ab = AdaBoostClassifier()
ab.fit(X, y)
ab.feature_importances_

In [None]:
from xgboost import XGBClassifier, plot_importance


model = XGBClassifier()
model.fit(X, y)
plot_importance(model);

In [None]:
X

In [None]:
from xgboost import plot_tree

plot_tree(model, num_trees=4);

In [None]:
import pandas as pd

X_train = pd.read_csv('X_train.csv', header=None)
X_train

In [None]:
y_train = pd.read_csv('y_train.csv', header=None).values.ravel()
y_train

In [None]:
X_test = pd.read_csv('X_test.csv', header=None)
y_test = pd.read_csv('y_test.csv', header=None).values.ravel()

In [None]:
ada = AdaBoostClassifier(random_state=42)

param_grid_ada = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.05, 0.1, 0.5]
}

grid_ada = GridSearchCV(
    ada,
    param_grid_ada,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_ada.fit(X_train, y_train)

In [None]:
grid_ada.best_score_, grid_ada.best_params_

In [None]:
ada_best = grid_ada.best_estimator_
accuracy_score(y_test, ada_best.predict(X_test))

In [None]:
plt.bar(range(X_train.shape[1]), ada_best.feature_importances_)
plt.title("AdaBoost Feature Importances")
plt.show()

In [None]:
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    random_state=42
)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1]
}

grid_xgb = GridSearchCV(
    xgb,
    param_grid_xgb,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_xgb.fit(X_train, y_train)

In [None]:
grid_xgb.best_score_, grid_xgb.best_params_

In [None]:
xgb_best = grid_xgb.best_estimator_
xgb_test_acc = accuracy_score(y_test, xgb_best.predict(X_test))

xgb_test_acc

In [None]:
plot_importance(xgb_best)
plt.title("XGBoost Feature Importances")
plt.show()

In [None]:
lgbm = LGBMClassifier(random_state=42)

param_grid_lgbm = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'num_leaves': [31, 50]
}

grid_lgbm = GridSearchCV(
    lgbm,
    param_grid_lgbm,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_lgbm.fit(X_train, y_train)

In [None]:
grid_lgbm.best_score_, grid_lgbm.best_params_

In [None]:
lgbm_best = grid_lgbm.best_estimator_
lgbm_test_acc = accuracy_score(y_test, lgbm_best.predict(X_test))

lgbm_test_acc

In [None]:
plot_importance(lgbm_best)
plt.title("LightGBM Feature Importances")
plt.show()

In [None]:
import pandas as pd

summary = pd.DataFrame({
    "Model": ["AdaBoost", "XGBoost", "LightGBM"],
    "CV Accuracy": [
        grid_ada.best_score_,
        grid_xgb.best_score_,
        grid_lgbm.best_score_
    ],
    "Test Accuracy": [
        ada_test_acc,
        xgb_test_acc,
        lgbm_test_acc
    ]
})

summary