<a href="https://colab.research.google.com/github/kankkw/229352-StatisticalLearning/blob/main/Lab04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #5

#### Load data at: https://donlapark.pages.dev/229352/heart_disease.csv

* Decision tree ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))
* Random hyperparameter search using cross-validation ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html))

In [None]:
import pandas as pd
import graphviz

from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# import data
data = pd.read_csv("heart_disease.csv", na_values="?")
data.head()

In [None]:

# split into X and y
y = data["label"]
X = data.drop("label", axis=1)

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# impute missing values
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Create a decision tree
clf = DecisionTreeClassifier()

![5CV](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("model", DecisionTreeClassifier(random_state=42))
])

param_grid = {
    "model__max_depth": [3, 6, 9, 12],
    "model__min_samples_split": [2, 5, 10],
    "model__min_samples_leaf": [1, 3, 5]
}

grid_dt = GridSearchCV(
    pipe,
    param_grid,
    scoring="f1_macro",
    cv=5
)

grid_dt.fit(X_train, y_train)

y_pred_dt = grid_dt.best_estimator_.predict(X_test)

dt_acc = accuracy_score(y_test, y_pred_dt)
dt_f1  = f1_score(y_test, y_pred_dt, average="macro")

dt_acc, dt_f1

In [None]:
plot_data = export_graphviz(
    grid_dt.best_estimator_["model"],
    out_file=None,
    filled=True,
    rounded=True,
    feature_names=data.columns[:-1],
    class_names=["0", "1"]
)

graph = graphviz.Source(plot_data)
graph

## Bagged decision trees
* Bagging classifier ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html))

In [None]:
base_tree = DecisionTreeClassifier(random_state=42)

bag_clf = BaggingClassifier(
    estimator=base_tree,
    random_state=42
)

pipe_bag = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("model", bag_clf)
])

param_bag = {
    "model__n_estimators": [50, 100, 200],
    "model__estimator__max_depth": [3, 6, None],
    "model__max_samples": [0.6, 0.8, 1.0]
}

grid_bag = GridSearchCV(
    pipe_bag,
    param_bag,
    scoring="f1_macro",
    cv=5
)

grid_bag.fit(X_train, y_train)

y_pred_bag = grid_bag.best_estimator_.predict(X_test)

bag_acc = accuracy_score(y_test, y_pred_bag)
bag_f1  = f1_score(y_test, y_pred_bag, average="macro")

bag_acc, bag_f1

## Random forest classifier
* Random forest ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html))

In [None]:
rf_clf = RandomForestClassifier(random_state=42)

pipe_rf = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("model", rf_clf)
])

param_rf = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [5, 10, None],
    "model__max_features": ["sqrt", "log2"]
}

grid_rf = GridSearchCV(
    pipe_rf,
    param_rf,
    scoring="f1_macro",
    cv=5
)

grid_rf.fit(X_train, y_train)

y_pred_rf = grid_rf.best_estimator_.predict(X_test)

rf_acc = accuracy_score(y_test, y_pred_rf)
rf_f1  = f1_score(y_test, y_pred_rf, average="macro")

rf_acc, rf_f1

#### Exercise
1. Study the hyperparameters of three models: [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [Bagged Decision Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) and [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. For each model, use pipeline+grid search cross-validation across multiple hyperparameters to find the best model.
* Decision tree: choose at least 3 hyperparameters
* Bagged decision trees: choose at least 3 hyperparameters
* Random forest: choose at least 3 hyperparameters
3. For each model, compute the `f1_macro` and `accuracy` score on the test set.
* What is your best model?
* Plot the best tree model
* What hyperparameters did you choose? (explain in words, not in `sklearn's` parameter name)
* What are the best values of your hyperparameters?

What is your best model?

The Random Forest model achieves the highest f1_macro score on the test set.

In [None]:
results = pd.DataFrame({
    "Model": ["Decision Tree", "Bagged Trees", "Random Forest"],
    "Accuracy": [dt_acc, bag_acc, rf_acc],
    "F1_macro": [dt_f1, bag_f1, rf_f1]
})

results

What hyperparameters did you choose?

For the decision tree, I tuned the tree depth and minimum number of samples required to split a node in order to control model complexity and prevent overfitting.

For bagged decision trees, I tuned the number of trees, tree depth, and the proportion of samples used in each bootstrap to improve stability and reduce variance.

For the random forest model, I tuned the number of trees, tree depth, and the number of features considered at each split to reduce correlation between trees and improve generalization.

Best hyperparameter values

In [None]:
grid_rf.best_params_
grid_dt.best_params_
grid_bag.best_params_