# Minggu 10: Advanced Machine Learning
## Week 10: Advanced Machine Learning Techniques

**Mata Kuliah / Course:** Big Data Analytics  
**Topik / Topic:** Ensemble Methods, Feature Engineering, Pipeline, Hyperparameter Tuning, Class Imbalance

---

### Deskripsi
Praktikum ini membahas teknik-teknik Machine Learning tingkat lanjut, meliputi:
- Ensemble methods: Random Forest, AdaBoost, Gradient Boosting, XGBoost
- Feature engineering dan seleksi fitur
- ML Pipeline yang reproducible
- Hyperparameter tuning dengan Grid Search dan Random Search
- Penanganan class imbalance dengan SMOTE

In [None]:
# Install dependencies (run once)
!pip install xgboost imbalanced-learn --quiet

# ============================================================
# Import Libraries
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Datasets
from sklearn.datasets import load_breast_cancer, make_classification

# Preprocessing & Model Selection
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV,
    StratifiedKFold
)
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, SelectFromModel
from sklearn.decomposition import PCA

# Classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier, AdaBoostClassifier,
    GradientBoostingClassifier, StackingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    f1_score, roc_auc_score, ConfusionMatrixDisplay
)

# Class imbalance
from imblearn.over_sampling import SMOTE

# Settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
np.random.seed(42)

print('All libraries imported successfully!')

## 1. Ensemble Methods

Kita bandingkan tiga ensemble method utama pada dataset **Breast Cancer**:
- **Random Forest** (Bagging)
- **AdaBoost** (Boosting – adaptive weights)
- **Gradient Boosting** (Boosting – residual fitting)

In [None]:
# ============================================================
# Ensemble Methods Comparison – Breast Cancer Dataset
# ============================================================

# Load data
bc = load_breast_cancer()
X_bc = pd.DataFrame(bc.data, columns=bc.feature_names)
y_bc = pd.Series(bc.target)

print('=== Breast Cancer Dataset ===')
print(f'Shape: {X_bc.shape}')
print(f'Classes: {bc.target_names}')
print(f'Class distribution: {dict(zip(bc.target_names, np.bincount(y_bc)))}')

# Split
X_tr, X_te, y_tr, y_te = train_test_split(X_bc.values, y_bc.values,
                                            test_size=0.2, random_state=42, stratify=y_bc.values)

# Define ensemble models
ensemble_models = {
    'Random Forest':      RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost':           AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting':  GradientBoostingClassifier(n_estimators=100, random_state=42),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
ensemble_results = {}

print('\n=== Ensemble Methods – 5-Fold Cross-Validation ===')
print(f'{"Model":<22} {"CV Mean":>9} {"CV Std":>8} {"Test Acc":>10} {"F1":>8} {"AUC":>8}')
print('-' * 70)

for name, model in ensemble_models.items():
    cv_scores = cross_val_score(model, X_bc.values, y_bc.values, cv=cv, scoring='accuracy')
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    y_prob = model.predict_proba(X_te)[:, 1]
    test_acc = accuracy_score(y_te, y_pred)
    f1 = f1_score(y_te, y_pred)
    auc = roc_auc_score(y_te, y_prob)
    ensemble_results[name] = {'cv': cv_scores, 'test_acc': test_acc, 'f1': f1, 'auc': auc}
    print(f'{name:<22} {cv_scores.mean():>9.4f} {cv_scores.std():>8.4f} {test_acc:>10.4f} {f1:>8.4f} {auc:>8.4f}')

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot of CV scores
cv_data = [v['cv'] for v in ensemble_results.values()]
bp = axes[0].boxplot(cv_data, labels=list(ensemble_results.keys()),
                     patch_artist=True, notch=False)
colors_box = ['#4CAF50', '#FF9800', '#2196F3']
for patch, color in zip(bp['boxes'], colors_box):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[0].set_title('CV Accuracy Distribution', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accuracy')
axes[0].set_ylim(0.88, 1.02)
axes[0].tick_params(axis='x', rotation=15)

# Grouped bar chart: test metrics
metrics_df = pd.DataFrame(
    {name: [v['test_acc'], v['f1'], v['auc']] for name, v in ensemble_results.items()},
    index=['Accuracy', 'F1-Score', 'ROC-AUC']
)
metrics_df.plot(kind='bar', ax=axes[1], color=colors_box, alpha=0.8, edgecolor='white')
axes[1].set_title('Test Set Metrics Comparison', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Score')
axes[1].set_ylim(0.88, 1.02)
axes[1].tick_params(axis='x', rotation=0)
axes[1].legend(loc='lower right')

plt.tight_layout()
plt.show()

## 2. XGBoost

**XGBoost** adalah implementasi Gradient Boosting yang dioptimalkan — mendukung regularisasi, missing values, dan komputasi paralel.  
Sering menjadi pilihan utama dalam kompetisi ML.

In [None]:
# ============================================================
# XGBoost Classifier
# ============================================================

xgb_clf = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

xgb_clf.fit(
    X_tr, y_tr,
    eval_set=[(X_te, y_te)],
    verbose=False
)

y_pred_xgb  = xgb_clf.predict(X_te)
y_proba_xgb = xgb_clf.predict_proba(X_te)[:, 1]

print('=== XGBoost Classifier ===')
print(f'Test Accuracy : {accuracy_score(y_te, y_pred_xgb):.4f}')
print(f'F1 Score      : {f1_score(y_te, y_pred_xgb):.4f}')
print(f'ROC-AUC       : {roc_auc_score(y_te, y_proba_xgb):.4f}')
print('\nClassification Report:')
print(classification_report(y_te, y_pred_xgb, target_names=bc.target_names))

# Feature importance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Top-15 important features
feat_imp = pd.Series(xgb_clf.feature_importances_, index=bc.feature_names)
top15 = feat_imp.nlargest(15)
top15.sort_values().plot(kind='barh', ax=axes[0], color='steelblue', alpha=0.8)
axes[0].set_title('XGBoost – Top 15 Feature Importance', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Importance Score')

# Training loss curve (logloss)
evals_result = xgb_clf.evals_result()
train_loss = evals_result['validation_0']['logloss']
axes[1].plot(range(len(train_loss)), train_loss, 'b-', linewidth=2, label='Validation LogLoss')
axes[1].set_title('XGBoost – Training Curve (Validation LogLoss)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Boosting Round')
axes[1].set_ylabel('Log Loss')
axes[1].legend()

plt.tight_layout()
plt.show()

# Add XGBoost to comparison
print('\n=== Adding XGBoost to Comparison ===')
all_models = list(ensemble_results.keys()) + ['XGBoost']
all_aucs = [v['auc'] for v in ensemble_results.values()] + [roc_auc_score(y_te, y_proba_xgb)]
for m, a in zip(all_models, all_aucs):
    print(f'  {m:<22} AUC = {a:.4f}')

## 3. Feature Engineering & Selection

Feature selection memilih subset fitur paling relevan, sedangkan PCA mengekstrak komponen utama dari semua fitur.  
Kita demonstrasikan **SelectKBest**, **SelectFromModel**, dan **PCA**.

In [None]:
# ============================================================
# Feature Selection & Extraction
# ============================================================

print(f'Original number of features: {X_bc.shape[1]}')
print()

# 1. SelectKBest (filter method)
k_best = 10
selector_kbest = SelectKBest(score_func=f_classif, k=k_best)
X_kbest = selector_kbest.fit_transform(X_bc.values, y_bc.values)
selected_kbest = X_bc.columns[selector_kbest.get_support()].tolist()
print(f'SelectKBest (k={k_best}): {X_kbest.shape[1]} features selected')
print(f'  Selected: {selected_kbest}')

# 2. SelectFromModel (embedded method) using Random Forest
rf_for_sel = RandomForestClassifier(n_estimators=100, random_state=42)
rf_for_sel.fit(X_bc.values, y_bc.values)
selector_sfm = SelectFromModel(rf_for_sel, threshold='mean', prefit=True)
X_sfm = selector_sfm.transform(X_bc.values)
selected_sfm = X_bc.columns[selector_sfm.get_support()].tolist()
print(f'\nSelectFromModel (RF): {X_sfm.shape[1]} features selected')
print(f'  Selected: {selected_sfm}')

# 3. PCA
scaler_pca = StandardScaler()
X_scaled = scaler_pca.fit_transform(X_bc.values)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)
explained_var = pca.explained_variance_ratio_
print(f'\nPCA (10 components):')
print(f'  Total explained variance: {explained_var.sum()*100:.2f}%')
print(f'  Variance per component: {[f"{v*100:.1f}%" for v in explained_var[:5]]} ...')

# Compare CV accuracy with different feature sets
clf_comp = RandomForestClassifier(n_estimators=100, random_state=42)
cv_full   = cross_val_score(clf_comp, X_bc.values, y_bc.values, cv=5).mean()
cv_kbest  = cross_val_score(clf_comp, X_kbest, y_bc.values, cv=5).mean()
cv_sfm    = cross_val_score(clf_comp, X_sfm, y_bc.values, cv=5).mean()
cv_pca    = cross_val_score(clf_comp, X_pca, y_bc.values, cv=5).mean()

print('\n=== CV Accuracy with Different Feature Sets ===')
feat_comparison = {
    f'All features ({X_bc.shape[1]})': cv_full,
    f'SelectKBest ({k_best})': cv_kbest,
    f'SelectFromModel ({len(selected_sfm)})': cv_sfm,
    'PCA (10 components)': cv_pca,
}
for method, score in feat_comparison.items():
    print(f'  {method:<35} Accuracy = {score:.4f}')

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PCA scree plot (explained variance)
pca_full = PCA(n_components=20)
pca_full.fit(X_scaled)
cum_var = np.cumsum(pca_full.explained_variance_ratio_)
axes[0].bar(range(1, 21), pca_full.explained_variance_ratio_[:20],
            color='steelblue', alpha=0.8, label='Individual')
ax2 = axes[0].twinx()
ax2.plot(range(1, 21), cum_var[:20], 'ro-', linewidth=2, label='Cumulative')
ax2.axhline(y=0.95, color='green', linestyle='--', alpha=0.7, label='95% threshold')
ax2.set_ylabel('Cumulative Explained Variance', color='red')
ax2.set_ylim(0, 1.05)
axes[0].set_title('PCA Scree Plot', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio', color='steelblue')
lines1, labels1 = axes[0].get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
axes[0].legend(lines1 + lines2, labels1 + labels2, loc='center right')

# Accuracy comparison
bar_data = list(feat_comparison.values())
bar_labels = list(feat_comparison.keys())
bar_colors = ['#4CAF50', '#2196F3', '#FF9800', '#9C27B0']
bars = axes[1].bar(range(len(bar_data)), bar_data, color=bar_colors, alpha=0.8,
                   edgecolor='white', linewidth=1.5)
axes[1].set_xticks(range(len(bar_data)))
axes[1].set_xticklabels(bar_labels, rotation=20, ha='right', fontsize=9)
axes[1].set_ylim(0.90, 1.02)
axes[1].set_title('Accuracy vs Feature Selection Method', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Mean CV Accuracy')
for bar, val in zip(bars, bar_data):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
                 f'{val:.4f}', ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. ML Pipeline dengan sklearn

**Pipeline** menggabungkan preprocessing dan modeling dalam satu objek yang aman dari data leakage.

In [None]:
# ============================================================
# ML Pipeline dengan ColumnTransformer
# ============================================================

# Create a mixed-type dataset (numeric + categorical simulation)
# Use breast_cancer numeric features + add a simulated categorical feature
X_mixed = X_bc.copy()
X_mixed['risk_category'] = pd.cut(
    X_bc['mean radius'], bins=3,
    labels=['low', 'medium', 'high']
).astype(str)

numeric_features = X_bc.columns.tolist()
categorical_features = ['risk_category']

print('=== Mixed-Type Dataset ===')
print(f'Numeric features: {len(numeric_features)}')
print(f'Categorical features: {categorical_features}')
print(f'\nCategory distribution:\n{X_mixed["risk_category"].value_counts()}')

# Build Pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocessor + classifier
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Split & Evaluate
X_mix_tr, X_mix_te, y_mix_tr, y_mix_te = train_test_split(
    X_mixed, y_bc.values, test_size=0.2, random_state=42, stratify=y_bc.values
)

full_pipeline.fit(X_mix_tr, y_mix_tr)
y_pred_pipe = full_pipeline.predict(X_mix_te)

print('\n=== Full Pipeline Results ===')
print(f'Accuracy: {accuracy_score(y_mix_te, y_pred_pipe):.4f}')
print(f'F1-Score: {f1_score(y_mix_te, y_pred_pipe):.4f}')
print('\nClassification Report:')
print(classification_report(y_mix_te, y_pred_pipe, target_names=bc.target_names))

# Pipeline diagram
print('\n=== Pipeline Structure ===')
print('Pipeline([')
print('  ("preprocessor", ColumnTransformer([')
print('    ("num", Pipeline([imputer, scaler]), numeric_features),')
print('    ("cat", Pipeline([imputer, onehot]), categorical_features)')
print('  ])),')
print('  ("classifier", RandomForestClassifier())')
print('])')

# CV on pipeline
cv_pipe_scores = cross_val_score(full_pipeline, X_mixed, y_bc.values, cv=5)
print(f'\n5-fold CV: {cv_pipe_scores.mean():.4f} ± {cv_pipe_scores.std():.4f}')

## 5. Hyperparameter Tuning

Kita gunakan **GridSearchCV** (exhaustive) dan **RandomizedSearchCV** (random sampling) pada Random Forest.

In [None]:
# ============================================================
# Hyperparameter Tuning: GridSearch vs RandomSearch
# ============================================================

X_ht = X_bc.values
y_ht = y_bc.values

# Base model
base_rf = RandomForestClassifier(random_state=42)
base_cv = cross_val_score(base_rf, X_ht, y_ht, cv=5).mean()
print(f'Baseline RF (default params) CV Accuracy: {base_cv:.4f}')

# ---- Grid Search ----
grid_param = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'max_features': ['sqrt', 'log2']
}

print(f'\nGrid Search – total combinations: {3*3*2*2} (n_est x depth x split x features)')
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid=grid_param,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=0
)
grid_search.fit(X_ht, y_ht)
print(f'Grid Search – Best CV Accuracy: {grid_search.best_score_:.4f}')
print(f'Grid Search – Best Params: {grid_search.best_params_}')

# ---- Random Search ----
from scipy.stats import randint
rand_param = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 3, 5, 7, 10, 15, 20],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

rand_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=rand_param,
    n_iter=50, cv=5, scoring='accuracy',
    n_jobs=-1, random_state=42, verbose=0
)
rand_search.fit(X_ht, y_ht)
print(f'\nRandom Search (50 iter) – Best CV Accuracy: {rand_search.best_score_:.4f}')
print(f'Random Search – Best Params: {rand_search.best_params_}')

# Compare results
print('\n=== Tuning Results Summary ===')
print(f'{"Method":<25} {"Best Accuracy":>15} {"# Evaluations":>15}')
print('-' * 57)
print(f'{"Baseline (default)":<25} {base_cv:>15.4f} {"N/A":>15}')
print(f'{"Grid Search":<25} {grid_search.best_score_:>15.4f} {grid_search.cv_results_["params"].__len__():>15}')
print(f'{"Random Search (50)":<25} {rand_search.best_score_:>15.4f} {"50":>15}')

# Visualization: CV score distribution for GridSearch
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

gs_scores = grid_search.cv_results_['mean_test_score']
axes[0].hist(gs_scores, bins=15, color='steelblue', alpha=0.8, edgecolor='white')
axes[0].axvline(x=grid_search.best_score_, color='red', linestyle='--',
                linewidth=2, label=f'Best = {grid_search.best_score_:.4f}')
axes[0].set_title('Grid Search – Score Distribution', fontsize=12, fontweight='bold')
axes[0].set_xlabel('CV Mean Accuracy')
axes[0].set_ylabel('Count')
axes[0].legend()

rs_scores = rand_search.cv_results_['mean_test_score']
axes[1].hist(rs_scores, bins=15, color='darkorange', alpha=0.8, edgecolor='white')
axes[1].axvline(x=rand_search.best_score_, color='red', linestyle='--',
                linewidth=2, label=f'Best = {rand_search.best_score_:.4f}')
axes[1].set_title('Random Search – Score Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('CV Mean Accuracy')
axes[1].set_ylabel('Count')
axes[1].legend()

plt.tight_layout()
plt.show()

## 6. Menangani Class Imbalance

Class imbalance menyebabkan model bias ke kelas mayoritas.  
Kita bandingkan: **tanpa treatment**, **SMOTE oversampling**, dan **class_weight='balanced'**.

In [None]:
# ============================================================
# Class Imbalance: SMOTE vs class_weight vs no treatment
# ============================================================

# Create highly imbalanced dataset
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.90, 0.10],   # 90% majority, 10% minority
    random_state=42
)

print('=== Imbalanced Dataset ===')
unique, counts = np.unique(y_imb, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f'  Class {cls}: {cnt} samples ({cnt/len(y_imb)*100:.1f}%)')

# Split
X_imb_tr, X_imb_te, y_imb_tr, y_imb_te = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

# ---- Strategy 1: No treatment ----
clf_no_treat = RandomForestClassifier(n_estimators=100, random_state=42)
clf_no_treat.fit(X_imb_tr, y_imb_tr)
y_pred_no_treat = clf_no_treat.predict(X_imb_te)

# ---- Strategy 2: SMOTE ----
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_imb_tr, y_imb_tr)
print(f'\nAfter SMOTE resampling:')
unique_s, counts_s = np.unique(y_smote, return_counts=True)
for cls, cnt in zip(unique_s, counts_s):
    print(f'  Class {cls}: {cnt} samples')

clf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
clf_smote.fit(X_smote, y_smote)
y_pred_smote = clf_smote.predict(X_imb_te)

# ---- Strategy 3: class_weight='balanced' ----
clf_weighted = RandomForestClassifier(
    n_estimators=100, class_weight='balanced', random_state=42
)
clf_weighted.fit(X_imb_tr, y_imb_tr)
y_pred_weighted = clf_weighted.predict(X_imb_te)

# Compare results – focus on minority class (class 1) recall
def get_metrics(y_true, y_pred, label):
    from sklearn.metrics import precision_recall_fscore_support
    acc = accuracy_score(y_true, y_pred)
    p, r, f, _ = precision_recall_fscore_support(y_true, y_pred, average=None)
    return {
        'Strategy': label,
        'Overall Accuracy': f'{acc:.4f}',
        'Majority Precision': f'{p[0]:.4f}',
        'Minority Precision': f'{p[1]:.4f}',
        'Majority Recall':    f'{r[0]:.4f}',
        'Minority Recall':    f'{r[1]:.4f}',
        'Minority F1':        f'{f[1]:.4f}'
    }

results_imb = [
    get_metrics(y_imb_te, y_pred_no_treat, 'No Treatment'),
    get_metrics(y_imb_te, y_pred_smote,    'SMOTE'),
    get_metrics(y_imb_te, y_pred_weighted, 'class_weight=balanced'),
]
df_imb = pd.DataFrame(results_imb).set_index('Strategy')
print('\n=== Class Imbalance Strategies Comparison ===')
print(df_imb.to_string())

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Distribution comparison
dist_data = {
    'Original\nTrain': [np.sum(y_imb_tr == 0), np.sum(y_imb_tr == 1)],
    'After SMOTE': [np.sum(y_smote == 0), np.sum(y_smote == 1)],
}
x_pos = np.arange(len(dist_data))
width = 0.35
keys = list(dist_data.keys())
axes[0].bar(x_pos - width/2, [dist_data[k][0] for k in keys], width,
            label='Majority (Class 0)', color='steelblue', alpha=0.8)
axes[0].bar(x_pos + width/2, [dist_data[k][1] for k in keys], width,
            label='Minority (Class 1)', color='tomato', alpha=0.8)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(keys)
axes[0].set_title('Class Distribution Before/After SMOTE', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Sample Count')
axes[0].legend()

# Minority recall comparison
strategies = ['No Treatment', 'SMOTE', 'class_weight']
min_recalls = [
    float(df_imb.loc['No Treatment', 'Minority Recall']),
    float(df_imb.loc['SMOTE', 'Minority Recall']),
    float(df_imb.loc['class_weight=balanced', 'Minority Recall'])
]
axes[1].bar(strategies, min_recalls, color=['#FF5252', '#4CAF50', '#2196F3'],
            alpha=0.85, edgecolor='white', linewidth=1.5)
axes[1].set_ylim(0, 1.1)
axes[1].set_title('Minority Class Recall', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Recall')
for i, val in enumerate(min_recalls):
    axes[1].text(i, val + 0.02, f'{val:.4f}', ha='center', fontweight='bold', fontsize=10)
axes[1].tick_params(axis='x', rotation=10)

# Minority F1 comparison
min_f1s = [
    float(df_imb.loc['No Treatment', 'Minority F1']),
    float(df_imb.loc['SMOTE', 'Minority F1']),
    float(df_imb.loc['class_weight=balanced', 'Minority F1'])
]
axes[2].bar(strategies, min_f1s, color=['#FF5252', '#4CAF50', '#2196F3'],
            alpha=0.85, edgecolor='white', linewidth=1.5)
axes[2].set_ylim(0, 1.1)
axes[2].set_title('Minority Class F1-Score', fontsize=12, fontweight='bold')
axes[2].set_ylabel('F1-Score')
for i, val in enumerate(min_f1s):
    axes[2].text(i, val + 0.02, f'{val:.4f}', ha='center', fontweight='bold', fontsize=10)
axes[2].tick_params(axis='x', rotation=10)

plt.tight_layout()
plt.show()

## 7. Stacking Classifier

**Stacking** menggunakan prediksi dari beberapa base learners sebagai input untuk meta-learner.  
Ini menggabungkan kelebihan dari setiap model secara cerdas.

In [None]:
# ============================================================
# Stacking Classifier
# ============================================================

# Define base learners (diverse models)
base_estimators = [
    ('rf',   RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb',   GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('dt',   DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('svc',  SVC(probability=True, random_state=42)),
]

# Meta-learner
meta_learner = LogisticRegression(max_iter=1000, random_state=42)

# Build StackingClassifier
stacking_clf = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_learner,
    cv=5,
    passthrough=False,
    n_jobs=-1
)

# Evaluate with 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Compare all models
final_models = {
    'Decision Tree':      DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':      RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting':  GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost':            XGBClassifier(n_estimators=100, eval_metric='logloss', random_state=42),
    'Stacking':           stacking_clf,
}

print('=== Final Model Comparison (5-fold CV on Breast Cancer) ===')
print(f'{"Model":<22} {"Mean Accuracy":>15} {"Std":>8}')
print('-' * 47)

final_results = {}
for name, model in final_models.items():
    scores = cross_val_score(model, X_bc.values, y_bc.values, cv=cv, scoring='accuracy', n_jobs=-1)
    final_results[name] = scores
    marker = ' ★' if scores.mean() == max([s.mean() for s in final_results.values()]) else ''
    print(f'{name:<22} {scores.mean():>15.4f} {scores.std():>8.4f}{marker}')

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Box plot
names_f = list(final_results.keys())
scores_f = list(final_results.values())
bp = axes[0].boxplot(
    scores_f,
    labels=[n.replace(' ', '\n') for n in names_f],
    patch_artist=True,
    notch=False,
    medianprops={'color': 'red', 'linewidth': 2}
)
bp_colors = ['#FF9800', '#4CAF50', '#2196F3', '#9C27B0', '#F44336']
for patch, color in zip(bp['boxes'], bp_colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[0].set_title('CV Score Distribution – All Models', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accuracy')
axes[0].set_ylim(0.88, 1.02)

# Mean accuracy bar chart
means_f = [s.mean() for s in scores_f]
stds_f  = [s.std() for s in scores_f]
bars = axes[1].bar(
    range(len(names_f)), means_f,
    color=bp_colors, alpha=0.85,
    yerr=stds_f, capsize=5,
    edgecolor='white', linewidth=1.5
)
axes[1].set_xticks(range(len(names_f)))
axes[1].set_xticklabels([n.replace(' ', '\n') for n in names_f], fontsize=9)
axes[1].set_ylim(0.88, 1.04)
axes[1].set_title('Mean Accuracy ± Std (5-fold CV)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Mean Accuracy')
best_idx = np.argmax(means_f)
for i, (bar, mean) in enumerate(zip(bars, means_f)):
    style = dict(fontweight='bold', fontsize=9)
    if i == best_idx:
        style['color'] = 'darkred'
    axes[1].text(bar.get_x() + bar.get_width()/2,
                 bar.get_height() + stds_f[i] + 0.002,
                 f'{mean:.4f}', ha='center', **style)

plt.tight_layout()
plt.show()

print('\n★ Best model:', names_f[np.argmax(means_f)])

## Tugas Praktikum

Kerjakan tugas berikut dan lampirkan hasil beserta analisis singkat:

---

**Tugas 1 – Ensemble Comparison**  
Gunakan dataset `wine` dari sklearn. Bandingkan `BaggingClassifier`, `AdaBoostClassifier`, `GradientBoostingClassifier`, dan `XGBClassifier` menggunakan 10-fold CV. Buat radar chart (atau bar chart) yang menampilkan accuracy, precision, recall, dan F1 untuk setiap model.

**Tugas 2 – Feature Engineering**  
Muat dataset `diabetes` dari sklearn. Implementasikan: (a) PolynomialFeatures degree 2, (b) SelectKBest dengan k=5, (c) PCA dengan n_components yang menjelaskan 95% variansi. Bandingkan RMSE prediksi `Ridge` regression menggunakan ketiga pendekatan tersebut.

**Tugas 3 – Pipeline Lengkap**  
Bangun pipeline ML menggunakan dataset `titanic` (dari seaborn atau sumber lain). Pipeline harus menangani: missing values (imputer), fitur numerik (scaler), fitur kategorikal (OHE), dan classifier (pilih sendiri). Lakukan hyperparameter tuning menggunakan GridSearchCV.

**Tugas 4 – Class Imbalance Deep Dive**  
Buat dataset imbalanced dengan rasio 95:5 menggunakan `make_classification`. Bandingkan setidaknya 5 strategi: no treatment, random oversampling, SMOTE, ADASYN, dan undersampling (RandomUnderSampler). Plot confusion matrix untuk setiap strategi dan bandingkan recall kelas minoritas.

**Tugas 5 – Custom Stacking**  
Bangun `StackingClassifier` dengan kombinasi base estimator yang berbeda dari contoh di atas (minimal 4 model berbeda jenis). Coba dua meta-learner yang berbeda. Bandingkan performa stack dengan base estimator terbaik menggunakan nested cross-validation.