# XGBoost - Latihan Praktis Advanced

Notebook ini berisi latihan praktis komprehensif untuk memahami dan menguasai XGBoost.

## 📚 Learning Objectives
Setelah menyelesaikan latihan ini, Anda akan dapat:
1. Menggunakan XGBoost untuk classification dan regression
2. Melakukan hyperparameter tuning yang efektif
3. Menginterpretasi model dengan feature importance dan SHAP
4. Menangani overfitting dengan regularization dan early stopping
5. Membandingkan XGBoost dengan algoritma lain
6. Mengoptimalkan performa untuk kompetisi ML

## 🛠️ Setup dan Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
import warnings
warnings.filterwarnings('ignore')

# Set style untuk plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"XGBoost version: {xgb.__version__}")

## 📊 Latihan 1: XGBoost Classification dari Dasar

**Tugas**: Implementasikan XGBoost classifier dengan breast cancer dataset

In [None]:
# TODO: Load breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

print(f"Dataset shape: {X.shape}")
print(f"Classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y)}")

# TODO: Split data
X_train, X_test, y_train, y_test = train_test_split(
    # Your code here
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# TODO: Create dan train XGBoost classifier
xgb_clf = XGBClassifier(
    # Your hyperparameters here
    # Tip: start with default parameters
)

# TODO: Fit model dengan evaluation set untuk monitoring
xgb_clf.fit(
    # Your code here
    # Don't forget eval_set and early_stopping_rounds
)

# TODO: Make predictions
y_pred = # Your code here
y_pred_proba = # Your code here

# TODO: Calculate metrics
accuracy = # Your code here
print(f"Accuracy: {accuracy:.4f}")

In [None]:
# TODO: Analyze feature importance
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': # Your code here
}).sort_values('importance', ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance.head(10))

# TODO: Plot feature importance
plt.figure(figsize=(10, 8))
# Your plotting code here
plt.show()

### Analisis Learning Curve

In [None]:
# TODO: Plot learning curve dari evals_result
results = xgb_clf.evals_result()

plt.figure(figsize=(12, 4))

# Plot 1: Training vs Validation Loss
plt.subplot(1, 2, 1)
epochs = range(len(results['validation_0']['logloss']))
# Your plotting code here

# Plot 2: Overfitting Analysis
plt.subplot(1, 2, 2)
# Calculate and plot training-validation gap
# Your code here

plt.tight_layout()
plt.show()

## 📊 Latihan 2: Hyperparameter Tuning

**Tugas**: Lakukan hyperparameter tuning untuk meningkatkan performa model

In [None]:
# TODO: Define parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    # Add more parameters
}

# TODO: Setup GridSearchCV
grid_search = GridSearchCV(
    # Your code here
)

# TODO: Fit grid search
print("🔍 Starting hyperparameter tuning...")
grid_search.fit(X_train, y_train)

print(f"✨ Best parameters: {grid_search.best_params_}")
print(f"🏆 Best CV score: {grid_search.best_score_:.4f}")

In [None]:
# TODO: Compare default vs tuned model
default_model = XGBClassifier(random_state=42)
tuned_model = grid_search.best_estimator_

# Train both models
# Your code here

# Compare performance
default_score = # Your code here
tuned_score = # Your code here

print(f"Default model accuracy: {default_score:.4f}")
print(f"Tuned model accuracy: {tuned_score:.4f}")
print(f"Improvement: {tuned_score - default_score:.4f}")

## 📊 Latihan 3: XGBoost Regression

**Tugas**: Gunakan XGBoost untuk regression task dengan California housing dataset

In [None]:
# TODO: Load California housing dataset
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target

print(f"Housing dataset shape: {X_reg.shape}")
print(f"Features: {housing.feature_names}")
print(f"Target statistics:")
print(f"  Mean: ${y_reg.mean():.2f} (hundreds of thousands)")
print(f"  Std: ${y_reg.std():.2f}")
print(f"  Range: ${y_reg.min():.2f} - ${y_reg.max():.2f}")

# TODO: Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    # Your code here
)

In [None]:
# TODO: Create dan train XGBoost regressor
xgb_reg = XGBRegressor(
    # Your hyperparameters here
)

# TODO: Train with early stopping
xgb_reg.fit(
    # Your code here
)

# TODO: Make predictions and evaluate
y_pred_reg = # Your code here

# Calculate regression metrics
r2 = r2_score(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
mae = mean_absolute_error(y_test_reg, y_pred_reg)

print(f"Regression Results:")
print(f"  R² Score: {r2:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE: {mae:.4f}")

In [None]:
# TODO: Create diagnostic plots untuk regression
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Predictions vs Actual
# Your plotting code here

# 2. Residuals plot
# Your plotting code here

# 3. Feature importance
# Your plotting code here

# 4. Learning curve
# Your plotting code here

plt.tight_layout()
plt.show()

## 📊 Latihan 4: Cross-Validation dan Model Validation

**Tugas**: Implementasikan robust model validation

In [None]:
# TODO: Perform k-fold cross-validation
from sklearn.model_selection import StratifiedKFold

# Setup cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# TODO: Cross-validate XGBoost
xgb_model = XGBClassifier(random_state=42)
cv_scores = cross_val_score(
    # Your code here
)

print(f"Cross-Validation Results:")
print(f"  Scores: {cv_scores}")
print(f"  Mean: {cv_scores.mean():.4f}")
print(f"  Std: {cv_scores.std():.4f}")
print(f"  95% CI: {cv_scores.mean() - 2*cv_scores.std():.4f} - {cv_scores.mean() + 2*cv_scores.std():.4f}")

In [None]:
# TODO: Compare dengan XGBoost native cross-validation
# Convert to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)

# Parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 6,
    'learning_rate': 0.1,
    'seed': 42
}

# TODO: XGBoost native CV
cv_results = xgb.cv(
    # Your parameters here
)

print(f"XGBoost Native CV Results:")
print(f"  Best iteration: {len(cv_results)}")
print(f"  Best train score: {cv_results['train-logloss-mean'].iloc[-1]:.4f}")
print(f"  Best test score: {cv_results['test-logloss-mean'].iloc[-1]:.4f}")

## 📊 Latihan 5: Model Comparison

**Tugas**: Bandingkan XGBoost dengan algoritma ensemble lainnya

In [None]:
# TODO: Setup multiple models for comparison
models = {
    'XGBoost': XGBClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    # Add more models if you want
}

results = {}

# TODO: Train and evaluate each model
for name, model in models.items():
    print(f"Training {name}...")
    
    # Cross-validate
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    
    # Train on full training set
    model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    results[name] = {
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Test Accuracy': test_accuracy
    }

# Display results
results_df = pd.DataFrame(results).T
print("\nModel Comparison Results:")
print(results_df)

In [None]:
# TODO: Visualize model comparison
plt.figure(figsize=(12, 4))

# Plot 1: CV Scores
plt.subplot(1, 2, 1)
# Your plotting code here

# Plot 2: Test Accuracy
plt.subplot(1, 2, 2)
# Your plotting code here

plt.tight_layout()
plt.show()

## 📊 Latihan 6: Advanced - Handling Imbalanced Data

**Tugas**: Gunakan XGBoost untuk imbalanced classification

In [None]:
# TODO: Create imbalanced dataset
from sklearn.datasets import make_classification

X_imbal, y_imbal = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42
)

print(f"Imbalanced dataset:")
print(f"  Total samples: {len(y_imbal)}")
print(f"  Class distribution: {np.bincount(y_imbal)}")
print(f"  Class ratio: {np.bincount(y_imbal)[1]/np.bincount(y_imbal)[0]:.3f}")

X_train_imbal, X_test_imbal, y_train_imbal, y_test_imbal = train_test_split(
    X_imbal, y_imbal, test_size=0.2, random_state=42, stratify=y_imbal
)

In [None]:
# TODO: Train XGBoost with scale_pos_weight
# Calculate scale_pos_weight
scale_pos_weight = (y_train_imbal == 0).sum() / (y_train_imbal == 1).sum()
print(f"Scale pos weight: {scale_pos_weight:.2f}")

# Models to compare
models_imbal = {
    'XGBoost (default)': XGBClassifier(random_state=42),
    'XGBoost (balanced)': XGBClassifier(
        scale_pos_weight=scale_pos_weight,
        random_state=42
    ),
    'XGBoost (class_weight)': XGBClassifier(
        # Use class_weight parameter if available
        random_state=42
    )
}

# TODO: Train and evaluate models
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

imbal_results = {}

for name, model in models_imbal.items():
    print(f"\nTraining {name}...")
    
    model.fit(X_train_imbal, y_train_imbal)
    y_pred = model.predict(X_test_imbal)
    y_pred_proba = model.predict_proba(X_test_imbal)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test_imbal, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test_imbal, y_pred, average='binary'
    )
    auc = roc_auc_score(y_test_imbal, y_pred_proba)
    
    imbal_results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1': f1,
        'AUC': auc
    }
    
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1: {f1:.4f}")
    print(f"  AUC: {auc:.4f}")

# Display comparison
imbal_df = pd.DataFrame(imbal_results).T
print("\nImbalanced Data Results:")
print(imbal_df)

## 📊 Latihan 7: Feature Engineering untuk XGBoost

**Tugas**: Advanced feature engineering techniques

In [None]:
# TODO: Create feature interactions
from sklearn.preprocessing import PolynomialFeatures

# Using housing dataset
housing_df = pd.DataFrame(X_reg, columns=housing.feature_names)
housing_df['target'] = y_reg

print("Original features:")
print(housing_df.columns.tolist())

# TODO: Create domain-specific features
# Rooms per household
housing_df['rooms_per_household'] = housing_df['AveRooms'] / housing_df['AveOccup']

# Bedrooms ratio
housing_df['bedrooms_ratio'] = housing_df['AveBedrms'] / housing_df['AveRooms']

# Population per household
housing_df['pop_per_household'] = housing_df['Population'] / housing_df['HouseholdIncome']

# TODO: Add your own feature engineering ideas
# Your code here

print(f"\nAfter feature engineering: {len(housing_df.columns)-1} features")
print("New features created:")
new_features = [col for col in housing_df.columns if col not in housing.feature_names and col != 'target']
print(new_features)

In [None]:
# TODO: Compare original vs engineered features
X_original = housing_df[housing.feature_names]
X_engineered = housing_df.drop('target', axis=1)
y_target = housing_df['target']

# Split both datasets
X_orig_train, X_orig_test, y_orig_train, y_orig_test = train_test_split(
    X_original, y_target, test_size=0.2, random_state=42
)

X_eng_train, X_eng_test, y_eng_train, y_eng_test = train_test_split(
    X_engineered, y_target, test_size=0.2, random_state=42
)

# Train models
model_original = XGBRegressor(n_estimators=100, random_state=42)
model_engineered = XGBRegressor(n_estimators=100, random_state=42)

# TODO: Fit and evaluate both models
# Your code here

# Compare results
print("Feature Engineering Impact:")
print(f"  Original features R²: {r2_original:.4f}")
print(f"  Engineered features R²: {r2_engineered:.4f}")
print(f"  Improvement: {r2_engineered - r2_original:.4f}")

## 🎯 Challenge: Kaggle-Style Competition

**Tugas**: Kombinasikan semua teknik untuk mencapai performa terbaik

In [None]:
# TODO: Build your best XGBoost model
# Combine all techniques you've learned:
# 1. Feature engineering
# 2. Hyperparameter tuning  
# 3. Cross-validation
# 4. Early stopping
# 5. Regularization

def build_ultimate_xgboost_model(X, y, task_type='classification'):
    """
    Build the ultimate XGBoost model with all best practices
    """
    # Your implementation here
    pass

# TODO: Test your ultimate model
# Use breast cancer dataset for final challenge
ultimate_model = build_ultimate_xgboost_model(X, y, 'classification')

# Evaluate with cross-validation
final_cv_scores = cross_val_score(ultimate_model, X, y, cv=10, scoring='accuracy')

print(f"🏆 Ultimate XGBoost Model Results:")
print(f"  CV Accuracy: {final_cv_scores.mean():.4f} ± {final_cv_scores.std():.4f}")
print(f"  CV Scores: {final_cv_scores}")

# Compare with baseline
baseline = XGBClassifier(random_state=42)
baseline_scores = cross_val_score(baseline, X, y, cv=10, scoring='accuracy')

print(f"\n📊 Baseline vs Ultimate:")
print(f"  Baseline: {baseline_scores.mean():.4f} ± {baseline_scores.std():.4f}")
print(f"  Ultimate: {final_cv_scores.mean():.4f} ± {final_cv_scores.std():.4f}")
print(f"  Improvement: {final_cv_scores.mean() - baseline_scores.mean():.4f}")

## 📝 Reflection dan Best Practices

**Tugas**: Analisis results dan identifikasi best practices

In [None]:
# TODO: Create comprehensive analysis
print("🎓 XGBoost Learning Summary")
print("=" * 50)

# Summarize all your findings
print("""
Key Insights from Exercises:

1. Feature Importance:
   - Most important features for breast cancer: [Your findings]
   - Most important features for housing prices: [Your findings]

2. Hyperparameter Impact:
   - Most influential parameters: [Your findings]
   - Best parameter combinations: [Your findings]

3. Model Performance:
   - XGBoost vs Random Forest: [Your findings]
   - XGBoost vs Gradient Boosting: [Your findings]

4. Advanced Techniques:
   - Feature engineering impact: [Your findings]
   - Cross-validation insights: [Your findings]
   - Imbalanced data handling: [Your findings]

5. Best Practices Discovered:
   - [Your best practice 1]
   - [Your best practice 2]
   - [Your best practice 3]
""")

# TODO: Create your XGBoost checklist
xgboost_checklist = """
🔧 XGBoost Implementation Checklist:

Data Preparation:
□ Handle missing values (XGBoost can handle them, but preprocessing might help)
□ Encode categorical variables
□ Feature scaling (optional but can help)
□ Create domain-specific features

Model Training:
□ Set appropriate objective function
□ Use eval_set for monitoring
□ Set early_stopping_rounds
□ Start with default parameters

Hyperparameter Tuning:
□ Tune n_estimators with early stopping
□ Adjust learning_rate and max_depth
□ Try different subsample values
□ Apply regularization (reg_alpha, reg_lambda)

Model Validation:
□ Use cross-validation for robust estimates
□ Plot learning curves
□ Check for overfitting
□ Validate on holdout test set

Model Interpretation:
□ Analyze feature importance
□ Use SHAP for detailed interpretation
□ Check partial dependence plots
□ Validate model makes sense
"""

print(xgboost_checklist)

## 🎯 Next Steps dan Advanced Topics

Setelah menyelesaikan latihan ini, explore topik advanced berikut:

### 🚀 Advanced XGBoost:
1. **Multi-class classification** dengan objective='multi:softprob'
2. **Custom objective functions** dan evaluation metrics
3. **GPU acceleration** dengan tree_method='gpu_hist'
4. **Incremental learning** dengan xgb.train()

### 🔄 Ensemble Methods:
1. **Stacking** XGBoost dengan algoritma lain
2. **Blending** multiple XGBoost models
3. **Bayesian optimization** untuk hyperparameter tuning
4. **AutoML** dengan XGBoost backend

### 📊 Production Deployment:
1. **Model serialization** dan versioning
2. **Real-time prediction** APIs
3. **Model monitoring** dan drift detection
4. **A/B testing** untuk model comparison

### 🏆 Competition Techniques:
1. **Feature selection** dengan recursive elimination
2. **Target encoding** untuk categorical features
3. **Pseudo-labeling** untuk semi-supervised learning
4. **Multi-level modeling** dan meta-features

---
**🎉 Selamat!** Anda telah menyelesaikan comprehensive XGBoost tutorial! 

**Next Algorithm**: Explore **LightGBM** dan **CatBoost** untuk alternative gradient boosting implementations!