<a href="https://colab.research.google.com/github/keripikkaneboo/Machine-Learning/blob/main/07.%20Week%207/TugasWeek7_Booster%26BaggingClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
!pip install imbalanced-learn




In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

In [13]:
# Dataset URLs
datasets = {
    "HeartDisease": "https://raw.githubusercontent.com/farrelrassya/teachingMLDL/refs/heads/main/01.%20Machine%20Learning/01.%20Week%201/Dataset/HeartDisease.csv",
    "CitarumWater": "https://raw.githubusercontent.com/keripikkaneboo/Machine-Learning/refs/heads/main/02.%20Week%202/CitarumWater.csv",
    "Income": "https://raw.githubusercontent.com/farrelrassya/teachingMLDL/refs/heads/main/02.%20Deep%20Learning/Dataset/income.csv"
}

In [14]:
# Evaluasi model dengan perbaikan (balancing dan tuning)
def evaluate_models_improved(name, url):
    print(f"\nEvaluasi dataset: {name}")
    df = pd.read_csv(url)

    target_col = df.columns[-1]
    y = df[target_col]
    X = df.drop(columns=[target_col])

    # Label encoding untuk fitur kategorikal
    for col in X.select_dtypes(include='object').columns:
        X[col] = LabelEncoder().fit_transform(X[col].astype(str))

    if y.dtype == 'object':
        y = LabelEncoder().fit_transform(y)

    # Pastikan label dimulai dari nol
    y = y - y.min()

    # Imputasi nilai hilang dan scaling
    X = X.fillna(X.median(numeric_only=True))
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

    # SMOTE untuk mengatasi imbalance
    smote = SMOTE(random_state=42)
    X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

    # Model dan hyperparameter sederhana untuk tuning
    models = {
        "Random Forest": GridSearchCV(RandomForestClassifier(random_state=42), {"n_estimators": [100], "max_depth": [5, None]}, cv=3, n_jobs=-1),
        "Gradient Boosting": GridSearchCV(GradientBoostingClassifier(random_state=42), {"n_estimators": [100], "learning_rate": [0.1, 0.01]}, cv=3, n_jobs=-1),
        "XGBoost": GridSearchCV(XGBClassifier(eval_metric='logloss', random_state=42), {"n_estimators": [100], "learning_rate": [0.1, 0.01]}, cv=3, n_jobs=-1)
    }

    results = []

    for model_name, model in models.items():
        model.fit(X_train_bal, y_train_bal)
        best_model = model.best_estimator_
        y_pred = best_model.predict(X_test)

        if hasattr(best_model, "predict_proba"):
            if len(np.unique(y)) == 2:
                y_proba = best_model.predict_proba(X_test)[:, 1]
                auc = roc_auc_score(y_test, y_proba)
            else:
                y_proba = best_model.predict_proba(X_test)
                auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='weighted')
        else:
            auc = np.nan

        results.append({
            "Dataset": name,
            "Model": model_name,
            "Akurasi": round(accuracy_score(y_test, y_pred), 4),
            "Presisi": round(precision_score(y_test, y_pred, average='weighted', zero_division=0), 4),
            "Recall": round(recall_score(y_test, y_pred, average='weighted', zero_division=0), 4),
            "F1 Score": round(f1_score(y_test, y_pred, average='weighted', zero_division=0), 4),
            "AUC": round(auc, 4)
        })

    return pd.DataFrame(results)

# Evaluasi semua dataset
all_results = pd.concat([evaluate_models_improved(name, url) for name, url in datasets.items()])
all_results


Evaluasi dataset: HeartDisease

Evaluasi dataset: CitarumWater

Evaluasi dataset: Income


Unnamed: 0,Dataset,Model,Akurasi,Presisi,Recall,F1 Score,AUC
0,HeartDisease,Random Forest,0.5246,0.503,0.5246,0.5129,0.8152
1,HeartDisease,Gradient Boosting,0.5082,0.5188,0.5082,0.5133,0.7698
2,HeartDisease,XGBoost,0.5082,0.5184,0.5082,0.5128,0.7807
0,CitarumWater,Random Forest,1.0,1.0,1.0,1.0,1.0
1,CitarumWater,Gradient Boosting,1.0,1.0,1.0,1.0,1.0
2,CitarumWater,XGBoost,0.996,0.996,0.996,0.9959,1.0
0,Income,Random Forest,0.5287,0.5122,0.5287,0.5163,0.7303
1,Income,Gradient Boosting,0.5339,0.5293,0.5339,0.5154,0.7525
2,Income,XGBoost,0.5422,0.5322,0.5422,0.5155,0.7552


---

##  1. **Akurasi (Accuracy)**

**Persamaan**:

$$
\text{Akurasi} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Penjelasan**:

* Mengukur seberapa banyak prediksi yang benar dibandingkan total prediksi.
* Cocok digunakan **jika dataset seimbang**.
* Tapi **menyesatkan** pada data yang imbalanced (misalnya memprediksi semua “negatif” bisa akurat 95% jika hanya 5% datanya “positif”).

---

##  2. **Presisi (Precision)**

**Persamaan**:

$$
\text{Presisi} = \frac{TP}{TP + FP}
$$

**Penjelasan**:

* Mengukur dari semua yang diprediksi **positif**, berapa yang benar-benar positif.
* Berguna jika **biaya false positive tinggi**, misalnya deteksi spam (lebih baik tidak salah tuduh email penting sebagai spam).

---

##  3. **Recall (Sensitivitas)**

**Persamaan**:

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Penjelasan**:

* Dari semua yang benar-benar **positif**, berapa yang berhasil diprediksi benar.
* Berguna jika **biaya false negative tinggi**, misalnya deteksi kanker (lebih baik over-diagnose daripada terlewat).

---

##  4. **F1 Score**

**Persamaan**:

$$
\text{F1} = 2 \cdot \frac{\text{Presisi} \cdot \text{Recall}}{\text{Presisi} + \text{Recall}}
$$

**Penjelasan**:

* Merupakan **harmonik rata-rata** antara presisi dan recall.
* Cocok untuk **imbalanced dataset** sebagai kompromi antara presisi dan recall.

---

## 5. **AUC (Area Under Curve)**

**Penjelasan**:

* Mengukur **luas di bawah kurva ROC** (Receiver Operating Characteristic).
* ROC menunjukkan **TPR (Recall)** terhadap **FPR (False Positive Rate)**:

  $$
  \text{FPR} = \frac{FP}{FP + TN}
  $$
* AUC = 1: prediksi sempurna
  AUC = 0.5: model asal tebak
  AUC < 0.5: model lebih buruk dari tebak asal

---

## 6. **ROC Curve**

* Grafik dengan:

  * **x-axis**: False Positive Rate (FPR)
  * **y-axis**: True Positive Rate (Recall)

* Model yang baik akan **mendekati sudut kiri atas** pada grafik ROC.