# Minggu 9: Machine Learning untuk Big Data
## Week 9: Machine Learning for Big Data

**Mata Kuliah / Course:** Big Data Analytics  
**Topik / Topic:** Supervised & Unsupervised Machine Learning  

---

### Deskripsi
Praktikum ini membahas penerapan algoritma Machine Learning menggunakan scikit-learn, mencakup:
- Regresi Linear
- Klasifikasi (Decision Tree, Random Forest, Logistic Regression)
- Clustering (K-Means)
- Evaluasi dan perbandingan model

In [None]:
# ============================================================
# Import Libraries
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

# Sklearn - Datasets
from sklearn.datasets import load_iris, make_regression, make_classification

# Sklearn - Preprocessing & Model Selection
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.metrics import (
    mean_squared_error, r2_score,
    accuracy_score, classification_report, confusion_matrix,
    roc_curve, auc, ConfusionMatrixDisplay
)

# Sklearn - Algorithms
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Plot settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
np.random.seed(42)

print('Libraries loaded successfully!')
print(f'NumPy: {np.__version__}')
print(f'Pandas: {pd.__version__}')
print(f'Matplotlib: {matplotlib.__version__}')
print(f'Seaborn: {sns.__version__}')

## 1. Dataset: Iris & Synthetic Data

Kita akan menggunakan dua dataset:
1. **Dataset Iris** – dataset klasifikasi klasik (150 sampel, 4 fitur, 3 kelas)
2. **Synthetic Regression Data** – data sintetis untuk demonstrasi regresi

In [None]:
# ============================================================
# Load Iris Dataset
# ============================================================
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = pd.Series(iris.target, name='species')
species_names = iris.target_names

print('=== Iris Dataset ===')
print(f'Shape: {X_iris.shape}')
print(f'Features: {list(X_iris.columns)}')
print(f'Classes: {list(species_names)}')
print(f'Class distribution:\n{y_iris.value_counts().rename(dict(enumerate(species_names)))}')
print()
print(X_iris.head())

# ============================================================
# Generate Synthetic Regression Data
# ============================================================
X_reg, y_reg = make_regression(
    n_samples=300, n_features=1, noise=20, random_state=42
)
print('\n=== Synthetic Regression Data ===')
print(f'X shape: {X_reg.shape}, y shape: {y_reg.shape}')

# Visualize datasets
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regression data scatter
axes[0].scatter(X_reg, y_reg, alpha=0.6, color='steelblue', edgecolors='white', s=60)
axes[0].set_title('Synthetic Regression Data', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Feature X')
axes[0].set_ylabel('Target y')

# Iris pairplot (sepal features)
colors = ['#e41a1c', '#377eb8', '#4daf4a']
for i, (cls, color) in enumerate(zip(species_names, colors)):
    mask = y_iris == i
    axes[1].scatter(
        X_iris.loc[mask, 'sepal length (cm)'],
        X_iris.loc[mask, 'sepal width (cm)'],
        label=cls, alpha=0.7, color=color, s=60
    )
axes[1].set_title('Iris Dataset (Sepal Features)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Sepal Length (cm)')
axes[1].set_ylabel('Sepal Width (cm)')
axes[1].legend()

plt.tight_layout()
plt.show()

## 2. Regresi Linear

**Linear Regression** memodelkan hubungan antara fitur input dan target kontinu sebagai fungsi linear:  
`ŷ = β₀ + β₁x`

Model dilatih dengan meminimalkan **Mean Squared Error (MSE)**.

In [None]:
# ============================================================
# Linear Regression
# ============================================================

# Train-test split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)
print(f'Training samples: {len(X_train_r)}, Test samples: {len(X_test_r)}')

# Fit model
lr_model = LinearRegression()
lr_model.fit(X_train_r, y_train_r)

# Predict
y_pred_train = lr_model.predict(X_train_r)
y_pred_test  = lr_model.predict(X_test_r)

# Metrics
train_mse  = mean_squared_error(y_train_r, y_pred_train)
test_mse   = mean_squared_error(y_test_r, y_pred_test)
train_rmse = np.sqrt(train_mse)
test_rmse  = np.sqrt(test_mse)
train_r2   = r2_score(y_train_r, y_pred_train)
test_r2    = r2_score(y_test_r, y_pred_test)

print('\n=== Linear Regression Results ===')
print(f'Intercept (β₀): {lr_model.intercept_:.4f}')
print(f'Coefficient (β₁): {lr_model.coef_[0]:.4f}')
print(f'\n{"Metric":<12} {"Train":>10} {"Test":>10}')
print('-' * 34)
print(f'{"MSE":<12} {train_mse:>10.2f} {test_mse:>10.2f}')
print(f'{"RMSE":<12} {train_rmse:>10.2f} {test_rmse:>10.2f}')
print(f'{"R²":<12} {train_r2:>10.4f} {test_r2:>10.4f}')

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter + regression line
x_line = np.linspace(X_reg.min(), X_reg.max(), 200).reshape(-1, 1)
y_line = lr_model.predict(x_line)

axes[0].scatter(X_train_r, y_train_r, alpha=0.5, label='Train', color='steelblue', s=40)
axes[0].scatter(X_test_r,  y_test_r,  alpha=0.7, label='Test',  color='orange',   s=60, marker='D')
axes[0].plot(x_line, y_line, color='red', linewidth=2.5, label='Regression Line')
axes[0].set_title('Linear Regression Fit', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Feature X')
axes[0].set_ylabel('Target y')
axes[0].legend()

# Residuals plot
residuals = y_test_r - y_pred_test
axes[1].scatter(y_pred_test, residuals, alpha=0.7, color='purple', s=50)
axes[1].axhline(y=0, color='red', linewidth=2, linestyle='--')
axes[1].set_title('Residuals Plot (Test Set)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')

plt.tight_layout()
plt.show()

## 3. Klasifikasi dengan Decision Tree

**Decision Tree** membagi data secara rekursif berdasarkan fitur yang paling informatif.  
Mudah diinterpretasi — kita bisa memvisualisasikan seluruh logika keputusan.

In [None]:
# ============================================================
# Decision Tree Classifier
# ============================================================

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_iris.values, y_iris.values, test_size=0.2, random_state=42, stratify=y_iris.values
)

# Train Decision Tree
dt_clf = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_clf.fit(X_train_c, y_train_c)
y_pred_dt = dt_clf.predict(X_test_c)

# Metrics
print('=== Decision Tree Classifier ===')
print(f'Accuracy: {accuracy_score(y_test_c, y_pred_dt):.4f}')
print('\nClassification Report:')
print(classification_report(y_test_c, y_pred_dt, target_names=species_names))

# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion Matrix
cm = confusion_matrix(y_test_c, y_pred_dt)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=species_names)
disp.plot(ax=axes[0], colorbar=False, cmap='Blues')
axes[0].set_title('Confusion Matrix – Decision Tree', fontsize=12, fontweight='bold')

# Decision Tree visualization
plot_tree(
    dt_clf,
    feature_names=iris.feature_names,
    class_names=species_names,
    filled=True,
    rounded=True,
    ax=axes[1],
    fontsize=8
)
axes[1].set_title('Decision Tree Structure (max_depth=4)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Klasifikasi dengan Random Forest

**Random Forest** adalah ensemble dari banyak decision tree yang dilatih pada bootstrap samples.  
Keunggulan utama: **Feature Importance** — menunjukkan seberapa penting setiap fitur.

In [None]:
# ============================================================
# Random Forest Classifier
# ============================================================

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_clf.fit(X_train_c, y_train_c)
y_pred_rf = rf_clf.predict(X_test_c)

print('=== Random Forest Classifier ===')
print(f'Accuracy: {accuracy_score(y_test_c, y_pred_rf):.4f}')
print('\nClassification Report:')
print(classification_report(y_test_c, y_pred_rf, target_names=species_names))

# Comparison: DT vs RF
print('\n=== Comparison: Decision Tree vs Random Forest ===')
print(f'{"":<20} {"Decision Tree":>15} {"Random Forest":>15}')
print('-' * 52)
print(f'{"Accuracy":<20} {accuracy_score(y_test_c, y_pred_dt):>15.4f} {accuracy_score(y_test_c, y_pred_rf):>15.4f}')

# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Feature importance
importances = rf_clf.feature_importances_
feature_names = [name.replace(' (cm)', '') for name in iris.feature_names]
sorted_idx = np.argsort(importances)

colors_imp = ['#2196F3' if i != sorted_idx[-1] else '#FF5722' for i in range(len(importances))]
bars = axes[0].barh(
    [feature_names[i] for i in sorted_idx],
    importances[sorted_idx],
    color=[colors_imp[i] for i in sorted_idx]
)
axes[0].set_title('Random Forest – Feature Importance', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Importance Score')
for bar, val in zip(bars, importances[sorted_idx]):
    axes[0].text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,
                 f'{val:.3f}', va='center', fontsize=10)

# DT vs RF accuracy comparison bar chart
models = ['Decision Tree', 'Random Forest']
accuracies = [
    accuracy_score(y_test_c, y_pred_dt),
    accuracy_score(y_test_c, y_pred_rf)
]
bar_colors = ['#FF9800', '#4CAF50']
axes[1].bar(models, accuracies, color=bar_colors, width=0.5, edgecolor='white', linewidth=1.5)
axes[1].set_ylim(0.8, 1.05)
axes[1].set_title('Accuracy: DT vs Random Forest', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Accuracy')
for i, acc in enumerate(accuracies):
    axes[1].text(i, acc + 0.005, f'{acc:.4f}', ha='center', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.show()

## 5. Regresi Logistik

**Logistic Regression** menggunakan fungsi sigmoid untuk menghasilkan probabilitas kelas.  
Di sini kita buat problem biner: **setosa vs bukan setosa**.  
Kita evaluasi dengan **ROC Curve** dan **AUC Score**.

In [None]:
# ============================================================
# Logistic Regression – Binary: setosa vs rest
# ============================================================

# Create binary target: setosa (0) vs non-setosa (1)
y_binary = (y_iris.values != 0).astype(int)  # 0=setosa, 1=non-setosa

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_iris.values, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

# Scale features
scaler = StandardScaler()
X_train_b_sc = scaler.fit_transform(X_train_b)
X_test_b_sc  = scaler.transform(X_test_b)

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_b_sc, y_train_b)

y_pred_lr   = log_reg.predict(X_test_b_sc)
y_proba_lr  = log_reg.predict_proba(X_test_b_sc)[:, 1]

print('=== Logistic Regression (Binary Classification) ===')
print(f'Accuracy: {accuracy_score(y_test_b, y_pred_lr):.4f}')
print('\nClassification Report:')
print(classification_report(y_test_b, y_pred_lr, target_names=['setosa', 'non-setosa']))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test_b, y_proba_lr)
roc_auc = auc(fpr, tpr)
print(f'ROC-AUC Score: {roc_auc:.4f}')

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
axes[0].plot(fpr, tpr, color='darkorange', lw=2.5, label=f'ROC Curve (AUC = {roc_auc:.4f})')
axes[0].plot([0, 1], [0, 1], color='navy', lw=1.5, linestyle='--', label='Random Classifier')
axes[0].fill_between(fpr, tpr, alpha=0.1, color='darkorange')
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel('False Positive Rate', fontsize=11)
axes[0].set_ylabel('True Positive Rate', fontsize=11)
axes[0].set_title('ROC Curve – Logistic Regression', fontsize=12, fontweight='bold')
axes[0].legend(loc='lower right', fontsize=10)

# Probability distribution
setosa_proba    = y_proba_lr[y_test_b == 0]
nonsetosa_proba = y_proba_lr[y_test_b == 1]
axes[1].hist(setosa_proba,    bins=10, alpha=0.7, color='steelblue', label='Setosa (actual)')
axes[1].hist(nonsetosa_proba, bins=10, alpha=0.7, color='tomato',    label='Non-setosa (actual)')
axes[1].axvline(x=0.5, color='black', linestyle='--', label='Threshold = 0.5')
axes[1].set_title('Predicted Probability Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Predicted Probability (non-setosa)')
axes[1].set_ylabel('Count')
axes[1].legend()

plt.tight_layout()
plt.show()

## 6. Clustering dengan K-Means

**K-Means** mengelompokkan data menjadi K cluster berdasarkan kedekatan ke centroid.  
Kita gunakan **Elbow Method** dan **Silhouette Score** untuk menemukan K optimal.

In [None]:
# ============================================================
# K-Means Clustering
# ============================================================

X_cluster = X_iris.values

# Elbow Method & Silhouette Score
inertias      = []
silhouette_scores = []
K_range = range(2, 9)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_cluster)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X_cluster, labels))

print('=== K-Means: Elbow & Silhouette Analysis ===')
print(f'  {"K":>3}  {"Inertia":>12}  {"Silhouette":>12}')
print('  ' + '-' * 30)
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
    marker = ' ← optimal' if k == 3 else ''
    print(f'  {k:>3}  {inertia:>12.2f}  {sil:>12.4f}{marker}')

# Final K-Means with K=3
best_k = 3
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
cluster_labels = km_final.fit_predict(X_cluster)
final_silhouette = silhouette_score(X_cluster, cluster_labels)

print(f'\nFinal K-Means (K={best_k})')
print(f'  Inertia (WCSS): {km_final.inertia_:.2f}')
print(f'  Silhouette Score: {final_silhouette:.4f}')

# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Elbow plot
axes[0].plot(list(K_range), inertias, 'bo-', linewidth=2.5, markersize=8)
axes[0].axvline(x=best_k, color='red', linestyle='--', alpha=0.7, label=f'K={best_k} (elbow)')
axes[0].set_title('Elbow Method', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Number of Clusters K')
axes[0].set_ylabel('Inertia (WCSS)')
axes[0].legend()

# Silhouette score plot
axes[1].plot(list(K_range), silhouette_scores, 'rs-', linewidth=2.5, markersize=8)
axes[1].axvline(x=best_k, color='blue', linestyle='--', alpha=0.7, label=f'K={best_k}')
axes[1].set_title('Silhouette Score vs K', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Number of Clusters K')
axes[1].set_ylabel('Silhouette Score')
axes[1].legend()

# Cluster visualization
cluster_colors = ['#e41a1c', '#377eb8', '#4daf4a']
for c in range(best_k):
    mask = cluster_labels == c
    axes[2].scatter(
        X_cluster[mask, 0], X_cluster[mask, 1],
        label=f'Cluster {c+1}', alpha=0.7,
        color=cluster_colors[c], s=60
    )
# Plot centroids
centroids = km_final.cluster_centers_
axes[2].scatter(
    centroids[:, 0], centroids[:, 1],
    marker='X', s=200, color='black',
    zorder=10, label='Centroids'
)
axes[2].set_title(f'K-Means Clusters (K={best_k})', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Sepal Length (cm)')
axes[2].set_ylabel('Sepal Width (cm)')
axes[2].legend()

plt.tight_layout()
plt.show()

## 7. Evaluasi & Perbandingan Model

Kita bandingkan semua classifier menggunakan **5-fold Cross-Validation** untuk evaluasi yang lebih andal.

In [None]:
# ============================================================
# Cross-Validation Comparison
# ============================================================

classifiers = {
    'Decision Tree (d=4)': DecisionTreeClassifier(max_depth=4, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = {}

print('=== 5-Fold Cross-Validation Results ===')
print(f'{"Model":<25} {"Mean Acc":>10} {"Std":>8} {"Min":>8} {"Max":>8}')
print('-' * 62)

for name, clf in classifiers.items():
    scores = cross_val_score(clf, X_iris.values, y_iris.values, cv=cv, scoring='accuracy')
    results[name] = scores
    print(f'{name:<25} {scores.mean():>10.4f} {scores.std():>8.4f} {scores.min():>8.4f} {scores.max():>8.4f}')

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot of CV scores
names  = list(results.keys())
scores = list(results.values())
bp = axes[0].boxplot(scores, labels=[n.split(' (')[0] for n in names],
                     patch_artist=True, notch=True)
box_colors = ['#FF9800', '#4CAF50', '#2196F3']
for patch, color in zip(bp['boxes'], box_colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[0].set_title('Cross-Validation Score Distribution', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accuracy')
axes[0].set_ylim(0.85, 1.02)
axes[0].tick_params(axis='x', rotation=15)

# Bar chart of mean accuracy
means = [s.mean() for s in scores]
stds  = [s.std() for s in scores]
short_names = [n.split(' (')[0] for n in names]
bars = axes[1].bar(short_names, means, color=box_colors, yerr=stds,
                   capsize=6, edgecolor='white', linewidth=1.5, alpha=0.85)
axes[1].set_ylim(0.8, 1.05)
axes[1].set_title('Mean Accuracy Comparison (5-fold CV)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Mean Accuracy')
axes[1].tick_params(axis='x', rotation=15)
for bar, mean in zip(bars, means):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                 f'{mean:.4f}', ha='center', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.show()

# Summary DataFrame
summary_df = pd.DataFrame({
    'Model': names,
    'Mean CV Accuracy': [s.mean() for s in scores],
    'Std': [s.std() for s in scores],
    'Min': [s.min() for s in scores],
    'Max': [s.max() for s in scores]
}).set_index('Model').round(4)

print('\n=== Summary Table ===')
print(summary_df.to_string())

## Tugas Praktikum

Kerjakan tugas berikut dan lampirkan hasil beserta analisis singkat:

---

**Tugas 1 – Regresi Lanjutan**  
Muat dataset `california_housing` dari `sklearn.datasets`. Latih `LinearRegression` dan evaluasi dengan MSE, RMSE, MAE, dan R². Bandingkan hasilnya dengan `Ridge` dan `Lasso` regression. Fitur mana yang paling berpengaruh?

**Tugas 2 – Klasifikasi Multi-kelas**  
Gunakan dataset `digits` (handwritten digit recognition) dari sklearn. Latih `RandomForestClassifier` dengan berbagai nilai `n_estimators` (50, 100, 200, 500). Buat plot akurasi vs jumlah estimator. Visualisasikan beberapa digit beserta prediksinya.

**Tugas 3 – Perbandingan Algoritma Klasifikasi**  
Muat dataset `breast_cancer` dari sklearn. Bandingkan `DecisionTreeClassifier`, `RandomForestClassifier`, dan `LogisticRegression` menggunakan 10-fold cross-validation. Buat heatmap confusion matrix untuk masing-masing model.

**Tugas 4 – Clustering Lanjutan**  
Gunakan dataset `make_blobs` untuk membuat data sintetis dengan 5 cluster dan 2 fitur. Terapkan K-Means, DBSCAN (`from sklearn.cluster import DBSCAN`), dan Agglomerative Clustering. Bandingkan hasilnya menggunakan Silhouette Score dan visualisasikan ketiga hasilnya berdampingan.

**Tugas 5 – Pipeline Dasar**  
Bangun `sklearn.pipeline.Pipeline` yang menggabungkan `StandardScaler` → `PCA(n_components=2)` → `RandomForestClassifier` pada dataset Iris. Gunakan 5-fold cross-validation untuk mengevaluasi. Bandingkan hasilnya dengan Random Forest tanpa preprocessing.