# ML Workshop

This notebook will provide a full workflow example in machine learning on the Iris dataset.
- Outline:
    - Utilities: dataset loader, plotting helpers, pipeline builder
    - Model presets: several classifiers and a regression model
    - Worked example: quick exploration, train/test split, scaling, train SVM & LogisticRegression, evaluate

- How to use this cell:
    - Read the inline comments and the Markdown prompts printed before each major step.
    - Try swapping datasets (e.g., 'wine', 'digits') or models from MODEL_OPTIONS.
    - Use build_classification_pipeline to create safe training pipelines and to add PCA.
    - Exercises: hyperparameter tuning with GridSearchCV, cross-validation, plotting learning curves.

# Importing packages

In [None]:
# This cell provides utilities, explanations, and a worked example (Iris).
# Students: read the short Markdown prompts printed below to understand each step,
# then modify the dataset, models, or parameters as exercises.

from IPython.display import Markdown, display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn import datasets
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
)

In [None]:
# Set up environment / display preferences
# No need to modify this cell nor focus on it
warnings.filterwarnings("ignore")
sns.set(style="whitegrid")
pd.set_option('display.max_columns', 50)
RANDOM_STATE = 42


# Helper functions

These functions are ready implemented. \
You will have to use them later in the notebook.

In [None]:
# Helper: load popular small datasets by name
def load_dataset(name="iris"):
    """
    Return (X, y, feature_names, target_names, dataset_object)
    Supported names: 'iris' (default), 'wine', 'digits', 'breast_cancer', 'diabetes'
    Note: 'diabetes' is a regression dataset.
    Use this function when you want to quickly swap datasets for experiments.
    """
    name = name.lower()
    if name == "iris":
        ds = datasets.load_iris()
    elif name == "wine":
        ds = datasets.load_wine()
    elif name in ("digits", "digit"):
        ds = datasets.load_digits()
    elif name in ("breast_cancer", "breastcancer", "cancer"):
        ds = datasets.load_breast_cancer()
    elif name in ("diabetes",):
        ds = datasets.load_diabetes()
    else:
        raise ValueError(f"Unknown dataset: {name}")

    X = ds.data
    y = ds.target
    # feature_names may not exist for all datasets (digits uses 'feature_names' sometimes absent)
    feature_names = getattr(ds, "feature_names", [f"f{i}" for i in range(X.shape[1])])
    target_names = getattr(ds, "target_names", np.unique(y).astype(str))
    return X, y, feature_names, target_names, ds

# Helper: quick pairplot (uses first 4 features by default to keep it readable)
def quick_pairplot(X, y, feature_names, target_names=None, max_features=4):
    """
    Quick visual diagnostic: pairwise scatter plots for up to `max_features`.
    - X: feature array (n_samples, n_features)
    - y: labels (n_samples,)
    - feature_names: list of feature names (length n_features)
    - target_names: optional list of class names for coloring
    Use this when you want to inspect feature separation or spot outliers.
    """
    df = pd.DataFrame(X[:, :max_features], columns=feature_names[:max_features])
    df['target'] = y
    if target_names is not None:
        # map numeric targets to their names for nicer legends
        df['target_name'] = df['target'].map(lambda t: target_names[t] if t < len(target_names) else str(t))
        sns.pairplot(df, hue='target_name', corner=True)
    else:
        sns.pairplot(df, hue='target', corner=True)
    plt.suptitle("Pairplot (first {} features)".format(min(max_features, X.shape[1])), y=1.02)
    plt.show()

# Helper: plot confusion matrix nicely
def plot_confusion(cm, labels, ax=None, cmap="Blues", title="Confusion Matrix"):
    """
    Nicely formatted confusion matrix using seaborn heatmap.
    - cm: confusion matrix (2D array)
    - labels: display labels for axes
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap=cmap, xticklabels=labels, yticklabels=labels, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    ax.set_title(title)
    plt.tight_layout()

# Helper: plot learning curve
def plot_learning_curve(estimator, X, y, scoring=None, cv=5, train_sizes=np.linspace(0.1, 1.0, 5), title="Learning Curve"):
    """
    Plot a learning curve (training vs cross-validation score) as training data size increases.
    Useful to diagnose high variance (overfitting) vs high bias (underfitting).
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, train_sizes=train_sizes, scoring=scoring, n_jobs=-1
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.figure(figsize=(6,4))
    plt.plot(train_sizes, train_scores_mean, 'o-', label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', label="Cross-validation score")
    plt.xlabel("Training set size")
    plt.ylabel("Score")
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Helper: generic pipeline builder for classification
def build_classification_pipeline(model, use_pca=False):
    """
    Build a standard pipeline:
    - Impute missing values (mean)
    - Standard Scaler
    - Optional PCA (2 components for quick visualization)
    - Classifier (model)
    Use this function to ensure consistent preprocessing across experiments.
    """
    steps = [
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
    ]
    if use_pca:
        steps.append(("pca", PCA(n_components=2, random_state=RANDOM_STATE)))
    steps.append(("clf", model))
    return Pipeline(steps)

# Quick list of model choices for students to try
MODEL_OPTIONS = {
    "svm_linear": SVC(kernel="linear", probability=True, random_state=RANDOM_STATE),
    "svm_rbf": SVC(kernel="rbf", probability=True, random_state=RANDOM_STATE),
    "logistic": LogisticRegression(max_iter=500, random_state=RANDOM_STATE),
    "decision_tree": DecisionTreeClassifier(random_state=RANDOM_STATE),
    "knn": KNeighborsClassifier(),
    "linear_regression": LinearRegression(),  # for regression tasks
}

**Machine Learning workshop helper loaded.**
- Use `load_dataset(name)` to load data. Examples: `'iris'`, `'wine'`, `'digits'`, `'breast_cancer'`, `'diabetes'`.
- Use `quick_pairplot(X, y, feature_names)` to visualize feature relationships (uses seaborn).
- Use `build_classification_pipeline(model, use_pca=False)` to get a safe training pipeline.
- Use `plot_learning_curve(estimator, X, y, cv=5)` to inspect learning behavior.
- `MODEL_OPTIONS` contains several model presets you can try.
Exercises / TODO:
- Try different datasets and compare model performance.
- Tune hyperparameters using `GridSearchCV`.
- Try regression on `'diabetes'` with `linear_regression`.

# Exercises — Start Here

## Worked Example: Iris Dataset

We load Iris, inspect basic statistics, visualize, preprocess, train two classifiers, and compare results.

In [None]:
# 1. Load a Dataset
# We'll start with the classic Iris dataset.
# Note: replace 'iris' below with 'wine' or 'digits' to try other datasets.

dataset = datasets.load_iris()

In [None]:
X = dataset.data
y = dataset.target
feature_names = dataset.feature_names
target_names = dataset.target_names

In [None]:
# TODO: Print dataset description. Number of features, samples, classes, feature names, target names.
# Write you code Here:


### Data Preview

In [None]:
# 2. Explore the Data
# TODO: Create a DataFrame from X and y, print first 5 rows and class distribution.
# TODO: continue the scatter plot code below to use variables instead of hardcoded values.
df = pd.DataFrame(___, ___)
df['target'] = ___
print("First 5 rows of the dataset:")
print(df.___())
print('Class distribution')
print(df['___'].___())

### Scatter plot of first two features
This plot shows how the first two features separate the classes. Try other pairs or `quick_pairplot()`.

In [None]:
# 3. Visualize the Data
# TODO: continue the scatter plot code below to use variables instead of hardcoded values.
# You can also try other pairs of features by changing the indices [, 0] and [, 1]
plt.figure(figsize=(8,6))
for i, target_name in enumerate(target_names):
    plt.scatter(X[y == ___, 0], X[y == ___, 1], label=___)
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.legend()
plt.title("Iris Dataset - Feature Scatter Plot")
plt.show()

# TODO: use quick pairplot by uncommenting:
# quick_pairplot(X, y, feature_names, target_names, max_features=4)

### Preprocessing: train/test split and scaling

We will stratify the split to preserve class proportions. Scaling is important for SVM and many other algorithms.

In [None]:
# 4. Prepare the Data
# Split into train and test sets
# TODO: continue the scatter plot code below to use variables instead of hardcoded values.
# Make sure to specify test_size (e.g., 0.2 for 20% test).
X_train, X_test, y_train, y_test = train_test_split(__, __, test_size=__, random_state=RANDOM_STATE)


In [None]:
# 5. Standardize features
# TODO: continue the scatter plot code below to use variables instead of hardcoded values.
# Use StandardScaler to fit on training data and transform both train and test.
# Call the methods fit_transform and transform appropriately.
scaler = StandardScaler()
X_train_scaled = scaler.__(__) # call fit_transform on training data
X_test_scaled = scaler.__(__) # call transform on test data


### Train classifiers: SVM (linear) and Logistic Regression

Change hyperparameters or swap models to see how performance changes.

In [None]:
# 6. Train a Model - Support Vector Machine (SVM)
svm_clf = SVC(kernel='linear', random_state=RANDOM_STATE)
svm_clf.fit(X_train_scaled, y_train)


In [None]:
# 7. Train a Model - Logistic Regression
logreg_clf = LogisticRegression(max_iter=200, random_state=RANDOM_STATE)
logreg_clf.fit(X_train_scaled, y_train)

### Evaluation: accuracy and classification reports

In [None]:
# 8. Evaluate the Models
svm_pred = svm_clf.predict(X_test_scaled)
logreg_pred = logreg_clf.predict(X_test_scaled)

print("\nSVM Accuracy:", accuracy_score(y_test, svm_pred))
print("Logistic Regression Accuracy:", accuracy_score(y_test, logreg_pred))

print("\nSVM Classification Report:")
print(classification_report(y_test, svm_pred, target_names=target_names))

print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, logreg_pred, target_names=target_names))

### Confusion Matrices

Inspect which classes are confused. Use `plot_confusion()` for a nicer heatmap.

In [None]:
# 9. Confusion Matrix Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
cm_svm = confusion_matrix(y_test, svm_pred)
cm_logreg = confusion_matrix(y_test, logreg_pred)

# Using imshow is fine but plot_confusion gives a nicer output; showing both approaches is educational.
axes[0].imshow(cm_svm, cmap='Blues')
axes[0].set_title('SVM Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('True')

axes[1].imshow(cm_logreg, cmap='Greens')
axes[1].set_title('Logistic Regression Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('True')

for ax, cm in zip(axes, [cm_svm, cm_logreg]):
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j], ha='center', va='center', color='red')

plt.tight_layout()
plt.show()

# Also demonstrate the helper heatmap for one of them:
display(Markdown("Helper heatmap (SVM):"))
plot_confusion(cm_svm, target_names, title="SVM Confusion Matrix (heatmap)")



### Summary comparison



In [None]:
# 10. Compare Models
print("SVM vs Logistic Regression:")
print(f"SVM Accuracy: {accuracy_score(y_test, svm_pred):.2f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, logreg_pred):.2f}")


## Next Steps / Exercises - On you Own

- Try `load_dataset('wine')` or `load_dataset('digits')` and repeat the workflow.
- Use `build_classification_pipeline(MODEL_OPTIONS['svm_rbf'], use_pca=True)` and train with cross-validation.
- Perform grid search over hyperparameters (e.g. `C` and `kernel` for SVC) using `GridSearchCV`.
- Plot learning curves with `plot_learning_curve()` to diagnose bias/variance.;
- For regression: load `'diabetes'` and try `linear_regression` with mean squared error and R^2 metrics.