# 02_classical_ml_system_design.ipynb

## Week 2: Classical Machine Learning & Intro to System Design Fundamentals

### Notebook Overview
This notebook covers:
- **Supervised Learning (Regression & Classification)**: Linear Regression, Ridge, Lasso, Logistic Regression, Decision Trees, Random Forests, SVM
- **Hyperparameter Tuning**: GridSearchCV, RandomizedSearchCV
- **Unsupervised Learning**: K-Means, PCA
- **Basic ML System Design**: Data ingestion, feature store concept, offline vs. online training, model hosting

**By the end of this notebook, you should be able to:**
1. Implement classical ML algorithms in scikit-learn.
2. Evaluate models with appropriate metrics and tune hyperparameters.
3. Understand basic ML system design considerations for scaling.
4. Complete a mini-project pipeline that includes data loading, cleaning, model training, and hyperparameter tuning.

---
## 1. Monday: Supervised Learning (Regression)

### Topics:
- **Linear Regression** (Ordinary Least Squares)
- **Regularization** (Ridge, Lasso)

### Notebook Tasks:
1. Use the **California Housing** dataset (or Boston Housing if available) to demonstrate regression.
2. Compare MSE, MAE, and R² metrics.
3. Observe the effects of **regularization**.

---
## 2. Tuesday: Supervised Learning (Classification)

### Topics:
- **Logistic Regression**, **Decision Trees**, **Random Forests**, **SVM**
- Performance metrics: Accuracy, Precision, Recall, F1, Confusion Matrix

### Notebook Tasks:
1. Use **Iris** or **Titanic** dataset for classification.
2. Train at least two different classifiers.
3. Evaluate each with relevant metrics; visualize confusion matrices.

---
## 3. Wednesday: Hyperparameter Tuning & Model Selection

### Topics:
- **GridSearchCV**, **RandomizedSearchCV**, cross-validation
- Model selection & avoiding overfitting

### Notebook Tasks:
1. Choose one model (e.g., Random Forest or SVM) and apply `GridSearchCV`.
2. Compare performance with default hyperparameters vs. tuned hyperparameters.
3. Document the best params and final performance.

---
## 4. Thursday: Unsupervised Learning & Basic ML System Design

### Topics:
- **K-Means** clustering
- **PCA** for dimensionality reduction
- **Basic ML system design**: data ingestion, feature engineering, model hosting (conceptual)

### Notebook Tasks:
1. Implement K-Means on a small dataset (e.g., Iris ignoring labels).
2. Use PCA for visualization in 2D or 3D.
3. Write a short outline about how you’d design a system if data were huge (streaming or partial fitting).

---
## 5. Friday: Mini-Project – Classical ML Pipeline

### Objective
Combine a small end-to-end pipeline:
1. Data Loading & Cleaning
2. Feature Engineering (optional, if relevant)
3. Model Training (Regression or Classification)
4. Hyperparameter Tuning
5. Model Evaluation
6. Brief mention of how you’d **deploy** or **serve** this model in a real-world system.

### Industry Context
- This pipeline reflects a typical scenario where data is prepared, models are trained, and eventually served to production.
- At scale, teams must consider data versioning, automation (CI/CD), and monitoring.

---
## 6. Weekend: Consolidation
- Review the classical ML techniques and ensure all code runs smoothly.
- Solidify understanding of hyperparameter tuning.
- **ADHD Tip**: Break tasks down into small chunks each day. Celebrate each milestone!

### Next Steps Preview
In **Week 3**, we’ll shift focus to **Deep Learning** basics with an intro to neural networks, CNNs, and best practices.

---

## Practical Implementation Sections

Below, you'll find skeleton code cells for each day of the week. Feel free to split or reorganize them as you see fit. Insert your own code, results, visualizations, and notes.

---

### 1. Monday: Regression (Linear, Ridge, Lasso)

In [None]:
# TODO: 1.1 Load California Housing (or Boston Housing) dataset
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: 1.2 Train LinearRegression, Ridge, Lasso and compare metrics
models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{name} -> MSE: {mse:.3f}, MAE: {mae:.3f}, R2: {r2:.3f}")

**Observations**:
- TODO: Write notes on which model performed best, how regularization impacted results, etc.

### 2. Tuesday: Classification (Logistic Regression, Decision Tree, Random Forest, SVM)

In [None]:
# TODO: 2.1 Load Iris or Titanic dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: 2.2 Train multiple classifiers
classifiers = {
    'LogisticRegression': LogisticRegression(max_iter=200),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'SVM': SVC()
}

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f"{name} Accuracy: {acc:.3f}")

# TODO: 2.3 Choose one model and visualize the confusion matrix
chosen_model = classifiers['RandomForest']
preds = chosen_model.predict(X_test)
cm = confusion_matrix(y_test, preds)
sns.heatmap(cm, annot=True)
plt.title("Confusion Matrix - RandomForest")
plt.show()

print(classification_report(y_test, preds))

**Observations**:
- TODO: Compare results among classifiers.
- Which performed best? Why do you think so?

### 3. Wednesday: Hyperparameter Tuning & Model Selection

In [None]:
# TODO: 3.1 GridSearchCV or RandomizedSearchCV example
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 2, 5]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Params:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
test_preds = best_model.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
print("Test Accuracy with tuned params:", test_acc)

**Notes**:
- Write about how cross-validation works.
- Document best params and any noticeable improvements over the default model.

### 4. Thursday: Unsupervised Learning & Basic ML System Design (Conceptual)

In [None]:
# TODO: 4.1 K-Means clustering with Iris (ignoring labels)
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Let's reuse the Iris data but drop the labels for clustering
X_iris = iris.data

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X_iris)

print("Cluster labels:", kmeans_labels[:10])

# 4.2 PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_iris)

plt.figure(figsize=(8,5))
plt.scatter(X_pca[:,0], X_pca[:,1], c=kmeans_labels, cmap='viridis')
plt.title("K-Means Clusters in PCA-Reduced Space")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

# TODO: 4.3 Basic ML system design outline
system_design_notes = """
Imagine we have a huge streaming dataset. We might:
- Use a message queue (e.g., Kafka) to handle incoming data.
- Store raw data in a data lake (S3, HDFS).
- Perform feature engineering offline, then train in a distributed environment (Spark, Dask, or HPC cluster).
- Deploy the trained model behind a microservice (FastAPI, Docker).
- Log requests and feedback for continuous updates.
"""
print(system_design_notes)

**Conceptual Discussion**:
- Summarize how you’d handle data ingestion if the dataset were 100 GB or coming in real-time.
- Briefly discuss partial fitting or online learning for extremely large data.


### 5. Friday: Mini-Project – Classical ML Pipeline

**Objective**: Combine data loading, cleaning, feature engineering (optional), model training, hyperparameter tuning, and evaluation into one workflow.

**Suggested Steps**:
1. Choose a dataset (Iris, Titanic, or another interesting one).
2. Clean data if necessary.
3. Split into train/test.
4. Train 1–2 models.
5. Tune hyperparameters.
6. Evaluate final model(s).
7. Briefly mention deployment/serving ideas.


In [None]:
# TODO: 5.1 Example skeleton code for a mini-project pipeline
def run_classical_ml_pipeline(dataset_df):
    # 1. Data cleaning
    # 2. Feature engineering
    # 3. Train/test split
    # 4. Modeling
    # 5. Hyperparameter tuning
    # 6. Evaluation
    pass

# fill in with your own logic
print("Mini-project pipeline placeholder.")

### Industry Context
In real projects, you’ll often have multiple data sources, need to track data versions, and iterate rapidly. This pipeline forms the foundation for production-level ML systems.


## 6. Weekend: Consolidation
- **Review** classical ML techniques.
- Ensure your code is clean and well-documented.
- Reflect on hyperparameter tuning outcomes.
- **ADHD Tip**: Keep tasks small and well-defined.


## Additional Resources (Optional)
- [scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (A. Géron)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
- [System Design for ML (various blog posts & tutorials)](https://github.com/puncsky/system-design)


# End of Week 2 Notebook

---
In **Week 3**, we’ll dive into **Deep Learning** fundamentals, building your first neural networks in TensorFlow or PyTorch!
