<a href="https://colab.research.google.com/github/lubaochuan/ml_python/blob/main/chapter3_assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Classification Assignment

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.)  
**Chapter:** 3 â€“ Classification

---
### Student Information
- **Name:**  
- **Date:**  



## Objectives
- Build a binary classifier
- Compute and interpret precision, recall, and F1 score
- Analyze a confusion matrix
- Reflect on metric choice for imbalanced data


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay
)

np.random.seed(42)


## Dataset: MNIST (5 vs Not-5)

We convert MNIST into a binary classification task:
- Positive class: digit 5
- Negative class: all other digits


In [None]:

# Load MNIST dataset
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
X, y = mnist.data, mnist.target.astype(int)

# Binary labels
y_binary = (y == 5)

# Train/test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_binary,
    test_size=0.2,
    stratify=y_binary,
    random_state=42
)

print("Training samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])
print("Positive class ratio (train):", y_train.mean())



## Train a Logistic Regression Classifier


In [None]:
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)


## Model Evaluation
Compute precision, recall, and F1 score.


In [None]:

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")

## Confusion Matrix


In [None]:

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.title("Confusion Matrix")
plt.show()


## Reflection Questions

Answer in complete sentences.

1. Why is accuracy misleading for imbalanced datasets?
2. What does precision measure?
3. Why might precision be more important than recall in spam detection?
4. Why might recall be more important than precision in medical screening?
5. How does class imbalance affect accuracy?

ðŸ‘‰ **Your answer:**  


## Optional Extension (Bonus)
- Try using `class_weight="balanced"` in Logistic Regression.
- Observe how precision and recall change.
- Explain the trade-off you observe.
