# Classification Problems in Machine Learning

## What is Classification?

Classification is a supervised learning task where the goal is to predict discrete categories or classes for input data. Unlike regression (which predicts continuous values), classification assigns each input to one of a predefined set of classes.

## Types of Classification

**Binary Classification**: Two possible classes (e.g., spam/not spam, positive/negative)

**Multi-class Classification**: More than two mutually exclusive classes (e.g., cat/dog/bird) <- **This is what Practical 1 is about!**

**Multi-label Classification**: Multiple non-exclusive labels can apply simultaneously (e.g., a movie can be both "action" and "comedy")

## Example: Email Spam Detection

Consider building a spam filter for emails. Each email needs to be classified as either "spam" or "not spam" based on features (matrix $X$) like:
- Presence of certain keywords ("free", "winner", "click here")
- Sender reputation
- Email length
- Number of links
- Semantic embeddings

Given a training dataset of labeled emails, a classification model learns patterns that distinguish spam from legitimate emails, then predicts the class for new, unseen emails.


For instance, the feature matrix $X$ for 3 emails might look like:

$$
X = \begin{bmatrix}
\text{keyword\_count} & \text{sender\_rep} & \text{length} & \text{links} \\
5 & 0.2 & 150 & 8  \\
0 & 0.9 & 450 & 1  \\
2 & 0.1 & 75 & 12 
\end{bmatrix}, \quad y = \begin{bmatrix}
1 \\
0 \\
1
\end{bmatrix}
$$

Where each row represents an email, columns represent features [keyword count, sender reputation, length, # links], and $y$ indicates whether each email is spam (1) or not spam (0).

## Key Metrics for Classification

### Confusion Matrix

A confusion matrix shows the breakdown of predictions:

|                | Predicted Positive | Predicted Negative |
|----------------|-------------------|-------------------|
| **Actually Positive** | True Positive (TP) | False Negative (FN) |
| **Actually Negative** | False Positive (FP) | True Negative (TN) |

### Precision

Precision measures the accuracy of positive predictions:

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Interpretation**: "Of all emails we flagged as spam, how many were actually spam?"

### Recall (Sensitivity)

Recall measures how many actual positives we found:

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Interpretation**: "Of all actual spam emails, how many did we catch?"

### F1 Score

F1 score is the harmonic mean of precision and recall, providing a single balanced metric:

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**Why harmonic mean?** It penalizes extreme values. A model with 100% recall but 10% precision would have a low F1 score, ensuring balance.




In [10]:
import numpy as np
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report

# Example: Email spam classification results
# 1 = spam, 0 = not spam
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0])

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nBreakdown:")
print(f"True Negatives (TN):  {cm[0, 0]}")
print(f"False Positives (FP): {cm[0, 1]}")
print(f"False Negatives (FN): {cm[1, 0]}")
print(f"True Positives (TP):  {cm[1, 1]}")

# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"\nPrecision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")

# Comprehensive report
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Not Spam', 'Spam']))

Confusion Matrix:
[[6 1]
 [2 6]]

Breakdown:
True Negatives (TN):  6
False Positives (FP): 1
False Negatives (FN): 2
True Positives (TP):  6

Precision: 0.857
Recall:    0.750
F1 Score:  0.800

Classification Report:
              precision    recall  f1-score   support

    Not Spam       0.75      0.86      0.80         7
        Spam       0.86      0.75      0.80         8

    accuracy                           0.80        15
   macro avg       0.80      0.80      0.80        15
weighted avg       0.81      0.80      0.80        15



In [11]:
# Manual calculation to understand the formulas
TP = np.sum((y_true == 1) & (y_pred == 1))
FP = np.sum((y_true == 0) & (y_pred == 1))
FN = np.sum((y_true == 1) & (y_pred == 0))
TN = np.sum((y_true == 0) & (y_pred == 0))

precision_manual = TP / (TP + FP) if (TP + FP) > 0 else 0
recall_manual = TP / (TP + FN) if (TP + FN) > 0 else 0
f1_manual = 2 * (precision_manual * recall_manual) / (precision_manual + recall_manual) if (precision_manual + recall_manual) > 0 else 0

print(f"Manual Precision: {precision_manual:.3f}")
print(f"Manual Recall:    {recall_manual:.3f}")
print(f"Manual F1:        {f1_manual:.3f}")

Manual Precision: 0.857
Manual Recall:    0.750
Manual F1:        0.800


## When to Use Which Metric?

**Precision is critical when**: False positives are costly
- Example: Medical diagnosis - you don't want to tell healthy patients they're sick

**Recall is critical when**: False negatives are costly
- Example: Fraud detection - you don't want to miss actual fraud cases

**F1 Score is useful when**: You need a balance between precision and recall, especially with imbalanced datasets

## Trade-offs

There's often a trade-off between precision and recall. By adjusting the classification threshold, you can:
- **Increase precision** → Lower recall (stricter about positive predictions)
- **Increase recall** → Lower precision (more lenient about positive predictions)

The F1 score helps you find a good balance between these two metrics.

# Confidence-Weighted F1 Score 

For a classifier that outputs predicted labels $\hat{y}_i$ and confidence scores $c_i \in [0,1]$ for each sample $i$, the symmetric confidence-weighted precision and recall are:

$
\text{Precision}_{\text{weighted}} = \frac{\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 1)}{\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 1) + \sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 0)}
$

$
\text{Recall}_{\text{weighted}} = \frac{\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 1)}{\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 1) + \sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 0 \land y_i = 1)}
$

Or more concisely:

$
\text{Precision}_{\text{weighted}} = \frac{\text{Weighted TP}}{\text{Weighted TP} + \text{Weighted FP}}
$

$
\text{Recall}_{\text{weighted}} = \frac{\text{Weighted TP}}{\text{Weighted TP} + \text{Weighted FN}}
$

The confidence-weighted F1 score is then:

$
F1_{\text{weighted}} = 2 \cdot \frac{\text{Precision}_{\text{weighted}} \cdot \text{Recall}_{\text{weighted}}}{\text{Precision}_{\text{weighted}} + \text{Recall}_{\text{weighted}}}
$

**Where:**
- $n$ is the number of samples
- $y_i$ is the true label for sample $i$
- $\hat{y}_i$ is the predicted label for sample $i$
- $c_i$ is the confidence score for sample $i$
- $\mathbb{1}(\cdot)$ is the indicator function
- Weighted TP = $\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 1)$
- Weighted FP = $\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 1 \land y_i = 0)$
- Weighted FN = $\sum_{i=1}^{n} c_i \cdot \mathbb{1}(\hat{y}_i = 0 \land y_i = 1)$

**Key Properties:**
- All predictions (both positive and negative) are weighted by their confidence scores
- High-confidence errors are penalized more heavily than low-confidence errors
- The metric rewards models that are both accurate and well-calibrated
- Unlike the asymmetric version, false negatives are also weighted by the confidence of the negative prediction

In [12]:
def confidence_weighted_f1(y_true, y_pred, confidence):
    # Weight all predictions, not just positives
    weighted_tp = np.sum(confidence * (y_pred == 1) * (y_true == 1))
    weighted_fp = np.sum(confidence * (y_pred == 1) * (y_true == 0))
    weighted_fn = np.sum(confidence * (y_pred == 0) * (y_true == 1))
    
    precision = weighted_tp / (weighted_tp + weighted_fp) if (weighted_tp + weighted_fp) > 0 else 0
    recall = weighted_tp / (weighted_tp + weighted_fn) if (weighted_tp + weighted_fn) > 0 else 0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return f1

In [16]:
# Example data
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 1])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 1])

#confidence = np.array([0.9, 0.8, 0.95, 0.4, 0.9, 0.85, 0.25, 0.95])
confidence = np.array([1, 1, 1, 1, 1, 1, 1, 1])

f1_weighted = confidence_weighted_f1(y_true, y_pred, confidence)

print("True labels:     ", y_true)
print("Predicted labels:", y_pred)
print("Confidence:      ", confidence)
print(f"\nConfidence-weighted F1 score: {f1_weighted:.4f}")

# Compare with standard F1 (all confidences = 1)
from sklearn.metrics import f1_score
f1_standard = f1_score(y_true, y_pred)
print(f"Standard F1 score:            {f1_standard:.4f}")

True labels:      [1 0 1 1 0 1 0 1]
Predicted labels: [1 0 1 0 0 1 1 1]
Confidence:       [1 1 1 1 1 1 1 1]

Confidence-weighted F1 score: 0.8000
Standard F1 score:            0.8000


In [17]:
# Example data
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 1])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 1])
confidence = np.array([1, 1, 1, 0.5, 1, 1, 1, 1])

f1_weighted = confidence_weighted_f1(y_true, y_pred, confidence)

print("True labels:     ", y_true)
print("Predicted labels:", y_pred)
print("Confidence:      ", confidence)
print(f"\nConfidence-weighted F1 score: {f1_weighted:.4f}")

# Compare with standard F1 (all confidences = 1)
from sklearn.metrics import f1_score
f1_standard = f1_score(y_true, y_pred)
print(f"Standard F1 score:            {f1_standard:.4f}")

True labels:      [1 0 1 1 0 1 0 1]
Predicted labels: [1 0 1 0 0 1 1 1]
Confidence:       [1.  1.  1.  0.5 1.  1.  1.  1. ]

Confidence-weighted F1 score: 0.8421
Standard F1 score:            0.8000


In [15]:
# Example data
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 1])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 1])
confidence = np.array([0.9, 1, 1, 0.8, 1, 1, 0.8, 1])

f1_weighted = confidence_weighted_f1(y_true, y_pred, confidence)

print("True labels:     ", y_true)
print("Predicted labels:", y_pred)
print("Confidence:      ", confidence)
print(f"\nConfidence-weighted F1 score: {f1_weighted:.4f}")

# Compare with standard F1 (all confidences = 1)
from sklearn.metrics import f1_score
f1_standard = f1_score(y_true, y_pred)
print(f"Standard F1 score:            {f1_standard:.4f}")

True labels:      [1 0 1 1 0 1 0 1]
Predicted labels: [1 0 1 0 0 1 1 1]
Confidence:       [0.9 1.  1.  0.8 1.  1.  0.8 1. ]

Confidence-weighted F1 score: 0.8298
Standard F1 score:            0.8000
