## 7. The Bayes Classifier for Minimum Error

In the previous sections, we discussed general principles of prediction and loss functions, primarily in the context of regression where the target variable $\mathbf{y}$ is continuous. Now, we turn our attention specifically to **classification problems**.

In classification, we are given a feature vector $\mathbf{x}$ and our goal is to assign it to one of $K$ discrete classes, which we can denote as $C_1, C_2, \dots, C_K$. Let $t$ represent the true class label for a given $\mathbf{x}$ (so $t \in \{C_1, \dots, C_K\}$). We want to find a decision rule, or a classifier, $f(\mathbf{x})$, that takes an input $\mathbf{x}$ and assigns it to one of the $K$ classes.

The most fundamental goal in classification is often to **minimize the probability of misclassification**. This means we want to choose $f(\mathbf{x})$ such that the probability $p(f(\mathbf{x}) \neq t)$ is as small as possible.

The decision rule that minimizes this misclassification probability is known as the **Bayes classifier**. For each input $\mathbf{x}$, the Bayes classifier assigns $\mathbf{x}$ to the class $C_k$ that has the highest posterior probability $p(C_k|\mathbf{x})$. That is:

$$
f_{\text{Bayes}}(\mathbf{x}) = \underset{C_k \in \{C_1, \dots, C_K\}}{\mathrm{argmax}} \ p(C_k | \mathbf{x})
$$

This is also known as the **Maximum A Posteriori (MAP)** estimation rule for the class label. If we achieve this for every $\mathbf{x}$, we will have the lowest possible misclassification rate. This optimal misclassification rate is called the **Bayes error rate**.

We can denote the ideal mapping corresponding to the Bayes classifier as $f^{\bullet}(\mathbf{x})$.

### 7.1 The Challenge: Unknown Posterior Probabilities

The main challenge is that the true posterior probabilities $p(C_k|\mathbf{x})$ are generally unknown in practice. We typically only have a training dataset $\mathcal{D}_{\text{train}} = \{(\mathbf{x}_n, t_n)\}_{n=1}^N$, where $t_n$ is the true class label for $\mathbf{x}_n$.

Therefore, we aim to find an **approximate classifier**, let's call it $f^*(\mathbf{x})$, based on the training data. There are two main approaches to this:

1.  **Discriminative Approach:**
    *   Model the posterior probabilities $p(C_k|\mathbf{x})$ directly.
    *   Examples: Logistic Regression, Neural Networks (with softmax output).
    *   Once $p(C_k|\mathbf{x})$ is estimated, the MAP rule is applied.

2.  **Generative Approach:**
    *   Model the class-conditional densities $p(\mathbf{x}|C_k)$ and the class priors $p(C_k)$.
    *   Then, use Bayes' theorem to find the posterior probabilities:
        $$
        p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)p(C_k)}{p(\mathbf{x})} = \frac{p(\mathbf{x}|C_k)p(C_k)}{\sum_{j=1}^K p(\mathbf{x}|C_j)p(C_j)}
        $$
    *   Examples: Naive Bayes, Gaussian Discriminant Analysis.
    *   Generative models can also be used to generate synthetic data points $\mathbf{x}$ by sampling from $p(\mathbf{x}|C_k)$ and $p(C_k)$.

The methods we will explore in this series, such as k-Nearest Neighbors, often take a more direct, data-based approach to approximate the decision rule without explicitly forming full probabilistic models for these distributions.

### 7.2 Evaluating Classifier Performance: Classification Errors

When we learn a classifier $f^*(\mathbf{x})$ from training data, we need ways to evaluate its performance.

*   **Training Data vs. Test Data:**
    *   The data used to learn $f^*(\mathbf{x})$ is the **training data**, $\mathcal{D}_{\text{train}} = \{(\mathbf{x}_n, t_n)\}_{n=1}^D$.
    *   To assess how well the classifier generalizes to new, unseen data, we use a separate **test data** set, $\mathcal{D}_{\text{test}} = \{(\mathbf{x}_m, t_m)\}_{m=1}^M$, which is assumed to be drawn from the same underlying distribution as the training data but was not used during the learning process.

*   **Empirical Error on Training Data (Training Error Rate):**
    This measures the fraction of misclassifications on the training set:
    $$
    R_{\text{emp}}^{\text{train}}(f^*) = \frac{1}{D} \sum_{n=1}^{D} I(f^*(\mathbf{x}_n) \neq t_n)
    $$
    where $I(\cdot)$ is the **indicator function**, defined as:
    $$
    I(\text{condition}) = \begin{cases} 1 & \text{if condition is true} \\ 0 & \text{if condition is false} \end{cases}
    $$
    The training error is often an overly optimistic estimate of how the classifier will perform on new data, as $f^*(\mathbf{x})$ was optimized using this very data.

*   **Empirical Error on Test Data (Test Error Rate):**
    This measures the fraction of misclassifications on the test set:
    $$
    R_{\text{emp}}^{\text{test}}(f^*) = \frac{1}{M} \sum_{m=1}^{M} I(f^*(\mathbf{x}_m) \neq t_m)
    $$
    The test error provides a more realistic estimate of the classifier's performance on unseen data.

*   **Generalization Error (True Error Rate):**
    The ultimate measure of a classifier's performance is its **generalization error**, $R(f^*)$, which is the expected error over the true underlying data distribution $p(\mathbf{x}, t)$:
    $$
    R(f^*) = \mathbb{E}_{\mathbf{x},t}[I(f^*(\mathbf{x}) \neq t)]
    $$
    This can also be expressed as the probability of making an erroneous decision:
    $$
    R(f^*) = p(f^*(\mathbf{x}) \neq t)
    $$
    In practice, we cannot compute $R(f^*)$ exactly because $p(\mathbf{x}, t)$ is unknown. The test error $R_{\text{emp}}^{\text{test}}(f^*)$ serves as an estimate of the generalization error.

*   **Generalization and Overfitting:**
    A good learning algorithm should not only achieve a low training error but also generalize well to new data, meaning the test error should also be low and not significantly higher than the training error.
    If a classifier performs very well on the training data (low $R_{\text{emp}}^{\text{train}}$) but poorly on the test data (high $R_{\text{emp}}^{\text{test}}$), it is said to have **overfit** the training data. This means it has learned the noise or specific idiosyncrasies of the training set rather than the true underlying patterns.

    **Example: A Memorizing Classifier and Overfitting**
    Consider a classifier $f_{\text{mem}}(\mathbf{x})$ that perfectly memorizes the training data:
    $$
    f_{\text{mem}}(\mathbf{x}) = \begin{cases} t_n & \text{if } \mathbf{x} = \mathbf{x}_n \text{ for some } (\mathbf{x}_n, t_n) \in \mathcal{D}_{\text{train}} \\ \text{a randomly chosen class} & \text{if } \mathbf{x} \text{ is not in } \mathcal{D}_{\text{train}} \end{cases}
    $$
    For this classifier:
    *   The training error $R_{\text{emp}}^{\text{train}}(f_{\text{mem}})$ will be 0 (or very close to 0 if there are conflicting labels for identical $\mathbf{x}_n$).
    *   However, its performance on test data $R_{\text{emp}}^{\text{test}}(f_{\text{mem}})$ can be very poor, especially if the input space is continuous or high-dimensional, as new feature vectors $\mathbf{x}_m$ are unlikely to exactly match any of the training vectors $\mathbf{x}_n$. This is a classic example of overfitting.

The crucial aspect of machine learning is to develop classifiers that generalize well to unseen data, capturing the underlying structure of the data distribution rather than just memorizing the training examples.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# --- Helper function to plot decision boundaries (from scikit-learn documentation) ---
def plot_decision_boundary(clf, X, y, ax, title):
    h = .02  # step size in the mesh
    # create a mesh to plot in
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8)

    # Plot also the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(title)

# --- 1. Generate Synthetic Data ---
# make_moons is good for showing non-linear boundaries
# noise parameter introduces some overlap and makes it more realistic
X, y = make_moons(n_samples=300, noise=0.6, random_state=42)

# --- 2. Split into Training and Test sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# --- 3. Define Models ---
# Model 1: A "reasonable" classifier (proxy for a good model)
# Using KNeighborsClassifier with a moderate k
reasonable_clf = KNeighborsClassifier(n_neighbors=15)

# Model 2: A classifier prone to overfitting
# KNeighborsClassifier with k=1 (memorizes training data)
overfitting_clf = KNeighborsClassifier(n_neighbors=1)

# (Optional) Model 3: A simple linear model that might underfit
# simple_clf = LogisticRegression(solver='liblinear', random_state=42) # Example

classifiers = {
    "Reasonable (k=15 NN)": reasonable_clf,
    "Overfitting (k=1 NN)": overfitting_clf,
    # "Simple Linear (Logistic Regression)": simple_clf
}

# --- 4. Train and Evaluate ---
print(f"{'Classifier':<35} | {'Train Accuracy':<15} | {'Test Accuracy':<15} | {'Train Error':<15} | {'Test Error':<15}")
print("-" * 100)

fig, axes = plt.subplots(1, len(classifiers), figsize=(5 * len(classifiers), 5))
if len(classifiers) == 1: # Ensure axes is iterable
    axes = [axes]

for i, (name, clf) in enumerate(classifiers.items()):
    # Train the classifier
    clf.fit(X_train, y_train)

    # Predictions
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)

    # Calculate accuracy
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Calculate error rates (1 - accuracy)
    train_error = 1 - train_accuracy
    test_error = 1 - test_accuracy

    print(f"{name:<35} | {train_accuracy:^15.3f} | {test_accuracy:^15.3f} | {train_error:^15.3f} | {test_error:^15.3f}")

    # Plot decision boundary
    plot_decision_boundary(clf, X_train, y_train, axes[i], f"{name}\nTrain Acc: {train_accuracy:.2f}, Test Acc: {test_accuracy:.2f}")

plt.tight_layout()
plt.show()

print("\nDiscussion:")
print(" - The 'Reasonable (k=15 NN)' classifier shows a good balance. Its training and test accuracies are relatively close, indicating good generalization.")
print(" - The 'Overfitting (k=1 NN)' classifier achieves very high (often perfect) accuracy on the training data.")
print("   However, its test accuracy is noticeably lower. This gap between training and test performance is a hallmark of overfitting.")
print("   The decision boundary for k=1 NN will likely be very complex and 'jagged', trying to perfectly separate every training point.")
print(" - If a 'Simple Linear' model was used on these non-linear 'moons' data, it would likely show underfitting:")
print("   both training and test accuracy would be relatively low because the model is too simple to capture the data's structure.")