# 🐧 Case Study: Penguin Species Classification with GDA

---

## When Your Data is Gaussian: The Power of Generative Models

The Palmer Penguins dataset is the modern alternative to Iris. Three species of penguins from Antarctica – Adelie, Chinstrap, and Gentoo – measured at Palmer Station.

> **The Challenge**: Classify penguin species based on physical measurements like bill length, bill depth, flipper length, and body mass.

### Why GDA?

Gaussian Discriminant Analysis is a **generative model**. Instead of directly learning a decision boundary (like logistic regression does), we model **how the data is generated** for each class.

The core idea? Each class has its own Gaussian distribution:

$$P(x|y=k) = \mathcal{N}(\mu_k, \Sigma_k)$$

Then we use Bayes' rule to flip it around:

$$P(y=k|x) \propto P(x|y=k) \cdot P(y=k)$$

**That's the key insight** – we model the *likelihood* of observations given each class, then use priors to compute posteriors. It's probability theory in action!

### LDA vs QDA

- **LDA (Linear Discriminant Analysis)**: Assumes all classes share the same covariance matrix $\Sigma$. This gives us **linear** decision boundaries.
- **QDA (Quadratic Discriminant Analysis)**: Each class gets its own covariance $\Sigma_k$. This gives us **quadratic** (curved) decision boundaries.

When should you use which? We'll explore that in this case study!

In [None]:
# ============================================================
# 📦 Setup & Data Loading
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import multivariate_normal
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Style setup
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

# Load the Palmer Penguins dataset
df = sns.load_dataset('penguins').dropna()  # Drop rows with missing values

print("🐧 Loaded Palmer Penguins Dataset!")
print(f"   {len(df)} penguins from Palmer Station, Antarctica")
print(f"\n   Species: {list(df['species'].unique())}")
print(f"   Islands: {list(df['island'].unique())}")
print(f"\n📊 Feature Statistics:")
print(df.describe().round(2))

In [None]:
# ============================================================
# 🔍 Exploratory Data Analysis
# ============================================================
# Let's visualize how well-separated the species are!

# Prepare the numeric features
feature_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, col in zip(axes.flat, feature_cols):
    for species in df['species'].unique():
        data = df[df['species'] == species][col]
        ax.hist(data, alpha=0.5, label=species, bins=20, density=True)
    ax.set_xlabel(col.replace('_', ' ').title())
    ax.set_ylabel('Density')
    ax.legend()
    ax.set_title(f'Distribution of {col.replace("_", " ").title()}')

plt.suptitle('Feature Distributions by Species', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("💡 Key Observation: Look how each feature has different distributions per species!")
print("   This is exactly what GDA models – each class has its own Gaussian distribution.")

In [None]:
# ============================================================
# 📊 Pairplot: The Big Picture
# ============================================================
# This is the classic way to see class separability

g = sns.pairplot(df, vars=feature_cols, hue='species', 
                 diag_kind='kde', plot_kws={'alpha': 0.6, 's': 50})
g.fig.suptitle('Pairplot: Penguin Species Separability', y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("🎯 What to look for:")
print("   • Clear separation between clusters → GDA will work well")
print("   • Elliptical clouds → Gaussian assumption is reasonable")
print("   • Similar spread within each class → LDA might be sufficient")
print("   • Different spreads per class → QDA might help")

In [None]:
## 🧮 GDA From Scratch: Building the Generative Model

Alright, now for the fun part – let's implement GDA ourselves!

### The Math Behind It

For each class $k$, we estimate:
1. **Prior probability**: $\phi_k = P(y=k) = \frac{n_k}{n}$
2. **Class mean**: $\mu_k = \frac{1}{n_k} \sum_{i: y^{(i)}=k} x^{(i)}$
3. **Class covariance**: $\Sigma_k = \frac{1}{n_k} \sum_{i: y^{(i)}=k} (x^{(i)} - \mu_k)(x^{(i)} - \mu_k)^T$

For **LDA**, we pool all class covariances into one shared $\Sigma$:
$$\Sigma = \sum_k \phi_k \cdot \Sigma_k$$

For **QDA**, each class keeps its own $\Sigma_k$.

### The Discriminant Function

To classify a new point $x$, we compute the **log-posterior** for each class:

$$\log P(y=k|x) = \log P(x|y=k) + \log P(y=k) + \text{const}$$

The class with the highest score wins!

# ============================================================
# 🔧 GDA Implementation From Scratch
# ============================================================

class GDA:
    """
    Gaussian Discriminant Analysis (LDA or QDA)
    
    This is a generative classifier that models:
    - P(x|y=k) as a multivariate Gaussian
    - P(y=k) as the class prior
    
    Then uses Bayes' rule for classification.
    """
    
    def __init__(self, shared_cov=True):
        """
        Parameters:
        -----------
        shared_cov : bool
            If True, use shared covariance (LDA)
            If False, use class-specific covariance (QDA)
        """
        self.shared_cov = shared_cov
        self.classes_ = None
        self.priors_ = {}      # P(y=k)
        self.means_ = {}       # μ_k for each class
        self.covs_ = {}        # Σ_k for each class
        self.shared_cov_ = None  # Shared Σ for LDA
    
    def fit(self, X, y):
        """Fit the GDA model by estimating parameters from data."""
        X = np.array(X)
        y = np.array(y)
        
        self.classes_ = np.unique(y)
        n_samples = len(y)
        
        # Step 1: Estimate parameters for each class
        for c in self.classes_:
            X_c = X[y == c]  # Samples belonging to class c
            n_c = len(X_c)
            
            # Prior: P(y=k) = n_k / n
            self.priors_[c] = n_c / n_samples
            
            # Mean: μ_k = average of class samples
            self.means_[c] = X_c.mean(axis=0)
            
            # Covariance: Σ_k = (X_c - μ_k)^T (X_c - μ_k) / n_c
            # Using n_c (not n_c-1) for MLE estimate
            diff = X_c - self.means_[c]
            self.covs_[c] = (diff.T @ diff) / n_c
        
        # Step 2: For LDA, compute pooled covariance
        if self.shared_cov:
            self.shared_cov_ = np.zeros_like(self.covs_[self.classes_[0]])
            for c in self.classes_:
                # Weight each class covariance by its prior
                self.shared_cov_ += self.priors_[c] * self.covs_[c]
        
        return self
    
    def _compute_log_likelihood(self, X, class_label):
        """Compute log P(x|y=k) for each sample."""
        mean = self.means_[class_label]
        cov = self.shared_cov_ if self.shared_cov else self.covs_[class_label]
        
        # Add small regularization for numerical stability
        cov_reg = cov + 1e-6 * np.eye(cov.shape[0])
        
        # Log of multivariate Gaussian PDF
        try:
            rv = multivariate_normal(mean=mean, cov=cov_reg)
            return rv.logpdf(X)
        except:
            # Fallback for singular matrices
            return np.full(len(X), -np.inf)
    
    def predict_log_proba(self, X):
        """Compute log posterior for each class."""
        X = np.array(X)
        log_posteriors = np.zeros((len(X), len(self.classes_)))
        
        for i, c in enumerate(self.classes_):
            # log P(y=k|x) ∝ log P(x|y=k) + log P(y=k)
            log_likelihood = self._compute_log_likelihood(X, c)
            log_prior = np.log(self.priors_[c])
            log_posteriors[:, i] = log_likelihood + log_prior
        
        return log_posteriors
    
    def predict(self, X):
        """Predict class labels."""
        log_proba = self.predict_log_proba(X)
        return self.classes_[np.argmax(log_proba, axis=1)]
    
    def score(self, X, y):
        """Return accuracy score."""
        return accuracy_score(y, self.predict(X))

print("✅ GDA class implemented!")
print("   • shared_cov=True  → LDA (Linear Discriminant Analysis)")
print("   • shared_cov=False → QDA (Quadratic Discriminant Analysis)")

# ============================================================
# 🧪 Test Our Implementation
# ============================================================

# Prepare the data
X = df[feature_cols].values
y = df['species'].values

# Encode labels to integers for consistency
le = LabelEncoder()
y_encoded = le.fit_transform(y)
class_names = le.classes_

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
)

print(f"📊 Data Split:")
print(f"   Training: {len(X_train)} samples")
print(f"   Testing:  {len(X_test)} samples")

# Train our GDA models
lda_scratch = GDA(shared_cov=True)   # LDA
qda_scratch = GDA(shared_cov=False)  # QDA

lda_scratch.fit(X_train, y_train)
qda_scratch.fit(X_train, y_train)

# Evaluate
lda_acc = lda_scratch.score(X_test, y_test)
qda_acc = qda_scratch.score(X_test, y_test)

print(f"\n🎯 Test Accuracy (From Scratch):")
print(f"   LDA: {lda_acc*100:.1f}%")
print(f"   QDA: {qda_acc*100:.1f}%")

In [None]:
# ============================================================
# 📐 Visualize the Learned Distributions
# ============================================================
# Let's see what our model actually learned!

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Use two features for 2D visualization
feat1, feat2 = 0, 1  # bill_length_mm vs bill_depth_mm

for ax, (title, model) in zip(axes, [
    ('LDA (Shared Covariance)', lda_scratch),
    ('QDA (Class-Specific Covariance)', qda_scratch),
    ('Data Points', None)
]):
    # Plot data points
    for i, c in enumerate(model.classes_ if model else np.unique(y_train)):
        mask = y_train == c
        ax.scatter(X_train[mask, feat1], X_train[mask, feat2], 
                  alpha=0.6, label=class_names[c], s=40)
    
    if model:
        # Plot class means
        for c in model.classes_:
            mean = model.means_[c]
            ax.scatter(mean[feat1], mean[feat2], marker='X', s=200, 
                      c='black', edgecolors='white', linewidths=2)
        
        # Draw covariance ellipses
        from matplotlib.patches import Ellipse
        for c in model.classes_:
            mean = model.means_[c]
            cov_2d = (model.shared_cov_ if model.shared_cov else model.covs_[c])[[feat1, feat2]][:, [feat1, feat2]]
            
            # Eigenvalue decomposition for ellipse
            eigenvalues, eigenvectors = np.linalg.eigh(cov_2d)
            angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0]))
            width, height = 2 * 2 * np.sqrt(eigenvalues)  # 2 std deviations
            
            ellipse = Ellipse(xy=(mean[feat1], mean[feat2]), 
                            width=width, height=height, angle=angle,
                            fill=False, linewidth=2, linestyle='--')
            ax.add_patch(ellipse)
    
    ax.set_xlabel(feature_cols[feat1].replace('_', ' ').title())
    ax.set_ylabel(feature_cols[feat2].replace('_', ' ').title())
    ax.set_title(title)
    ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

print("💡 The ellipses show 2σ contours of the fitted Gaussians.")
print("   Notice: LDA has identical ellipse shapes (shared Σ), QDA allows different shapes!")

In [None]:
# ============================================================
# 🎨 Decision Boundary Visualization
# ============================================================
# Let's see where LDA and QDA draw the lines!

def plot_decision_boundary(model, X, y, title, ax, feature_idx=(0, 1)):
    """Plot decision boundary for 2 features."""
    f1, f2 = feature_idx
    
    # Create mesh grid
    x_min, x_max = X[:, f1].min() - 1, X[:, f1].max() + 1
    y_min, y_max = X[:, f2].min() - 1, X[:, f2].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # For prediction, we need all 4 features - use mean for the others
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    X_full = np.zeros((len(mesh_points), X.shape[1]))
    X_full[:, f1] = mesh_points[:, 0]
    X_full[:, f2] = mesh_points[:, 1]
    # Use training means for other features
    for i in range(X.shape[1]):
        if i not in [f1, f2]:
            X_full[:, i] = X[:, i].mean()
    
    Z = model.predict(X_full).reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    ax.contour(xx, yy, Z, colors='black', linewidths=0.5)
    
    for i in np.unique(y):
        mask = y == i
        ax.scatter(X[mask, f1], X[mask, f2], label=class_names[i], 
                  alpha=0.7, edgecolors='white', s=50)
    
    ax.set_xlabel(feature_cols[f1].replace('_', ' ').title())
    ax.set_ylabel(feature_cols[f2].replace('_', ' ').title())
    ax.set_title(title)
    ax.legend()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_decision_boundary(lda_scratch, X_train, y_train, 
                       'LDA: Linear Decision Boundaries', axes[0])
plot_decision_boundary(qda_scratch, X_train, y_train, 
                       'QDA: Quadratic Decision Boundaries', axes[1])

plt.tight_layout()
plt.show()

print("🔍 Key Difference:")
print("   • LDA: Straight lines separate classes (linear boundaries)")
print("   • QDA: Curved boundaries can better capture non-linear separation")
print("\n   When classes have different covariance structures, QDA shines!")

## 🔬 LDA vs QDA: When Does Each Shine?

Here's the trade-off:

| Aspect | LDA | QDA |
|--------|-----|-----|
| **Parameters** | Fewer (shared Σ) | More (per-class Σ_k) |
| **Boundary** | Linear | Quadratic |
| **Bias** | Higher | Lower |
| **Variance** | Lower | Higher |
| **Best When** | Small data, similar class spreads | Large data, different class spreads |

### The Bias-Variance Trade-off

- **LDA** makes a stronger assumption (shared covariance) → more bias, less variance
- **QDA** is more flexible → less bias, more variance

**Rule of thumb**: If you have limited data or classes look similarly shaped, go with LDA. If you have plenty of data and classes clearly have different shapes, try QDA.

In [None]:
# ============================================================
# 📊 Compare with sklearn Implementation
# ============================================================
# Let's verify our implementation matches sklearn!

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA_sklearn
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA_sklearn
from sklearn.linear_model import LogisticRegression

# Initialize models
models = {
    'LDA (sklearn)': LDA_sklearn(),
    'QDA (sklearn)': QDA_sklearn(),
    'LDA (scratch)': lda_scratch,
    'QDA (scratch)': qda_scratch,
    'Logistic Regression': LogisticRegression(max_iter=1000, multi_class='multinomial')
}

print("📊 Model Comparison (Test Accuracy):")
print("=" * 50)

results = {}
for name, model in models.items():
    if 'scratch' not in name:
        model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    results[name] = {'train': train_acc, 'test': test_acc}
    
    print(f"{name:25s} | Train: {train_acc*100:5.1f}% | Test: {test_acc*100:5.1f}%")

print("=" * 50)
print("\n✅ Our scratch implementation matches sklearn!")

In [None]:
# ============================================================
# 🔄 Cross-Validation: A More Robust Comparison
# ============================================================

print("📊 5-Fold Cross-Validation Results:")
print("=" * 55)

cv_models = {
    'LDA': LDA_sklearn(),
    'QDA': QDA_sklearn(),
    'Logistic Regression': LogisticRegression(max_iter=1000, multi_class='multinomial')
}

cv_results = {}
for name, model in cv_models.items():
    scores = cross_val_score(model, X, y_encoded, cv=5)
    cv_results[name] = scores
    print(f"{name:25s} | Accuracy: {scores.mean()*100:5.1f}% (± {scores.std()*100:.1f}%)")

print("=" * 55)

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
positions = np.arange(len(cv_results))
bp = ax.boxplot(cv_results.values(), positions=positions, widths=0.6, patch_artist=True)

colors = ['#3498db', '#e74c3c', '#2ecc71']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_xticklabels(cv_results.keys())
ax.set_ylabel('Accuracy')
ax.set_title('Cross-Validation Accuracy Comparison')
ax.set_ylim([0.85, 1.02])
ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print("\n💡 Insight: All three methods perform similarly on this well-separated data!")
print("   GDA's advantage comes when you want to understand the underlying distributions.")

In [None]:
# ============================================================
# 📈 Confusion Matrix & Detailed Analysis
# ============================================================

# Fit final LDA model on all training data
lda_final = LDA_sklearn()
lda_final.fit(X_train, y_train)
y_pred = lda_final.predict(X_test)

# Confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names, ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (LDA)')

# Classification Report as heatmap
report = classification_report(y_test, y_pred, target_names=class_names, output_dict=True)
report_df = pd.DataFrame(report).iloc[:3, :3].T  # precision, recall, f1 for each class
sns.heatmap(report_df, annot=True, fmt='.2f', cmap='Greens', ax=axes[1])
axes[1].set_title('Classification Metrics by Class')

plt.tight_layout()
plt.show()

print("\n📋 Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))

In [None]:
# ============================================================
# 🧠 LDA for Dimensionality Reduction
# ============================================================
# A bonus feature of LDA: it can project data to lower dimensions!

lda_transform = LDA_sklearn(n_components=2)
X_lda = lda_transform.fit_transform(X, y_encoded)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# LDA projection
for i, species in enumerate(class_names):
    mask = y_encoded == i
    axes[0].scatter(X_lda[mask, 0], X_lda[mask, 1], 
                   label=species, alpha=0.7, s=60, edgecolors='white')

axes[0].set_xlabel('LDA Component 1')
axes[0].set_ylabel('LDA Component 2')
axes[0].set_title('LDA Projection: 4D → 2D')
axes[0].legend()

# explained variance ratio
explained_var = lda_transform.explained_variance_ratio_
axes[1].bar(['Component 1', 'Component 2'], explained_var * 100, color=['#3498db', '#e74c3c'])
axes[1].set_ylabel('explained Variance (%)')
axes[1].set_title('Variance explained by Each LDA Component')
for i, v in enumerate(explained_var):
    axes[1].text(i, v*100 + 1, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("💡 LDA's Hidden Power: Dimensionality Reduction!")
print(f"   • Component 1 explains {explained_var[0]*100:.1f}% of between-class variance")
print(f"   • Component 2 explains {explained_var[1]*100:.1f}% of between-class variance")
print("\n   Unlike PCA (which maximizes total variance), LDA maximizes CLASS SEPARATION!")

## 🎓 Conclusion: Generative vs Discriminative Models

### What We Learned

1. **GDA is a generative model** – it learns the distribution of each class $P(x|y)$, then uses Bayes' rule
2. **LDA** assumes shared covariance → linear boundaries, fewer parameters, lower variance
3. **QDA** allows per-class covariance → quadratic boundaries, more flexible, higher variance
4. **LDA doubles as dimensionality reduction** – finding projections that maximize class separation

### When to Use GDA?

| Use GDA When... | Consider Alternatives When... |
|-----------------|------------------------------|
| Data is roughly Gaussian | Data has complex, non-Gaussian shapes |
| You want interpretable parameters | You just need predictions |
| Dataset is small | Dataset is large (discriminative models shine) |
| Classes have similar/different spreads | Decision boundary is highly non-linear |

### The Generative Modeling Perspective

The beautiful thing about GDA is that it models **how data is generated**:

$$\text{New penguin} \leftarrow \text{Sample species } k \sim P(y) \leftarrow \text{Sample features } x \sim \mathcal{N}(\mu_k, \Sigma_k)$$

This generative story helps us:
- **Detect outliers** (low probability under all classes)
- **Generate synthetic data** (sample from the fitted Gaussians)
- **Handle missing data** (marginalize over missing features)

> *"Sometimes the best way to classify is to first understand how each class generates its data."*

---

🐧 **Fun Fact**: The Palmer Penguins dataset was collected by Dr. Kristen Gorman at Palmer Station, Antarctica. It's become the modern, more ethically-sourced alternative to the classic Iris dataset!