# Cluster-Distance Feature Boost for Binary Perceptron

---

## ðŸ§© Problem Statement

**What problem are we solving?**

Imagine a teacher trying to identify students in "Class A" using basic info like height and weight. A simple classifier (Perceptron) struggles with just these 2 features.

**The Solution:** Add "distance-to-centroid" features from K-Means clustering. These tell us how close each student is to the "average" of each class.

---

## ðŸªœ Steps to Solve

1. Generate synthetic data with 3 clusters (900 points)
2. Create binary labels (cluster 0 = 1, others = 0)
3. Standardize features (mean=0, std=1)
4. Fit K-Means to find 3 cluster centers
5. Compute distance-to-centroid features
6. Train baseline Perceptron (2 original features)
7. Train enhanced Perceptron (2 original + 3 distance features)
8. Compare metrics over 5 random splits

---

## ðŸŽ¯ Expected Output

| Metric | Baseline | Enhanced | Improvement |
|--------|----------|----------|-------------|
| Accuracy | ~57% | ~92% | +35% |
| Precision | ~32% | ~90% | +58% |
| Recall | ~61% | ~89% | +28% |
| ROC AUC | ~49% | ~98% | +49% |

**Success Criteria:** At least one metric improves by â‰¥5 percentage points.

---

## Section 1: Importing Libraries

### 2.1 What: Import numpy
NumPy is a library for working with arrays and mathematical operations.

### 2.2 Why:
We need arrays to store data efficiently. Python lists are slow for math operations, but numpy is 10-100x faster because it uses C code internally.

### 2.3 When:
Always import at the start of any data science or machine learning project.

### 2.4 Where:
Every ML/Data Science project uses numpy for numerical operations.

### 2.5 How:
```python
import numpy as np  # 'np' is the standard abbreviation
```

### 2.6 Internal Working:
NumPy stores data in contiguous memory blocks and uses compiled C code for operations, making it much faster than Python loops.

### 2.7 Output:
No visible output - just makes numpy available as 'np' for the rest of the notebook.

In [None]:
# Import numpy for numerical operations (arrays, math)
import numpy as np

### 2.1 What: Import make_blobs
A function to generate synthetic clustered data.

### 2.2 Why:
We need test data with known cluster structure. Real-world data is messy, but synthetic data lets us understand concepts clearly first. It's like practicing with training wheels!

### 2.3 When:
For learning, testing algorithms, or when real data isn't available.

### 2.4 Where:
Tutorials, prototyping, algorithm comparisons.

### 2.5 How:
```python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3)
```

### 2.6 Internal Working:
1. Randomly places k center points in feature space
2. Generates n_samples/k points around each center
3. Adds Gaussian noise based on cluster_std

### 2.7 Output:
Returns X (features array) and y (cluster labels).

In [None]:
# Import make_blobs to generate synthetic clustered data
from sklearn.datasets import make_blobs

### 2.1 What: Import StandardScaler
A tool to normalize features to mean=0 and standard deviation=1.

### 2.2 Why:
Different features have different scales (e.g., age in years, salary in thousands). Scaling makes them comparable. Both Perceptron and K-Means are sensitive to feature scales!

**Analogy:** It's like converting all currencies to dollars before comparing prices.

### 2.3 When:
Before training most ML models, especially distance-based ones.

### 2.4 Where:
Almost every ML pipeline includes scaling as a preprocessing step.

### 2.5 How:
```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

### 2.6 Internal Working:
For each value: `z = (value - mean) / std_deviation`

### 2.7 Output:
Transformed data where each feature has meanâ‰ˆ0, stdâ‰ˆ1.

In [None]:
# Import StandardScaler to normalize features
from sklearn.preprocessing import StandardScaler

### 2.1 What: Import KMeans
A clustering algorithm that groups data into k clusters.

### 2.2 Why:
We want to find natural groupings in data and use distances to these group centers as new features. This is the key to "boosting" our classifier!

**Analogy:** Finding the "center" of each friend group in a school cafeteria.

### 2.3 When:
When you suspect data has natural clusters/groups.

### 2.4 Where:
Customer segmentation, image compression, feature engineering.

### 2.5 How:
```python
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
```

### 2.6 Internal Working:
1. Initialize k random centroids
2. Assign each point to nearest centroid
3. Move centroids to mean of assigned points
4. Repeat until convergence

### 2.7 Output:
Cluster labels and centroid locations.

In [None]:
# Import KMeans for clustering
from sklearn.cluster import KMeans

### 2.1 What: Import Perceptron
The simplest neural network - just weights + bias.

### 2.2 Why:
It's a great baseline to show improvement from feature engineering. If we can boost a simple model, the technique works!

**Analogy:** A teacher with one simple rule: "If score > 50, pass. Otherwise, fail."

### 2.3 When:
As a baseline, for linearly separable data, for teaching.

### 2.4 Where:
First step in learning neural networks, simple classification.

### 2.5 How:
```python
model = Perceptron()
model.fit(X, y)
```

### 2.6 Internal Working:
Learns weights w such that `sign(wÂ·x + b)` predicts the class.

### 2.7 Output:
Trained model that can predict class labels.

In [None]:
# Import Perceptron - the simplest neural network
from sklearn.linear_model import Perceptron

### 2.1 What: Import train_test_split
A function to divide data into training and test sets.

### 2.2 Why:
We need separate data to train and evaluate the model. Training and testing on the same data gives false confidence (overfitting). It's like a student memorizing answers vs understanding concepts!

### 2.3 When:
Always before training any ML model.

### 2.4 Where:
Every ML project with supervised learning.

### 2.5 How:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```

### 2.6 Internal Working:
Randomly shuffles data, then splits at the specified ratio (e.g., 75% train, 25% test).

### 2.7 Output:
Four arrays: training features, test features, training labels, test labels.

In [None]:
# Import train_test_split to divide data
from sklearn.model_selection import train_test_split

### 2.1 What: Import classification metrics
Functions to measure how good our classifier is.

### 2.2 Why:
We need to measure performance. Different metrics tell us different things:
- **Accuracy**: Overall correctness
- **Precision**: Of predictions, how many are correct?
- **Recall**: Did we find all positives?
- **ROC AUC**: Overall ranking quality

### 2.3 When:
After making predictions on test data.

### 2.4 Where:
Every classification problem needs evaluation metrics.

### 2.5 How:
```python
accuracy_score(y_true, y_pred)
precision_score(y_true, y_pred)
```

### 2.6 Internal Working:
Compares predictions to actual labels, calculates ratios.

### 2.7 Output:
Numbers between 0 and 1 (higher is better).

In [None]:
# Import classification metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

---

## Section 2: Generate Synthetic Data

### 2.1 What: Create blob data with make_blobs
We're generating 900 data points grouped into 3 clusters.

### 2.2 Why:
We need structured data where we know the "ground truth". This helps us understand if our algorithm is working correctly.

**Analogy:** Creating a practice test where we already know all the answers.

### 2.3 When:
At the start of the project, before any training.

### 2.4 Where:
Any clustering or classification tutorial.

### 2.5 How:
```python
X, cluster_ids = make_blobs(n_samples=900, centers=3, ...)
```

### 2.6 Internal Working:
1. Randomly picks 3 center points
2. Generates 300 points around each center
3. Adds Gaussian noise based on cluster_std

### 2.7 Output:
- X: (900, 2) array of coordinates
- cluster_ids: (900,) array of cluster labels (0, 1, or 2)

### make_blobs Parameter Explanation

| Parameter | Value | Explanation |
|-----------|-------|-------------|
| `n_samples` | 900 | Total number of data points (300 per cluster) |
| `centers` | 3 | Number of cluster centers (groups) |
| `cluster_std` | [1.0, 1.2, 1.4] | Spread of each cluster (higher = more spread) |
| `random_state` | 12 | Seed for reproducibility (same data every run) |

In [None]:
# Generate synthetic blob data with 3 clusters
X, cluster_ids = make_blobs(
    n_samples=900,      # 900 total points (300 per cluster)
    centers=3,          # 3 cluster centers
    cluster_std=[1.0, 1.2, 1.4],  # Different spreads per cluster
    random_state=12,    # For reproducibility
)

print(f"X shape: {X.shape}")
print(f"First 5 points:\n{X[:5]}")
print(f"\nCluster IDs: {np.unique(cluster_ids)} (0, 1, 2)")

### 2.1 What: Create binary labels
Convert cluster labels (0, 1, 2) to binary (1, 0, 0).

### 2.2 Why:
We want a binary classification problem: "Is this point in cluster 0 or not?"

**Analogy:** Asking "Is this student in Class A?" instead of "Which of the 3 classes is this student in?"

### 2.3 When:
When converting multi-class to binary problem.

### 2.4 Where:
One-vs-all classification, binary problems.

### 2.5 How:
```python
y = (cluster_ids == 0).astype(int)
```

### 2.6 Internal Working:
- `cluster_ids == 0` gives True/False array
- `.astype(int)` converts Trueâ†’1, Falseâ†’0

### 2.7 Output:
~300 points with label 1, ~600 with label 0.

In [None]:
# Create binary labels: 1 if cluster 0, else 0
y = (cluster_ids == 0).astype(int)

print(f"Binary labels: {np.unique(y)} (0 or 1)")
print(f"Class 1 count: {sum(y)} (cluster 0 points)")
print(f"Class 0 count: {len(y) - sum(y)} (clusters 1 and 2)")

---

## Section 3: Visualize the Data

Let's see what our data looks like! This helps us understand why a simple Perceptron might struggle.

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot
plt.figure(figsize=(10, 6))

# Plot class 1 (cluster 0) in blue
plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label='Class 1 (cluster 0)', alpha=0.6)

# Plot class 0 (clusters 1, 2) in red
plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Class 0 (clusters 1, 2)', alpha=0.6)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data: Binary Classification Problem')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---

## Section 4: Run the Full Experiment

Now we'll run the complete experiment over 5 random splits to compare the baseline vs enhanced Perceptron.

### Helper Function: Standardize Features

This function scales features to have mean=0 and std=1.

In [None]:
def standardize_features(X_train, X_test):
    """
    Standardize features to mean=0, std=1.
    
    IMPORTANT: Fit on training data, transform both!
    This prevents data leakage.
    """
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)  # Learn from train
    X_test_scaled = scaler.transform(X_test)        # Apply to test
    return X_train_scaled, X_test_scaled

### Helper Function: Create Distance Features

This is the KEY function! It uses K-Means to compute distance-to-centroid features.

In [None]:
def create_distance_features(X_train, X_test, n_clusters=3):
    """
    Create distance-to-centroid features using K-Means.
    
    For each point, computes distance to each of the k cluster centers.
    This gives us k new features!
    """
    # Fit K-Means on training data only
    kmeans = KMeans(n_clusters=n_clusters, random_state=12, n_init=10)
    kmeans.fit(X_train)
    
    # Transform to get distances (this is the magic!)
    train_distances = kmeans.transform(X_train)
    test_distances = kmeans.transform(X_test)
    
    return train_distances, test_distances

### Helper Function: Train and Evaluate

Train a Perceptron and compute all metrics.

In [None]:
def train_and_evaluate(X_train, X_test, y_train, y_test):
    """
    Train Perceptron and return all metrics.
    """
    # Create and train Perceptron
    model = Perceptron(random_state=12, max_iter=1000, tol=1e-3)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    y_scores = model.decision_function(X_test)
    
    # Calculate metrics
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'roc_auc': roc_auc_score(y_test, y_scores)
    }

### Run the Experiment: 5 Random Splits

In [None]:
# Storage for metrics
baseline_metrics = {'accuracy': [], 'precision': [], 'recall': [], 'roc_auc': []}
enhanced_metrics = {'accuracy': [], 'precision': [], 'recall': [], 'roc_auc': []}

print("Running 5 random splits...\n")

for split_idx in range(5):
    # 1. Split data (different random state each time)
    random_state = 42 + split_idx
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=random_state
    )
    
    # 2. Standardize features
    X_train_scaled, X_test_scaled = standardize_features(X_train, X_test)
    
    # 3. Create distance features
    train_dist, test_dist = create_distance_features(X_train_scaled, X_test_scaled)
    
    # 4. Create enhanced feature sets (original + distance)
    X_train_enhanced = np.column_stack([X_train_scaled, train_dist])
    X_test_enhanced = np.column_stack([X_test_scaled, test_dist])
    
    # 5. Train and evaluate BASELINE (original 2 features)
    baseline_result = train_and_evaluate(
        X_train_scaled, X_test_scaled, y_train, y_test
    )
    
    # 6. Train and evaluate ENHANCED (5 features)
    enhanced_result = train_and_evaluate(
        X_train_enhanced, X_test_enhanced, y_train, y_test
    )
    
    # Collect metrics
    for metric in baseline_metrics:
        baseline_metrics[metric].append(baseline_result[metric])
        enhanced_metrics[metric].append(enhanced_result[metric])
    
    print(f"Split {split_idx + 1}: Baseline Acc={baseline_result['accuracy']:.3f}, "
          f"Enhanced Acc={enhanced_result['accuracy']:.3f}")

---

## Section 5: Results - Metric Comparison Table

In [None]:
# Calculate averages
baseline_avg = {k: np.mean(v) for k, v in baseline_metrics.items()}
enhanced_avg = {k: np.mean(v) for k, v in enhanced_metrics.items()}

# Print formatted results
print("\n" + "=" * 60)
print("RESULTS: AVERAGED OVER 5 RANDOM SPLITS")
print("=" * 60)

print(f"\n{'Metric':<12} {'Baseline':>12} {'Enhanced':>12} {'Improvement':>14}")
print("-" * 52)

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']:
    base = baseline_avg[metric]
    enh = enhanced_avg[metric]
    improvement = (enh - base) * 100
    marker = " [OK]" if improvement >= 5 else ""
    print(f"{metric.upper():<12} {base:>12.4f} {enh:>12.4f} {improvement:>+12.2f}%{marker}")

print("-" * 52)

---

## Section 6: Why Distance Features Help (200-Word Explanation)

The enhanced model outperforms the baseline because **distance-to-centroid features capture CLUSTER GEOMETRY** that the original 2D features cannot express.

In the original feature space, the Perceptron tries to draw a single linear boundary (hyperplane) to separate class 1 (cluster 0) from class 0 (clusters 1, 2). However, cluster 0 may be positioned such that a simple line cannot cleanly separate it from the overlapping regions of clusters 1 and 2.

By adding distance features, we transform each point into a **5D space** where:
- Points **CLOSE** to cluster 0's center have **SMALL** distance to centroid 0
- Points **FAR** from cluster 0's center have **LARGE** distance to centroid 0

This **BOUNDARY SHIFT** is critical: in the enhanced space, the decision boundary can now leverage "closeness to cluster 0" as a feature. Points with small distance-to-centroid-0 are highly likely to be class 1, regardless of their original x,y position.

The **cluster geometry** (tight cluster 0 with std=1.0 vs spread clusters 1,2 with std=1.2,1.4) means that distance-to-centroid-0 becomes a strong signal for class membership. The Perceptron's linear boundary in this enriched space effectively creates a **NON-LINEAR boundary** in the original 2D space.

---

## Section 7: Success Criteria Check

In [None]:
# Check if we meet success criteria
print("\n" + "=" * 60)
print("SUCCESS CRITERIA CHECK")
print("=" * 60)

improvements = {
    metric: (enhanced_avg[metric] - baseline_avg[metric]) * 100
    for metric in baseline_avg
}

success = any(imp >= 5 for imp in improvements.values())

if success:
    print("\n[OK] SUCCESS: At least one metric improved by >=5 percentage points!\n")
    for metric, imp in improvements.items():
        if imp >= 5:
            print(f"   - {metric.upper()}: +{imp:.2f}%")
else:
    print("\n[X] FAILED: No metric improved by >=5 percentage points.")

---

## Summary

| What We Did | Why It Worked |
|-------------|---------------|
| Added distance-to-centroid features | Captured cluster geometry |
| Used K-Means transform() | Computed distances efficiently |
| Combined original + distance features | Gave Perceptron more information |
| Linear boundary in 5D | = Non-linear boundary in 2D |

**Key Takeaway:** Feature engineering can dramatically improve even simple classifiers!