# üß© K-Means Feature Augmentation + Perceptron Cross-Validation

## üß© Problem Statement

**What problem are we solving?**

We want to answer an important question in machine learning: **Can we improve a simple classifier by adding cluster information as new features?**

Think of it this way:
- You're trying to predict if a wine belongs to a specific type (Class 0) or not.
- The classifier only knows chemical measurements (like alcohol content, color intensity, etc.).
- But what if we tell the classifier: *"Hey, this wine is similar to wines in Group 2, and very different from wines in Group 0"*?
- Would that extra information help make better predictions?

**Why does this matter?**

In real-world machine learning:
- Raw features alone might not be enough
- **"Feature Engineering"** (creating new features) can boost performance
- But it also adds complexity‚Äîwe need to prove it's worth it!

**Real-World Relevance:**
- **Customer Segmentation:** Group customers into types (budget, premium, luxury), then use that grouping to predict churn
- **Medical Diagnosis:** Cluster patients by symptoms, use cluster info to improve disease prediction
- **Fraud Detection:** Group transactions into patterns, use pattern membership to detect anomalies

---

## ü™ú Steps to Solve the Problem

**High-Level Approach:**

1. **Load Data:** Get the Wine dataset (178 samples, 13 features, 3 classes)
2. **Create Binary Labels:** Convert 3-class problem ‚Üí 2-class (Class 0 vs Others)
3. **Set Up Cross-Validation:** Use 5-fold stratified splitting (fair evaluation)
4. **For Each Fold:**
   - **Standardize** features (fit on train only!)
   - **Cluster** training data with K-Means (k=4)
   - **Augment** features: Add cluster membership + distances to centroids
   - Train **Baseline** Perceptron (original 13 features)
   - Train **Enhanced** Perceptron (21 augmented features)
   - Compare metrics (Accuracy, F1, Average Precision)
5. **Statistical Testing:** Check if differences are significant or just luck
6. **Recommendation:** Should we use this in production?

**Plain-English Reasoning:**

Imagine you're a teacher trying to predict which students will pass:
- **Baseline:** You only know exam scores (13 subjects)
- **Enhanced:** You also know student "type" (Nerd, Athlete, Artist, Socialite) and how similar each student is to each type
- Question: Does knowing the "type" help you predict better?

---

## üéØ Expected Output (OVERALL)

**What will we get at the end?**

1. **Cross-Validation Metrics Table** (CSV file)
   - Shows Accuracy, F1, Average Precision for both pipelines
   - 5 rows (one per fold) + summary row with Mean ¬± Std

2. **Comparison Bar Plot** (PNG image)
   - Visual comparison of Baseline vs Enhanced
   - Error bars showing variability

3. **Statistical Test Results** (printed output)
   - p-values for each metric
   - **p < 0.05** = difference is significant! üéâ
   - **p ‚â• 0.05** = difference could be random ü§∑

4. **Executive Summary** (400-450 word TXT file)
   - Professional recommendation
   - "Should we use K-Means augmentation in production?"
   - Considers statistical significance AND practical complexity

**Success Criteria:**
- Enhanced improves at least 2 metrics, OR
- We have evidence-based reasons why it doesn't improve
- Summary references statistical significance clearly
- Discussion of operational impact (complexity, runtime, etc.)

**Sample Interpretation:**
- If Enhanced Accuracy = 0.95, Baseline = 0.88, p = 0.02:
  - **Interpretation:** "Enhanced is 7% better, and this is statistically significant (p=0.02 < 0.05). The improvement is REAL, not luck!"
- If Enhanced F1 = 0.91, Baseline = 0.90, p = 0.45:
  - **Interpretation:** "Enhanced is slightly better, but p=0.45 means this could easily be random chance. No strong evidence of improvement."

---

Now let's dive into the code! üöÄ

---

# üì¶ SECTION 1: IMPORTS

Before we can solve any problem, we need to bring in the tools (libraries) we'll use.

Think of this like going to a toolbox:
- **NumPy** = Calculator for arrays and numbers
- **Pandas** = Excel for Python
- **Matplotlib** = Drawing tools for charts
- **Scikit-learn** = Machine learning toolkit

---

## üìä Import NumPy - The Calculator for Arrays

### 2.1 What does this line do?
Imports NumPy library and gives it a short name `np` so we can use it easily.

### 2.2 Why is NumPy used?
- Python lists are slow for math operations
- NumPy arrays are 100x faster (written in C language)
- We need it for:
  - Array operations (like adding all elements)
  - Mathematical functions (mean, standard deviation, square root)
  - Distance calculations (Euclidean distance)

**Is this the only way?**
- You could use pure Python lists and loops, but it would be MUCH slower
- For 178 samples it's okay, but for 1 million samples, NumPy is essential
- **Why NumPy is better:** Speed + less code + built-in functions

### 2.3 When to use NumPy?
- When working with numbers, matrices, or mathematical operations
- Almost always in data science and machine learning

### 2.4 Where is NumPy used in real projects?
- Image processing (images are arrays of pixels)
- Financial analysis (stock prices, calculations)
- Scientific computing (physics simulations)
- Machine learning (all data is stored as arrays)

### 2.5 How to use NumPy?
**Syntax:**
```python
import numpy as np
```

**Example:**
```python
# Create an array
arr = np.array([1, 2, 3, 4, 5])

# Calculate mean
mean = np.mean(arr)  # Result: 3.0

# Square all elements
squared = arr ** 2  # Result: [1, 4, 9, 16, 25]
```

### 2.6 How does NumPy work internally?
- NumPy stores data in contiguous memory blocks (all together, not scattered)
- Operations are done in compiled C code (not slow Python loops)
- Uses SIMD (Single Instruction Multiple Data) for parallel processing
- Example: Adding two arrays of 1000 elements happens in one CPU instruction!

### 2.7 Output with sample example
```python
import numpy as np
print(np.__version__)  # Shows NumPy version, e.g., '1.24.3'
print(type(np))        # Shows: <class 'module'>
```

No visible output when importing, but `np` is now available to use throughout the code.

In [None]:
import numpy as np

## üìã Import Pandas - Excel for Python

### 2.1 What does this line do?
Imports the Pandas library with the standard nickname `pd`.

### 2.2 Why is Pandas used?
- NumPy is great for arrays, but Pandas is great for **tables with labels**
- We need it to:
  - Create tables (DataFrames) for our metrics
  - Save results to CSV files
  - Display data in a nice, readable format
  - Handle row/column names (like Excel)

**Is this the only way?**
- Could use pure NumPy arrays, but no column names or labels
- Could write CSV manually with Python's `csv` module, but much more code
- **Why Pandas is better:** Built-in CSV support, pretty printing, column names

### 2.3 When to use Pandas?
- When you have tabular data (rows and columns with labels)
- When you need to save/load CSV, Excel files
- When you want to display results in a table format

### 2.4 Where is Pandas used in real projects?
- Data analysis (like Excel but programmable)
- Business reports (sales data, metrics dashboards)
- Data cleaning (removing duplicates, handling missing values)
- Financial modeling (stock portfolios, risk analysis)

### 2.5 How to use Pandas?
**Syntax:**
```python
import pandas as pd
```

**Example:**
```python
# Create a DataFrame (table)
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Score': [88, 92, 85]
})

# Save to CSV
df.to_csv('results.csv', index=False)

# Display table
print(df)
```

### 2.6 How does Pandas work internally?
- Built on top of NumPy (uses NumPy arrays underneath)
- Adds labels (row index + column names) to NumPy arrays
- Provides convenient methods for common operations
- Example: `df.mean()` calculates mean of each column automatically

### 2.7 Output with sample example
```python
import pandas as pd
print(pd.__version__)  # Shows Pandas version, e.g., '2.0.3'
```

No visible output when importing, but now we can create DataFrames!

In [None]:
import pandas as pd

## üìä Import Matplotlib - The Plotting Library

### 2.1 What does this line do?
Imports the `pyplot` module from Matplotlib and nicknames it `plt`.

### 2.2 Why is Matplotlib used?
- We need to **visualize** our results (a picture is worth 1000 numbers!)
- Will create a bar chart comparing baseline vs enhanced metrics
- Humans understand charts faster than tables of numbers
- We'll save the plot as a PNG file for the report

**Is this the only way?**
- Alternatives: Seaborn (prettier but built on Matplotlib), Plotly (interactive)
- **Why Matplotlib is better here:** Most versatile, works everywhere, standard in ML

### 2.3 When to use Matplotlib?
- Whenever you need to create charts, graphs, or visualizations
- Line plots, scatter plots, bar charts, histograms, etc.

### 2.4 Where is Matplotlib used in real projects?
- Research papers (all figures and charts)
- Business dashboards (sales trends, KPI visualizations)
- Machine learning (loss curves, confusion matrices)
- Scientific visualization (experiment results)

### 2.5 How to use Matplotlib?
**Syntax:**
```python
import matplotlib.pyplot as plt
```

**Example:**
```python
# Create simple bar chart
plt.bar(['Baseline', 'Enhanced'], [0.85, 0.92])
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.savefig('comparison.png')
plt.show()
```

### 2.6 How does Matplotlib work internally?
- Creates a "figure" object (like a canvas)
- Adds "axes" (the actual plot area)
- Renders graphics using a backend (AGG for PNG, TkAgg for interactive)
- Can save as vector formats (PDF, SVG) or raster (PNG, JPG)

### 2.7 Output with sample example
```python
import matplotlib.pyplot as plt
# No output when importing, but plotting functions are now available
```

After import, we can create plots!

In [None]:
import matplotlib.pyplot as plt

## üç∑ Import load_wine - The Wine Dataset

### 2.1 What does this line do?
Imports the `load_wine` function from scikit-learn's datasets module.

### 2.2 Why is this used?
- This is our **DATA SOURCE**!
- The Wine dataset contains:
  - 178 samples (wines)
  - 13 features (chemical measurements like alcohol, acidity, color)
  - 3 classes (types of wine: 0, 1, 2)
- It's built into scikit-learn, so no download needed
- Perfect for learning and experiments

**Is this the only way?**
- Could load from CSV file using `pd.read_csv()`
- Could download from UCI Machine Learning Repository
- **Why load_wine is better here:** Instant access, standardized format, clean data

### 2.3 When to use load_wine?
- When practicing classification algorithms
- When you need a small, clean, multi-class dataset
- For educational demonstrations

### 2.4 Where is this dataset used?
- Machine learning courses (like this one!)
- Algorithm benchmarking
- Research on classification methods
- Wine industry (originally collected for wine origin analysis)

### 2.5 How to use load_wine?
**Syntax:**
```python
from sklearn.datasets import load_wine
```

**Example:**
```python
wine = load_wine()
X = wine.data          # Features (178, 13)
y = wine.target        # Labels (178,) with values 0, 1, 2
names = wine.feature_names  # Names of the 13 features

# Print first sample
print(X[0])  # Array of 13 chemical measurements
print(y[0])  # Class label (0, 1, or 2)
```

### 2.6 How does load_wine work internally?
- The data is embedded in scikit-learn's installation files
- When called, it loads from disk into memory
- Returns a "Bunch" object (dictionary-like)
- Keys: `data`, `target`, `feature_names`, `DESCR` (description)

### 2.7 Output with sample example
```python
from sklearn.datasets import load_wine
wine = load_wine()
print(wine.data.shape)     # (178, 13) - 178 samples, 13 features
print(wine.target.shape)   # (178,) - 178 labels
print(wine.target[:5])     # [0, 0, 0, 0, 0] - first 5 labels
```

In [None]:
from sklearn.datasets import load_wine

## üìè Import StandardScaler - Feature Normalization

### 2.1 What does this line do?
Imports the `StandardScaler` class from scikit-learn's preprocessing module.

### 2.2 Why is StandardScaler used?
**The Problem:**
- The wine dataset has features on different scales:
  - Alcohol: 11-14% (range of ~3)
  - Proline: 278-1680 (range of ~1400)
- K-Means uses **Euclidean distance**: ‚àö((x‚ÇÅ-x‚ÇÇ)¬≤ + (y‚ÇÅ-y‚ÇÇ)¬≤)
- If proline ranges from 278-1680, it will DOMINATE the distance calculation!
- Alcohol (range 3) would barely matter

**The Solution:**
- StandardScaler transforms each feature to have:
  - Mean = 0
  - Standard Deviation = 1
- Now all features contribute equally to distance calculations

**Real-life analogy:**
Imagine comparing students:
- Student A: Math=90, English=85, Science=88
- Student B: Math=80, English=82, Science=85
- But Math is out of 100, English out of 100, Science out of 1000 (unfair!)
- StandardScaler would convert Science from 0-1000 to the same scale as Math and English

**Is this the only way?**
- **MinMaxScaler:** Scales to [0, 1] range
  - Use when: You want bounded values (e.g., image pixels 0-255 ‚Üí 0-1)
- **RobustScaler:** Uses median and IQR (less sensitive to outliers)
  - Use when: Your data has extreme outliers
- **Why StandardScaler is better here:**
  - K-Means assumes circular/spherical clusters (StandardScaler preserves shape)
  - Wine data doesn't have extreme outliers
  - Industry standard for distance-based algorithms

### 2.3 When to use StandardScaler?
- **ALWAYS** before K-Means, KNN, SVM (distance-based algorithms)
- **ALWAYS** before Perceptron, Logistic Regression (gradient descent benefits from scaling)
- **NOT NEEDED** for tree-based models (Decision Trees, Random Forest‚Äîthey split by thresholds, not distances)

### 2.4 Where is StandardScaler used in real projects?
- Customer segmentation (clustering on age, income, purchases‚Äîall different scales)
- Recommendation systems (collaborative filtering with different feature types)
- Medical diagnosis (lab values: blood pressure 120, glucose 100, cholesterol 200‚Äîdifferent scales)

### 2.5 How to use StandardScaler?
**Syntax:**
```python
from sklearn.preprocessing import StandardScaler
```

**Example:**
```python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data (before scaling)
X = np.array([[1, 1000],
              [2, 2000],
              [3, 3000]])

scaler = StandardScaler()
scaler.fit(X)  # Learn mean and std from data
X_scaled = scaler.transform(X)  # Apply transformation

print("Before scaling:")
print(X)
# [[   1 1000]
#  [   2 2000]
#  [   3 3000]]

print("\nAfter scaling:")
print(X_scaled)
# [[-1.22 -1.22]
#  [ 0.    0.  ]
#  [ 1.22  1.22]]

# Check: mean is now ~0, std is now ~1
print("\nMean:", X_scaled.mean(axis=0))  # [0, 0]
print("Std:", X_scaled.std(axis=0))      # [1, 1]
```

### 2.6 How does StandardScaler work internally?
**Step-by-step process:**

1. **During `.fit(X_train)`:**
   - Calculate mean: Œº = (x‚ÇÅ + x‚ÇÇ + ... + x‚Çô) / n
   - Calculate standard deviation: œÉ = ‚àö(Œ£(x·µ¢ - Œº)¬≤ / n)
   - Store Œº and œÉ for each feature

2. **During `.transform(X)`:**
   - For each feature: x_scaled = (x - Œº) / œÉ
   - This centers data (mean=0) and scales variance (std=1)

**Why fit on train only?**
- In production, we won't know test data's mean/std
- Must use training mean/std to transform test data
- Otherwise we "leak" information from test to train

**Math Example:**
```
Feature: [100, 200, 300]
Mean Œº = 200
Std œÉ = 81.65

Transform:
100 ‚Üí (100 - 200) / 81.65 = -1.22
200 ‚Üí (200 - 200) / 81.65 =  0.00
300 ‚Üí (300 - 200) / 81.65 = +1.22
```

### 2.7 Output with sample example
```python
from sklearn.preprocessing import StandardScaler
# No output when importing, but now we can create scalers
```

In [None]:
from sklearn.preprocessing import StandardScaler

## üéØ Import KMeans - The Clustering Algorithm

### 2.1 What does this line do?
Imports the `KMeans` class from scikit-learn's cluster module.

### 2.2 Why is KMeans used?
**What is K-Means?**
- An **unsupervised** algorithm that groups data into k clusters
- "Unsupervised" means it doesn't need labels‚Äîit finds patterns on its own!
- We'll use it to find k=4 natural groups in the wine data

**Why do we need clustering here?**
- We want to create NEW features from these clusters:
  1. **One-hot membership:** "This wine belongs to cluster 2"
  2. **Distances to centroids:** "This wine is very close to cluster 2 center, far from cluster 0"
- These new features give the Perceptron more information!

**Real-life analogy:**
Imagine grouping students:
- Give K-Means student grades ‚Üí it finds 4 types: Nerds, Athletes, Artists, Socialites
- Now when predicting "will pass exam?", you can say:
  - "Student is a Nerd (cluster 0)"
  - "Student is 90% similar to Nerd type, 10% to Athlete type"
- This helps the predictor make better decisions!

**Is this the only way?**
- **Hierarchical Clustering:** Builds a tree of clusters (good for dendrograms)
  - Use when: You want to see cluster hierarchy, not sure about k
- **DBSCAN:** Finds arbitrary-shaped clusters, handles noise
  - Use when: Clusters aren't spherical, lots of outliers
- **Gaussian Mixture Models (GMM):** Probabilistic clustering
  - Use when: You want soft assignments ("30% cluster A, 70% cluster B")
- **Why K-Means is better here:**
  - Fast and simple (scales to large datasets)
  - Works well when clusters are roughly spherical (wine data is)
  - Problem specifies k=4 (K-Means requires k, others don't)
  - Gives hard assignments (easier to create one-hot features)

### 2.3 When to use K-Means?
- When you want to find k groups in unlabeled data
- When clusters are roughly circular/spherical
- When you know (or can estimate) k
- When speed is important (K-Means is very fast)

### 2.4 Where is K-Means used in real projects?
- **Customer Segmentation:** Group customers by behavior (high spenders, bargain hunters, etc.)
- **Image Compression:** Group similar colors together (reduce 16 million colors to 16)
- **Document Clustering:** Group news articles by topic
- **Anomaly Detection:** Samples far from all centroids are outliers
- **Feature Engineering:** (Like this project!) Use cluster info as new features

### 2.5 How to use K-Means?
**Syntax:**
```python
from sklearn.cluster import KMeans
```

**Example:**
```python
from sklearn.cluster import KMeans
import numpy as np

# Sample data: 2D points
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])

# Create KMeans model with k=2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)

# Fit model (find centroids)
kmeans.fit(X)

# Get cluster assignments
labels = kmeans.labels_
print("Cluster assignments:", labels)
# Output: [0, 0, 0, 1, 1, 1] - first 3 in cluster 0, next 3 in cluster 1

# Get centroids
print("Centroids:", kmeans.cluster_centers_)
# Output: [[ 1.  2.]
#          [10.  2.]] - two cluster centers

# Predict cluster for new data
new_point = [[0, 0]]
print("New point cluster:", kmeans.predict(new_point))
# Output: [0] - closer to first cluster
```

### 2.6 How does K-Means work internally?
**The Algorithm (Lloyd's Algorithm):**

1. **Initialization:**
   - Randomly place k centroids in the data space
   - (Or use smarter methods like K-Means++)

2. **Assignment Step:**
   - For each data point:
     - Calculate distance to each centroid
     - Assign to nearest centroid
   - Example: point (5, 5) is closer to centroid at (6, 6) than (1, 1) ‚Üí assign to cluster 1

3. **Update Step:**
   - For each cluster:
     - Calculate mean of all assigned points
     - Move centroid to this mean position
   - Example: Cluster has points [(1,2), (2,3), (3,4)] ‚Üí new centroid at (2, 3)

4. **Repeat:**
   - Keep alternating Assignment and Update
   - Stop when centroids don't move (converged)
   - Or when max iterations reached

**Visual Example:**
```
Iteration 0: Random centroids
  ‚óè(centroid) ..(points)
  
Iteration 1: Assign points to nearest centroid
  ‚óè.. (cluster A)
     ‚óè.. (cluster B)
     
Iteration 2: Move centroids to cluster means
  .‚óè. (centroid moved to middle of cluster A)
     .‚óè. (centroid moved to middle of cluster B)
     
... repeat until stable
```

**Why n_init=10?**
- K-Means can get stuck in local optima (bad starting centroids)
- Solution: Run 10 times with different random starts
- Keep the best result (lowest "inertia" = sum of distances to centroids)

**What is inertia?**
- Sum of squared distances from each point to its centroid
- Inertia = Œ£(distance to centroid)¬≤
- Lower inertia = tighter, better clusters

### 2.7 Output with sample example
```python
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

print("Labels:", kmeans.labels_)
# Output: [0 0 0 1 1 1]
# Means: first 3 points in cluster 0, next 3 in cluster 1

print("Centroids shape:", kmeans.cluster_centers_.shape)
# Output: (2, 2)
# Means: 2 centroids, each is a 2D point
```

In [None]:
from sklearn.cluster import KMeans