## Part 1: The k-NN Classifier for Categorical Outcomes

### Example: Riding Mowers

### üè¢ Business Context

Recall the riding mower marketing problem:
- **Goal**: Predict which households will buy a riding mower
- **Features**: Income ($1000s), Lot Size (1000s sq ft)
- **Business value**: Target marketing to likely buyers

**Why k-NN here?**
- Relationship might be non-linear
- "Similar households (in income and lot size) have similar buying behavior"
- Small dataset (24 observations) ‚Üí k-NN works well

### Step 1: Prepare Data (Train/Holdout Split)

In [None]:
library(ggplot2)
library(ggrepel)
library(caret)
library(mlba)

mowers.df <- mlba::RidingMowers
set.seed(35)

idx <- sample(nrow(mowers.df), 0.6*nrow(mowers.df))
train.df <- mowers.df[idx, ]
holdout.df <- mowers.df[-idx, ]

## new household to classify
new.df <- data.frame(Income = 60, Lot_Size = 20)

cat("Training set:", nrow(train.df), "observations\n")
cat("Holdout set:", nrow(holdout.df), "observations\n")
cat("New household: Income =", new.df$Income, ", Lot Size =", new.df$Lot_Size)

### üìã Understanding the Data Split

- **Training set** (60%): Used to find neighbors
- **Holdout set** (40%): For evaluation (not used in this example)
- **New household**: Income = $60K, Lot = 20,000 sq ft

**Question**: Will this household buy a riding mower?

### Visualizing the Data and New Point

In [None]:
g <- ggplot(mapping=aes(x=Income, y=Lot_Size, shape=Ownership, color=Ownership, fill=Ownership)) +
  geom_point(data=train.df, size=4) +
  geom_text_repel(aes(label=rownames(train.df)), data=train.df, show.legend = FALSE) +
  geom_point(data=cbind(new.df, Ownership='New'),  size=5) +
  scale_shape_manual(values = c(18, 15, 21)) +
  scale_color_manual(values = c('black', 'darkorange', 'steelblue')) +
  scale_fill_manual(values = c('black', 'darkorange', 'lightblue'))

g

### üìã Reading the k-NN Visualization

**What you see**:
- **Orange points**: Non-owners
- **Blue points**: Owners  
- **Black diamond**: New household (unknown ownership)
- **Numbers**: Row indices from training data

**Visual inspection**:
- New point (60, 20) is surrounded by mostly **blue (owner)** points
- Nearest neighbors appear to be owners
- **Intuition**: This household will likely buy a mower

**Key insight**: k-NN formalizes this visual intuition with distance calculations.

---

## Part 2: Building the k-NN Model

### üè¢ Business Context: Choosing k

The parameter **k** (number of neighbors) is critical:

| k Value | Behavior | Business Impact |
|---------|----------|----------------|
| **k = 1** | Very sensitive to noise | Over-responds to outliers |
| **k = 3-7** | Balanced | Good starting point |
| **k = large** | Over-smoothed | Misses local patterns |

**Trade-off**:
- **Small k** ‚Üí Flexible, but noisy (overfitting)
- **Large k** ‚Üí Stable, but too simple (underfitting)

### Training k-NN with k=3

In [None]:
library(caret)
# train k-NN model with k=3
model <- train(Ownership ~ ., data=train.df,
               method="knn",  # specify the model
               preProcess=c("center", "scale"),  # normalize data
               tuneGrid=expand.grid(k=3),
               trControl=trainControl(method="none"))
model

### üìã Understanding the Model Output

**Key components**:
- **k-Nearest Neighbors**: Model type
- **15 samples, 2 predictors, 2 classes**: Data summary
- **Pre-processing: centered (2), scaled (2)**: Features were normalized
- **k = 3**: Using 3 nearest neighbors

### ‚ö†Ô∏è CRITICAL: Why Normalize?

**Problem without normalization**:

```
Household A: Income = 60 ($60K),  Lot = 20 (20K sq ft)
Household B: Income = 65 ($65K),  Lot = 18 (18K sq ft)

Distance = ‚àö[(65-60)¬≤ + (18-20)¬≤]
         = ‚àö[25 + 4] = ‚àö29 ‚âà 5.4

Income difference: 5
Lot difference: 2

PROBLEM: Income dominates distance just because of scale!
But 2K sq ft difference might be MORE important than $5K income.
```

**Solution: Normalize** (mean=0, sd=1 for both variables)

```
After normalization:
Income: (60 - mean) / sd
Lot:    (20 - mean) / sd

Now both contribute equally to distance calculation.
```

**Best practice**: **Always** use `preProcess=c("center", "scale")` for k-NN!

### Making Predictions

In [None]:
# predict new data point
predict(model, new.df)

### üìã Interpreting the Prediction

**Output**: "Owner" (or "Nonowner")

**What happened behind the scenes**:
1. Normalized new household's features
2. Calculated distance to all 15 training observations
3. Found 3 nearest neighbors
4. Checked their classes ‚Üí majority vote
5. Returned winning class

**Business action**: If "Owner" ‚Üí Add to marketing campaign!

### Identifying the Nearest Neighbors

Let's see **which specific households** influenced the prediction:

In [None]:
# determine nearest neighbors to new data point
train.norm.df <- predict(model$preProcess, train.df)
new.norm.df <- predict(model$preProcess, new.df)
distances <- apply(train.norm.df[, 1:2], 1,
                   function(d){ sqrt(sum((d - new.norm.df)^2)) })
rownames(train.df)[order(distances)][1:3]

### üìã Understanding Nearest Neighbors

**Output**: Row indices of 3 closest households

**What this tells you**:
- These are the 3 most similar households from training data
- Their ownership status determined the prediction
- You can examine these specific cases to understand the prediction

**Business value**:
- **Explainability**: "We predict Owner because households #4, #8, #12 are very similar and all own mowers"
- **Stakeholder confidence**: Show actual comparable cases
- **Quality check**: If neighbors seem unreasonable, investigate data issues

### üè¢ Comparing to Real Estate Comps

This is exactly like real estate "comparables":

```
PROPERTY VALUATION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Subject Property: 60K income, 20K sq ft lot

Comparable 1: 58K income, 19K sq ft ‚Üí Owns mower
Comparable 2: 62K income, 21K sq ft ‚Üí Owns mower  
Comparable 3: 59K income, 20K sq ft ‚Üí Owns mower

Conclusion: Subject likely to own mower
```

Stakeholders understand this reasoning!

---

## Part 3: Choosing the Optimal k

### üè¢ Business Context: Hyperparameter Tuning

How do we know k=3 is the best choice? We need to **test different values** and see which performs best.

**Challenge**: Our dataset is small (24 observations)
- 60% training = 15 observations
- Can't afford to hold out more data for validation

**Solution**: **Leave-One-Out Cross-Validation (LOOCV)**

### How LOOCV Works

```
For each observation in training set:
  1. Remove it temporarily ("leave one out")
  2. Train model on remaining 14 observations
  3. Predict the removed observation
  4. Check if prediction is correct
  
Repeat for all 15 observations ‚Üí 15 predictions
Accuracy = % correct predictions
```

**Advantage**: Uses ALL data for both training and validation (no waste!)

### Testing Multiple k Values

In [None]:
# use leave-one-out cross-validation for small dataset
trControl <- trainControl(method="loocv", number=5, allowParallel=TRUE)
model <- train(Ownership ~ ., data=train.df,
               method="knn",
               preProcess=c("center", "scale"),
               tuneGrid=expand.grid(k=seq(1, 13, 2)),
               trControl=trControl)
model

### üìã Interpreting the Tuning Results

**What you see**:
- **tuneGrid**: Tested k = 1, 3, 5, 7, 9, 11, 13
- **Accuracy** for each k value
- **Best k** is highlighted (highest accuracy)

**How to read the results**:

| k | Accuracy | Interpretation |
|---|----------|----------------|
| 1 | Lower | Too sensitive to outliers |
| 3-7 | **Highest** | Sweet spot ‚úì |
| 11-13 | Lower | Over-smoothing |

### üìä Typical k Selection Pattern

```
ACCURACY
  ^
  ‚îÇ     ‚ï±‚Äæ‚ï≤
  ‚îÇ    ‚ï±   ‚ï≤___
  ‚îÇ   ‚ï±        ‚ï≤___
  ‚îÇ  ‚ï±             ‚ï≤___
  ‚îÇ ‚ï±                  ‚ï≤
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ> k
    1  3  5  7  9  11  13
    
    Low k:  Overfitting (noisy)
    Mid k:  Just right ‚úì
    High k: Underfitting (too simple)
```

**Business decision**: Use the k with highest cross-validation accuracy.

### Training Final Model with Optimal k

In [None]:
# Use the best k found from cross-validation (assume k=7 was best)
model <- train(Ownership ~ ., data=mowers.df,
               method="knn",
               preProcess=c("center", "scale"),
               tuneGrid=expand.grid(k=7),
               trControl=trainControl(method="none"))
predict(model, new.df)

### üìã Why Retrain on Full Data?

**After finding optimal k**:
1. We used training set (15 obs) to find k=7 is best
2. Now retrain on **ALL data** (24 obs) with k=7
3. More data ‚Üí Better final model

**Workflow**:
```
Step 1: Use subset + LOOCV ‚Üí Find best k
Step 2: Use full data + best k ‚Üí Final model
Step 3: Predict new cases
```

**Business rationale**: We already paid the cost of tuning (finding k). Now maximize model quality by using all available data.

---

## Part 4: Examining Neighbors for Business Insights

### üè¢ Business Context: Building Trust

Stakeholders often ask: **"Why did you classify this household as an Owner?"**

With k-NN, you can show the **actual comparable cases**. This builds trust and helps validate the model.

### Finding All Neighbors Within k

In [None]:
# Let's examine the closest 8 neighbors (more than k=7 for context)
train.norm.df <- predict(model$preProcess, train.df)
new.norm.df <- predict(model$preProcess, new.df)
distances <- apply(train.norm.df[, 1:2], 1,
                   function(d){ sqrt(sum((d - new.norm.df)^2)) })
train.df[order(distances)[1:8],]

### üìã Interpreting the Neighbors Table

**What you see**: 8 closest households sorted by distance (closest first)

**Columns**:
- **Income**: Annual household income ($1000s)
- **Lot_Size**: Property size (1000s sq ft)
- **Ownership**: Actual ownership status

**How to use this**:

1. **Check consistency**: Are most neighbors the same class?
   - If 7/8 are "Owner" ‚Üí High confidence
   - If 4/8 are "Owner" ‚Üí Low confidence (borderline case)

2. **Examine outliers**: Is there an unexpected neighbor?
   - Very different Income/Lot but still "close"?
   - Might indicate data quality issue or interesting pattern

3. **Business reasoning**: Show to stakeholders
   - "These 7 similar households all own mowers"
   - "Comparable cases support our classification"

### üè¢ Real-World Application: Loan Approval Explanation

```
LOAN APPLICATION DECISION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Applicant: $60K income, $20K savings, 680 credit score
Decision: APPROVED

Rationale:
We examined 7 similar applicants:
  - Similar income (58-62K)
  - Similar savings (18-22K)  
  - Similar credit (670-690)
  
6 out of 7 successfully repaid loans ‚Üí Low risk
1 defaulted (outlier, had additional risk factors)

Recommendation: Approve with standard terms
```

This is **much more convincing** than "the algorithm said yes."

### Investigating Borderline Cases

**When neighbors disagree** (e.g., 4 Owner, 3 Nonowner):

| Action | Purpose |
|--------|----------|
| **Collect more data** | New household might genuinely be borderline |
| **Flag for review** | Human expert should examine |
| **Use probabilities** | Report confidence (e.g., 57% Owner) |
| **Gather more features** | Need additional info to differentiate |

**Business value**: Uncertainty quantification helps avoid costly mistakes.

---

## Summary: Key Takeaways

### üîß Essential k-NN Functions

#### Building k-NN Models
| Task | Function | Critical Parameters |
|------|----------|--------------------|
| Train k-NN | `train(y ~ ., data, method='knn')` | Always include `preProcess` |
| Normalize | `preProcess=c('center', 'scale')` | **REQUIRED** for k-NN |
| Set k | `tuneGrid=expand.grid(k=3)` | Choose via cross-validation |
| Tune k | `trControl=trainControl(method='loocv')` | For small datasets |

#### Making Predictions
| Task | Function | Output |
|------|----------|--------|
| Predict class | `predict(model, newdata)` | Class labels |
| Find neighbors | Compute distances manually | Explainability |

---

### üéØ k-NN Best Practices Checklist

‚úÖ **ALWAYS normalize features** with `preProcess=c('center', 'scale')`
   - k-NN is distance-based ‚Üí scale matters!
   - Without normalization, results will be wrong

‚úÖ **Choose k via cross-validation**
   - Don't guess k arbitrarily
   - Use LOOCV for small datasets
   - Use k-fold CV for larger datasets

‚úÖ **Start with odd k** to avoid ties
   - k=3, 5, 7 are good starting points
   - Avoids 50/50 splits in binary classification

‚úÖ **Examine nearest neighbors** for explainability
   - Show stakeholders comparable cases
   - Validate predictions make sense
   - Identify borderline/uncertain cases

‚úÖ **Watch for curse of dimensionality**
   - k-NN degrades with many features
   - Consider PCA for dimension reduction first
   - Or use feature selection

‚úÖ **Consider computational cost**
   - Prediction requires distance to ALL training points
   - Slow for large datasets (100K+ rows)
   - May need approximate methods or different algorithm

---

### üìä Choosing k: Guidelines

| k Range | Behavior | When to Use |
|---------|----------|-------------|
| **k = 1** | Memorizes training data | Almost never (overfits) |
| **k = 3-7** | Balanced, flexible | **Start here** ‚úì |
| **k = ‚àön** | Rule of thumb | Medium datasets |
| **k = n/10** | Conservative | Noisy data |
| **k = large** | Over-smoothed | Very clean data only |

**Best practice**: Let cross-validation decide, but sanity-check the result.

---

### üè¢ Business Value Summary

| k-NN Feature | Business Benefit | Example |
|--------------|------------------|----------|
| **Similarity-based** | Intuitive explanations | "Similar customers bought this" |
| **Non-linear** | Captures complex patterns | Flexible decision boundaries |
| **Lazy learning** | No training time | Instant model updates |
| **Explainable neighbors** | Stakeholder trust | Show comparable cases |
| **Probability estimates** | Risk quantification | Confidence scores |

### üö® Common k-NN Pitfalls

| Mistake | Consequence | Fix |
|---------|-------------|-----|
| **Forget to normalize** | Wrong predictions | Always `preProcess=c('center','scale')` |
| **Arbitrary k** | Suboptimal performance | Use cross-validation |
| **Too many features** | Distances meaningless | PCA or feature selection |
| **Categorical features** | Distance undefined | Use dummy encoding |
| **Large dataset** | Slow predictions | Consider alternatives |
| **Outliers in data** | Skew normalization | Remove or transform |

---

### üìä k-NN vs. Other Classifiers

**When k-NN wins**:
- ‚úÖ Non-linear relationships
- ‚úÖ Need to show comparable cases
- ‚úÖ Small to medium data
- ‚úÖ Continuous features

**When to use alternatives**:
- Large data (100K+ rows) ‚Üí **Decision Trees, Random Forest**
- Need interpretable rules ‚Üí **Decision Trees, Logistic Regression**
- High dimensions (100+ features) ‚Üí **Regularized Logistic, SVM**
- Real-time predictions ‚Üí **Pre-trained models (LDA, Logistic)**
- Linear relationships ‚Üí **Logistic Regression, LDA**

---

### üí° Advanced k-NN Techniques

| Technique | Purpose | When to Use |
|-----------|---------|-------------|
| **Weighted k-NN** | Closer neighbors count more | When distance matters |
| **Distance metrics** | Manhattan, Euclidean, etc. | Different data types |
| **Feature weighting** | Some features more important | Domain knowledge available |
| **Approximate NN** | Speed up large datasets | Production systems |

---

### üìö Connection to Other Modules

- **Module 1.1**: Train/test splits apply to k-NN
- **Module 2.1 (LDA)**: Compare linear vs. non-linear boundaries
- **Module 2.2 (PCA)**: Reduce dimensions before k-NN
- **Module 3**: Distance metrics used in both clustering and k-NN
- **Module 5**: Clustering is unsupervised k-NN

---

### üéì k-NN Interview Questions

**For stakeholders**:
- "How does k-NN work?" ‚Üí "It finds similar historical cases and uses their outcomes"
- "Why this prediction?" ‚Üí Show the k nearest neighbors as evidence
- "How confident are you?" ‚Üí "X out of k neighbors agree"

**For technical teams**:
- "Did you normalize?" ‚Üí Yes, with `preProcess=c('center','scale')`
- "How did you choose k?" ‚Üí Cross-validation tested k=1,3,5,7,9; k=5 was best
- "Computational cost?" ‚Üí O(n) per prediction, manageable for our data size

---

**Next Steps**:
1. Apply k-NN to your own classification problems
2. Always normalize features first!
3. Use cross-validation to find optimal k
4. Show nearest neighbors to stakeholders for explainability
5. Compare k-NN performance to logistic regression and decision trees