# Chapter 6: k-Nearest Neighbors (k-NN)

In [None]:
# Import required libraries
# Your code here

## 6.1 Introduction & Motivation

Now that we've explored two classification models, it's time to add a third one: k-NN or *k-nearest neighbors*. This model essentially works based on one simple question: **When looking at the *k* data points closest to me, which label occurs the most?**

This model is easy to understand but can be tricky to tune, especially when it comes to finding the optimal value for k. In this exercise, you will put the model to the test, compare it to the ones you already know, and explore new ways to achieve the best evaluation!

**Key insight**: k-NN is a "lazy learning" algorithm - it doesn't build an explicit model during training but makes predictions based on similarity to stored examples.

## 6.2 Problem Setting

After a war, dangerous equipment often gets left behind scattered across fields. Over time, these items can become buried in the earth and lost to time. However, at some point later, civilians sometimes discover still-live explosives, leading to dangerous situations. Here in Flanders, for example, farmers still find live mines from WWI when plowing their fields to this day.

To combat this and remove as many dangerous items as possible, governments can take sonar scans of the ground. The dataset we are going to explore today contains a series of objects, each scanned 60 times with sonar from different angles. We are trying to predict whether a scanned object is an actual mine ('M') or a rock shaped like a mine ('R'). These values are stored in the 'Material' column.

**Real-world context**: This is a classic example of where the cost of false negatives (missing an actual mine) is much higher than false positives (incorrectly identifying a rock as a mine).

## 6.3 Model

First, let's examine the structure and characteristics of our sonar data:

In [None]:
# Load the Sonar.csv dataset and examine its structure
# Your code here

##### Question 1: Try to plot a heatmap to further explore the data. Do you encounter any errors? Why does this happen? How can you solve this using a method we've already seen in previous chapters?

**Hint**: Think about what type of data correlation matrices can work with and what the 'Material' column contains.

In [None]:
# Try to create a heatmap - you should encounter an error
# Your code here

**Explain the error you encountered and how to solve it:**

*Your explanation here*

In [None]:
# Fix the Material column by encoding categorical values as numerical
# Your code here

In [None]:
# Now create the heatmap successfully
# Your code here

##### Question 2: Analyze the data by examining the heatmap. Which variables would you expect to have the highest impact on Material prediction, and which ones would you expect to have the lowest impact? Explain your reasoning.

**Analysis tip**: Look at the bottom row of the heatmap (or rightmost column) which shows correlations with the Material variable. Strong positive or negative correlations indicate higher predictive power.

**Your analysis:**

Variables with high correlation to Material:
- *Your answer here*

Variables with low correlation to Material:
- *Your answer here*

Reasoning:
*Your explanation here*

##### Question 3: Build and train your k-NN model. Make sure to:
- Keep some data aside for testing (use train-test split)
- Exclude the target column ('Material') from your training features
- Choose an appropriate test size and random state for reproducibility

**Reminder**: The features (X) should contain all sonar measurements, while the target (y) should contain only the Material labels.

In [None]:
# Define features (X) and target (y)
# Your code here

In [None]:
# Split the data and train the k-NN model
# Your code here

##### Question 4: Predict the materials for your test data. Analyze the distribution of predictions:
- How many rocks are predicted in the test set?
- How many mines are predicted in the test set?
- Does this distribution seem reasonable given the problem context?

**Analysis tip**: Use `.value_counts()` on your predictions to get a quick summary of the distribution.

In [None]:
# Make predictions and analyze distribution
# Your code here

**Your analysis:**

Number of rocks predicted: *Your answer here*

Number of mines predicted: *Your answer here*

Analysis of distribution: *Your interpretation here*

## 6.4 Model Evaluation

Of course, our predictions don't mean anything without knowing how accurate they are. Let's evaluate our model's performance using various metrics:

##### Question 5: Evaluate your model's performance by calculating accuracy and precision. Is your model performing well? Provide a detailed analysis of what these metrics tell you about your model's effectiveness.

**Key definitions**:
- **Accuracy**: Overall percentage of correct predictions
- **Precision**: Of all positive predictions (mines), how many were actually correct?

In [None]:
# Calculate accuracy and precision
# Your code here

**Your analysis:**

Accuracy: *Your result and interpretation here*

Precision: *Your result and interpretation here*

Overall performance assessment: *Your evaluation here*

##### Question 6: Create and analyze a confusion matrix to visually confirm your previous findings. What patterns do you observe? How does this matrix support or contradict your accuracy and precision calculations?

**Confusion matrix reminder**: 
- Diagonal elements = correct predictions
- Off-diagonal elements = incorrect predictions
- For binary classification: [[TN, FP], [FN, TP]]

In [None]:
# Create and visualize confusion matrix
# Your code here

**Your analysis of the confusion matrix:**

Patterns observed: *Your observations here*

How it supports/contradicts previous calculations: *Your analysis here*

##### Question 7: Calculate the additional evaluation metrics we've studied (recall, specificity, F1-score). Analyze each metric - are they satisfactory? In this specific mine detection dataset, would you prefer high recall or high specificity? Justify your choice with the real-world implications.

*Remember: We're working with a binary classification problem, not multiclass!*

**Critical thinking**: Consider the consequences of false positives vs. false negatives in a mine detection scenario.

In [None]:
# Calculate recall, specificity, and F1-score
# Your code here

**Your analysis:**

Recall: *Your result and interpretation here*

Specificity: *Your result and interpretation here*

F1-score: *Your result and interpretation here*

Which metric is more important for mine detection? *Your choice and justification here*

Real-world implications: *Your explanation of consequences here*

## 6.5 Questions

##### Question 8: By default, scikit-learn uses k=5 as the number of nearest neighbors. Find the optimal value by plotting accuracy across a range of possible k values. What patterns do you observe?

**Hyperparameter tuning**: This process of finding the best parameter values is crucial for model optimization.

In [None]:
# Test different k values and plot accuracy
# Your code here

**Your observations:**

Optimal k value: *Your finding here*

Patterns observed: *Your analysis here*

##### Question 9: Repeat the analysis for the metric you want to maximize (recall, based on your previous analysis). Does your preferred value for k change? Which k value would be 'best' overall for this dataset?

**Strategy**: Since recall is most important for mine detection, we should optimize for recall rather than just accuracy.

In [None]:
# Test different k values optimizing for recall
# Your code here

**Your analysis:**

Optimal k for recall: *Your finding here*

Does k change when optimizing for recall vs accuracy? *Your comparison here*

Best overall k value: *Your recommendation with justification*

##### Question 10: Train logistic regression, decision tree, and random forest models on this same data. Optimize each model using techniques from previous lessons to find the best possible performance. Compare all models - which do you prefer overall?

**Comparative analysis**: This will help you understand which algorithm works best for this specific problem and dataset characteristics.

In [None]:
# Import additional model classes
# Your code here

In [None]:
# Optimize logistic regression
# Your code here

In [None]:
# Optimize decision tree
# Your code here

In [None]:
# Optimize random forest
# Your code here

In [None]:
# Compare all models with optimal parameters
# Your code here

**Your model comparison:**

Best performing model: *Your choice here*

Justification: *Why this model works best for this dataset*

Performance summary: *Brief comparison of all models*

##### Question 11: Our current evaluation might be biased because the test set we're using may not properly represent the entire dataset, potentially unfairly favoring one model. A way to address this is called **cross-validation**. Research this method and implement 10-fold cross-validation. Does the outcome you found above change?

**Cross-validation benefits**:
- More robust evaluation by using multiple train/test splits
- Reduces dependence on a single random split
- Provides better estimate of model generalization performance

In [None]:
# Implement 10-fold cross-validation for all models
# Your code here

**Your cross-validation analysis:**

Best model with cross-validation: *Your finding here*

Did the outcome change? *Comparison with previous results*

Why cross-validation provides better evaluation: *Your explanation here*

Final recommendation: *Which model to use in production and why*