# K-Nearest Neighbors: Classification and Regression by Similarity

Welcome to the seventh notebook in our **Machine Learning Basics for Beginners** series! After exploring decision trees, let's dive into **K-Nearest Neighbors (KNN)**, a simple yet powerful supervised learning algorithm used for both classification and regression. KNN makes predictions based on the idea of similarity—finding the closest examples in the data.

**What You'll Learn in This Notebook:**
- What K-Nearest Neighbors is and when to use it.
- How KNN works in simple terms.
- A hands-on example of classifying flowers based on measurements using KNN.
- An interactive exercise to adjust the number of neighbors and see how predictions change.
- Visualizations to understand the concept of nearest neighbors.

Let's get started!

## 1. What is K-Nearest Neighbors?

**K-Nearest Neighbors (KNN)** is a supervised learning algorithm that makes predictions based on the similarity of data points. It’s often called a "lazy learning" algorithm because it doesn’t build a model during training—it simply stores the data and makes decisions at prediction time.

- **Goal**: For a new data point, find the `K` closest points (neighbors) in the training data and use their labels or values to make a prediction.
- **When to Use It**: Use KNN for classification (e.g., identifying categories like spam or not spam) or regression (e.g., predicting a numerical value like house price) when you have labeled data and believe similar inputs should have similar outputs. It works well with small to medium-sized datasets.
- **Examples**:
  - Classifying a flower as a specific species based on petal and sepal measurements.
  - Predicting a house price based on the prices of nearby houses with similar features.
  - Recommending movies by finding users with similar tastes.

**Analogy**: Imagine you’re trying to decide if a new friend likes action movies. You look at your 5 closest friends (based on age or other traits). If most of them like action movies, you guess your new friend will too. KNN works the same way—guessing based on the "nearest" examples.

## 2. How Does K-Nearest Neighbors Work?

KNN is straightforward and intuitive. Here’s how it works step by step:

1. **Store the Data**: During training, KNN simply stores all the training data points and their labels (no complex model is built).
2. **Choose K**: Decide on a number `K`, which is how many neighbors to consider when making a prediction. Common choices are 3, 5, or 7 (odd numbers help avoid ties).
3. **Measure Distance**: For a new data point, calculate the "distance" to all points in the training data to find the closest ones. Distance is often measured using Euclidean distance (like a straight-line distance on a graph between two points).
4. **Find K Nearest Neighbors**: Identify the `K` training points closest to the new point.
5. **Make a Prediction**:
   - For **classification**: Take a majority vote among the `K` neighbors’ labels (e.g., if 3 out of 5 neighbors are "spam," predict "spam").
   - For **regression**: Take the average of the `K` neighbors’ values (e.g., average their house prices).

**Analogy**: Think of KNN as asking for advice from your closest friends. If you’re deciding what restaurant to try, you ask your 3 nearest friends (by location or taste) for their favorite nearby spot. If most say "Pizza Place," you go there. KNN predicts by consulting the nearest data points in a similar way.

**Key Parameter**: The choice of `K` matters. A small `K` (like 1) can be too sensitive to noise (overfitting), while a large `K` might smooth over important differences (underfitting).

## 3. Example: Classifying Flowers with KNN

Let’s see KNN in action with a simplified dataset inspired by the famous Iris dataset. We’ll classify flowers into two species (Setosa or Versicolor) based on two features: petal length and petal width.

**Dataset** (simplified):
- Petal Length (cm): 1.4, 1.5, 4.5, 4.0, 1.3
- Petal Width (cm): 0.2, 0.2, 1.5, 1.3, 0.3
- Species (Label): Setosa, Setosa, Versicolor, Versicolor, Setosa

We’ll use Python’s `scikit-learn` library to create a KNN model, train it on this data, and predict the species of a new flower. Focus on the steps and output, not the code details.

**Instructions**: Run the code below to see how KNN classifies flowers and visualizes the nearest neighbors.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Our small dataset
X = np.array([[1.4, 0.2], [1.5, 0.2], [4.5, 1.5], [4.0, 1.3], [1.3, 0.3]])  # Features: petal length, petal width
y = np.array(['Setosa', 'Setosa', 'Versicolor', 'Versicolor', 'Setosa'])  # Labels: species

# Create and train the KNN model with K=3
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)

# Predict for a new flower with petal length=2.0 cm, petal width=0.5 cm
new_flower = np.array([[2.0, 0.5]])
prediction = model.predict(new_flower)[0]
print(f"New Flower (petal length=2.0 cm, petal width=0.5 cm): Predicted as {prediction}")

# Get the indices of the 3 nearest neighbors
distances, indices = model.kneighbors(new_flower)
print(f"Distances to 3 nearest neighbors: {distances[0]}")
print(f"Indices of 3 nearest neighbors: {indices[0]}")
print(f"Species of 3 nearest neighbors: {y[indices[0]]}")

# Visualize the data and nearest neighbors
# Create a mesh grid for decision boundary
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.2, X[:, 1].max() + 0.2
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.05))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = np.array([1 if z == 'Versicolor' else 0 for z in Z]).reshape(xx.shape)

# Plot decision boundary and points
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X[y == 'Setosa'][:, 0], X[y == 'Setosa'][:, 1], color='blue', label='Setosa', alpha=0.8)
plt.scatter(X[y == 'Versicolor'][:, 0], X[y == 'Versicolor'][:, 1], color='red', label='Versicolor', alpha=0.8)
plt.scatter(new_flower[0][0], new_flower[0][1], color='green', marker='x', s=200, label='New Flower')
# Highlight the nearest neighbors
for idx in indices[0]:
    plt.scatter(X[idx, 0], X[idx, 1], color='yellow', edgecolor='black', s=150, alpha=0.5, label='Neighbor' if idx == indices[0][0] else '')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('K-Nearest Neighbors (K=3): Flower Classification')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are Setosa flowers.")
print("- Red dots are Versicolor flowers.")
print("- The colored background shows the decision regions based on K=3 nearest neighbors.")
print("- The green 'X' is the new flower being classified.")
print("- Yellow highlighted points are the 3 nearest neighbors used for the prediction.")

## 4. Interactive Exercise: Adjust K and Predict

Now it’s your turn to experiment with KNN! In this exercise, you can adjust the number of neighbors (`K`) and add a new flower to the dataset to see how the prediction changes. You’ll also choose a new flower to classify.

**Instructions**:
- Run the code below.
- Enter a value for `K` (number of neighbors, e.g., 1, 3, 5).
- Add a new flower by specifying petal length, petal width, and species.
- Specify a flower to predict by entering its petal length and width.
- Observe how the prediction and nearest neighbors change with different `K` values and data.

In [None]:
# Interactive exercise for KNN
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

print("Welcome to the 'Adjust K and Predict' Exercise!")
print("You’ll adjust the number of neighbors (K), add a new flower, and predict the species of another flower.")

# Original dataset
X = np.array([[1.4, 0.2], [1.5, 0.2], [4.5, 1.5], [4.0, 1.3], [1.3, 0.3]])
y = np.array(['Setosa', 'Setosa', 'Versicolor', 'Versicolor', 'Setosa'])

# Ask user for K value
try:
    k = int(input("Enter the number of neighbors (K, e.g., 3): "))
    if k < 1 or k > len(X):
        raise ValueError(f"K must be between 1 and {len(X)}.")
except ValueError as e:
    print(f"Invalid input: {e}. Defaulting to K=3.")
    k = 3

# Ask user to add a new data point
try:
    new_length = float(input("Enter petal length for new flower (cm, e.g., 2.5): "))
    new_width = float(input("Enter petal width for new flower (cm, e.g., 0.5): "))
    new_species = input("Enter species for new flower (Setosa/Versicolor): ").strip().capitalize()
    if new_species not in ['Setosa', 'Versicolor']:
        raise ValueError("Species must be Setosa or Versicolor.")
    X = np.vstack([X, [new_length, new_width]])
    y = np.append(y, new_species)
    print(f"Added flower: petal length={new_length} cm, petal width={new_width} cm, species={new_species}.")
except ValueError as e:
    print(f"Invalid input: {e}. Using original data without changes.")

# Train the model with updated data
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X, y)

# Ask user for a new flower to predict
try:
    predict_length = float(input("Enter petal length to predict species (cm, e.g., 2.0): "))
    predict_width = float(input("Enter petal width to predict species (cm, e.g., 0.5): "))
    new_flower = np.array([[predict_length, predict_width]])
    prediction = model.predict(new_flower)[0]
    print(f"Predicted species for flower (petal length={predict_length} cm, petal width={predict_width} cm): {prediction}")
    distances, indices = model.kneighbors(new_flower)
    print(f"Distances to {k} nearest neighbors: {distances[0]}")
    print(f"Species of {k} nearest neighbors: {y[indices[0]]}")
except ValueError:
    new_flower = np.array([[2.0, 0.5]])
    prediction = model.predict(new_flower)[0]
    print(f"Invalid input. Defaulting to petal length=2.0 cm, petal width=0.5 cm. Predicted species: {prediction}")
    distances, indices = model.kneighbors(new_flower)
    print(f"Distances to {k} nearest neighbors: {distances[0]}")
    print(f"Species of {k} nearest neighbors: {y[indices[0]]}")

# Visualize the updated data and nearest neighbors
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.2, X[:, 1].max() + 0.2
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.05))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = np.array([1 if z == 'Versicolor' else 0 for z in Z]).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X[:-1][y[:-1] == 'Setosa'][:, 0], X[:-1][y[:-1] == 'Setosa'][:, 1], color='blue', label='Original Setosa', alpha=0.8)
plt.scatter(X[:-1][y[:-1] == 'Versicolor'][:, 0], X[:-1][y[:-1] == 'Versicolor'][:, 1], color='red', label='Original Versicolor', alpha=0.8)
plt.scatter(X[-1, 0], X[-1, 1], color='orange', label=f"Your Added Data ({y[-1]})", alpha=0.8)
plt.scatter(new_flower[0][0], new_flower[0][1], color='green', marker='x', s=200, label='Prediction')
# Highlight the nearest neighbors
for idx in indices[0]:
    plt.scatter(X[idx, 0], X[idx, 1], color='yellow', edgecolor='black', s=150, alpha=0.5, label='Neighbor' if idx == indices[0][0] else '')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title(f'K-Nearest Neighbors (K={k}): Flower Classification')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are original Setosa flowers.")
print("- Red dots are original Versicolor flowers.")
print("- Orange dot is the flower data you added.")
print("- The colored background shows the decision regions based on K nearest neighbors.")
print("- The green 'X' is the flower being classified.")
print(f"- Yellow highlighted points are the {k} nearest neighbors used for the prediction.")

## 5. Key Considerations for K-Nearest Neighbors

KNN is simple and intuitive, but it comes with some considerations to keep in mind:

- **Choosing K**: The value of `K` significantly affects predictions. A small `K` can be noisy and overfit (too sensitive to individual points), while a large `K` can underfit (oversmoothing by considering too many distant points).
- **Computationally Expensive**: Since KNN stores all training data and calculates distances for every prediction, it can be slow for large datasets. It’s not ideal for big data.
- **Sensitive to Feature Scaling**: If features are on different scales (e.g., one feature in meters, another in centimeters), distance calculations can be skewed. Features often need to be normalized or standardized.
- **Curse of Dimensionality**: KNN struggles in high-dimensional spaces (many features) because distances become less meaningful as dimensions increase.

**Analogy**: KNN is like asking for opinions from nearby friends. If you ask too few (small K), one odd opinion can mislead you. If you ask too many (large K), you might dilute good advice. Also, if your friends are far away or speak different "languages" (unscaled features), their advice might not make sense.

Despite these limitations, KNN is a great starting point for understanding similarity-based learning and can be effective for small, well-scaled datasets.

## 6. Key Takeaways

- **K-Nearest Neighbors (KNN)** is a supervised learning algorithm for classification and regression, predicting based on the similarity (distance) to nearby data points.
- It works by finding the `K` closest training points to a new point and using majority vote (classification) or average (regression) for prediction.
- Use it for tasks like species classification or price prediction when similarity matters and datasets are small to medium-sized.
- Be aware of limitations: choosing the right `K` is crucial, it’s slow for large data, and it requires scaled features and low-dimensional data for best results.

You’ve now learned a fundamental similarity-based algorithm! KNN introduces the concept of distance and neighbor-based learning, which is a building block for other methods in machine learning.

**What's Next?**
Move on to **Notebook 8: Support Vector Machines** to learn about a powerful algorithm for classification that finds the best boundary to separate classes. See you there!