# K-nearest Neighbors Regression — Warmup Activity

This warm-up activity demonstrates how a the k-nearest neighbors regression algorithm predicts the value at unknown points.

We begin, as usual, by importing the python libraries we need and setting a few matplotlib plotting options.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from datascience import Table
%matplotlib inline

# Plot defaults
plt.rcParams['figure.figsize'] = (5, 5)
plt.rcParams['axes.grid'] = True

## Create a small data set with three clusters of points. 
Each cluster is comprised of points assigned randomly around different center points.

In [None]:
# rng stands for random number generator
rng = np.random.default_rng(42)

# --- Make a 2-D feature dataset with a smooth target y + noise ---
# Think of (x1, x2) as two measurable traits; y is a quantity we want to predict.
n = 150
x1 = rng.uniform(-3, 3, size=n)
x2 = rng.uniform(-3, 3, size=n)

# Smooth underlying function + noise
y_true = np.sin(x1) + 0.5 * x2
y = y_true + rng.normal(0, 0.35, size=n)

data = Table().with_columns(
    "x1", x1,
    "x2", x2,
    "y",  y
)
data

## Data characteristics

In [None]:
data.stats()

In [None]:
data.hist("y")

## Student Challenge #1
What is the standard deviation of the data? (add code cells as needed)

## Split the data
In Lab 10 we learn that:
"A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling."

In this case we will use 70% of the data to train the model and then try to predict the other 30%. 

## Student Challenge 1
Explain the trade-offs when deciding what fraction of the data to use for training and what fraction for testing?

In [None]:
# Train/Test split (simple holdout)
perm = rng.permutation(n)
split = int(n * 0.7)
train_idx, test_idx = perm[:split], perm[split:]

train = data.take(train_idx)
test  = data.take(test_idx)

print(f"Train: {train.num_rows}, Test: {test.num_rows}")

In [None]:
# Visualize training targets (color by y)
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(train.column("x1"), train.column("x2"),
            c=train.column("y"), s=30, cmap="viridis")
plt.colorbar(label="y (target)")
plt.title("Training data: color = target y")
plt.xlabel("x1"); plt.ylabel("x2")

# --- Make it square (1:1 aspect ratio) ---
ax.set_aspect('equal', adjustable='box')

plt.show()

The values of y are color-coded. We see the data does not follow any simple linear pattern. Note, however, that higher values of y tend to be located in the upper right quandrant, associated with higher values of x1 and x2, while lower values of y are located in in the lower left quandrant.

## Making a prediction using the nearest neighbors
Now suppose we wanted to predict the value of y at a point where we have no data. For example, at

x1=0.5, x2=0.5

What would we do? 

In [None]:
# Visualize training targets (color by y)
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(train.column("x1"), train.column("x2"),
            c=train.column("y"), s=30, cmap="viridis")

# --- Target point (the one we’ll predict) ---
target = np.array([0.5, 0.5])
plt.scatter(target[0], target[1], c='black', s=120, marker='X', label='Target point')
plt.colorbar(label="y (target)")
plt.title("Training data: X marks the target y")
plt.xlabel("x1"); plt.ylabel("x2")

# --- Make it square (1:1 aspect ratio) ---
ax.set_aspect('equal', adjustable='box')

plt.show()

## Helper Functions
The functions below are the heart of the k-nearest neighbor regression method.
We need functions to:

* Find the distance from the unknown point to all of the known points
* Return the closest k points
* Predict the value at an unknown point by averaging the k closest neighbors
* Calculate the error of our predictions by using the training data to predict the test values

In [None]:
def distance(p, q):
    """Euclidean distance between two 2-D points p and q."""
    dx = p[0] - q[0]
    dy = p[1] - q[1]
    return (dx*dx + dy*dy)**0.5

def knn_indices(point, X_train, k):
    """
    Return indices of the k nearest neighbors in X_train to 'point'.
    """
    dists = []
    for i, p in enumerate(X_train):
        dists.append((distance(point, p), i))
    dists.sort(key=lambda t: t[0])     # sort by distance
    idxs = [i for (_, i) in dists[:k]]
    return idxs

def predict_knn(point, X_train, y_train, k):
    """
    Predict numeric target for a single point = average of the k nearest y's.
    """
    idxs = knn_indices(point, X_train, k)
    return float(np.mean(y_train[idxs]))

def rmse(y_true, y_pred):
    """
    Find the root mean square error of the predictions.
    """
    return float(np.sqrt(np.mean((y_true - y_pred)**2)))

So to find the value at x1=0.5, x2=0.5,
if we were using the five nearest neighbors (k=5), it would look like this.

In [None]:
# Convert your existing training Table to arrays
X_train = np.column_stack([train.column("x1"), train.column("x2")])
y_train = train.column("y")

# Convert Tables to NumPy arrays for mathEolumn("x1"),  test.column("x2")])
y_test  = test.column("y")

# create the target point (x1, x2)
target = np.array([0.5, 0.5])

# Number of neighbors
k = 5

# Compute distances from all training points
dists = np.array([distance(p, target) for p in X_train])

# Get indices of k nearest neighbors
nearest_idx = np.argsort(dists)[:k]
r = dists[nearest_idx[-1]]  # distance to kth neighbor (for circle)

# --- Plot ---
fig, ax = plt.subplots(figsize=(6, 6))
sc = plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', s=40, label='Training data')
plt.colorbar(sc, label='Target value (y)')

# Highlight the k nearest neighbors
plt.scatter(X_train[nearest_idx, 0], X_train[nearest_idx, 1],
            edgecolor='red', facecolor='none', s=120, linewidth=2, label=f'{k} nearest neighbors')

# Plot the target point itself
plt.scatter(target[0], target[1], c='black', s=150, marker='X', label='Target point')

# Circle showing the distance to the kth neighbor
circle = plt.Circle(target, r, color='red', fill=False, linestyle='--', linewidth=1.8)
plt.gca().add_artist(circle)

plt.xlabel('x1'); plt.ylabel('x2')
plt.title(f'KNN Visualization (k={k})\nTarget x1={target[0]}, x2={target[1]} and its nearest neighbors')
plt.legend()
plt.grid(True, alpha=0.3)

# --- Make it square (1:1 aspect ratio) ---
ax.set_aspect('equal', adjustable='box')

plt.show()

# Print out neighbor coordinates and distances
neighbor_info = np.column_stack([X_train[nearest_idx], y_train[nearest_idx], dists[nearest_idx]])
print("Nearest neighbors (x1, x2, y, distance):\n", np.round(neighbor_info, 3))

## Student Challenge 2
In plain words, what does KNN regression do to make a prediction? (Write a paragraph or more.)

## What's the best choice k, the number of neighbors?
As mentioned above in Lab 10:
"Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data."

For k-nearest neighbors regression there is one hyperparameter -- k.

The optimal choice of the number of neighbors to use in k-means regression varies depending on the data set. The way to find the best choice to try different k-values and look at the RMSE for the training data. 

Let's try a variety of k's and see which works best for our data set.

In [None]:
# Validation curve: RMSE vs k (fixed)

# Convert training and test Tables to NumPy arrays for features and targets
X_train = np.column_stack([train.column("x1"), train.column("x2")])
y_train = np.array(train.column("y"))

X_test  = np.column_stack([test.column("x1"),  test.column("x2")])
y_test  = np.array(test.column("y"))

# Try a range of k values and compute RMSE on the test set
Ks = [1, 3, 5, 7, 9, 11, 15, 21, 31]
rmses = []

for K in Ks:
    y_pred_K = np.array([predict_knn(pt, X_train, y_train, K) for pt in X_test])
    rmses.append(rmse(y_test, y_pred_K))

plt.plot(Ks, rmses, marker="o")
plt.title("Validation curve: RMSE vs k")
plt.xlabel("k (neighbors)")
plt.ylabel("Test RMSE (lower is better)")
plt.grid(True, alpha=0.3)
plt.show()


## Student Challenge 3:

What is the best choice of k for this data set?  How much do you think it matters if you choose a slightly suboptimal value of k?

## Looking at the full prediction space
To get a full sense of the k-nearest neighbors regression predictions, we can use the algorithm to predict y values over a dense grid of point and contour the predicted values.

In [None]:
# Make a grid over the feature space
g = 50
gx = np.linspace(-3, 3, g)
gy = np.linspace(-3, 3, g)
GX, GY = np.meshgrid(gx, gy)
grid_pts = np.column_stack([GX.ravel(), GY.ravel()])

# Predict on grid
k_map = 9  # pick a moderate k for a smooth map
grid_pred = np.array([predict_knn(pt, X_train, y_train, k_map) for pt in grid_pts]).reshape(g, g)

# Plot the prediction field + training points
plt.figure(figsize=(6,5))
cn = plt.contourf(GX, GY, grid_pred, levels=20, cmap="viridis")
plt.colorbar(cn, label="Predicted ŷ")
plt.scatter(X_train[:,0], X_train[:,1], c="white", s=10, alpha=0.6, label="train pts")
plt.title(f"KNN Regression Prediction Map (k={k_map})")
plt.xlabel("x1"); plt.ylabel("x2"); plt.legend()
plt.show()

## Student Challenge 4
Based on the figure above, roughly what would be the predicted value at x1=1.0, x2=0?

## Student Challenge 5
In this exercise, we use two features, x1 and x2, to predict the value of our target, y. Do you think this method could be generalized to predict a target value from more than two features? How would that work?
(Write a paragraph).