# Assignment-1 (KNN Classifier)--PartB
## Breast cancer wisconsin dataset
In this assignment, we will build a KNN classifier that takes an features as as input and outputs a label 0 or 1.

The breast cancer dataset contains 569 samples with 30 real, positive features (including cancer mass attributes like mean radius, mean texture, mean perimeter, et cetera). Of the samples, 212 are labeled “malignant” and 357 are labeled “benign”. 
You can find more details in: https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset

In [2]:
import numpy as np
import time
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

In [3]:
## Load the training set
data = load_breast_cancer()
X = data.data
y = data.target


In [4]:
## print some statistics on the dataset
### TODO YOUR CODE ###

print("Total number of samples:", X.shape[0])

print("Number of features per sample:", X.shape[1])

print("Total number of classes:", len(data.target_names))

print("Number of samples in each class:")
for i, class_name in enumerate(data.target_names):
    print(f"  {class_name}: {sum(y == i)}")


Total number of samples: 569
Number of features per sample: 30
Total number of classes: 2
Number of samples in each class:
  malignant: 212
  benign: 357


## Splitting the Train data to Train and Validate Sets

In [5]:
from sklearn.model_selection import train_test_split
### TODO YOUR CODE ###
# Set random seed
random_seed = 777

# Split into training set (70%) and temporary set (test and validation) (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Split the temporary set into validation set (15%) and test set (15%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=random_seed)

# Verify the sizes of each set
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Validation set size: {X_val.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")




Training set size: 398 samples
Validation set size: 85 samples
Test set size: 86 samples


## Scaling the features

In [6]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the scaler on the training data and transform it
X_train = scaler.fit_transform(X_train)

# Use the same scaler to transform the validation and test sets (without fitting again)
X_val = scaler.transform(X_val)
X_test= scaler.transform(X_test)


## Nearest neighbor classification with L2 distance

To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. A natural distance function is _Euclidean distance_: for two vectors $x, y \in \mathbb{R}^d$, their Euclidean distance is defined as 
$$\|x - y\| = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}.$$
Often we omit the square root, and simply compute _squared Euclidean distance_:
$$\|x - y\|^2 = \sum_{i=1}^d (x_i - y_i)^2.$$
For the purposes of nearest neighbor computations, the two are equivalent: for three vectors $x, y, z \in \mathbb{R}^d$, we have $\|x - y\| \leq \|x - z\|$ if and only if $\|x - y\|^2 \leq \|x - z\|^2$.

## 1. Nearest neighbor classification with L2 distance

Write a function, **NN_L2**, which takes as input the training data (`trainx` and `trainy`) and the test points (`evalx`) and predicts labels for these test points using 1-NN classification. These labels should be returned in a `numpy` array with one entry per test point. For **NN_L2**, the L2 norm should be used as the distance metric.

In [7]:
def NN_L2(trainx, trainy, evalx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    ### START CODE HERE ###
    predictions = []
    
    # Iterate over each point in evalx
    for eval_point in evalx:
        # Calculate the L2 (Euclidean) distance between the eval_point and all training points
        distances = np.sqrt(np.sum((trainx - eval_point)**2, axis=1))
        
        # Find the index of the nearest neighbor (smallest distance)
        nearest_neighbor_idx = np.argmin(distances)
        
        # Get the label of the nearest neighbor and append it to predictions
        predictions.append(trainy[nearest_neighbor_idx])
    
    # Convert predictions list to a numpy array and return
    return np.array(predictions)
    ### END CODE HERE ###

## 2. K-Nearest neighbor classification with L2 distance

Write a function, **KNN_L2**, which takes as input the training data (`trainx` and `trainy`), the test points (`evalx`), and the value of **K** (integer) and predicts labels for these test points using K-NN classification. These labels should be returned in a `numpy` array with one entry per test point.

In [8]:
def KNN_L2(trainx, trainy, evalx, K):
    # output: an np.array of the predicted values for evalx 
    
    ### START CODE HERE ###
    # Initialize an empty array to store the predictions
    predictions = np.empty(evalx.shape[0], dtype=trainy.dtype)
    
    # Iterate over each point in evalx
    for i, eval_point in enumerate(evalx):
        # Calculate the L2 (Euclidean) distance between the eval_point and all training points
        distances = np.sqrt(np.sum((trainx - eval_point)**2, axis=1))
        
        # Find the indices of the K nearest neighbors
        nearest_neighbor_idxs = np.argsort(distances)[:K]
        
        # Get the labels of the K nearest neighbors
        nearest_neighbor_labels = trainy[nearest_neighbor_idxs]
        
        # Predict the label by taking the most common label among the nearest neighbors
        unique_labels, label_counts = np.unique(nearest_neighbor_labels, return_counts=True)
        most_common_label = unique_labels[np.argmax(label_counts)]
        
        # Store the predicted label in the predictions array
        predictions[i] = most_common_label
    
    return predictions
    ### END CODE HERE ###

## 3. Nearest neighbor classification with L1 distance

We now compute nearest neighbors using the L1 distance (sometimes called *Manhattan Distance*).

Write a function, **NN_L1**, which again takes as input the arrays `trainx`, `trainy`, and `evalx`, and predicts labels for the test points using 1-nearest neighbor classification. For **NN_L1**, the L1 distance metric should be used. As before, the predicted labels should be returned in a `numpy` array with one entry per test point.

Notice that **NN_L1** and **NN_L2** may well produce different predictions on the test set.

In [9]:
def NN_L1(trainx, trainy, evalx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    # Get the number of test points
    n_test = evalx.shape[0]
    
    # Initialize an array to store predictions
    predictions = np.zeros(n_test, dtype=trainy.dtype)
    
    # Iterate through each test point
    for i in range(n_test):
        # Compute L1 distances between the test point and all training points
        distances = np.sum(np.abs(trainx - evalx[i]), axis=1)
        
        # Find the index of the nearest neighbor
        nearest_neighbor_index = np.argmin(distances)
        
        # Assign the label of the nearest neighbor to the test point
        predictions[i] = trainy[nearest_neighbor_index]
    
    return predictions#

## 4. K-Nearest neighbor classification with L1 distance

Write a function, **KNN_L1**, which takes as input the training data (`trainx` and `trainy`), the test points (`evalx`), and the value of **K** (integer) and predicts labels for these test points using K-NN classification and L1 distance metric. These labels should be returned in a `numpy` array with one entry per test point.

In [10]:
def KNN_L1(trainx, trainy, evalx, K):
    predictions = []
    
    # Iterate over each point in evalx
    for eval_point in evalx:
        # Calculate the L1 (Manhattan) distance between the eval_point and all training points
        distances = np.sum(np.abs(trainx - eval_point), axis=1)
        
        # Find the indices of the K nearest neighbors
        nearest_neighbor_indices = np.argsort(distances)[:K]
        
        # Get the labels of the K nearest neighbors
        nearest_neighbor_labels = trainy[nearest_neighbor_indices]
        
        # Predict the label by majority voting
        prediction = np.bincount(nearest_neighbor_labels).argmax()
        predictions.append(prediction)
    
    # Convert predictions list to a numpy array and return
    return np.array(predictions)

## 5. K-Nearest neighbor classifier

Write a function, **KNN**, which takes as input the training data (`trainx` and `trainy`), the test points (`evalx`), the value of **K** (integer), and a parameter for deciding the distance metric to be used (for example 1 for L1 and 2 for L2) and predicts labels for these test points using KNN classification. These labels should be returned in a `numpy` array with one entry per test point.

In [11]:
def KNN(trainx, trainy, evalx, K, metric=2):
    """
    K-Nearest Neighbor classifier

    Parameters:
    trainx (np.array): Training data features
    trainy (np.array): Training data labels
    evalx (np.array): Evaluation data features
    K (int): Number of neighbors to consider
    metric (int): Distance metric to use (1 for L1, 2 for L2)

    Returns:
    np.array: Predicted labels for evaluation data
    """
    if metric not in [1, 2]:
        raise ValueError("Invalid metric. Use 1 for L1 distance or 2 for L2 distance.")

    if metric == 1:
        return KNN_L1(trainx, trainy, evalx, K)
    else:  # metric == 2
        return KNN_L2(trainx, trainy, evalx, K)

## 6. Putting it all together

Write code that allows you to select the hyper-parameters (distance measure and the value of K) by calling the KNN classifier with different values of K and either L1 or L2 distance measure. Make sure that you set the hyper-parameters using the validation set and not the test set. You need to systemtically try different values for K in conjunction with a distance measure and tabulate the results (you can do that be craeting a seperate cell and documenting in that cell) and note down the best hyper-parameter settings.

In [28]:
### START CODE HERE ###
# Define the range of K values to test
k_values = range(1, 16, 2)  # Testing K values from 1 to 15, stepping by 2

# Define the distance metrics to test
distance_metrics = [1, 2]  # 1 for L1, 2 for L2

# Variables to store the best hyper-parameters and their corresponding accuracy
best_k = None
best_dist_metric = None
best_val_accuracy = 0
best_test_accuracy = 0

# Dictionary to store results for documentation
results = {}

# Iterate over all combinations of K and distance metrics
for k in k_values:
    for dist_metric in distance_metrics:
        # Predict on the validation and test sets using the current combination of K and distance metric
        val_predictions = KNN(X_train, y_train, X_val, k, dist_metric)
        test_predictions = KNN(X_train, y_train, X_test, k, dist_metric)
        
        # Calculate the accuracy on the validation and test sets
        val_accuracy = np.mean(y_val == val_predictions)
        test_accuracy = np.mean(y_test == test_predictions)
        
        # Store the result
        results[(k, dist_metric)] = (val_accuracy, test_accuracy)
        
        # Update best parameters if the current validation accuracy is better
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_test_accuracy = test_accuracy
            best_k = k
            best_dist_metric = dist_metric

# Print the best hyper-parameters and their corresponding accuracies
print(f"Best K: {best_k}")
print(f"Best Distance Metric: {'L1' if best_dist_metric == 1 else 'L2'}")
print(f"Best Validation Accuracy: {best_val_accuracy:.4f}")
print(f"Test Accuracy for Best K and Distance Metric: {best_test_accuracy:.4f}")

# Print the tabulated results for reference
print("\nResults for all K and distance metrics combinations:")
for (k, dist_metric), (val_accuracy, test_accuracy) in results.items():
    print(f"K: {k}, Distance Metric: {'L1' if dist_metric == 1 else 'L2'}, Validation Accuracy: {val_accuracy:.4f}, Test Accuracy: {test_accuracy:.4f}")
### END CODE HERE ###


Best K: 3
Best Distance Metric: L1
Best Validation Accuracy: 0.9882
Test Accuracy for Best K and Distance Metric: 0.9535

Results for all K and distance metrics combinations:
K: 1, Distance Metric: L1, Validation Accuracy: 0.9765, Test Accuracy: 0.9419
K: 1, Distance Metric: L2, Validation Accuracy: 0.9647, Test Accuracy: 0.9535
K: 3, Distance Metric: L1, Validation Accuracy: 0.9882, Test Accuracy: 0.9535
K: 3, Distance Metric: L2, Validation Accuracy: 0.9765, Test Accuracy: 0.9651
K: 5, Distance Metric: L1, Validation Accuracy: 0.9882, Test Accuracy: 0.9651
K: 5, Distance Metric: L2, Validation Accuracy: 0.9765, Test Accuracy: 0.9651
K: 7, Distance Metric: L1, Validation Accuracy: 0.9765, Test Accuracy: 0.9651
K: 7, Distance Metric: L2, Validation Accuracy: 0.9647, Test Accuracy: 0.9767
K: 9, Distance Metric: L1, Validation Accuracy: 0.9882, Test Accuracy: 0.9651
K: 9, Distance Metric: L2, Validation Accuracy: 0.9765, Test Accuracy: 0.9651
K: 11, Distance Metric: L1, Validation Accura

In [None]:
### START CODE HERE ###
# Define the range of K values to test
k_values = range(1, 16,2)  # Testing K values from 1 to 20

# Define the distance metrics to test
distance_metrics = [1, 2]  # 1 for L1, 2 for L2

# Variables to store the best hyper-parameters and their corresponding accuracy
best_k = None
best_dist_metric = None
best_accuracy = 0

# Dictionary to store results for documentation
results = {}

# Iterate over all combinations of K and distance metrics
for k in k_values:
    for dist_metric in distance_metrics:
        # Predict on the validation set using the current combination of K and distance metric
        val_predictions = KNN(X_train, y_train, X_val, k, dist_metric)
        test_predictions = KNN(X_train, y_train, X_test, k, dist_metric)
        # Calculate the accuracy on the validation set
        val_accuracy = np.mean(y_val == val_predictions)
        test_accuracy = np.mean(y_test == test_predictions)
        
        # Store the result
        results[(k, dist_metric)] = val_accuracy
        
        # Update best parameters if the current accuracy is better
        if val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            best_k = k
            best_dist_metric = dist_metric

# Print the best hyper-parameters and their corresponding accuracy
print(f"Best K: {best_k}")
print(f"Best Distance Metric: {'L1' if best_dist_metric == 1 else 'L2'}")
print(f"Best Validation Accuracy: {best_accuracy:.4f}")


# Print the tabulated results for reference
print("\nResults for all K and distance metrics combinations:")
for (k, dist_metric), accuracy in results.items():
    print(f"K: {k}, Distance Metric: {'L1' if dist_metric == 1 else 'L2'}, Validation Accuracy: {accuracy:.4f}")
### END CODE HERE ###


# 7. Test errors and the confusion matrix

Once the hyper-parameters have been selected, we now would like to perform a final evaluation on the test set and record the error rates. Also, Write a function, **confusion**, which takes as input the true labels for the test set (that is, `testy`) as well as the predicted labels and returns the confusion matrix. The confusion matrix should be a `np.array`.

**Note:** Record the cpu time it takes to perform the evaluation on the test set using functions like **time.time()**.

In [13]:
def confusion(testy, testy_fit):
    # inputs: the correct labels, the fitted KNN labels 
    # output: a 10x10 np.array representing the confusion matrix
    
    ### START CODE HERE ###
    # Initialize a 10x10 numpy array with zeros
    conf_matrix = np.zeros((10, 10), dtype=int)
    
    # Populate the confusion matrix
    for true, pred in zip(testy, testy_fit):
        conf_matrix[true][pred] += 1
    
    return conf_matrix
    ### END CODE HERE ###

In [14]:
# Code for performing the final evaluation on the test set and generating the confuson matrix.
### START CODE HERE ###
start_time = time.time()  # Record the start time

# Predict on the test set using the best K and distance metric
test_predictions = KNN(X_train, y_train, X_test, best_k, best_dist_metric)

end_time = time.time()  # Record the end time
evaluation_time = end_time - start_time  # Calculate evaluation time

# Calculate the error rate
error_rate = 1 - accuracy_score(y_test, test_predictions)

# Print the evaluation results
print(f"Evaluation Time: {evaluation_time:.4f} seconds")
print(f"Test Error Rate: {error_rate:.4f}")

# Get the confusion matrix for the test set
conf_matrix = confusion(y_test, test_predictions)

# Print the confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

### END CODE HERE ###

Evaluation Time: 0.0040 seconds
Test Error Rate: 0.0465

Confusion Matrix:
[[26  3  0  0  0  0  0  0  0  0]
 [ 1 56  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]]


## 8. Preparing the assignment report

You need to record your answers for the following questions:
1. What is the error rate on the validation set for NN_L2?
2. What is the best error rate on the validation set for KNN_L2?
3. What is the error rate on the validation set for NN_L1?
4. What is the best error rate on the validation set for KNN_L1?
5. What is the error rate on the test set?
7. Do you need to normalize data, in general, when using KNN?
8. Do you need to normalize data when using KNN for the given problem? Explain why?

### 1. What is the error rate on the validation set for NN_L2?

In [15]:
accuracy = np.mean(NN_L2(X_train, y_train, X_val) == y_val)
error_rate = 1 - accuracy
print(f"Error Rate: {error_rate:.4f}")

Error Rate: 0.0353


### 2. What is the best error rate on the validation set for KNN_L2?


as shown before, the best accuracy was at k = 3 and using L2 and the error rate is 1 - 0.9412 = 0.0588

### 3. What is the error rate on the validation set for NN_L1?

In [16]:
accuracy = np.mean(NN_L1(X_train, y_train, X_val) == y_val)
error_rate = 1 - accuracy
print(f"Error Rate: {error_rate:.4f}")

Error Rate: 0.0235


### 4.What is the best error rate on the validation set for KNN_L1?

Best K: 6
Best Distance Metric: L1
Best error rate: 1 - 0.9647 = 0.353

5. What is the error rate on the test set?

In [21]:
accuracy = np.mean(NN_L1(X_train, y_train, X_test) == y_test)
error_rate = 1 - accuracy
print(f"NN_L1 Error Rate: {error_rate:.4f}")

accuracy = np.mean(NN_L2(X_train, y_train, X_test) == y_test)
error_rate = 1 - accuracy
print(f"NN_L2 Error Rate: {error_rate:.4f}")


accuracy = np.mean(KNN(X_train, y_train, X_test, 5, metric = 1)== y_test)
error_rate = 1 - accuracy
print(f"KNN_L1 Error Rate: {error_rate:.4f}")

accuracy = np.mean(KNN(X_train, y_train, X_test, 11 , metric = 2) == y_test)
error_rate = 1 - accuracy
print(f"KNN_L2 Error Rate: {error_rate:.4f}")


NN_L1 Error Rate: 0.0581
NN_L2 Error Rate: 0.0465
KNN_L1 Error Rate: 0.0349
KNN_L2 Error Rate: 0.0349


6. Do you need to normalize data, in general, when using KNN?

It is better in general to normalize the data in knn for the following reasons:

* KNN uses distance calculations between data points.
* Features with larger scales can dominate the distance calculations.
* Normalization ensures all features contribute equally to the distance metric.
* It prevents features with larger numerical ranges from having undue influence on the results.

but for one case maybe normalizing the dataset is not a good idea when we have for example two features and one of the features are in reality dominant and they should have big value, if we normalized them this will result in removing this needed dominance and will affect the performance negatively

Normalizing data helps KNN perform more accurately and fairly across all features.

7. Do you need to normalize data when using KNN for the given problem? Explain why?

Yes we need, as you can see below, we examined one sample for X and we can see the values are ranging with number with zero and fractions and numbers are in thousands, when we are going to measure the distance, the features with low values will be dominated with the features with high values. that is why we need to normalize the data.

In [6]:
X[0]

array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])

## 9. Extra Stuff

You are invited to try some more ideas as extra work like:
1. Implementing weighted KNN where the vote of a neighbour is scaled down based on its distance from the test point.
2. Implement L_infinity distance measure and use it for classification.

In [23]:
def weighted_KNN(trainx, trainy, evalx, K, dist_metric=2):
    # inputs: 
    # trainx (training features), 
    # trainy (training labels), 
    # evalx (evaluation features), 
    # K (number of neighbors), 
    # dist_metric (1 for L1, 2 for L2)
    
    predictions = np.empty(evalx.shape[0], dtype=int)
    
    if dist_metric == 1:
        distances = np.sum(np.abs(trainx[np.newaxis, :] - evalx[:, np.newaxis]), axis=2)
    elif dist_metric == 2:
        distances = np.sqrt(np.sum((trainx[np.newaxis, :] - evalx[:, np.newaxis])**2, axis=2))
    else:
        raise ValueError("Invalid distance metric! Use 1 for L1 and 2 for L2.")
    
    for i in range(evalx.shape[0]):
        # Find the indices of the K nearest neighbors
        nearest_neighbor_idxs = np.argsort(distances[i])[:K]
        
        # Get the distances of the K nearest neighbors
        nearest_distances = distances[i][nearest_neighbor_idxs]
        
        # Calculate the weights (inverse of the distances)
        weights = 1 / (nearest_distances + 1e-5)  # Add a small value to avoid division by zero
        
        # Get the labels of the K nearest neighbors
        nearest_labels = trainy[nearest_neighbor_idxs]
        
        # Calculate the weighted votes
        unique_labels = np.unique(nearest_labels)
        weighted_votes = np.zeros(len(unique_labels))
        
        for idx, label in enumerate(unique_labels):
            label_weights = weights[nearest_labels == label]
            weighted_votes[idx] = np.sum(label_weights)
        
        # Predict the label with the maximum weighted vote
        predictions[i] = unique_labels[np.argmax(weighted_votes)]
    
    return predictions

In [39]:
weighted_predictions = weighted_KNN(X_train, y_train, X_val, K=4, dist_metric=2)
weighted_val_accuracy = accuracy_score(y_val, weighted_predictions)
print(f"Validation Accuracy for Weighted KNN (K=5, L2): {weighted_val_accuracy:.4f}")

# Test Weighted KNN with K=5 and L1 distance
weighted_predictions_l1 = weighted_KNN(X_train, y_train, X_val, K=5, dist_metric=1)
weighted_val_accuracy_l1 = accuracy_score(y_val, weighted_predictions_l1)
print(f"Validation Accuracy for Weighted KNN (K=5, L1): {weighted_val_accuracy_l1:.4f}")

Validation Accuracy for Weighted KNN (K=5, L2): 0.9412
Validation Accuracy for Weighted KNN (K=5, L1): 0.9529


In [25]:
# Assume the best model is Weighted KNN with K=5 and L2
# Test on the test set
weighted_test_predictions = weighted_KNN(X_train, y_train, X_test, K=5, dist_metric=1)
weighted_test_accuracy = accuracy_score(y_test, weighted_test_predictions)
print(f"Test Accuracy for Weighted KNN (K=5, L1): {weighted_test_accuracy:.4f}")

# Generate confusion matrix for the test set
conf_matrix_test = confusion(y_test, weighted_test_predictions)
print("\nConfusion Matrix for Test Set (Weighted KNN, K=5, L1):")
print(conf_matrix_test)

Test Accuracy for Weighted KNN (K=5, L1): 0.9302

Confusion Matrix for Test Set (Weighted KNN, K=5, L1):
[[23  6  0  0  0  0  0  0  0  0]
 [ 0 57  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]]


In [40]:
def KNN_L_infinity(trainx, trainy, evalx, K):
    # inputs: 
    # trainx (training features), 
    # trainy (training labels), 
    # evalx (evaluation features), 
    # K (number of neighbors)
    
    # Initialize an empty array to store the predictions
    predictions = np.empty(evalx.shape[0], dtype=int)
    
    # Calculate L-infinity (Chebyshev) distance between each eval point and all train points
    distances = np.max(np.abs(trainx[np.newaxis, :] - evalx[:, np.newaxis]), axis=2)
    
    # Find the indices of the K nearest neighbors for each evaluation point
    nearest_neighbor_idxs = np.argsort(distances, axis=1)[:, :K]
    
    # Use fancy indexing to get the labels of the K nearest neighbors
    nearest_neighbor_labels = trainy[nearest_neighbor_idxs]
    
    # For each evaluation point, find the most common label among the K neighbors using numpy functions
    for i in range(evalx.shape[0]):
        labels, counts = np.unique(nearest_neighbor_labels[i], return_counts=True)
        most_common_label = labels[np.argmax(counts)]
        predictions[i] = most_common_label
    
    return predictions

In [49]:
# Test L-infinity KNN with K=5
linfinity_predictions = KNN_L_infinity(X_train, y_train, X_val, K=5)
linfinity_val_accuracy = accuracy_score(y_val, linfinity_predictions)
print(f"Validation Accuracy for KNN with L-infinity (K=5): {linfinity_val_accuracy:.4f}")

Validation Accuracy for KNN with L-infinity (K=5): 0.9294


In [50]:
K_best = 5

# Test KNN with L-infinity distance on the test set
linfinity_test_predictions = KNN_L_infinity(X_train, y_train, X_test, K=K_best)
linfinity_test_accuracy = accuracy_score(y_test, linfinity_test_predictions)
print(f"Test Accuracy for KNN with L-infinity (K={K_best}): {linfinity_test_accuracy:.4f}")

# Generate confusion matrix for the test set
conf_matrix_test_linfinity = confusion(y_test, linfinity_test_predictions)
print("\nConfusion Matrix for Test Set (KNN with L-infinity, K=5):")
print(conf_matrix_test_linfinity)

Test Accuracy for KNN with L-infinity (K=5): 0.9186

Confusion Matrix for Test Set (KNN with L-infinity, K=5):
[[24  5  0  0  0  0  0  0  0  0]
 [ 2 55  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]]


## Points to remember

You need to keep in mind the following points:
1. Use numpy arrays and numpy libraries for efficient computations. 
2. Vectorize the code wherever possible instead of using explicit loops. This will significantly speed-up your code.