In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter

# Audio Machine Learning - Formative Task - Exercise 1
## 1 - k-Nearest Neighbour Classifier

### 1.1 - Dataset

The dataset $D = \{(x^{(i)}, y^{(i)})\}^N_{i=1}$, contains $N$ labelled datapoints. Each datapoint consists of a feature vector, $x^{(i)}$, and the correspoinding class label, $y^{(i)}$.  


$x^{(i)}$ denotes the feature vector of the i-th datapoint is the dataset, and $y^{(i)}$ denotes the i-th label.

Each feature vector $x^{(i)}$, consists of $j$ features. $x^{(i)}_j$ denotes the j-th element in the feature vector $x^{(i)}$. Each class label, $y^{(i)}$, consists of a single class label.

### 1.2 - Distance Metric

We want to classify a new datapoint, $q$, which consists of $j$ features but for which the class label is unknown. We can approach this by measuring the euclidean distance between $q$ and each of the $N$ datapoints in the dataset $D$. The euclidean distance between $q$ and a feature vector $x^{(i)}$ is given by:

Euclidean Distance = $\sqrt{\sum\limits_{n=1}^{j} (x^{(i)}_n - q_n)^2}$

Once we've calculated the distance between $q$ and each of the $N$ datapoints in the dataset, we can rank each datapoint $x^{(i)}$ by it's distance to the query point $q$.

### 1.3 - Classification

To carry out classification, the $k$ datapoints that are closest to the query $q$ are selected, where $k$ is a user-defined constant. The most frequently occuring label amongst the selected $k$ datapoints is the k-nearest neighbour algorithm's predicted label for the query $q$.

## 2 - Implementation

### 2.1 - Task 1 - Implementing the KNN Classifier

Your task is to implement the k-nearest neighbour classifier, using the Euclidean Distance. It should work for a dataset with any number of training examples, N, and any number of features.  You should use the Python class template provided below:

In [57]:


# define the KNNClassifier class
class KNNClassifier:

    # The KNNClassifier should have the method 'fit', which takes the features and labels from Dataset D as it's arguments and saves them for classification.
    # If the KNNClassifier already has a dataset saved, it should be replaced by whatever dataset is passed to the 'fit' method

    def __init__(self):
        pass

    
    # features_ should be a numpy array of shape (N, j)
    # labels_ should be a numpy array of shape (N)

    def fit(self, features_, labels_):
        """ 
        
        Args:
            features\_(np.array): Feature array with shape (N x j)
            labels\_(np.array): Label array with shape (N)

        Returns:

            None

        """
               
        self.features = np.array(features_)
        self.labels = np.array(labels_)

        self.shape_features = np.shape(self.features)[1]
        self.dimFeatures = np.ndim(self.features)

        if self.dimFeatures != 2:
            raise ValueError( f"Expected 2D array (N, j), got {features_.ndim}")
        
        if len(self.features) != len(self.labels):
            raise ValueError( f"Features and labels must have the same number of rows")
        
        return self
    

        
    # The KNNClassifier should have the method 'predict', which takes a query vector q, and the parameter k. The 'predict' method should implement 
    # the KNN algoirthm, returning the classifier's predicted class label for query q. If an invalid query is made, i.e the number of features in q is
    # incorrect, the predict function should inform the user.
    def predict(self, q, k):
        """

        Args:
            q (np.array): The query vector with shape (1,j)
            k (int): Number of neighbors

        Returns:

            str: The predicted label

        """

        if np.shape(q) != (1,self.shape_features):
            return f"q has the incorrect dimensions ( {np.shape(q)} ), the correct dimensions are {self.shape_features}"
               
        self.q = np.array( q ) # This is our query. The new data point which we are trying to guess
        self.k = k # This is the number of nearest neighbours we want to use in our model

        distance =   np.sqrt( np.sum((self.features - q)**2,axis=1) )
        arg_order =  np.argpartition(distance, k)[:k]   # Memorize the sort order, argpartition gives me back the k smallest, but doesn't worry about the order

        # Find the labels of the k nearest neighbours
        label_sort_k = self.labels[arg_order]

        # Return the most common value in our sorted list
        best_label = Counter(label_sort_k).most_common(1)

        # best_label returns a string, so need to access it with [0][0]
        return best_label[0][0]




  features\_(np.array): Feature array with shape (N x j)


In [6]:

feat = [[1, 4],[2,4],[1,2],[4,5]]
labels = ["white","black","white","white"]
q = [[1,3]]

intialise = KNNClassifier()

fit_model = intialise.fit(features_= feat , labels_= labels) 

predict_model = fit_model.predict(q,3)

print(predict_model)




The best label for q is 'white'


You should aim to implement the predict function without using any 'for' loops.

All methods should have clear comments describing what each line of code does. The methods should also have comments describing the expected data types
and dimensions of the input arguments.

## 3 - Testing

### 3.1 - Testing the KNN Classifier
Section 3.1 contains code to generate training examples and test your classifier. You don't have to implement anything here but you may find it useful for testing purposes.

The below functions generate data from two different classes, as well as the corresponding class labels.

In [13]:
# These functions generate data from two different classes
def class_1_generator(num_samples):
    return np.random.randn(num_samples, 2) + 2.5, np.ones(num_samples, dtype=np.uint32)

def class_2_generator(num_samples):
    return np.random.randn(num_samples, 2) + 5, 2*np.ones(num_samples, dtype=np.uint32)
    
# This function generates a dataset, and return a tuple (X, Y), where X contains N datapoints each consisting of j features, shape [N, j], and Y contains the
# corresponding class labels, shape [N]. The argument 'examples_per_class' determines how many datapoints from each class to include in the dataset.
def generate_dataset(examples_per_class):
    class_1s = class_1_generator(examples_per_class)
    class_2s = class_2_generator(examples_per_class)
    dataset = (np.concatenate([class_1s[0], class_2s[0]]), np.concatenate([class_1s[1], class_2s[1]]))
    return dataset

The below code will create a Model object from your KNNClassifier class.

In [14]:
Model = KNNClassifier()

The below code will generate a training dataset, and pass it to the KNNClassifier.fit() method, which you implemented in section 2.1

In [15]:
Train_Dataset = generate_dataset(100)

In [16]:
Model.fit(features_=Train_Dataset[0], labels_=Train_Dataset[1])

<__main__.KNNClassifier at 0x1dfd15519d0>

The below code generates a test dataset, and then passes a single query to KNNClassifier using the .predict() method.

In [17]:
Test_Dataset = generate_dataset(25)

In [18]:
prediction = Model.predict(q=Test_Dataset[0][0:1,:], k=5)

Now print the prediciton of the model:

In [19]:
print(f'Prediction for query point {Test_Dataset[0][0,:]} is class: {prediction}. The true label is {Test_Dataset[1][0]}')

Prediction for query point [3.38742632 2.78310759] is class: The best label for q is '1'. The true label is 1


### 3.2 - Task 2 - Test your classifier

Your task is to use your KNN Classifier to classify all the examples in 'Test_Dataset' (created in the previous section).

In [None]:
Model = KNNClassifier()
Model.fit(features_=Train_Dataset[0], labels_=Train_Dataset[1])

q_values = np.array(Test_Dataset[0])
label_values = np.array(Test_Dataset[1])


correct_count = 0

for q, l in zip( q_values, label_values):

    # Need to make sure that q has a size of 1 x 2
    pred = Model.predict( [q] , 5)
    actual = l

    if pred == actual:
        a = "correct"
        correct_count += 1
    else:
        a = "false"    

    print( f"{a}: The prediction is {pred}, the actual value is {actual}" )

percentage_accuracy =  ( correct_count / np.shape(q_values)[0] ) * 100

print(f"The overall accuracy is {percentage_accuracy} %")




correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
false: The prediction is 2, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The prediction is 1, the actual value is 1
correct: The p

You should then compare the predictions to the true class labels from 'Test_Dataset'. 
The accuracy is given by the ratio of the number of correct predictions to the total number of samples input to the classifier.
Calculate the accuracy of the classifier, for k=1, k=2 and k=5.


In [51]:

q_values.size


100