In [71]:
%%markdown
**1. Part 1: Implementation and evaluation of K-NN in pure python syntax
(using basic data structures and functions in Python)**  
1.1. Function _loadDataset_:  
  + This function is to load the data from csv file
  + Input: 
    - _filePath_: path to the csv file
    - _trainPercentages_: percentages of test set
  + Output:
    - _dataset_: is a dict and devided into 4 parts: **x_train** contains all features from training set, **y_train** contains the class label (flower species) of
training set and similarly, **x_test** contains all features of test set and **y_test** contains class label of the test set.
  + Using **random.shuffle()** function: Because the data from csv file have been sorted by class label, we need to make it become random before deviding the dataset into training set and test set.  

1.2. Function _calculateDistances_:
  + This function is to calculate the distance between a test object and all other instances in the training dataset
  + Input:
    - _xTrain_: all features from training dataset
    - _testObject_: test object from x_test
  + Output:
    - _distances_: list of Euclidean distances

1.3. Function _findNeighbourLabels_:
  + This function is to get a list of class labels from k-nearest neighbours of a test object
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _testObject_: test object from x_test
    - *k*: number of nearest neighbours
  + Output:
    - _neighbourLabels_: list of class labels
  + Calling function **calculateDistances** to get a list of distances. Then sort the list of distances and get the index of the first *k* items in the sorted list. Return *k* labels from yTrain with the *k* indexes we just get.  

1.4. Function _predictLabel_:
  + This function is to find the most common label of the k-nearest neighbours
  + Input:
    - _neighbourLabels_: list of class labels of k-nearest neighbours
  + Output:
    - _mostCommon_: the most common label
  + In the function, using a dict to store the counting number of values in **neighbourLabels**, then find the key of max value of the dict.

1.5. Function _predictTestSet_:
  + This function is to predict the class label of all instances in test set
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _xTest_: all features from test set
    - *k*: number of nearest neighbours
  + Output:
    - _predictLabels_: list of predicted labels of instances in test set
  + For each instance in test set, call function **findNeighbourLabels** to find the class labels of k-nearest neighbours. Then call function **predictLabel** to predict class label for that test instance.

1.6. Function _evaluate_:
  + This function is to evaluate the accuracy of predictions by calculating ratio of the total correct predictions out of all predictions made.
  + Input:
    - _predictions_: list of predicted class labels of all instances in test set
    - _yTest_: list of class labels of the test
  + Output:
    - _accuracy_: ratio of the total correct predictions out of all predictions made
  + Count the number of the same values in the same index from **predictions** and **yTest**, then devide it by the number of intances in test set.

1.7. Function _main_:
  + Calls functions **loadDataset**, **predictTestSet** and **evaluate** in appropriate order to get the final result.
  + Print the result in % correct

**1. Part 1: Implementation and evaluation of K-NN in pure python syntax
(using basic data structures and functions in Python)**  
1.1. Function _loadDataset_:  
  + This function is to load the data from csv file
  + Input: 
    - _filePath_: path to the csv file
    - _trainPercentages_: percentages of test set
  + Output:
    - _dataset_: is a dict and devided into 4 parts: **x_train** contains all features from training set, **y_train** contains the class label (flower species) of
training set and similarly, **x_test** contains all features of test set and **y_test** contains class label of the test set.
  + Using **random.shuffle()** function: Because the data from csv file have been sorted by class label, we need to make it become random before deviding the dataset into training set and test set.  

1.2. Function _calculateDistances_:
  + This function is to calculate the distance between a test object and all other instances in the training dataset
  + Input:
    - _xTrain_: all features from training dataset
    - _testObject_: test object from x_test
  + Output:
    - _distances_: list of Euclidean distances

1.3. Function _findNeighbourLabels_:
  + This function is to get a list of class labels from k-nearest neighbours of a test object
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _testObject_: test object from x_test
    - *k*: number of nearest neighbours
  + Output:
    - _neighbourLabels_: list of class labels
  + Calling function **calculateDistances** to get a list of distances. Then sort the list of distances and get the index of the first *k* items in the sorted list. Return *k* labels from yTrain with the *k* indexes we just get.  

1.4. Function _predictLabel_:
  + This function is to find the most common label of the k-nearest neighbours
  + Input:
    - _neighbourLabels_: list of class labels of k-nearest neighbours
  + Output:
    - _mostCommon_: the most common label
  + In the function, using a dict to store the counting number of values in **neighbourLabels**, then find the key of max value of the dict.

1.5. Function _predictTestSet_:
  + This function is to predict the class label of all instances in test set
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _xTest_: all features from test set
    - *k*: number of nearest neighbours
  + Output:
    - _predictLabels_: list of predicted labels of instances in test set
  + For each instance in test set, call function **findNeighbourLabels** to find the class labels of k-nearest neighbours. Then call function **predictLabel** to predict class label for that test instance.

1.6. Function _evaluate_:
  + This function is to evaluate the accuracy of predictions by calculating ratio of the total correct predictions out of all predictions made.
  + Input:
    - _predictions_: list of predicted class labels of all instances in test set
    - _yTest_: list of class labels of the test
  + Output:
    - _accuracy_: ratio of the total correct predictions out of all predictions made
  + Count the number of the same values in the same index from **predictions** and **yTest**, then devide it by the number of intances in test set.

1.7. Function _main_:
  + Calls functions **loadDataset**, **predictTestSet** and **evaluate** in appropriate order to get the final result.
  + Print the result in % correct

In [26]:
%%prun -s cumulative -l 30
import csv
import random
import math

def loadDataset(filePath, trainPercentages):
    dataset = {'x_train': [], 'y_train': [], 'x_test': [], 'y_test': []}
    with open(filePath, newline='') as file:
        reader = csv.reader(file)
        data = list(reader)
        random.shuffle(data)
        length = len(data)
        splitPoint = int(length*trainPercentages)
        for i in range(length):
            features = [float(data[i][j]) for j in range(len(data[i]) - 1)]
            label = int(data[i][-1])
            if(i < splitPoint) :
                dataset['x_train'].append(features)
                dataset['y_train'].append(label)
            else:
                dataset['x_test'].append(features)
                dataset['y_test'].append(label)
            
    return dataset            

def calculateDistances(xTrain, testObject):
    length = len(testObject)
    distances = []
    for x in xTrain:
        sigma = 0
        for i in range(length):
            sigma += pow(x[i] - testObject[i], 2)
        distances.append(math.sqrt(sigma))
    
    return distances

def findNeighbourLabels(xTrain, yTrain, testObject, k):
    distances = calculateDistances(xTrain, testObject)
    sortedIndex = sorted(range(len(distances)), key=distances.__getitem__)
    neighbourLabels = [yTrain[sortedIndex[i]] for i in range(k)]
    return neighbourLabels

def predictLabel(neighbourLabels):
    labelCount = {0: 0, 1: 0, 2: 0}
    for i in neighbourLabels:
        labelCount[i] += 1
    mostCommon = max(labelCount.keys(), key=lambda k: labelCount[k])
    return mostCommon

def predictTestSet(xTrain, yTrain, xTest, k):
    predictLabels = [predictLabel(findNeighbourLabels(xTrain, yTrain, x, k)) for x in xTest]
    return predictLabels

def evaluate(predictions, yTest):
    accuracy = 0
    length = len(yTest)
    for i in range(length):
        if predictions[i] == yTest[i]:
            accuracy += 1
    accuracy = round(accuracy/length, 2)        
    return accuracy     

def main():
    dataset = loadDataset('IRIS.csv', 0.66)
    predictions = predictTestSet(dataset['x_train'], dataset['y_train'], 
                                 dataset['x_test'], 3)
    accuracy = evaluate(predictions, dataset['y_test'])
    print("Accuracy: {0}% correct".format(int(accuracy*100)))
    
main()    

Accuracy: 98% correct
 

In [64]:
%%markdown
**Using %timeit magic command to check the execution time of each function**

**Using %timeit magic command to check the execution time of each function**

In [5]:
%timeit loadDataset('IRIS.csv', 0.66)

439 µs ± 9.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [6]:
dataset = loadDataset('IRIS.csv', 0.66)
%timeit predictTestSet(dataset['x_train'], dataset['y_train'], dataset['x_test'], 3)

6.8 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
predictions = predictTestSet(dataset['x_train'], dataset['y_train'], 
                             dataset['x_test'], 3)
%timeit evaluate(predictions, dataset['y_test'])

4.18 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [8]:
%timeit findNeighbourLabels(dataset['x_train'], dataset['y_train'], dataset['x_test'][0], 3)

132 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [9]:
%timeit calculateDistances(dataset['x_train'], dataset['x_test'][0])

118 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [10]:
neighbourLabels = findNeighbourLabels(dataset['x_train'], dataset['y_train'], dataset['x_test'][0], 3)
%timeit predictLabel(neighbourLabels)

1.24 µs ± 32.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [82]:
%%markdown
****
**2. Part 2: Implementation and evaluation of K-NN in Numpy syntax (Using
advanced data structures; Numpy and its functions).**  
2.1. Function _npLoadDataset_:  
  + This function is to load the data from csv file
  + Input: 
    - _filePath_: path to the csv file
    - _trainPercentages_: percentages of test set
  + Output:
    - _dataset_: is a dict and devided into 4 parts: **x_train** contains all features from training set, **y_train** contains the class label (flower species) of
training set and similarly, **x_test** contains all features of test set and **y_test** contains class label of the test set.
  + Using **np.random.shuffle** function: Because the data from csv file have been sorted by class label, we need to make it become random before deviding the dataset into training set and test set.  

2.2. Function _npCalculateDistances_:
  + This function is to calculate the distance between a test object and all other instances in the training dataset
  + Input:
    - _xTrain_: all features from training dataset
    - _testObject_: test object from x_test
  + Output:
    - _distances_: array of Euclidean distances
  + With numpy array, we can use the universal functions (np.square, np.sqrt ...) to do the calculations without the need of for loops. This helps the calculations run very fast.      

2.3. Function _npFindNeighbourLabels_:
  + This function is to get an array of class labels from k-nearest neighbours of a test object
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _testObject_: test object from x_test
    - *k*: number of nearest neighbours
  + Output:
    - _neighbourLabels_: numpy array of class labels
  + Calling function **npCalculateDistances** to get an array of distances. Then sort the array of distances by using the **np.argsort** function, then slice the result to get the index of the first *k* items in the sorted list. Return *k* labels from yTrain with the *k* indexes we just found.  

2.4. Function _npPredictLabel_:
  + This function is to find the most common label of the k-nearest neighbours
  + Input:
    - _neighbourLabels_: list of class labels of k-nearest neighbours
  + Output:
    - _mostCommon_: the most common label
  + In the function, using function **np.bincount** to count number of occurrences of each value in array **neighbourLabels**, then find the key of max value by using function **np.argmax**.

2.5. Function _npPredictTestSet_:
  + This function is to predict the class label of all instances in test set
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _xTest_: all features from test set
    - *k*: number of nearest neighbours
  + Output:
    - _predictLabels_: array of predicted labels of instances in test set
  + For each instance in test set, call function **npFindNeighbourLabels** to find the class labels of k-nearest neighbours. Then call function **npPredictLabel** to predict class label for that test instance.

2.6. Function _npEvaluate_:
  + This function is to evaluate the accuracy of predictions by calculating ratio of the total correct predictions out of all predictions made.
  + Input:
    - _predictions_: array of predicted class labels of all instances in test set
    - _yTest_: array of class labels of the test
  + Output:
    - _accuracy_: ratio of the total correct predictions out of all predictions made
  + Using function **np.sum** to get the total number of the same values in the same index from **predictions** and **yTest**, then devide it by the number of intances in test set.

2.7. Function _npMain_:
  + Calls functions **nploadDataset**, **npPredictTestSet** and **npEvaluate** in appropriate order to get the final result.
  + Print the result in % correct

****
**2. Part 2: Implementation and evaluation of K-NN in Numpy syntax (Using
advanced data structures; Numpy and its functions).**  
2.1. Function _npLoadDataset_:  
  + This function is to load the data from csv file
  + Input: 
    - _filePath_: path to the csv file
    - _trainPercentages_: percentages of test set
  + Output:
    - _dataset_: is a dict and devided into 4 parts: **x_train** contains all features from training set, **y_train** contains the class label (flower species) of
training set and similarly, **x_test** contains all features of test set and **y_test** contains class label of the test set.
  + Using **np.random.shuffle** function: Because the data from csv file have been sorted by class label, we need to make it become random before deviding the dataset into training set and test set.  

2.2. Function _npCalculateDistances_:
  + This function is to calculate the distance between a test object and all other instances in the training dataset
  + Input:
    - _xTrain_: all features from training dataset
    - _testObject_: test object from x_test
  + Output:
    - _distances_: array of Euclidean distances
  + With numpy array, we can use the universal functions (np.square, np.sqrt ...) to do the calculations without the need of for loops. This helps the calculations run very fast.      

2.3. Function _npFindNeighbourLabels_:
  + This function is to get an array of class labels from k-nearest neighbours of a test object
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _testObject_: test object from x_test
    - *k*: number of nearest neighbours
  + Output:
    - _neighbourLabels_: numpy array of class labels
  + Calling function **npCalculateDistances** to get an array of distances. Then sort the array of distances by using the **np.argsort** function, then slice the result to get the index of the first *k* items in the sorted list. Return *k* labels from yTrain with the *k* indexes we just found.  

2.4. Function _npPredictLabel_:
  + This function is to find the most common label of the k-nearest neighbours
  + Input:
    - _neighbourLabels_: list of class labels of k-nearest neighbours
  + Output:
    - _mostCommon_: the most common label
  + In the function, using function **np.bincount** to count number of occurrences of each value in array **neighbourLabels**, then find the key of max value by using function **np.argmax**.

2.5. Function _npPredictTestSet_:
  + This function is to predict the class label of all instances in test set
  + Input:
    - _xTrain_: all features from training set
    - _yTrain_: class label from training set
    - _xTest_: all features from test set
    - *k*: number of nearest neighbours
  + Output:
    - _predictLabels_: array of predicted labels of instances in test set
  + For each instance in test set, call function **npFindNeighbourLabels** to find the class labels of k-nearest neighbours. Then call function **npPredictLabel** to predict class label for that test instance.

2.6. Function _npEvaluate_:
  + This function is to evaluate the accuracy of predictions by calculating ratio of the total correct predictions out of all predictions made.
  + Input:
    - _predictions_: array of predicted class labels of all instances in test set
    - _yTest_: array of class labels of the test
  + Output:
    - _accuracy_: ratio of the total correct predictions out of all predictions made
  + Using function **np.sum** to get the total number of the same values in the same index from **predictions** and **yTest**, then devide it by the number of intances in test set.

2.7. Function _npMain_:
  + Calls functions **nploadDataset**, **npPredictTestSet** and **npEvaluate** in appropriate order to get the final result.
  + Print the result in % correct

In [17]:
%%prun -s cumulative -l 30
import pandas as pd
import numpy as np

def npLoadDataset(filePath, trainPercentages):
    df = pd.read_csv(filePath, header=None)
    data = df.values
    np.random.shuffle(data)
    dataset = {}
    splitPoint = int(data.shape[0]*trainPercentages)
    dataset['x_train'] = data[:splitPoint, :4]
    dataset['y_train'] = data[:splitPoint, 4].astype(int)
    dataset['x_test'] = data[splitPoint:, :4]
    dataset['y_test'] = data[splitPoint:, 4].astype(int)
    return dataset
    
def npCalculateDistances(xTrain, testObject):
    subtract = np.subtract(xTrain, testObject)
    square = np.square(subtract)
    distances = np.sqrt(np.sum(square, axis=1))
    return distances
    
def npFindNeighbourLabels(xTrain, yTrain, testObject, k):
    distances = npCalculateDistances(xTrain, testObject)
    sortedIndex = np.argsort(distances)
    neighboursIndex = sortedIndex[:k]
    neighbourLabels = yTrain[neighboursIndex]
    return neighbourLabels

def npPredictLabel(neighbourLabels):
    counts = np.bincount(neighbourLabels)
    mostCommon = np.argmax(counts)
    return mostCommon

def npPredictTestSet(xTrain, yTrain, xTest, k):
    predictLabels = [npPredictLabel(npFindNeighbourLabels(xTrain, yTrain, x, k)) for x in xTest]
    return predictLabels

def npEvaluate(predictions, yTest):
    accuracy = np.sum(predictions == yTest)
    length = yTest.size;
    return round(accuracy/length, 2) 
    
def npMain():
    dataset = npLoadDataset('IRIS.csv', 0.66)
    predictions = npPredictTestSet(dataset['x_train'], dataset['y_train'], 
                                   dataset['x_test'], 3)
    accuracy = npEvaluate(predictions, dataset['y_test'])
    print("Accuracy: {0}% correct".format(int(accuracy*100)))
    
npMain()    

Accuracy: 98% correct
 

In [73]:
%%markdown
**Using %timeit magic command to check the execution time of each function**

**Using %timeit magic command to check the execution time of each function**

In [279]:
%timeit pd.read_csv('IRIS.csv', header=None)

1.26 ms ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [278]:
%timeit npLoadDataset('IRIS.csv', 0.66)

1.49 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
dataset = npLoadDataset('IRIS.csv', 0.66)
%timeit npPredictTestSet(dataset['x_train'], dataset['y_train'], dataset['x_test'], 3)

873 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [5]:
predictions = npPredictTestSet(dataset['x_train'], dataset['y_train'], dataset['x_test'], 3)
%timeit npEvaluate(predictions, dataset['y_test'])

14.1 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
%timeit npFindNeighbourLabels(dataset['x_train'], dataset['y_train'], dataset['x_test'][0], 3)

13.3 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [6]:
%timeit npCalculateDistances(dataset['x_train'], dataset['x_test'][0])

8.63 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [10]:
neighbourLabels = npFindNeighbourLabels(dataset['x_train'], dataset['y_train'], dataset['x_test'][0], 3)
%timeit npPredictLabel(neighbourLabels)

1.42 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [39]:
%%markdown
******
**3. Part 3: Compare the performance of the implemented algorithms based on
their execution times.**
  + **_Profiling of the implementation in part 1:_**
  ```
  32073 function calls in 0.241 seconds

   Ordered by: cumulative time
   List reduced from 48 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.241    0.241 {built-in method builtins.exec}
        1    0.000    0.000    0.240    0.240 <string>:2(<module>)
        1    0.000    0.000    0.240    0.240 <string>:63(main)
        1    0.000    0.000    0.231    0.231 <string>:50(predictTestSet)
        1    0.001    0.001    0.231    0.231 <string>:51(<listcomp>)
       51    0.001    0.000    0.228    0.004 <string>:37(findNeighbourLabels)
       51    0.115    0.002    0.225    0.004 <string>:26(calculateDistances)
    20196    0.073    0.000    0.073    0.000 {built-in method builtins.pow}
     5349    0.019    0.000    0.019    0.000 {method 'append' of 'list' objects}
     5049    0.018    0.000    0.018    0.000 {built-in method math.sqrt}
        1    0.003    0.003    0.009    0.009 <string>:6(loadDataset)
        1    0.001    0.001    0.003    0.003 random.py:261(shuffle)
      149    0.002    0.000    0.003    0.000 random.py:223(_randbelow)
       51    0.001    0.000    0.002    0.000 <string>:43(predictLabel)
       51    0.001    0.000    0.001    0.000 {built-in method builtins.max}
       51    0.001    0.000    0.001    0.000 {built-in method builtins.sorted}
      255    0.001    0.000    0.001    0.000 {built-in method builtins.len}
      150    0.001    0.000    0.001    0.000 <string>:15(<listcomp>)
      209    0.001    0.000    0.001    0.000 {method 'getrandbits' of '_random.Random' objects}
        1    0.000    0.000    0.001    0.001 {built-in method builtins.print}
        2    0.000    0.000    0.001    0.000 iostream.py:366(write)
      153    0.001    0.000    0.001    0.000 <string>:47(<lambda>)
        3    0.000    0.000    0.001    0.000 iostream.py:195(schedule)
      149    0.000    0.000    0.000    0.000 {method 'bit_length' of 'int' objects}
        3    0.000    0.000    0.000    0.000 socket.py:333(send)
       51    0.000    0.000    0.000    0.000 <string>:40(<listcomp>)
        2    0.000    0.000    0.000    0.000 iostream.py:313(_schedule_flush)
       51    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        3    0.000    0.000    0.000    0.000 threading.py:1104(is_alive) 
  ```
  <br/>
  + **_Profiling of the implementation in part 2:_**
  ```
  3334 function calls (3298 primitive calls) in 0.030 seconds

   Ordered by: cumulative time
   List reduced from 365 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.030    0.030 {built-in method builtins.exec}
        1    0.000    0.000    0.029    0.029 <string>:2(<module>)
        1    0.000    0.000    0.029    0.029 <string>:44(npMain)
        1    0.000    0.000    0.022    0.022 <string>:5(npLoadDataset)
        1    0.000    0.000    0.020    0.020 parsers.py:542(parser_f)
        1    0.000    0.000    0.020    0.020 parsers.py:414(_read)
        1    0.000    0.000    0.018    0.018 parsers.py:1029(read)
        1    0.000    0.000    0.017    0.017 frame.py:334(__init__)
        1    0.000    0.000    0.017    0.017 frame.py:426(_init_dict)
      3/2    0.000    0.000    0.008    0.004 series.py:165(__init__)
        1    0.000    0.000    0.007    0.007 frame.py:7308(_arrays_to_mgr)
        1    0.000    0.000    0.007    0.007 <string>:35(npPredictTestSet)
        1    0.000    0.000    0.007    0.007 <string>:36(<listcomp>)
        9    0.000    0.000    0.006    0.001 base.py:4897(_ensure_index)
        1    0.000    0.000    0.005    0.005 series.py:283(_init_dict)
       51    0.001    0.000    0.004    0.000 <string>:23(npFindNeighbourLabels)
        1    0.000    0.000    0.004    0.004 internals.py:4869(create_block_manager_from_arrays)
      607    0.002    0.000    0.003    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.003    0.001 {pandas._libs.lib.clean_index_list}
        1    0.000    0.000    0.003    0.003 internals.py:4880(form_blocks)
      4/3    0.000    0.000    0.003    0.001 base.py:250(__new__)
      8/2    0.000    0.000    0.003    0.001 <frozen importlib._bootstrap>:966(_find_and_load)
        2    0.000    0.000    0.002    0.001 base.py:2430(equals)
       51    0.001    0.000    0.002    0.000 <string>:17(npCalculateDistances)
      8/2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
      6/2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
      6/2    0.000    0.000    0.002    0.001 {built-in method builtins.__import__}
       51    0.001    0.000    0.002    0.000 <string>:30(npPredictLabel)
      102    0.001    0.000    0.002    0.000 fromnumeric.py:50(_wrapfunc)
        2    0.000    0.000    0.002    0.001 missing.py:376(array_equivalent)
  ```
<br/>
  + **_The results from profiling and using %timeit magic command show that:_**
    - Both implementations have the result of good classification accuracy (above 90% correct, typically 96% or better.), but the implementation with NumPy is much faster in total(0.03s vs 0.241s)
    - In the algorithm implementation with NumPy, the distances between a test object and all other intances in the training set are calculated in 8.63 µs, ~14 times fater than the implementation withour NumPy (118 µs). The reason of this is because we can call the fast operations on entire **ndarray** of data without having to write loops, which is slow in the implementation with pure Python. 
    - Since the function to calculate the distances with NumPy is much faster than without NumPy, then the functions to find neightbour labels and predict all of test set, and also the entire implementation with Numpy are much faster than without NumPy.

******
**3. Part 3: Compare the performance of the implemented algorithms based on
their execution times.**
  + **_Profiling of the implementation in part 1:_**
  ```
  32073 function calls in 0.241 seconds

   Ordered by: cumulative time
   List reduced from 48 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.241    0.241 {built-in method builtins.exec}
        1    0.000    0.000    0.240    0.240 <string>:2(<module>)
        1    0.000    0.000    0.240    0.240 <string>:63(main)
        1    0.000    0.000    0.231    0.231 <string>:50(predictTestSet)
        1    0.001    0.001    0.231    0.231 <string>:51(<listcomp>)
       51    0.001    0.000    0.228    0.004 <string>:37(findNeighbourLabels)
       51    0.115    0.002    0.225    0.004 <string>:26(calculateDistances)
    20196    0.073    0.000    0.073    0.000 {built-in method builtins.pow}
     5349    0.019    0.000    0.019    0.000 {method 'append' of 'list' objects}
     5049    0.018    0.000    0.018    0.000 {built-in method math.sqrt}
        1    0.003    0.003    0.009    0.009 <string>:6(loadDataset)
        1    0.001    0.001    0.003    0.003 random.py:261(shuffle)
      149    0.002    0.000    0.003    0.000 random.py:223(_randbelow)
       51    0.001    0.000    0.002    0.000 <string>:43(predictLabel)
       51    0.001    0.000    0.001    0.000 {built-in method builtins.max}
       51    0.001    0.000    0.001    0.000 {built-in method builtins.sorted}
      255    0.001    0.000    0.001    0.000 {built-in method builtins.len}
      150    0.001    0.000    0.001    0.000 <string>:15(<listcomp>)
      209    0.001    0.000    0.001    0.000 {method 'getrandbits' of '_random.Random' objects}
        1    0.000    0.000    0.001    0.001 {built-in method builtins.print}
        2    0.000    0.000    0.001    0.000 iostream.py:366(write)
      153    0.001    0.000    0.001    0.000 <string>:47(<lambda>)
        3    0.000    0.000    0.001    0.000 iostream.py:195(schedule)
      149    0.000    0.000    0.000    0.000 {method 'bit_length' of 'int' objects}
        3    0.000    0.000    0.000    0.000 socket.py:333(send)
       51    0.000    0.000    0.000    0.000 <string>:40(<listcomp>)
        2    0.000    0.000    0.000    0.000 iostream.py:313(_schedule_flush)
       51    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        3    0.000    0.000    0.000    0.000 threading.py:1104(is_alive) 
  ```
  <br/>
  + **_Profiling of the implementation in part 2:_**
  ```
  3334 function calls (3298 primitive calls) in 0.030 seconds

   Ordered by: cumulative time
   List reduced from 365 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.030    0.030 {built-in method builtins.exec}
        1    0.000    0.000    0.029    0.029 <string>:2(<module>)
        1    0.000    0.000    0.029    0.029 <string>:44(npMain)
        1    0.000    0.000    0.022    0.022 <string>:5(npLoadDataset)
        1    0.000    0.000    0.020    0.020 parsers.py:542(parser_f)
        1    0.000    0.000    0.020    0.020 parsers.py:414(_read)
        1    0.000    0.000    0.018    0.018 parsers.py:1029(read)
        1    0.000    0.000    0.017    0.017 frame.py:334(__init__)
        1    0.000    0.000    0.017    0.017 frame.py:426(_init_dict)
      3/2    0.000    0.000    0.008    0.004 series.py:165(__init__)
        1    0.000    0.000    0.007    0.007 frame.py:7308(_arrays_to_mgr)
        1    0.000    0.000    0.007    0.007 <string>:35(npPredictTestSet)
        1    0.000    0.000    0.007    0.007 <string>:36(<listcomp>)
        9    0.000    0.000    0.006    0.001 base.py:4897(_ensure_index)
        1    0.000    0.000    0.005    0.005 series.py:283(_init_dict)
       51    0.001    0.000    0.004    0.000 <string>:23(npFindNeighbourLabels)
        1    0.000    0.000    0.004    0.004 internals.py:4869(create_block_manager_from_arrays)
      607    0.002    0.000    0.003    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.003    0.001 {pandas._libs.lib.clean_index_list}
        1    0.000    0.000    0.003    0.003 internals.py:4880(form_blocks)
      4/3    0.000    0.000    0.003    0.001 base.py:250(__new__)
      8/2    0.000    0.000    0.003    0.001 <frozen importlib._bootstrap>:966(_find_and_load)
        2    0.000    0.000    0.002    0.001 base.py:2430(equals)
       51    0.001    0.000    0.002    0.000 <string>:17(npCalculateDistances)
      8/2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
      6/2    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
      6/2    0.000    0.000    0.002    0.001 {built-in method builtins.__import__}
       51    0.001    0.000    0.002    0.000 <string>:30(npPredictLabel)
      102    0.001    0.000    0.002    0.000 fromnumeric.py:50(_wrapfunc)
        2    0.000    0.000    0.002    0.001 missing.py:376(array_equivalent)
  ```
<br/>
  + **_The results from profiling and using %timeit magic command show that:_**
    - Both implementations have the result of good classification accuracy (above 90% correct, typically 96% or better.), but the implementation with NumPy is much faster in total(0.03s vs 0.241s)
    - In the algorithm implementation with NumPy, the distances between a test object and all other intances in the training set are calculated in 8.63 µs, ~14 times fater than the implementation withour NumPy (118 µs). The reason of this is because we can call the fast operations on entire **ndarray** of data without having to write loops, which is slow in the implementation with pure Python. 
    - Since the function to calculate the distances with NumPy is much faster than without NumPy, then the functions to find neightbour labels and predict all of test set, and also the entire implementation with Numpy are much faster than without NumPy.