# The Nearest Neighbors Algorithm in Python #

No we will try to use the KNN on real music data by rewriting it in python. 
The dataset we will use to start will be the [University of Iowa Musical Instrument Samples Dataset](http://theremin.music.uiowa.edu/index.html). The first two cells in this notebook will load the dataset for you. Make sure you have downloaded the `Iowa_MIS_dataset.mat` file and you have put it the same directory where this notebook lives.

In [1]:
import numpy as np
import scipy.io
from scipy.stats import mode

In [2]:
Iowa_MIS_dataset = scipy.io.loadmat('Iowa_MIS_dataset.mat')
data = Iowa_MIS_dataset['dat_all']

The variable `data` is of size [660,88201], and is is already shuffled (it's always good to double check, just to make sure). The number of rows tells you the number of datapoints in the dataset, and the number of columns - 1 tells you the dimensionality of the dataset. The rightmost column in the matrix contains the labels for all datapoints. Possible labels are:

0 'Bass'

1 'Bassoon'

2 'Cello'

3 'Clarinet'

4 'Flute'

5 'Guitar'

6 'Horn'

7 'Sax'

8 'Trombone'

9 'Trumpet'

10 'Viola'

11 'Violin'

Next, separate the datapoints into training (~80% of the data), validation (~10%), and test sets (~10%) (procedure should be very similar to what you did in Julia).

In [3]:
# general data parameters
N = data.shape[0]
D = data.shape[1]-1
C = 12

# your code here:
shuffled_data = data[:]
np.random.shuffle(shuffled_data)
del data

In [4]:
train_size = int(0.8 * N)
val_size = int(0.1 * N)
test_size = int(0.1 * N)
print(shuffled_data.shape)
x_tr = shuffled_data[:train_size, :-1]
y_tr = shuffled_data[:train_size, D]
print("Train Shapes: X: " + str(x_tr.shape) + ", Y: " + str(y_tr.shape))

x_vl = shuffled_data[train_size:train_size + val_size, :-1]
y_vl = shuffled_data[train_size:train_size + val_size, D]
print("Val Shapes: X: " + str(x_vl.shape) + ", Y: " + str(y_vl.shape))

x_ts = shuffled_data[train_size + val_size:, :-1]
y_ts = shuffled_data[train_size + val_size:, D]
print("Test Shapes: X: " + str(x_ts.shape) + ", Y: " + str(y_ts.shape))

(660, 88201)
Train Shapes: X: (528, 88200), Y: (528,)
Val Shapes: X: (66, 88200), Y: (66,)
Test Shapes: X: (66, 88200), Y: (66,)


In [8]:
for k in range(1,20,2):

    num_correct = 0
    for i in range(x_vl.shape[0]):

        # calculate the L1 norm between each point that you used to train the model
        # and each point that you use to validate the model.    
        # your code here:
        #distances = sum(abs(x_tr .- reshape(x_vl[i,:], 1, size(x_tr)[2])), 2);
        distances = np.sum(abs(x_tr - x_vl[i,:]), axis=1)
        
        # obtain the indices that would sort the array in ascending order
        # your code here:
        KNN_index = np.argsort(distances)[0:k]

        # obtain the labels for the KNNs using their indices
        # your code here:
        KNNs = y_tr[KNN_index]

        # have the neighbors vote [hint: use the scipy function 'mode']
        # your code here:
        predicted_label = mode(KNNs)
        #print(predicted_label)

        if  int(predicted_label.mode[0]) == int(y_vl[i]):
            num_correct += 1

    print('accuracy with ', k, 'nearest neighbors: ', float(num_correct)/x_vl.shape[0])        

accuracy with  1 nearest neighbors:  0.16666666666666666
accuracy with  3 nearest neighbors:  0.15151515151515152
accuracy with  5 nearest neighbors:  0.07575757575757576
accuracy with  7 nearest neighbors:  0.10606060606060606
accuracy with  9 nearest neighbors:  0.06060606060606061
accuracy with  11 nearest neighbors:  0.06060606060606061
accuracy with  13 nearest neighbors:  0.045454545454545456
accuracy with  15 nearest neighbors:  0.045454545454545456
accuracy with  17 nearest neighbors:  0.06060606060606061
accuracy with  19 nearest neighbors:  0.06060606060606061


What is the best accuracy that you obtained? Was it above chance?

How could you make this model better?