# HW01 - Test and Train Split; K-Nearest Neighbors

Name: Maida Raza

Tufts University

CS 135

# Q1 - Test and Train Split

(10 pts.) The first part of the code asks you to write a function to split data into a training set and a testing set (a very common task in ML). Your code will use basic NumPy array indexing and random number generation to take an input array containing L instances of F-dimensional data, and divide it into two mutually exclusive arrays of size M (for training) and N (for testing). As part of its input, the function takes parameter frac_test, specifying the overall fraction
of the data-set to use for testing purposes. It will use this fraction to determine the size N, rounding up to the nearest whole number: N = ⌈frac_test ∗ L⌉ The function will also use NumPy functions like shuffle or permutation for doing random sampling of the data, so that the test/train instances are uniformly selected from the data-set: https://numpy.org/doc/stable/reference/random/legacy.html 

(Note that this links to a library that has been generally replaced by something a bit newer in the latest version of NumPy, but is still supported. Your code will function correctly if you use the legacy version, or the newer NumPy features.) Furthermore, we want the results of the function to be reproducible, for scientific purposes. That means that there should be a way to specify a source of randomness (a seed) such that it is possible to duplicate any random selection. In NumPy, this is generally handled by using a RandomState instance, or via an integer seed. See the linked discussion from the source code comments for insight into using such seeds in your code.

In [5]:

import numpy as np

def split_into_train_and_test(x_all_LF, frac_test=0.5, random_state=None): #frac_test allocates half of the data to test and half to training. If random_state has a number to it, then the split is reproducible
    L = x_all_LF.shape[0] #shape[0] to get the number of rows, shape(1) to get the number of columns
    N = int(np.ceil(frac_test*L))

    if isinstance(random_state, np.random.RandomState):
        rng = random_state
    else:
        rng = np.random.RandomState(seed=random_state)

    permuted_index = rng.permutation(L)
    test_array = permuted_index[:N]
    train_array = permuted_index[N:]

    x_test = x_all_LF[test_array]
    x_train = x_all_LF[train_array]

    return np.array(x_test), np.array(x_train)
    
data = np.array([[2,3,4],[9,7,0]])

rows = np.random.randint(0,2, size = 2)
cols = np.random.randint(0,3, size = 2)

split_into_train_and_test(data[rows, cols])

(array([7]), array([7]))

In [3]:
import numpy as np

def split_into_train_and_test(x_all_LF, frac_test=0.4, random_state=None):
    L = x_all_LF.shape[0] #when it says L instances, its actually syaing the number of rows
    N = int(np.ceil(frac_test*L))
    print(N)
    
    shuffle_indices = np.random.permutation(L) # we are shuffling the rows
    test_indices = shuffle_indices[:N]
    train_indices = shuffle_indices[N:]

    test_data = x_all_LF[test_indices]
    train_data = x_all_LF[train_indices]

    return train_data, test_data

data = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])

split_into_train_and_test(data, frac_test=0.4, random_state=43)
    

2


(array([[1, 2, 3, 4]]),
 array([[ 9, 10, 11, 12],
        [ 5,  6,  7,  8]]))

# Q2 - K-Nearest Neighbours

(10 pts.) The other function you are to complete finds nearest neighbors of given data instances. That is, given a set of F-dimensional data of size N, and Q query instances (also F-dimensional), we want to compute the K closest vectors found in the data, for some integer value K, and for each query instance (for a total of Q × K neighbors).

Your function will take in a data-set (a 2-dimensional array of size N × F) and a query-set (a 2-dimensional array of size Q×F), and return a 3-dimensional array (of size Q×K ×F), where each row (indexed by Q) consists of the K nearest neighbors of the corresponding query vector. These neighbors should appear in order, closest to least-close.

Notes: it is possible that there will be ties among neighbors. If this occurs, such ties can be broken however you like (randomly or not). Again, be sure that your code uses only basic Python and functions from NumPy. Do not call functions from libraries like sklearn.

In [None]:
#input will be query (QxF) and data (NxF)
#output will be (QxKxF)
#the smaller the distance, the closer two vectors

import numpy as np

def calc_k_nearest_neighbors(data_NF, query_QF, K=1):
    data_NF = np.array(data_NF)
    query_QF = np.array(query_QF)
    
    all_neighbors = [] # we will store the distance between two points here
    
    for q in range(len(query_QF)):
        distance = []
        for d in range(len(data_NF)):
            temp_distance = np.sum(np.square(data_NF[d] - query_QF[q]))
            distance.append((d, temp_distance))

        distance.sort(key = lambda x: x[1])
        k_nearest = distance[:K]

        neighbors_for_q = []
        for item in k_nearest:
            indice = item[0]
            neighbors_for_q.append(data_NF[indice])
                
        all_neighbors.append(neighbors_for_q)
    return np.array(all_neighbors)
        #from the tuples, get 

data = [[2,3],[5,7],[0,7],[15,11]]

query = [[8,10],[11,13]]

calc_k_nearest_neighbors(data, query, K=1)

array([[[ 5,  7]],

       [[15, 11]]])