### Used Cartesian distance as the similarity measurements to show the results of the gender prediction for the Evaluation data based on the corresponding training data for values of K of 1, 3, and 5. I have also included the intermediate steps.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df= pd.read_csv('data.csv')
df1 = pd.read_csv('Test.csv')
samples = df1.values
k_values = [1,3,5]

In [3]:
def cartesian_distance(sample, inputs):
    
    diff = sample - inputs
    sum_pow = np.sum(np.power(diff, 2), axis=1)
    
    return np.power(sum_pow, 0.5)

In [4]:
def classify(k, sorted_labels):
    
    k_neighbors = sorted_labels[:k]
    men_occurencies = np.count_nonzero(k_neighbors == ' M')
    women_occurencies = np.count_nonzero(k_neighbors == ' W')
    
    return ' M' if men_occurencies > women_occurencies else ' W'
    

### Implemented the KNN algorithm. Implementation works with different training data sets as well as different values of K and allows to input a data point for the prediction.

In [5]:
def KNN_classification(sample, k, df):

    labels = df['Class'].values
    inputs = df.drop('Class', axis=1).values

    # get the cartesian distance from each data point
    cart_distance = cartesian_distance(sample, inputs)

    # create a 2D array with the 1st column being the above distances and the second corresponding label
    labeled_cart = np.vstack((cart_distance, labels))

    # sort in an ascending manner the above 2D array based on the distances
    sorted_cart = labeled_cart.T[labeled_cart.T[:, 0].argsort()]
    sorted_labels = sorted_cart.T[1]

    return classify(k, sorted_labels)

In [6]:
for sample in samples:
    print("sample:{}".format(sample))
    for k in k_values:
        print("\tk:{}".format(k))
        prediction_1 = KNN_classification(sample, k, df)
        print("\tClass predicted is {} for k:{} neighbors".format(prediction_1, k))
        

sample:[ 1.61159968 72.74989648 25.        ]
	k:1
	Class predicted is  W for k:1 neighbors
	k:3
	Class predicted is  M for k:3 neighbors
	k:5
	Class predicted is  W for k:5 neighbors
sample:[ 1.51334854 65.4026277  20.        ]
	k:1
	Class predicted is  W for k:1 neighbors
	k:3
	Class predicted is  W for k:3 neighbors
	k:5
	Class predicted is  W for k:5 neighbors
sample:[ 1.65552675 63.48427979 31.        ]
	k:1
	Class predicted is  W for k:1 neighbors
	k:3
	Class predicted is  W for k:3 neighbors
	k:5
	Class predicted is  W for k:5 neighbors
sample:[ 1.59412216 70.02069521 23.        ]
	k:1
	Class predicted is  W for k:1 neighbors
	k:3
	Class predicted is  W for k:3 neighbors
	k:5
	Class predicted is  W for k:5 neighbors


### Evaluated the performance of the KNN algorithm, implemented a leave-one-out evaluation routine for the algorithm. In leave-one-out validation, I repeatedly evaluated the algorithm by removing one data point from the training set, training the algorithm on the remaining data set and then testing it on the point we removed to see if the label matches or not. Repeated this for each of the datapoints gives us an estimate as to the percentage of erroneous predictions the algorithm makes and thus a measure of the accuracy of the algorithm for the given data.Applied leave-one-out validation with the KNN algorithm to the dataset for values for K of 1, 3, and 5 and reported the results.

In [7]:
data2c = pd.read_csv('data2c.csv')

In [8]:
for k in k_values:
    count = 0
    
    for index, test_sample in data2c.iterrows():
        
        sample = test_sample.values[:3]
        target = test_sample.values[3]
        prediction = KNN_classification(sample, k, data2c.drop(index))
        if target == prediction:
            count = count + 1

    print("KNN Accuracy using k:{}".format(k))
    print("{}/{} correct predictions using all features".format(count, data2c.shape[0]))

KNN Accuracy using k:1
73/120 correct predictions using all features
KNN Accuracy using k:3
75/120 correct predictions using all features
KNN Accuracy using k:5
80/120 correct predictions using all features


### Repeated the prediction and validation using KNN when the age data is removed (i.e. when only the height and weight features are used as part of the distance calculation in the KNN algorithm). Reported the results and compared the performance without the age attribute. Discussed the results.

In [9]:
#Remove age feature

data2c_wo_age = data2c.drop('Age', axis=1)
data2c_wo_age.head(1)

Unnamed: 0,Height,Weight,Class
0,1.581431,81.535494,M


In [10]:
for k in k_values:
    count = 0
    
    for index, test_sample in data2c_wo_age.iterrows():
        
        sample = test_sample.values[:2]
        target = test_sample.values[2]
        prediction = KNN_classification(sample, k, data2c_wo_age.drop(index))
        if target == prediction:
            count = count + 1
            

    print("KNN Accuracy using k:{}".format(k))
    print("{}/{} correct predictions without Age feature".format(count, data2c_wo_age.shape[0]))

KNN Accuracy using k:1
80/120 correct predictions without Age feature
KNN Accuracy using k:3
86/120 correct predictions without Age feature
KNN Accuracy using k:5
77/120 correct predictions without Age feature


#### Report of the results

* KNN Performance using k:1 <br>73/120 correct predictions using all features                                 
* KNN Performance using k:3 <br>75/120 correct predictions using all features
* KNN Performance using k:5 <br>80/120 correct predictions using all features<br>

    
* KNN Performance using k:1 <br>80/120 correct predictions without age feature                                 
* KNN Performance using k:3 <br>86/120 correct predictions without age feature
* KNN Performance using k:5 <br>77/120 correct predictions without age feature<br>
<br>

Here from the results we can compare that when age feature is excluded the results that came are comparatively better for k=1,3 and whereas for k=5 vice versa. As the values of k increases it becomes difficult for algorithm to perform well with less data. The Age feature when passed works better with low values of k but when it is removed the performance for low values of K increases but for the larger values of k accuracy is decreased. As KNN is discrimnative classifier and non-parametric which depends on the data it can be said that the data provided is not not sufficient to get good results.