# Session 1
## Experiment 1
### Lab

In this experiment, we will use the data set on fruits which we explored earlier and learn how a simple K nearest neighbour classification works. 

Let us consider a simple situation. Given some data about a fruit, we want to label it automatically.

Fruits are characterized by 
 * weight in grams as a float
 * colour as an integer
     - 1 $\rightarrow$ red
     - 2 $\rightarrow$ orange
     - 3 $\rightarrow$ yellow
     - 4 $\rightarrow$ green
     - 5 $\rightarrow$ blue
     - 6 $\rightarrow$ purple
 * label as a string
     - "A" $\rightarrow$ Apple
     - "B" $\rightarrow$ Banana
     
We are given some sample data such as (303, 3, "A") meaning the fruit with 303 gram weight, and yellow colour is an apple. A set of such *training examples* is given in “01-train.csv”. This has a small set of 17 **labeled** examples. 

We are given a set of **test** data where only weight and colour are given,  eg. (373,1). We should design a simple Nearest Neighbour classifier that will find the fruit label. i.e., "A" or "B", meaning Apple or Banana. 

We have 102 such testcases. We are also given additional files which have the correct labels for all the 102 test cases. If your predicted label is correct, you have done well!

Here are the details of all the files:
  * 01-train.csv $\Rightarrow$ The original input data. 
    - 18 lines
    - the first line is a header
    - each of the remaining 17 lines has three pieces of data:
       * weight in grams :: float
       * colour code :: 1, 2, 3, 4, 5 
       * label :: "A", "B"
  * 01-test1.csv $\Rightarrow$ The first test data set.
    - 31 lines
    - the first line is a header
    - each of the remaining 30 lines has two pieces of data
       * weight in grams :: float
       * colour code :: 1, 2, 3, 4, 5
  * 01-test1-labels.csv $\Rightarrow$ The labels for test data set above. That is, each line has just the correct label.
  * 01-test1-labelled.csv $\Rightarrow$ The above two files combined. 
  * 01-test2.csv $\Rightarrow$ The second test data set. Similar to the first data set, except that it has 73 lines.
  * 01-test2-labels.csv $\Rightarrow$ The labels for test data set above. That is, each line has just the correct label.
  * 01-test2-labelled.csv $\Rightarrow$ The above two files combined. 

We  see that similar fruits come close in the feature (weight, color) space? Now let us plot one sample data given in black.

In [None]:
# Let us first read the data from the file and do a quick visualization
import pandas as pd
import matplotlib.pyplot as plt
train = pd.read_csv("../Datasets/01-train.csv")

In [None]:
apples = train[train.Label == "A"]
bananas = train[train.Label == "B"]
plt.plot(apples.Weight, apples.Colour, "ro")
plt.plot(bananas.Weight, bananas.Colour, "y+")
plt.xlabel("Weight -- in grams")
plt.ylabel("Colour -- r-o-y-g-b-p")
plt.legend(["Apples", "Bananas"])
plt.show()

In [None]:
plt.plot(apples.Weight, apples.Colour, "ro")
plt.plot(bananas.Weight, bananas.Colour, "y+")
plt.xlabel("Weight -- in grams")
plt.ylabel("Colour -- r-o-y-g-b-p")
plt.legend(["Apples", "Bananas"])
plt.plot([373], [1], "ko")
plt.show()


From the visualization alone, we can infer that the unknown fruit is likely to be an apple. 

The job now is to instead of eyeballing it one at a time like above, use a kNN classifier with, say, $k = 3$ and using the *Euclidean* distance, to determine the correct label for the data in the file "01-test1.csv" that has 30 data points. 

Let us first write a distance function to calculate the *Euclidean* distance between two fruits.

In [None]:
import math
def dist(a, b):
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)


Now let us write code to find the $k$ nearest neighbours of a given fruit

In [None]:
def kNN(k, train, given):
    distances = []
    for t in train.values:              
        # loop over all training samples
        distances.append((dist(t[:2], given), t[2])) 
        # compute and store distances with respect to each training sample
    distances.sort()            
    return distances[:k]    # return first k samples = nearest  k distances to the given sample

In [None]:
print(kNN(5, train, (340, 1)))
print(kNN(5, train, (373, 1)))

As you can see above, the 3 (and 5) nearest neighbours of the fruit with the characteristics (373, 1) are all Apples -- label 1; which is what we visually saw when we plotted the point as a black spot in the chart. Of course we need to write another function to get this attribute rather than read, so we have written a function for that. We have used collections.Counter, which is a very useful python library. More detail are at:

 * https://docs.python.org/3/library/collections.html#collections.Counter 

In [None]:
import collections
def kNNmax(k, train, given):
    tally = collections.Counter()
    for nn in kNN(k, train, given):
        tally.update(nn[-1])
    return tally.most_common(1)[0]
print(kNNmax(5, train, (340, 1)))
print(kNNmax(7, train, (340, 1)))

This shows that of the five nearest neighbours to (340, 1) four are Apples and of the seven nearest, five are Apples 

Now let us load the test data and find the labels for all of them 

In [None]:
testData1 = pd.read_csv('../Datasets/01-test1.csv').values
test1labels = pd.read_csv('../Datasets/01-test1-labels.csv').values
for t,t1 in zip(testData1,test1labels):
    print(t, kNNmax(1, train, t)[0], t1)
    


Let us count how many are correct, instead of displaying the results

In [None]:
testData = pd.read_csv('../Datasets/01-test1.csv').values
testResults = pd.read_csv('../Datasets/01-test1-labels.csv').values.flatten()
results = []
    
for i, t in enumerate(testData):
    results.append(kNNmax(3, train, t)[0] == testResults[i])

print(results.count(True), "are correct")

In [None]:
len(testData)

**Exercise 1** :: Find the accuracy of your prediction -- percentage of the samples that are correctly predicted.

In [None]:
print("Accuracy is ",results.count(True)/len(testData) * 100 , "%")

**Exercise 2** :: Predict the labels for the larger file "01-test2.csv" that has 72 data points


In [None]:
testData2 = pd.read_csv('../Datasets/01-test2.csv').values
for t in testData2:
    print(t,kNNmax(3,train,t)[0])


**Exercise 3** :: Find the accuracy of the prediction by comparing with "01-test2-labelled.csv" 

In [None]:
testResults2 = pd.read_csv('../Datasets/01-test2-labels.csv').values.flatten()
results2 = []
for i,t in enumerate(testData2):
    results2.append(kNNmax(3,train,t)[0] == testResults2[i])
print(results2.count(True), " are correct")
print("Accuracy is ",results2.count(True)/len(testData2)*100, " %")

**Exercise 4** :: Repeat the above experiment with $k = 5$ and $k = 7$. Explain which $k$ is better and why?

**Exercise 5** :: Repeat the above experiment with $k = 17$. What do you think is happening?


In [None]:
def accuracy(train,test,testResults,k):
    results = []
    for i,t in enumerate(test):
        results.append(kNNmax(k,train,t)[0] == testResults[i])
        
    print(results.count(True), " are correct")
    print("Accuracy is ",results.count(True)/len(test)*100, " %")
    
    
accuracy(train,testData2,testResults2,5)
accuracy(train,testData2,testResults2,7)
accuracy(train,testData2,testResults2,17)

**Exercise 6** :: If the weights are in Kgs, that is divide all of the data in weights column by 1000, what is the accuracy for $k = 3$


In [None]:
TraininKG = pd.read_csv("../Datasets/01-train.csv")
TraininKG['Weight'] = TraininKG['Weight']/1000
TestData2inKG = pd.read_csv('../Datasets/01-test2.csv')
TestData2inKG['Weight'] = TestData2inKG['Weight']/1000
TestData2inKG = TestData2inKG.values

In [None]:
accuracy(TraininKG,TestData2inKG,testResults2,3)

In [None]:
### Calculation of weight can be done using pandas converts function

def convert_weight(weight):
    return int(weight)/1000

def convert_weights(weight):
    return float(weight)/1000

Traininkgs = pd.read_csv("../Datasets/01-train.csv", converters={"Weight":convert_weight})
Testdata2inkgs = pd.read_csv("../Datasets/01-test2.csv", converters={"Weight":convert_weights})

**Exercise 7** :: Modify the distance function to ignore the colour feature. Calculate the accuracy on "01-test1.csv"


In [None]:
import math
def dist(a, b):
    sqSum = 0
    for i in range(len(a)-1):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)

In [None]:
testData = pd.read_csv('../Datasets/01-test1.csv').values
testResults = pd.read_csv('../Datasets/01-test1-labels.csv').values.flatten()
results = []
    
for i, t in enumerate(testData):
    results.append(kNNmax(3, train, t)[0] == testResults[i])

print(results.count(True), "are correct")

In [None]:
results.count(True)/len(testData) * 100

#### Result: Accuracy is 100% for all test data

**Exercise 8** :: If we used the square of the Euclidean distance, for the distance function does it affect the accuracy?

In [None]:
import math
def dist(a, b):
    sqSum = 0
    for i in range(len(a)-1):
        sqSum += (a[i] - b[i]) ** 2
    return sqSum

In [None]:
testData = pd.read_csv('../Datasets/01-test1.csv').values
testResults = pd.read_csv('../Datasets/01-test1-labels.csv').values.flatten()
results = []
    
for i, t in enumerate(testData):
    results.append(kNNmax(3, train, t)[0] == testResults[i])

print(results.count(True), "are correct")

In [None]:
results.count(True)/len(testData) * 100

#### Result : No effect

**Exercise 9** :: If we use the sum of the absolute differences, as the distance metric instead of the Euclidean, how does that affect the accuracy?

**Result:** Slight change in Accuracy if color feature is included in dist function

## Acknowledgment
This experiment is based on the blog post http://www.jiaaro.com/KNN-for-humans. 

## Summary
In the above experiment, we find that a simple nearest neighbour method can successfully predict labels with a small number of labelled examples. But we also see that the results can go really wrong if we make some wrong choices (like weight in Kg, or a very large K). This should remind you about the practical expertise and experimental skills that will become equally important as we move forward.