# Exercise 3: Using kNN for classification of Wheat Seeds

# Task 1: Get the data and load it to an appropriate data structure

The dataset can be downloaded from: https://archive.ics.uci.edu/dataset/236/seedsimport

This can be done by, e.g. using curl: curl -o seeds.zip https://archive.ics.uci.edu/static/public/236/seeds.zip

In a next step, load the data to a pandas Dataframe. If you encounter any issues think about how you could solve them.
You can load the data using pandas directly using `pd.read_csv(..., sep='\t')` as the file is tab separted. 
Or you can loop over an open file handle and append the data to a list which you can then convert to a pandas DataFrame.
```python
data = []
with open('seeds_dataset.txt') as f:
pass
```

In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [71]:
data = []
df = pd.read_csv('seeds_dataset.txt', sep='\s+', names=["A", "P", "compactness", "length of kernel",
                                                        "width of kernel", "asymmetry coeff", 
                                                        "length of kernel groove", "varieties"])
print(df)

         A      P  compactness  length of kernel  width of kernel   
0    15.26  14.84       0.8710             5.763            3.312  \
1    14.88  14.57       0.8811             5.554            3.333   
2    14.29  14.09       0.9050             5.291            3.337   
3    13.84  13.94       0.8955             5.324            3.379   
4    16.14  14.99       0.9034             5.658            3.562   
..     ...    ...          ...               ...              ...   
205  12.19  13.20       0.8783             5.137            2.981   
206  11.23  12.88       0.8511             5.140            2.795   
207  13.20  13.66       0.8883             5.236            3.232   
208  11.84  13.21       0.8521             5.175            2.836   
209  12.30  13.34       0.8684             5.243            2.974   

     asymmetry coeff  length of kernel groove  varieties  
0              2.221                    5.220          1  
1              1.018                    4.956        

## Task 1.1: Rename the columns to something more descriptive (see the dataset description for that)

In [72]:
df.head()


Unnamed: 0,A,P,compactness,length of kernel,width of kernel,asymmetry coeff,length of kernel groove,varieties
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


In [73]:
df.describe()

Unnamed: 0,A,P,compactness,length of kernel,width of kernel,asymmetry coeff,length of kernel groove,varieties
count,210.0,210.0,210.0,210.0,210.0,210.0,210.0,210.0
mean,14.847524,14.559286,0.870999,5.628533,3.258605,3.700201,5.408071,2.0
std,2.909699,1.305959,0.023629,0.443063,0.377714,1.503557,0.49148,0.818448
min,10.59,12.41,0.8081,4.899,2.63,0.7651,4.519,1.0
25%,12.27,13.45,0.8569,5.26225,2.944,2.5615,5.045,1.0
50%,14.355,14.32,0.87345,5.5235,3.237,3.599,5.223,2.0
75%,17.305,15.715,0.887775,5.97975,3.56175,4.76875,5.877,3.0
max,21.18,17.25,0.9183,6.675,4.033,8.456,6.55,3.0


# Task 1.2: Look at the data and get a feeling for it
Use
```python
df.head()
df.describe()
```
or plot some histograms of the dataset to get a feeling for the data.

# Task 2: Implement the kNN algorithm using sklearn
Split the data into a training and a test set. Use the kNN algorithm from sklearn to classify the data.
Use accuracy (the fraction of correctly classified instances) as a metric to evaluate the performance of the algorithm.
Use 20% of the data as test set and the rest as training set.
Example of how to split the data:
```python
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)
```
More details can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


# Task 2.1 Split the data
Use 20\% of the data as test set and the rest as training set.
Question to ask yourself: 
- Why do we need to split the data?
- What is the randdom state?

In [74]:
from sklearn.model_selection import train_test_split

X, y = df.drop('varieties', axis=1), df['varieties']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Task 2.2. Implement the kNN algorithm and predict the test set samples with the trained model. 
```python
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
```
Use the fit and predict functions of the model to train and predict the data.

In [75]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
y_predicted = knn.predict(x_test)


# Task 2.3 Check the accuracy of the model using your own implementation of the accuracy function (fraction of correctly classified examples). 
Now compare the accuracy of the model with different values of k.  (e.g. 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21). You can use a loop for that.

In [76]:
def accuracy(y_true, y_pred):
    correct = 0
    num_of_samples = len(y_true)

    if num_of_samples != len(y_pred):
        raise ValueError("Incompatible lengths of the input arrays")
    
    for i in range(num_of_samples):
        if y_true.iloc[i] == y_pred[i]:
            correct += 1
    return correct / num_of_samples


In [77]:
from sklearn.metrics import accuracy_score

print(accuracy(y_test, y_predicted))
print(accuracy_score(y_test, y_predicted))


0.8809523809523809
0.8809523809523809
