# 1. Developing base functions with iris data

## Context

Working through the base implementation of functions from scratch that can implement a simple k-nearest neighbors predictive model for quantitative vector data.

## Work outline

I am following the tutorial found at [this_site](https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/) and walking through a simple example of predicting iris species using the well known [iris_petal_dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data).

### Steps

1. import module from local filepath; load local data file
2. test euclidean_distance base function with example rows from data
3. test find_k_neighbors base function with fake query against cleaned dataset
4. test predict_category_from_knn 

## Result

The base functions appear to work well from the interactive light testing here. Also have some early unittest.TestCase classes written in a test module that show functions are behaving as expected.

## Next steps

Decide whether I want to test scratch implementation against skikit-learn or other third-party package implementation: **At this point will build on "scratch implementation" to extend to predicting a continuous variable output**

1. Run through notebook with matplotlib visualization workflow continued on Iris data
2. Extend knn_base with functions to call continuous predicted variables from datasets

## 1. import module from local filepath; load local data file

Also cleanup the dataset to prepare for applying functions

In [1]:
import csv
with open('data/iris_data.csv', 'r') as f:
    lines = csv.reader(f)
    dataset = list(lines)

In [2]:
dataset[1]

['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']

In [3]:
len(dataset)

150

In [4]:
from collections import Counter

In [5]:
Counter([row[4] for row in dataset])

Counter({'Iris-setosa': 50, 'Iris-versicolor': 50, 'Iris-virginica': 50})

In [6]:
len([i for row in dataset for i in row])

750

In [7]:
Counter([type(i) for row in dataset for i in row])

Counter({str: 750})

In [8]:
cleaned_dataset = []
for row in dataset:
    new_row = []
    for i in range(len(row)):
        if i < 4:
            new_row.append(float(row[i]))
        else:
            new_row.append(row[i])
    cleaned_dataset.append(new_row)

In [9]:
print cleaned_dataset[1]
print cleaned_dataset[2]

[4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
[4.7, 3.2, 1.3, 0.2, 'Iris-setosa']


## 2. test euclidean_distance base function with example rows from data

In [10]:
from knn_base import euclidean_distance

In [11]:
euclidean_distance(cleaned_dataset[1], cleaned_dataset[2], 3)

0.30000000000000016

## 3. test find_k_neighbors base function with fake query against cleaned dataset

In [12]:
from knn_base import find_k_neighbors

In [13]:
find_k_neighbors([4.5, 3, 1.6, 0.3], cleaned_dataset, 7, 4)

[[4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
 [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
 [4.7, 3.2, 1.6, 0.2, 'Iris-setosa'],
 [4.6, 3.2, 1.4, 0.2, 'Iris-setosa'],
 [4.8, 3.1, 1.6, 0.2, 'Iris-setosa'],
 [4.4, 3.0, 1.3, 0.2, 'Iris-setosa'],
 [4.8, 3.0, 1.4, 0.3, 'Iris-setosa']]

## 4. test predict_category_from_knn 

In [14]:
from knn_base import predict_category_from_knn
from knn_base import calc_category_frequency

In [15]:
calc_category_frequency(cleaned_dataset,vector_length=4)

Counter({'Iris-setosa': 50, 'Iris-versicolor': 50, 'Iris-virginica': 50})

In [16]:
predict_category_from_knn([4.5, 3, 1.6, 0.3], cleaned_dataset, 7, 4)

[['Iris-setosa', 7]]

## Looks like everything is in working order here

Seems like next obvious step is: How can I expand to a continuous prediction value?