## PROBLEM DESCRIPTION

The iris flower has 3 species: setosa, versicolor, and virginica. Each species has different characteristics based on the measurements of their petals and sepals. 

<img src="iris.png" width="500"/>

We have several feature data of 150 iris flowers:
* Sepal length in cm
* Sepal width in cm
* Petal length in cm
* Petal width in cm
* Species: setosa, versicolor, virginica

These data can be found in the files `iris_features.csv`and `iris_labels.csv`. The CSV stands for "comma-separated values": the data is stored in plain text where each value is separated by a comma.

Part of the data is shown below:

<img src="data.png" width="350"/>

The figure belowe shows a scatter plot of the iris flowers based on sepal length and sepal width, colored by species.

<img src="plot.png" width="500"/>

## EXERCISE GOAL

Suppose we have a new iris flower with the following measurements:
* Sepal length: 5.0 cm
* Sepal width: 2.5 cm

This new flower is represented by the black point shown in the figure below:

<center><img src="punt.png" width="750"/></center>

Using the techniques we discussed in the previous 2 sessions, we want to predict the species of this new iris flower based on the existing data.

## Work Flow

1. Import the features and labels from the CSV files.
    * Skip the header row.
    * Remove newline characters.
    * Split the feature values by commas.
    * Convert the feature values to floats.
2. The formula for Euclidean distance between two points ($x_1$, $y_1$) and ($x_2$, $y_2$) is:

    $$
    d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
    $$

    Calculate the Euclidean distance between the given sepal length and width and all other flowers in the dataset. Save these distances in a list `distances`.
3. Find the smallest distance and return the species of that flower as the predicted species.

## SOLUTION

1. Import the features and labels from the CSV files. Skip the header row.

In [None]:
# (a) Read iris features from iris_features.csv


# Skip the header


# Process each line to convert it into a list of floats


# (b) Read iris labels from iris_labels.csv


# Skip the header


# Process each label to remove any trailing newline characters


2. Calculate the Euclidean distance between the given sepal length and width and all other flowers in the dataset. Save these distances in a list `distances`.

In [None]:
# Define a sepal length and width
sep_len = 5.0
sep_wid = 2.5

# Calculate the Euclidean distances
# (a) create an empty list distances
distances = []
# (b) loop over each row in features


3. Find the smallest distance and return the species of that flower as the predicted species.

In [None]:
# Find the nearest neighbor: ie the minimum distance and its index

print("Minimum distance:")
print("Index of nearest neighbor:")

# Final step: predict the species of the new iris sample based on the nearest neighbor


# Step 4: Print the predicted species
print("Predicted species:")

## SOLUTION with Numpy

Using the numpy module, we can simplify the calculations and make the code more efficient.

1. Import the features and labels from the CSV files.

In [None]:
import numpy as np


2. The formula for Euclidean distance between two points ($x_1$, $y_1$) and ($x_2$, $y_2$) is:

    $$
    d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
    $$

    Calculate the Euclidean distance between the given sepal length and width and all other flowers in the dataset. 


3. Find the smallest distance and return the species of that flower as the predicted species.

### Extension
Suppose we want to use all four features (sepal length, sepal width, petal length, petal width) to calculate the Euclidean distance. We have to modify the code accordingly. The only change needed is in the distance calculation step. 

The formula for Euclidean distance between two points ($x_1$, $y_1$, $z_1$, $w_1$) and ($x_2$, $y_2$, $z_2$, $w_2$  ) becomes:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2 + (w_1 - w_2)^2}
$$

We can implement this in the code by considering all four features when calculating the distances. The rest of the code remains the same.


In [None]:
sep_len = 5.0
sep_wid = 2.5
pet_len = 3.5
pet_wid = 1.0
