# Project 4 Part B:  Classification

## Part 1: Exploring the dataset
We will first take a look at a subset of the provided data. Our data is stored in a CSV (comma-separated values) file. Pandas makes it easy to load and preview datasets in this format. _Make sure that the data file is in the same directory as this notebook._

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from mpl_toolkits.mplot3d import Axes3D

%matplotlib qt

In [3]:
dataset_path = "iris_dataset.csv"
# TODO 1: Read the data file with function pd.read_csv() and display the resulting DataFrame. 
#         See Lecture 17 notebook for an example.  
df=pd.read_csv(dataset_path)
display(df)

Unnamed: 0,sepal_length,sepal_width,petal_area,species
0,3.0,2.9,-0.9,versicolor
1,5.9,3.0,9.2,virginica
2,6.4,2.9,5.6,versicolor
3,6.8,2.8,6.7,versicolor
4,7.6,3.0,13.9,virginica
...,...,...,...,...
179,6.9,3.1,7.4,versicolor
180,-0.8,2.7,8.4,virginica
181,4.9,2.5,7.7,virginica
182,5.5,4.2,0.3,setosa


In [4]:
# TODO 2: Filter out the data points with any negative values,
#         name the resulting DataFrame `fdf`, and
#         display fdf.

fdf=df[(df['sepal_length']>=0)&(df['sepal_width']>=0)&(df['petal_area']>=0)]
display(fdf)

Unnamed: 0,sepal_length,sepal_width,petal_area,species
1,5.9,3.0,9.2,virginica
2,6.4,2.9,5.6,versicolor
3,6.8,2.8,6.7,versicolor
4,7.6,3.0,13.9,virginica
6,6.9,3.1,11.3,virginica
...,...,...,...,...
178,5.5,3.5,0.3,setosa
179,6.9,3.1,7.4,versicolor
181,4.9,2.5,7.7,virginica
182,5.5,4.2,0.3,setosa


### TODO 3: Answer the questions by writing text in this cell

1. What is the shape of the **original** dataset (i.e., how many rows and columns of data are there?)
184 rows, 4 columns
2. After filtering, what is the shape?
149 rows, 4 columns

## Part 2: Visualization

Now that the dataset has been pre-processed, we convert the _filtered_ data to it to a **NumPy array** and assigns it to the variable `data`.  This will not be a numeric 2-array that we have seen in the past because the last column of data are _strings_ and not numbers.  NumPy automatically chooses the type `object`, which can accommodate both numbers and strings, for the array.  For our purpose of creating visualizations and later performing a prediction, the type does not matter.  Just know that for any row `r`, the value in `data[r][0]`, `data[r][1]`, and `data[r][2]` would each be a number while `data[r][3]` is a string. 

Run the code box below and use this `data` array for the rest of the project.

In [5]:
# Convert filtered data (assuming it is called `fdf`) to a NumPy array called `data`
data = fdf.to_numpy()
print(np.shape(data))

# Print the values of the first row as an example
for k in range(4):
    print(data[0][k])

(149, 4)
5.9
3.0
9.2
virginica


In [6]:
# TODO 4: Complete the following function to return a unique color name for each species
def species_to_color(sp):
    """
    Returns (string) the name of a color for representing a species.  
    
    Parameter sp: (string) the name of the species, one of "setosa", "versicolor", "virginica".
    
    The returned string is one of "blue", "green", "red", "cyan", "magenta", "yellow", or "black".
    Each species should be represented by a different color.
    """
    if sp=="setosa":
        return "blue"
    elif sp=="versicolor":
        return "green"
    else:
        return "red"

In [7]:
# TODO 5: Create a 2D scatter plot of the filtered data with 
#         sepal_length on the x-axis and sepal_width on the y-axis.
plt.close('all')        # close all currently open figure windows
plt.figure()            # creates a new figure window
ax= plt.gca()           # gets current axes
xpos=0 #initiate xpos to 0 as a placeholder for max x value
ypos=0 #same for ypos
rows,cols=np.shape(data)
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
for i in range(rows):
    xdat=data[i][0]
    if xdat>xpos:
        xpos=xdat
    ydat=data[i][1]
    if ydat>ypos:
        ypos=ydat
    plt.scatter(xdat, ydat, c = species_to_color(data[i][3]))
for species_name in ["setosa" , "versicolor" , "virginica" ]:
    plt.text(xpos -1, ypos, species_name, c = species_to_color(species_name))
    ypos = ypos - .1

In [8]:
# TODO 6: Create a 3D scatter plot of sepal_length, sepal_width, and petal_area

# Create a figure with axes for a 3D plot
fig = plt.figure()
ax = plt.axes(projection="3d")

# Add your code below
ax.set_xlabel('sepal length (cm)')
ax.set_ylabel('sepal width (cm)')
ax.set_zlabel('petal area (cm ^2)')
for i in range(rows):
    xdat=data[i][0]
    ydat=data[i][1]
    zdat=data[i][2]
    ax.scatter(xdat, ydat,zdat, c = species_to_color(data[i][3]))

### Part 3: Performing predictions

In [9]:
# TODO 7: Complete the function below
def get_averages(dataset, target_species):
    """
    Returns the averages of sepal length, sepal width, and petal area for a target species
    
    Parameters:
        dataset: the 2-d array storing the sepal length, sepal width, petal area, and species
        
        target_species: (string) the target species whose average feature values are to be 
            calculated.  One of "setosa", "versicolor", "virginica".
    
    Returns a tuple (average_sepal_length, average_sepal_width, average_petal_area) 
        for the target species 
    """
    sepal_length = dataset[:, 0]  
    sepal_width = dataset[:, 1] 
    petal_area = dataset[:, 2] 
    species = dataset[:, -1]
    
    ### Do NOT modify the code above.  Add your code BELOW.
    sepal_length_sum=0
    sepal_width_sum=0
    petal_area_sum=0
    for i in range(rows):
        sepal_length_sum+=sepal_length[i]
        sepal_width_sum+=sepal_width[i]
        petal_area_sum+=petal_area[i]
    average_sepal_length=sepal_length_sum/rows
    average_sepal_width=sepal_width_sum/rows
    average_petal_area=petal_area_sum/rows
    return (average_sepal_length, average_sepal_width, average_petal_area)

In [10]:
# Display the average sepal length, average sepal width, average petal area for each species
print(f"setosa: {get_averages(data, 'setosa')}")
print(f"versicolor: {get_averages(data, 'versicolor')}")
print(f"virginica: {get_averages(data, 'virginica')}")

setosa: (5.844966442953019, 3.055704697986577, 5.773154362416107)
versicolor: (5.844966442953019, 3.055704697986577, 5.773154362416107)
virginica: (5.844966442953019, 3.055704697986577, 5.773154362416107)


In [13]:
# TODO 8: Implement the distance function to compute the Euclidean distance
#         between two data points
import math
def dist(p, q):
    """ 
    Returns the Euclidean distance between the point p and the point q
    
    Paramters p, q: each is a length 3 array representing the feature values
        sepal length, sepal width, and petal area
    """
    distance=math.sqrt((p[0]-q[0])**2+(p[1]-q[1])**2+(p[2]-q[2])**2)
    return distance

In [14]:
# TODO 9: Implement the nearest-neighbor algorithm
def nearest_neighbor(newpoint, dataset):
    """
    Returns a tuple of the feature values and label of the point in `dataset` that is
        nearest to `newpoint` 
    
    Parameters:
        newpoint: length 3 array of the feature values (sepal length, sepal width, 
            and petal area) of a point
        
        dataset: the 2-d array storing the sepal length, sepal width, petal area, and species
        
    Returns a tuple (values, label) of the point in `dataset` that is nearest to `newpoint`,
        where values is a length 3 array storing sepal length, sepal width, and petal area, 
        and label is the species (string).
    """
    
    allvalues = dataset[:, 0:-1]  # all feature values of the data in dataset (3 columns)
    labels = dataset[:, -1]       # all the species labels of the data in the dataset

    ### Do NOT modify the code above.  Add your code BELOW.
    #initate min_dis, values, and label based on the first point in the dataset as a placeholder
    min_dis=dist(newpoint,allvalues[0])
    values=(allvalues[0,0],allvalues[0,1],allvalues[0,2])
    label=labels[0]
    for i in range(rows):
        dis=dist(newpoint,allvalues[i])
        if dis<min_dis:
            min_dis=dis
            values=(allvalues[i,0],allvalues[i,1],allvalues[i,2])
            label=labels[i]
    return (values, label)

In [18]:
# TODO 10: Given a new data point, predict its species by calling your
#   nearest_neighbor function.
#   Print the predicted label (species).
#   OPTIONAL: Additionally draw a 3-d scatter plot of the dataset, the new
#             data point, a line connecting the new data point to its
#             nearest neighbor in the datasest. Add a title to show the
#             predicted label.

# New data point
test_point = np.array([5,3,2])

### Do NOT modify the code above. 
#   Add code BELOW to predict and print the species of `test_point`.  
(measures,species)=nearest_neighbor(test_point,data)
print(f'The predicted species of the new data point is {species}.')

The predicted species of the new data point is setosa.
