# Setup
I use Visual Studio Code and Jupyter Notebook and run locally in my computer.

The libraries I used:
- pandas for data handling
- matplotlib and seaborn for visualizations
- scikit-learn for spliting data into train and test, working with model, and calculating metrics

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Libraries loaded successfully")

# Data Preparation
I use Iris dataset from D2L. This is a CSV file with 150 samples (50 samples each flower).

In this step, I will load the Iris CSV file and make sure the column names match expected features and target.

In [None]:
# Load the Iris dataset from CSV
iris = pd.read_csv('dataset/Iris.csv')

# Display the first 5 rows of the dataset to make suare it's loaded correctly
print(iris.head(5))

# Data Exploration

## Display basic statistics
Because SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm are numeric features, I print min, max, mean, median and Standard deviation to see their typical values, range, and how much they vary.

For the species, I count each class to see how many samples are in each group.

### SepalLengthCm statistics

In [None]:
print("Min:", iris["SepalLengthCm"].min())
print("Max:", iris["SepalLengthCm"].max())
print("Mean:", iris["SepalLengthCm"].mean())
print("Median:", iris["SepalLengthCm"].median())
print("Standard Deviation:", iris["SepalLengthCm"].std())

### SepalWidthCm statistics

In [None]:
print("Min:", iris["SepalWidthCm"].min())
print("Max:", iris["SepalWidthCm"].max())
print("Mean:", iris["SepalWidthCm"].mean())
print("Median:", iris["SepalWidthCm"].median())
print("Standard Deviation:", iris["SepalWidthCm"].std())

### PetalLengthCm statistics

In [None]:
print("Min:", iris["PetalLengthCm"].min())
print("Max:", iris["PetalLengthCm"].max())
print("Mean:", iris["PetalLengthCm"].mean())
print("Median:", iris["PetalLengthCm"].median())
print("Standard Deviation:", iris["PetalLengthCm"].std())

### PetalWidthCm statistics

In [None]:
print("Min:", iris["PetalWidthCm"].min())
print("Max:", iris["PetalWidthCm"].max())
print("Mean:", iris["PetalWidthCm"].mean())
print("Median:", iris["PetalWidthCm"].median())
print("Standard Deviation:", iris["PetalWidthCm"].std())

### Species statistics

In [None]:
print(iris['Species'].value_counts())

## Visualizations

### Histograms of features
Because SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm are numeric features, I visualize by plot historams.

In [None]:
iris[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']].hist(bins=15, figsize=(10, 6))
plt.suptitle("Histograms of features")
plt.show()

### Sepal length vs Sepal width
The diagram shows the comparation between length and width of Sepal and color by species

In [None]:
sns.scatterplot(data=iris, x='SepalLengthCm', y='SepalWidthCm', hue='Species')
plt.title("Sepal Length vs Sepal Width by Species")
plt.show()

### Petal length vs Petal width
The diagram shows the comparation between length and width of Petal and color by species

In [None]:
sns.scatterplot(data=iris, x='PetalLengthCm', y='PetalWidthCm', hue='Species')
plt.title("Petal Length vs Petal Width by Species")
plt.show()

# Data Preprocessing

### See missing values

In [None]:
print(iris.isnull().sum())

### Drop rows with missing values

In [None]:
iris = iris.dropna()

### Convert Species into numbers because model needs numeric input

Iris-setosa: 0

Iris-versicolor: 1

Iris-virginica: 2

In [None]:
target_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
iris['Species_num'] = iris['Species'].map(target_map)
print(iris.head(150)) 

### Split the data into training and testing sets
80% training, 20% testing because I want to give the model enough data to learn patterns and keep a 20% to check how well it performs on unseen data.

In [None]:

features = iris[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
target = iris['Species_num']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=0)

# Model Building
I chose K-Nearest Neighbors because it is easy to understand and classifies samples based on similarity and it works well for small datasets like Iris. 

I set `n_neighbors = 5` when defining KNN model because the model is more stable than using `3` or `1` (too small). It will look at more neighbors before making a decision -> reduce the effect of noise but still keeps good accuracy.

I will also do a hyperparameter tuning step later to see how different values of k affect performance.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)

# Model Training

Train the model using training dataset and then predict on test set

In [None]:
knn.fit(features_train, target_train)

target_prediction = knn.predict(features_test)

# Evaluation

Accuracy shows overall performance

Confusion matrix and classification report show how well the model did for each flower class

### Accuracy

In [None]:
accuracy = accuracy_score(target_test, target_prediction)
print(accuracy)

### Confusion matrix

In [None]:
conf_matrix = confusion_matrix(target_test, target_prediction)
print(conf_matrix)

### Classification report

In [None]:
class_report = classification_report(target_test, target_prediction, target_names=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
print(class_report)

# Hyperparameter Tuning

The reason why I chose tuning the `n_neighbors` hyperparameter in KNN because it decides how many nearby points that the model looks at to make a prediction. That means this parameter is really important.

I will test in small `k` (`1` and `3`) and large `k` (`15` and `20`) to see what happened.

In [None]:
for k in [1, 3, 15, 20]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(features_train, target_train)
    target_prediction = knn.predict(features_test)

    accuracy = accuracy_score(target_test, target_prediction)
    conf_matrix = confusion_matrix(target_test, target_prediction)
    class_report = classification_report(target_test, target_prediction, target_names=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

    print("With k =",k)
    print("")
    print("Accuracy =", accuracy)
    print("Confusion matrix")
    print(conf_matrix)
    print("Classification report")
    print(class_report)
    print("")

After experimenting with 5 different values of `k` (`1`, `3`, `5`, `15`, `20`), I see the results:

k = 1: The model worls well because the Iris data is very clean, so the closest neighbor is always correct.

k = 3, 5: The model is still very good but a tiny mistake happened because 3 and 5 neighbors sometimes disagree.

k = 15, 20: The model takes more neighbors, but it is perfect because Iris dataset is well-seperated.

In this assignment, I understand many machine learning concepts from data preparation, data preprocessing, model selection, model training, evaluation and hyperparameter tuning. It also helps me practice using Jupyter notebook.

I put my thinking and explanation into each steps.

Here is what I found after using Iris dataset and K-Nearest Neighbors in this assignment:

Iris dataset is clear and well-structured. After cleaning the data, the KNN model gave very high accuracy (97%–100%). The confusion matrix showed only a few small mistakes. I tried different values of `k` and found that `k` = `3` and `5` performed slightly worse than `1` or larger values like `15` and `20`.


However, because the Iris dataset is small (150 samples), the results cannot show how the model works on larger and complex data. With more data, the best value of k might change, and the accuracy might not stay this high.
