# Assignment 3: Iris Classification ðŸŒ¸

## ðŸ“š Learning Objectives
- Explore and visualize the **Iris dataset**.
- Train a **K-Nearest Neighbors (KNN)** classifier.
- Evaluate model performance and make predictions.

## Part 1: Data Loading and Exploration (20 marks)

### Q1 (5 marks)
Load the Iris dataset using `sklearn.datasets.load_iris()`. Print the feature names, target names, and the shape of the data.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

print("Feature Names:", iris.feature_names)
print("Target Names:", iris.target_names)
print("Data Shape:", X.shape)

### Q2 (5 marks)
**Question:** What type of problem is this (classification or regression)? Why?

**Answer:**
This is a **Classification** problem. 
Why? Because the target variable (species) is **categorical** (Setosa, Versicolor, Virginica), not continuous. We are trying to assign each flower to a specific class or category.

### Q3 (5 marks)
Display the first 5 rows of the data as a pandas DataFrame, using the feature names as column headers.

In [None]:
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = [iris.target_names[i] for i in y]
df.head()

### Q4 (5 marks)
Report the class distribution (counts for each species).

In [None]:
print(df['species'].value_counts())

## Part 2: Visualization (20 marks)

### Q5 (10 marks)
Create a pair plot (scatter plot matrix) for the four features using `seaborn` or `matplotlib`. Color the points by species.

In [None]:
sns.pairplot(df, hue='species', palette='viridis')
plt.show()

### Q6 (10 marks)
**Question:** Based on your plot, which two features appear most useful to separate the classes? Explain in one short sentence.

**Answer:**
**Petal length** and **petal width** appear most useful because the species form very distinct, non-overlapping clusters when plotted against these two features.

## Part 3: Train-Test Split and KNN Modeling (30 marks)

### Q7 (10 marks)
Split the data into training (75%) and testing (25%) sets. Use `stratify=target` and `random_state=42`. Print the shapes of the resulting sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)

### Q8 (10 marks)
Train a `KNeighborsClassifier` with `n_neighbors=3`. Fit it on the training data and report the test accuracy.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy (k=3): {accuracy:.4f}")

### Q9 (10 marks)
Compare the accuracy for `n_neighbors` = 1, 3, and 5. Which value gives the best accuracy and why?

In [None]:
for k in [1, 3, 5]:
    knn_k = KNeighborsClassifier(n_neighbors=k)
    knn_k.fit(X_train, y_train)
    acc = knn_k.score(X_test, y_test)
    print(f"Accuracy for k={k}: {acc:.4f}")

**Answer:**
In this specific split, all values might give very high accuracy (often 1.0 or close to it) because the Iris dataset is relatively simple and well-separated. However, generally, **k=3 or k=5** is preferred over k=1 to avoid overfitting and be more robust to noise.

## Part 4: Prediction (10 marks)

### Q10 (10 marks)
Use your trained classifier (with `k=3`) to predict the class of a new sample with features `[5.0, 2.9, 1.0, 0.2]`. Display the predicted class name.

In [None]:
new_sample = [[5.0, 2.9, 1.0, 0.2]]
prediction_idx = knn.predict(new_sample)[0]
predicted_species = iris.target_names[prediction_idx]

print(f"Predicted Class: {predicted_species} ðŸŒ¸")