<a href="https://colab.research.google.com/github/lovnishverma/Python-Getting-Started/blob/main/Iris_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸŒ¸ Iris Dataset Analysis Using NumPy

This notebook explains **every step** of a simple machine learning workflow using **only NumPy**.

You will learn:
- What the Iris dataset is
- How data is stored in NumPy arrays
- Why normalization is needed
- How a classifier works internally


## 1. Import Required Libraries

- **NumPy**: numerical computing
- **sklearn**: ONLY used to load the dataset (not ML)


In [1]:
import numpy as np
from sklearn.datasets import load_iris

np.set_printoptions(precision=3, suppress=True)

## 2. Load the Iris Dataset

- 150 samples (flowers)
- 4 features per sample
- 3 output classes


In [2]:
iris = load_iris()

X = iris.data      # Feature matrix (150 x 4)
y = iris.target    # Labels (150,)

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Feature names:", iris.feature_names)
print("Class names:", iris.target_names)

X shape: (150, 4)
y shape: (150,)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Class names: ['setosa' 'versicolor' 'virginica']


## 3. Inspect the Data

Look at the first few rows to understand the structure.

In [3]:
print(X[:5])
print(y[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]


## 4. Basic Statistics

We compute **mean** and **standard deviation** for each feature.
This is needed for normalization.

In [4]:
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)

for i, name in enumerate(iris.feature_names):
    print(name, "â†’ mean:", round(mean[i],2), "std:", round(std[i],2))

sepal length (cm) â†’ mean: 5.84 std: 0.83
sepal width (cm) â†’ mean: 3.06 std: 0.43
petal length (cm) â†’ mean: 3.76 std: 1.76
petal width (cm) â†’ mean: 1.2 std: 0.76


## 5. Normalize the Features (Z-score)

Normalization makes all features comparable in scale.

In [5]:
X_norm = (X - mean) / std

print(X_norm[:5])

[[-0.901  1.019 -1.34  -1.315]
 [-1.143 -0.132 -1.34  -1.315]
 [-1.385  0.328 -1.397 -1.315]
 [-1.507  0.098 -1.283 -1.315]
 [-1.022  1.249 -1.34  -1.315]]


## 6. Trainâ€“Test Split (Manual)

- 80% training data
- 20% testing data


In [6]:
np.random.seed(42)
indices = np.random.permutation(len(X_norm))
split = int(0.8 * len(X_norm))

train_idx = indices[:split]
test_idx = indices[split:]

X_train, X_test = X_norm[train_idx], X_norm[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print("Train samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])

Train samples: 120
Test samples: 30


## 7. Nearest Centroid Classifier

Each class is represented by the **average feature vector** of that class.

In [7]:
centroids = np.array([
    X_train[y_train == 0].mean(axis=0),
    X_train[y_train == 1].mean(axis=0),
    X_train[y_train == 2].mean(axis=0)
])

print("Centroids:\n", centroids)

Centroids:
 [[-1.039  0.869 -1.302 -1.251]
 [ 0.094 -0.657  0.288  0.183]
 [ 0.926 -0.186  1.019  1.116]]


## 8. Prediction Function

We compute the **Euclidean distance** from a sample to each centroid.

In [8]:
def predict(X, centroids):
    distances = np.linalg.norm(
        X[:, None, :] - centroids[None, :, :],
        axis=2
    )
    return np.argmin(distances, axis=1)

## 9. Model Evaluation

Accuracy = correct predictions / total predictions

In [9]:
y_pred = predict(X_test, centroids)
accuracy = np.mean(y_pred == y_test)

print("Model Accuracy:", accuracy)

Model Accuracy: 0.8666666666666667


## â¤µ Summary

- Dataset: Iris
- ML Type: Distance-based classification
- Tools: NumPy only
- Purpose: Learn ML fundamentals clearly
