# Task 5: Exploratory Data Analysis (EDA) — Iris Dataset

**Tools:** Python, pandas, matplotlib  
**Dataset:** `iris_dataset.csv` (included)  
**Deliverables:** This notebook + `EDA_Report_Iris.pdf`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

# Load dataset
df = pd.read_csv('iris_dataset.csv')
df.head()

## 1. Structure & Summary

In [None]:
df.info()

In [None]:
df.describe(include='all')

In [None]:
df['species'].value_counts()

## 2. Distributions — Histograms

In [None]:
for col in df.columns[:-1]:
    plt.figure()
    df[col].plot(kind='hist', bins=20, alpha=0.8, edgecolor='black', title=f'Histogram: {col}')
    plt.xlabel(col); plt.ylabel('Count'); plt.show()

**Observations:**  
- Petal-based features (length/width) exhibit clear bimodality.  
- Sepal features have more overlap and milder separation.

## 3. Boxplots by Species

In [None]:
for col in df.columns[:-1]:
    plt.figure()
    data = [df.loc[df['species']==s, col].values for s in df['species'].unique()]
    plt.boxplot(data, labels=list(df['species'].unique()))
    plt.title(f'Boxplot by species: {col}')
    plt.xlabel('species'); plt.ylabel(col); plt.show()

**Observations:**  
- Petal length/width show strong separation (setosa < versicolor < virginica).  
- Sepal width tends to be slightly higher for setosa; sepal length higher for versicolor/virginica.

## 4. Correlation Heatmap

In [None]:
corr = df.drop(columns=['species']).corr()
plt.figure(figsize=(6,5))
plt.imshow(corr, interpolation='nearest')
plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha='right')
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title('Correlation Heatmap (features only)')
for i in range(len(corr.columns)):
    for j in range(len(corr.columns)):
        plt.text(j, i, f"{corr.iloc[i, j]:.2f}", ha='center', va='center')
plt.colorbar()
plt.tight_layout()
plt.show()
corr

**Observations:**  
- Petal length and petal width are very strongly correlated (~0.96).  
- Sepal width is weakly negatively correlated with petal features.

## 5. Pairwise Relationships — Scatter Matrix

In [None]:
axs = scatter_matrix(df.drop(columns=['species']), diagonal='hist', figsize=(6,6))
plt.tight_layout(); plt.show()

**Observations:**  
- Petal features show tight linear relationships; sepal features have broader dispersion.

## 6. Key Scatter Plots Colored by Species

In [None]:
def scatter_by_species(df, x, y):
    for s in df['species'].unique():
        sub = df[df['species']==s]
        plt.scatter(sub[x], sub[y], label=s, alpha=0.8)
    plt.xlabel(x); plt.ylabel(y); plt.title(f'{x} vs {y} by species'); plt.legend()
    plt.show()

pairs = [('sepal_length','sepal_width'), ('petal_length','petal_width'), ('sepal_length','petal_length')]
for x,y in pairs:
    scatter_by_species(df, x, y)

## 7. Summary of Findings
- Balanced classes (50 each).  
- Petal measurements offer strong class separation and are highly correlated.  
- Sepal measurements show weaker relations and overlap.  
- *Petal_length* and *petal_width* are the most informative features.