# Iris Dataset Classification Analysis

## Overview
The Iris dataset is one of the most famous datasets in machine learning and statistics. It contains measurements of 150 iris flowers from three different species:
- Iris setosa
- Iris versicolor
- Iris virginica

## Dataset Details
- **Samples**: 150 (50 per class)
- **Features**: 4 numerical features
- **Target**: 3 classes (species)
- **Task**: Multi-class classification

## Features
1. Sepal length (cm)
2. Sepal width (cm)
3. Petal length (cm)
4. Petal width (cm)

## Step 1: Import Required Libraries
We'll import all necessary libraries for data loading, visualization, and machine learning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.decomposition import PCA

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)

## Step 2: Load and Explore the Dataset
Load the Iris dataset and examine its structure and basic statistics.

In [None]:
# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nClass Distribution:")
print(df['species'].value_counts())

## Step 3: Statistical Summary
Examine the statistical properties of each feature.

In [None]:
# Statistical summary
print("Statistical Summary:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Summary by species
print("\nSummary by Species:")
print(df.groupby('species').describe())

## Step 4: Data Visualization
Create various visualizations to understand the data distribution and relationships.

In [None]:
# Pairplot to show relationships between features
plt.figure(figsize=(12, 10))
sns.pairplot(df, hue='species', diag_kind='hist')
plt.suptitle('Iris Dataset - Pairwise Feature Relationships', y=1.02)
plt.show()

In [None]:
# Box plots for each feature by species
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
features = iris.feature_names

for i, feature in enumerate(features):
    row, col = i // 2, i % 2
    sns.boxplot(data=df, x='species', y=feature, ax=axes[row, col])
    axes[row, col].set_title(f'{feature} by Species')

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df[iris.feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## Step 5: Feature Engineering and Preprocessing
Prepare the data for machine learning models.

In [None]:
# Separate features and target
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Feature dimensions: {X_train.shape[1]}")

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nFeature scaling completed.")

## Step 6: Model Training and Evaluation
Train multiple classification models and compare their performance.

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42)
}

# Train and evaluate models
results = {}

for name, model in models.items():
    print(f"\n=== {name} ===")
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test Accuracy: {accuracy:.4f}")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"Cross-validation Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    # Store results
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'predictions': y_pred,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))

## Step 7: Confusion Matrix Visualization
Visualize the confusion matrices for each model.

In [None]:
# Plot confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, (name, result) in enumerate(results.items()):
    cm = confusion_matrix(y_test, result['predictions'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i],
                xticklabels=iris.target_names, yticklabels=iris.target_names)
    axes[i].set_title(f'{name}\nAccuracy: {result["accuracy"]:.4f}')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('Actual')

plt.tight_layout()
plt.show()

## Step 8: Principal Component Analysis (PCA)
Perform dimensionality reduction and visualize the data in 2D.

In [None]:
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train_scaled)

# Create DataFrame for PCA results
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['species'] = y_train
pca_df['species_name'] = pca_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Plot PCA results
plt.figure(figsize=(10, 8))
for i, species in enumerate(['setosa', 'versicolor', 'virginica']):
    mask = pca_df['species_name'] == species
    plt.scatter(pca_df[mask]['PC1'], pca_df[mask]['PC2'], 
                label=species, alpha=0.7, s=50)

plt.xlabel(f'First Principal Component (Explained Variance: {pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'Second Principal Component (Explained Variance: {pca.explained_variance_ratio_[1]:.2%})')
plt.title('Iris Dataset - PCA Visualization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.2%}")
print(f"Principal Component 1 explains {pca.explained_variance_ratio_[0]:.2%} of variance")
print(f"Principal Component 2 explains {pca.explained_variance_ratio_[1]:.2%} of variance")

## Step 9: Feature Importance Analysis
Analyze feature importance using Random Forest.

In [None]:
# Get feature importance from Random Forest
rf_model = results['Random Forest']['model']
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

print("Feature Importance Ranking:")
for i, row in feature_importance.iterrows():
    print(f"{row['feature']}: {row['importance']:.4f}")

## Step 10: Model Comparison Summary
Compare all models and summarize findings.

In [None]:
# Create comparison DataFrame
comparison = pd.DataFrame({
    'Model': list(results.keys()),
    'Test Accuracy': [results[name]['accuracy'] for name in results.keys()],
    'CV Mean': [results[name]['cv_mean'] for name in results.keys()],
    'CV Std': [results[name]['cv_std'] for name in results.keys()]
})

comparison = comparison.sort_values('Test Accuracy', ascending=False)
print("Model Performance Comparison:")
print(comparison.to_string(index=False))

# Plot model comparison
plt.figure(figsize=(10, 6))
x_pos = np.arange(len(comparison))
plt.bar(x_pos, comparison['Test Accuracy'], alpha=0.7, color='skyblue', label='Test Accuracy')
plt.errorbar(x_pos, comparison['CV Mean'], yerr=comparison['CV Std'], 
             fmt='ro', capsize=5, label='CV Mean ± Std')

plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Model Performance Comparison')
plt.xticks(x_pos, comparison['Model'], rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Key Findings and Conclusions

### Dataset Characteristics:
- The Iris dataset is well-balanced with 50 samples per class
- Features show good separation between classes, especially petal measurements
- No missing values or data quality issues

### Model Performance:
- All models achieve high accuracy (typically >95%) due to the dataset's linear separability
- The dataset is considered a "toy" problem in machine learning due to its simplicity
- Cross-validation scores are consistent with test accuracy, indicating stable model performance

### Feature Insights:
- Petal length and petal width are typically the most important features for classification
- PCA shows that the first two components capture most of the variance
- The three species form distinct clusters in the feature space

### Practical Applications:
- Excellent dataset for learning classification concepts
- Demonstrates the effectiveness of simple linear models on well-separated data
- Useful for testing new algorithms and visualization techniques