# Data Analysis Assignment: Pandas and Matplotlib
**Analyzing the Iris Dataset**

This notebook demonstrates comprehensive data analysis using pandas for data manipulation and matplotlib/seaborn for visualization.

## Assignment Overview
- **Objective**: Load, analyze, and visualize the Iris dataset
- **Tools**: pandas, matplotlib, seaborn
- **Dataset**: Iris flower measurements dataset

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)

print('📊 Libraries imported successfully!')

## Task 1: Load and Explore the Dataset

We'll use the famous Iris dataset, which contains measurements of iris flowers from three different species.

In [None]:
# Load the Iris dataset
try:
    iris_sklearn = load_iris()
    iris_df = pd.DataFrame(data=iris_sklearn.data, columns=iris_sklearn.feature_names)
    iris_df['species'] = pd.Categorical.from_codes(iris_sklearn.target, iris_sklearn.target_names)
    print('✅ Dataset loaded successfully!')
    print(f'Dataset shape: {iris_df.shape}')
except Exception as e:
    print(f'❌ Error loading dataset: {e}')

In [None]:
# Display first few rows
print('First 10 rows of the dataset:')
iris_df.head(10)

In [None]:
# Explore dataset structure
print('Dataset Information:')
print(iris_df.info())
print('
Dataset shape:', iris_df.shape)
print('Column names:', list(iris_df.columns))

In [None]:
# Check for missing values
print('Missing Values Check:')
missing_values = iris_df.isnull().sum()
print(missing_values)
print(f'
Total missing values: {missing_values.sum()}')

if missing_values.sum() == 0:
    print('✅ No missing values found - dataset is clean!')
else:
    print('⚠️ Missing values detected - cleaning required')

## Task 2: Basic Data Analysis

Let's compute basic statistics and perform groupby operations to understand the data better.

In [None]:
# Compute basic statistics
print('Basic Statistics of Numerical Columns:')
basic_stats = iris_df.describe()
basic_stats

In [None]:
# Group analysis by species
print('Mean values by species:')
species_means = iris_df.groupby('species').mean()
species_means

In [None]:
# Count by species
print('Sample count by species:')
species_counts = iris_df.groupby('species').size()
print(species_counts)

# More detailed statistics
print('
Detailed statistics by species:')
species_detailed = iris_df.groupby('species').agg({
    'sepal length (cm)': ['mean', 'std', 'min', 'max'],
    'petal length (cm)': ['mean', 'std', 'min', 'max']
})
species_detailed

### Key Findings from Analysis

1. **Dataset Quality**: 150 samples with 4 numerical features, no missing values
2. **Species Distribution**: Equal distribution with 50 samples per species
3. **Feature Variation**: Petal measurements show more variation than sepal measurements
4. **Species Differences**: Clear distinctions between species, especially in petal dimensions
5. **Setosa Distinctiveness**: Setosa species clearly separable from the other two

## Task 3: Data Visualization

Creating four different types of visualizations to understand the data patterns.

In [None]:
# 1. Line Chart - Sepal length trends across samples
plt.figure(figsize=(12, 6))

for species in iris_df['species'].unique():
    species_data = iris_df[iris_df['species'] == species]
    plt.plot(species_data.index, species_data['sepal length (cm)'], 
             marker='o', markersize=4, linewidth=2, label=species.title(), alpha=0.8)

plt.title('Sepal Length Trends Across Sample Index by Species', fontsize=16, fontweight='bold')
plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Sepal Length (cm)', fontsize=12)
plt.legend(title='Species', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('📈 Line chart shows how sepal length varies across the dataset ordering')

In [None]:
# 2. Bar Chart - Mean measurements comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

for i, feature in enumerate(features):
    ax = axes[i//2, i%2]
    means = iris_df.groupby('species')[feature].mean()
    bars = means.plot(kind='bar', ax=ax, color=['#FF6B6B', '#4ECDC4', '#45B7D1'], alpha=0.8)
    
    ax.set_title(f'Mean {feature.title()} by Species', fontweight='bold')
    ax.set_ylabel(f'{feature.title()}')
    ax.set_xlabel('Species')
    ax.tick_params(axis='x', rotation=45)
    ax.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar in bars.patches:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                f'{height:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print('📊 Bar charts reveal significant differences in mean measurements between species')

In [None]:
# 3. Histogram - Distribution of petal length
plt.figure(figsize=(12, 8))

# Overall histogram
plt.subplot(2, 2, (1, 2))
plt.hist(iris_df['petal length (cm)'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('Overall Distribution of Petal Length', fontweight='bold', fontsize=14)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

# Species-specific histograms
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
species_list = iris_df['species'].unique()

for i, (species, color) in enumerate(zip(species_list, colors)):
    plt.subplot(2, 3, i+4)
    species_data = iris_df[iris_df['species'] == species]['petal length (cm)']
    plt.hist(species_data, bins=10, alpha=0.8, color=color, edgecolor='black')
    plt.title(f'{species.title()}', fontweight='bold')
    plt.xlabel('Petal Length (cm)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('📈 Histograms show clear bimodal distribution indicating distinct species groups')

In [None]:
# 4. Scatter Plot - Sepal Length vs Petal Length relationship
plt.figure(figsize=(12, 8))

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
species_list = iris_df['species'].unique()

for species, color in zip(species_list, colors):
    species_data = iris_df[iris_df['species'] == species]
    plt.scatter(species_data['sepal length (cm)'], species_data['petal length (cm)'],
                c=color, label=species.title(), alpha=0.7, s=60, edgecolors='black', linewidth=0.5)

plt.title('Sepal Length vs Petal Length by Species', fontweight='bold', fontsize=16)
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Petal Length (cm)', fontsize=12)
plt.legend(title='Species', fontsize=10)
plt.grid(True, alpha=0.3)

# Add correlation coefficient
correlation = iris_df['sepal length (cm)'].corr(iris_df['petal length (cm)'])
plt.text(0.05, 0.95, f'Overall Correlation: {correlation:.3f}', 
         transform=plt.gca().transAxes, fontsize=12, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

print('🔍 Scatter plot reveals strong positive correlation and clear species clustering')

## Additional Analysis: Correlation Matrix and Pair Plot

In [None]:
# Correlation matrix heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = iris_df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='RdYlBu_r', center=0,
            square=True, linewidths=0.5, fmt='.3f')
plt.title('Correlation Matrix of Iris Features', fontweight='bold', fontsize=16)
plt.tight_layout()
plt.show()

print('🌡️ Correlation matrix reveals strong relationships between petal measurements')

In [None]:
# Pair plot for comprehensive view
plt.figure(figsize=(12, 10))
pair_plot = sns.pairplot(iris_df, hue='species', markers=['o', 's', '^'], 
                        diag_kind='hist', palette='husl')
pair_plot.fig.suptitle('Pairwise Relationships in Iris Dataset', 
                       fontweight='bold', fontsize=16, y=1.02)
plt.show()

print('🎯 Pair plot provides comprehensive view of all feature relationships')

## Summary and Conclusions

### Key Insights Discovered:

1. **Data Quality**: The Iris dataset is exceptionally clean with no missing values and balanced classes

2. **Species Separation**: 
   - Setosa is clearly separable from the other species
   - Versicolor and Virginica show some overlap but distinct patterns
   - Petal measurements are most discriminative

3. **Feature Relationships**:
   - Strong positive correlation between petal length and petal width (0.963)
   - Moderate correlation between sepal and petal measurements
   - Sepal width shows weaker correlations with other features

4. **Distribution Patterns**:
   - Bimodal distribution in petal measurements indicates natural clustering
   - Normal distributions within each species

5. **Classification Potential**: The clear separation suggests high classification accuracy is achievable

### Technical Skills Demonstrated:
- ✅ Pandas data loading and exploration
- ✅ Missing value detection and handling
- ✅ Groupby operations and aggregations
- ✅ Statistical analysis with describe()
- ✅ Multiple visualization techniques
- ✅ Plot customization and styling
- ✅ Error handling and data validation