# Unit 12 ‚Äì Data Visualization & Reporting

## üéØ Learning Objectives

By the end of this unit, you will be able to:

1. Understand the fundamentals of data visualization and why it matters
2. Create various types of plots using **Matplotlib**
3. Build beautiful statistical visualizations with **Seaborn**
4. Choose the appropriate chart type for different data scenarios
5. Tell compelling stories with data through effective visualization

---

## üìö Table of Contents

1. [Visualization Fundamentals](#1-visualization-fundamentals)
2. [Matplotlib Basics](#2-matplotlib-basics)
3. [Seaborn Overview](#3-seaborn-overview)
4. [Choosing the Right Chart](#4-choosing-the-right-chart)
5. [Storytelling with Data](#5-storytelling-with-data)
6. [Practice Projects](#6-practice-projects)

---

# 1. Visualization Fundamentals

## Why Data Visualization Matters

Data visualization is the graphical representation of information and data. It transforms raw numbers into visual stories that our brains can process much faster than tables of numbers.

### The Power of Visualization

**Consider this:** The human brain processes visual information **60,000 times faster** than text. A well-designed chart can communicate insights in seconds that might take minutes to explain with words.

### Key Principles of Effective Visualization

| Principle | Description | Example |
|-----------|-------------|----------|
| **Clarity** | The message should be immediately clear | Avoid 3D effects that distort perception |
| **Accuracy** | Data should be represented truthfully | Start y-axis at 0 for bar charts |
| **Efficiency** | Maximize data-ink ratio | Remove unnecessary gridlines and decorations |
| **Aesthetics** | Visual appeal increases engagement | Use consistent color palettes |

### The Grammar of Graphics

Every visualization consists of these components:

- **Data**: The information you want to visualize
- **Aesthetics**: How data maps to visual properties (position, color, size)
- **Geometries**: The visual elements (points, lines, bars)
- **Scales**: How data values translate to visual values
- **Labels & Annotations**: Context and explanation

In [None]:
# First, let's install and import the necessary libraries
# !pip install matplotlib seaborn pandas numpy

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Set default style for better-looking plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

### üè≠ Real-World Example: Why Visualization Beats Tables

Let's demonstrate with **Anscombe's Quartet** - four datasets that have nearly identical statistical properties but look completely different when visualized.

In [None]:
# Anscombe's Quartet - Famous example showing importance of visualization
anscombe = sns.load_dataset('anscombe')

# Calculate statistics for each dataset
stats = anscombe.groupby('dataset').agg({
    'x': ['mean', 'std'],
    'y': ['mean', 'std']
}).round(2)

print("üìä Statistics look almost identical for all 4 datasets:")
print(stats)
print("\nü§î But are they really the same? Let's visualize...")

In [None]:
# Visualize Anscombe's Quartet
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle("Anscombe's Quartet: Same Statistics, Different Stories!", fontsize=14, fontweight='bold')

datasets = ['I', 'II', 'III', 'IV']
colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']

for ax, dataset, color in zip(axes.flat, datasets, colors):
    data = anscombe[anscombe['dataset'] == dataset]
    ax.scatter(data['x'], data['y'], s=80, color=color, alpha=0.7, edgecolors='white', linewidth=2)
    ax.set_title(f'Dataset {dataset}', fontsize=12, fontweight='bold')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_xlim(3, 20)
    ax.set_ylim(3, 14)
    
    # Add regression line
    z = np.polyfit(data['x'], data['y'], 1)
    p = np.poly1d(z)
    ax.plot(data['x'].sort_values(), p(data['x'].sort_values()), 
            color=color, linestyle='--', alpha=0.5, linewidth=2)

plt.tight_layout()
plt.show()

print("\nüí° KEY INSIGHT: Always visualize your data! Statistics alone can be misleading.")

---

# 2. Matplotlib Basics

**Matplotlib** is the foundational plotting library in Python. It provides complete control over every aspect of your visualizations.

## Understanding the Matplotlib Architecture

```
Figure (the entire window/page)
‚îÇ
‚îú‚îÄ‚îÄ Axes (individual plot area)
‚îÇ   ‚îú‚îÄ‚îÄ Title
‚îÇ   ‚îú‚îÄ‚îÄ X-axis (with label, ticks)
‚îÇ   ‚îú‚îÄ‚îÄ Y-axis (with label, ticks)
‚îÇ   ‚îú‚îÄ‚îÄ Plot elements (lines, bars, scatter)
‚îÇ   ‚îî‚îÄ‚îÄ Legend
‚îÇ
‚îî‚îÄ‚îÄ Additional Axes (for subplots)
```

## Two Approaches to Matplotlib

1. **Pyplot Interface** (Quick and simple): `plt.plot()`
2. **Object-Oriented Interface** (More control): `fig, ax = plt.subplots()`

## 2.1 Line Plots

**Best for:** Time series data, trends over continuous intervals, showing change

### üè¢ Real-World Use Case: Company Revenue Over Time

In [None]:
# Create sample data: Monthly revenue for a tech startup (2024-2025)
months = pd.date_range('2024-01', periods=24, freq='M')
revenue = [45000, 52000, 48000, 61000, 67000, 72000, 78000, 85000, 91000, 98000, 105000, 125000,
           132000, 145000, 158000, 172000, 185000, 195000, 210000, 225000, 242000, 265000, 285000, 310000]
expenses = [40000, 45000, 43000, 50000, 55000, 58000, 62000, 68000, 72000, 78000, 82000, 95000,
            100000, 108000, 115000, 125000, 132000, 140000, 150000, 160000, 172000, 185000, 195000, 210000]

# Create the line plot
fig, ax = plt.subplots(figsize=(14, 6))

# Plot revenue and expenses
ax.plot(months, revenue, marker='o', linewidth=2.5, markersize=6, 
        color='#27ae60', label='Revenue', linestyle='-')
ax.plot(months, expenses, marker='s', linewidth=2.5, markersize=6, 
        color='#e74c3c', label='Expenses', linestyle='--')

# Fill the profit area
ax.fill_between(months, expenses, revenue, alpha=0.3, color='#27ae60', label='Profit')

# Customization
ax.set_title('üöÄ TechStartup Inc. - Financial Performance (2024-2025)', fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Month', fontsize=11)
ax.set_ylabel('Amount ($)', fontsize=11)
ax.legend(loc='upper left', frameon=True, fancybox=True, shadow=True)
ax.grid(True, alpha=0.3)

# Format y-axis to show currency
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Add annotation for milestone
ax.annotate('Break-even\nAchieved! üéâ', xy=(months[5], revenue[5]), xytext=(months[3], 85000),
            fontsize=10, ha='center',
            arrowprops=dict(arrowstyle='->', color='gray', lw=1.5),
            bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Multiple line styles and markers demonstration
fig, ax = plt.subplots(figsize=(12, 5))

x = np.linspace(0, 10, 50)

# Different line styles
styles = [
    ('-', 'o', 'Solid + Circle'),
    ('--', 's', 'Dashed + Square'),
    ('-.', '^', 'Dash-dot + Triangle'),
    (':', 'D', 'Dotted + Diamond')
]

for i, (linestyle, marker, label) in enumerate(styles):
    ax.plot(x, np.sin(x + i*0.5), linestyle=linestyle, marker=marker, 
            markevery=5, label=label, linewidth=2, markersize=8)

ax.set_title('üìñ Matplotlib Line Styles & Markers Reference', fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.set_xlabel('X values')
ax.set_ylabel('Y values')
plt.tight_layout()
plt.show()

## 2.2 Bar Charts

**Best for:** Comparing discrete categories, showing rankings, part-to-whole relationships

### üõí Real-World Use Case: E-commerce Sales by Product Category

In [None]:
# E-commerce sales data
categories = ['Electronics', 'Clothing', 'Home & Garden', 'Books', 'Sports', 'Toys', 'Beauty']
sales_q1 = [125000, 98000, 76000, 45000, 67000, 52000, 89000]
sales_q2 = [142000, 105000, 82000, 48000, 78000, 61000, 95000]

x = np.arange(len(categories))
width = 0.35

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Grouped Bar Chart
ax1 = axes[0]
bars1 = ax1.bar(x - width/2, sales_q1, width, label='Q1 2025', color='#3498db', edgecolor='white')
bars2 = ax1.bar(x + width/2, sales_q2, width, label='Q2 2025', color='#e74c3c', edgecolor='white')

ax1.set_title('üìä Quarterly Sales Comparison by Category', fontsize=12, fontweight='bold')
ax1.set_xlabel('Product Category')
ax1.set_ylabel('Sales ($)')
ax1.set_xticks(x)
ax1.set_xticklabels(categories, rotation=45, ha='right')
ax1.legend()
ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Add value labels on bars
for bar in bars1:
    height = bar.get_height()
    ax1.annotate(f'${height/1000:.0f}K', xy=(bar.get_x() + bar.get_width() / 2, height),
                 xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=8)

# Horizontal Bar Chart (for rankings)
ax2 = axes[1]
total_sales = [q1 + q2 for q1, q2 in zip(sales_q1, sales_q2)]
sorted_data = sorted(zip(categories, total_sales), key=lambda x: x[1])
cats_sorted, sales_sorted = zip(*sorted_data)

colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(cats_sorted)))
bars = ax2.barh(cats_sorted, sales_sorted, color=colors, edgecolor='white')

ax2.set_title('üèÜ Total Sales Ranking (Q1 + Q2)', fontsize=12, fontweight='bold')
ax2.set_xlabel('Total Sales ($)')
ax2.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Add value labels
for bar, val in zip(bars, sales_sorted):
    ax2.text(val + 5000, bar.get_y() + bar.get_height()/2, f'${val/1000:.0f}K',
             va='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Stacked Bar Chart - Market Share Analysis
companies = ['Apple', 'Samsung', 'Xiaomi', 'Oppo', 'Others']
years = ['2021', '2022', '2023', '2024', '2025']

# Market share data (percentage)
market_data = {
    'Apple':   [23, 24, 26, 27, 28],
    'Samsung': [21, 22, 21, 20, 19],
    'Xiaomi':  [14, 15, 14, 15, 16],
    'Oppo':    [10, 11, 12, 12, 13],
    'Others':  [32, 28, 27, 26, 24]
}

fig, ax = plt.subplots(figsize=(12, 7))

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
bottom = np.zeros(len(years))

for company, color in zip(companies, colors):
    values = market_data[company]
    ax.bar(years, values, bottom=bottom, label=company, color=color, edgecolor='white', width=0.6)
    
    # Add percentage labels in the middle of each segment
    for i, (y, v) in enumerate(zip(years, values)):
        if v > 5:  # Only show label if segment is large enough
            ax.text(i, bottom[i] + v/2, f'{v}%', ha='center', va='center', 
                   fontsize=9, color='white', fontweight='bold')
    
    bottom += values

ax.set_title('üì± Global Smartphone Market Share (2021-2025)', fontsize=14, fontweight='bold')
ax.set_xlabel('Year', fontsize=11)
ax.set_ylabel('Market Share (%)', fontsize=11)
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
ax.set_ylim(0, 105)

plt.tight_layout()
plt.show()

## 2.3 Scatter Plots

**Best for:** Showing relationships between two variables, identifying correlations, outlier detection

### üè† Real-World Use Case: Real Estate Price Analysis

In [None]:
# Generate realistic real estate data
np.random.seed(42)
n_houses = 150

# House features
sqft = np.random.normal(2000, 500, n_houses).clip(800, 4000)
bedrooms = np.random.choice([2, 3, 4, 5], n_houses, p=[0.15, 0.40, 0.35, 0.10])
age = np.random.exponential(15, n_houses).clip(0, 50)

# Price with some realistic relationships
base_price = 100000
price = (base_price + 
         sqft * 150 + 
         bedrooms * 25000 - 
         age * 2000 + 
         np.random.normal(0, 30000, n_houses))

# Create DataFrame
housing_df = pd.DataFrame({
    'sqft': sqft,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

# Create multi-dimensional scatter plot
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Price vs Square Footage (color by bedrooms)
scatter1 = axes[0].scatter(housing_df['sqft'], housing_df['price']/1000, 
                           c=housing_df['bedrooms'], cmap='viridis', 
                           s=80, alpha=0.7, edgecolors='white')
axes[0].set_xlabel('Square Footage', fontsize=11)
axes[0].set_ylabel('Price ($K)', fontsize=11)
axes[0].set_title('üè† Price vs Size (colored by bedrooms)', fontsize=11, fontweight='bold')
cbar1 = plt.colorbar(scatter1, ax=axes[0])
cbar1.set_label('Bedrooms')

# Add trend line
z = np.polyfit(housing_df['sqft'], housing_df['price']/1000, 1)
p = np.poly1d(z)
x_line = np.linspace(housing_df['sqft'].min(), housing_df['sqft'].max(), 100)
axes[0].plot(x_line, p(x_line), 'r--', linewidth=2, alpha=0.7, label='Trend')
axes[0].legend()

# Plot 2: Price vs Age (color by size)
scatter2 = axes[1].scatter(housing_df['age'], housing_df['price']/1000, 
                           c=housing_df['sqft'], cmap='plasma', 
                           s=80, alpha=0.7, edgecolors='white')
axes[1].set_xlabel('Age (years)', fontsize=11)
axes[1].set_ylabel('Price ($K)', fontsize=11)
axes[1].set_title('üìâ Price vs Age (colored by sqft)', fontsize=11, fontweight='bold')
cbar2 = plt.colorbar(scatter2, ax=axes[1])
cbar2.set_label('Square Feet')

# Plot 3: Bubble chart - Price vs Sqft, bubble size = age
scatter3 = axes[2].scatter(housing_df['sqft'], housing_df['price']/1000,
                           s=housing_df['age']*10, alpha=0.5, 
                           c=housing_df['bedrooms'], cmap='coolwarm',
                           edgecolors='gray', linewidth=0.5)
axes[2].set_xlabel('Square Footage', fontsize=11)
axes[2].set_ylabel('Price ($K)', fontsize=11)
axes[2].set_title('üîµ Bubble Chart (size = age)', fontsize=11, fontweight='bold')
cbar3 = plt.colorbar(scatter3, ax=axes[2])
cbar3.set_label('Bedrooms')

plt.tight_layout()
plt.show()

# Print correlation
print("\nüìä Correlation Analysis:")
print(housing_df[['sqft', 'bedrooms', 'age', 'price']].corr()['price'].sort_values(ascending=False))

## 2.4 Other Essential Matplotlib Charts

In [None]:
# Pie Charts and Donut Charts
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Data: Website Traffic Sources
sources = ['Organic Search', 'Direct', 'Social Media', 'Email', 'Referral', 'Paid Ads']
traffic = [35, 25, 18, 12, 6, 4]
colors = ['#2ecc71', '#3498db', '#e74c3c', '#f39c12', '#9b59b6', '#1abc9c']
explode = (0.05, 0, 0, 0, 0, 0)  # Emphasize largest slice

# Pie Chart
axes[0].pie(traffic, labels=sources, autopct='%1.1f%%', startangle=90,
            colors=colors, explode=explode, shadow=True,
            textprops={'fontsize': 10})
axes[0].set_title('üåê Website Traffic Sources\n(Pie Chart)', fontsize=12, fontweight='bold')

# Donut Chart (more modern look)
wedges, texts, autotexts = axes[1].pie(traffic, labels=sources, autopct='%1.1f%%', 
                                        startangle=90, colors=colors, 
                                        wedgeprops=dict(width=0.5), pctdistance=0.75)
axes[1].set_title('üç© Website Traffic Sources\n(Donut Chart)', fontsize=12, fontweight='bold')

# Add center text for donut
axes[1].text(0, 0, '100K\nVisitors', ha='center', va='center', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Histograms - Understanding Data Distribution
# Real-World Use Case: Employee Salary Distribution

np.random.seed(42)

# Generate salary data for different departments
engineering_salaries = np.random.normal(95000, 15000, 200)
sales_salaries = np.random.normal(75000, 20000, 150)
hr_salaries = np.random.normal(65000, 10000, 50)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Simple Histogram
axes[0].hist(engineering_salaries, bins=20, color='#3498db', edgecolor='white', alpha=0.7)
axes[0].axvline(engineering_salaries.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: ${engineering_salaries.mean():,.0f}')
axes[0].set_title('üë®‚Äçüíª Engineering Salaries', fontsize=11, fontweight='bold')
axes[0].set_xlabel('Salary ($)')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Overlapping Histograms (Comparison)
axes[1].hist(engineering_salaries, bins=20, alpha=0.6, label='Engineering', color='#3498db')
axes[1].hist(sales_salaries, bins=20, alpha=0.6, label='Sales', color='#e74c3c')
axes[1].hist(hr_salaries, bins=20, alpha=0.6, label='HR', color='#2ecc71')
axes[1].set_title('üìä Salary Distribution by Department', fontsize=11, fontweight='bold')
axes[1].set_xlabel('Salary ($)')
axes[1].set_ylabel('Frequency')
axes[1].legend()
axes[1].xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Histogram with KDE (Kernel Density Estimation)
all_salaries = np.concatenate([engineering_salaries, sales_salaries, hr_salaries])
axes[2].hist(all_salaries, bins=30, density=True, alpha=0.6, color='#9b59b6', edgecolor='white')

# Add KDE line
from scipy import stats
kde = stats.gaussian_kde(all_salaries)
x_range = np.linspace(all_salaries.min(), all_salaries.max(), 100)
axes[2].plot(x_range, kde(x_range), 'r-', linewidth=2, label='KDE')

axes[2].set_title('üìà Company-Wide Salary Distribution', fontsize=11, fontweight='bold')
axes[2].set_xlabel('Salary ($)')
axes[2].set_ylabel('Density')
axes[2].legend()
axes[2].xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.show()

In [None]:
# Box Plots - Understanding Data Spread and Outliers
# Real-World Use Case: Customer Satisfaction Scores by Region

np.random.seed(42)

# Generate satisfaction scores (1-10 scale)
regions = ['North America', 'Europe', 'Asia Pacific', 'Latin America', 'Middle East']
data = {
    'North America': np.random.normal(7.5, 1.2, 100).clip(1, 10),
    'Europe': np.random.normal(8.0, 0.8, 100).clip(1, 10),
    'Asia Pacific': np.random.normal(7.0, 1.5, 100).clip(1, 10),
    'Latin America': np.random.normal(6.5, 1.8, 100).clip(1, 10),
    'Middle East': np.random.normal(7.2, 1.3, 100).clip(1, 10)
}

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Standard Box Plot
bp1 = axes[0].boxplot([data[r] for r in regions], labels=regions, patch_artist=True)
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']
for patch, color in zip(bp1['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[0].set_title('üì¶ Customer Satisfaction by Region\n(Box Plot)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Satisfaction Score (1-10)')
axes[0].set_xticklabels(regions, rotation=30, ha='right')
axes[0].axhline(y=7, color='green', linestyle='--', alpha=0.7, label='Target (7.0)')
axes[0].legend()

# Violin Plot (shows distribution shape)
parts = axes[1].violinplot([data[r] for r in regions], showmeans=True, showmedians=True)
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i])
    pc.set_alpha(0.7)

axes[1].set_title('üéª Customer Satisfaction by Region\n(Violin Plot)', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Satisfaction Score (1-10)')
axes[1].set_xticks(range(1, len(regions) + 1))
axes[1].set_xticklabels(regions, rotation=30, ha='right')

plt.tight_layout()
plt.show()

print("\nüìä Summary Statistics:")
for region in regions:
    print(f"  {region}: Mean={np.mean(data[region]):.2f}, Median={np.median(data[region]):.2f}, Std={np.std(data[region]):.2f}")

## 2.5 Matplotlib Subplots and Layouts

In [None]:
# Creating Complex Layouts with GridSpec
from matplotlib.gridspec import GridSpec

# Sample data for dashboard
np.random.seed(42)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [120, 150, 180, 140, 200, 220]
customers = [800, 950, 1100, 900, 1300, 1400]
categories = ['Product A', 'Product B', 'Product C', 'Product D']
cat_sales = [35, 30, 20, 15]

# Create figure with custom layout
fig = plt.figure(figsize=(14, 10))
gs = GridSpec(3, 3, figure=fig, hspace=0.3, wspace=0.3)

# Large plot spanning 2 columns
ax1 = fig.add_subplot(gs[0, :2])
ax1.plot(months, sales, marker='o', linewidth=2, markersize=10, color='#3498db')
ax1.fill_between(months, sales, alpha=0.3, color='#3498db')
ax1.set_title('üìà Monthly Sales Trend', fontweight='bold')
ax1.set_ylabel('Sales ($K)')

# Small pie chart
ax2 = fig.add_subplot(gs[0, 2])
ax2.pie(cat_sales, labels=categories, autopct='%1.0f%%', colors=['#e74c3c', '#3498db', '#2ecc71', '#f39c12'])
ax2.set_title('ü•ß Sales by Product', fontweight='bold')

# Bar chart spanning full width
ax3 = fig.add_subplot(gs[1, :])
x = np.arange(len(months))
width = 0.35
ax3.bar(x - width/2, sales, width, label='Sales ($K)', color='#3498db')
ax3.bar(x + width/2, [c/10 for c in customers], width, label='Customers (√ó10)', color='#e74c3c')
ax3.set_xticks(x)
ax3.set_xticklabels(months)
ax3.legend()
ax3.set_title('üìä Sales vs Customer Count', fontweight='bold')

# Three small plots at the bottom
ax4 = fig.add_subplot(gs[2, 0])
ax4.hist(np.random.normal(100, 20, 500), bins=20, color='#9b59b6', edgecolor='white')
ax4.set_title('üìâ Order Value Distribution', fontweight='bold')

ax5 = fig.add_subplot(gs[2, 1])
ax5.scatter(np.random.rand(50)*100, np.random.rand(50)*100, c=np.random.rand(50), cmap='viridis', s=100, alpha=0.6)
ax5.set_title('üîµ Customer Segments', fontweight='bold')

ax6 = fig.add_subplot(gs[2, 2])
ax6.barh(['Excellent', 'Good', 'Average', 'Poor'], [45, 35, 15, 5], color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
ax6.set_title('‚≠ê Customer Ratings', fontweight='bold')

fig.suptitle('üéØ Sales Dashboard - Q1 2025', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

# 3. Seaborn Overview

**Seaborn** is built on top of Matplotlib and provides a high-level interface for creating attractive statistical graphics. It integrates closely with Pandas DataFrames.

## Key Advantages of Seaborn:

- Beautiful default styles
- Built-in themes and color palettes
- Easy statistical visualizations
- Direct DataFrame integration
- Automatic handling of categorical data

In [None]:
# Load built-in datasets for examples
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')

print("üìä Available Datasets Loaded:")
print(f"\n1. Tips Dataset (Restaurant data): {tips.shape[0]} records")
print(tips.head())
print(f"\n2. Iris Dataset (Flower measurements): {iris.shape[0]} records")
print(iris.head())

## 3.1 Seaborn Distribution Plots

In [None]:
# Distribution Plots with Seaborn
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Histogram with KDE (displot)
sns.histplot(tips['total_bill'], kde=True, ax=axes[0, 0], color='#3498db')
axes[0, 0].set_title('üìä Distribution of Total Bills', fontweight='bold')
axes[0, 0].set_xlabel('Total Bill ($)')

# 2. KDE Plot by Category
sns.kdeplot(data=tips, x='total_bill', hue='time', ax=axes[0, 1], fill=True, alpha=0.5)
axes[0, 1].set_title('üìà Bill Distribution: Lunch vs Dinner', fontweight='bold')
axes[0, 1].set_xlabel('Total Bill ($)')

# 3. Box Plot by Category
sns.boxplot(data=tips, x='day', y='total_bill', hue='sex', ax=axes[1, 0], palette='Set2')
axes[1, 0].set_title('üì¶ Bills by Day and Gender', fontweight='bold')
axes[1, 0].set_xlabel('Day of Week')
axes[1, 0].set_ylabel('Total Bill ($)')

# 4. Violin Plot
sns.violinplot(data=tips, x='day', y='tip', hue='smoker', split=True, ax=axes[1, 1], palette='muted')
axes[1, 1].set_title('üéª Tip Distribution: Smokers vs Non-Smokers', fontweight='bold')
axes[1, 1].set_xlabel('Day of Week')
axes[1, 1].set_ylabel('Tip ($)')

plt.tight_layout()
plt.show()

## 3.2 Seaborn Relationship Plots

In [None]:
# Relationship Plots with Seaborn
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Scatter Plot with Regression Line
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[0, 0], 
            scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
axes[0, 0].set_title('üìà Tip vs Total Bill (with regression)', fontweight='bold')

# 2. Scatter Plot with Categories
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', style='smoker', 
                size='size', sizes=(50, 200), ax=axes[0, 1], palette='deep')
axes[0, 1].set_title('üîµ Multi-dimensional Scatter Plot', fontweight='bold')

# 3. Joint Plot (Scatter + Histograms) - using separate figure
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species', 
                ax=axes[1, 0], palette='viridis', s=80, alpha=0.7)
axes[1, 0].set_title('üå∏ Iris: Sepal Length vs Width', fontweight='bold')

# 4. Strip Plot (Categorical Scatter)
sns.stripplot(data=tips, x='day', y='total_bill', hue='sex', 
              dodge=True, ax=axes[1, 1], alpha=0.7, palette='Set1')
axes[1, 1].set_title('üìç Individual Bills by Day and Gender', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Pair Plot - Visualizing all pairwise relationships
# Real-World Use Case: Exploring relationships in the Iris dataset

g = sns.pairplot(iris, hue='species', diag_kind='kde', 
                 plot_kws={'alpha': 0.6, 's': 80, 'edgecolor': 'white'},
                 palette='husl')
g.fig.suptitle('üå∫ Iris Dataset: Pairwise Feature Relationships', y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("\nüí° INSIGHT: Pair plots reveal that setosa is easily separable from the other species,")
print("   while versicolor and virginica have more overlap.")

## 3.3 Seaborn Categorical Plots

In [None]:
# Categorical Plots with Seaborn
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Count Plot (Frequency of categories)
sns.countplot(data=titanic, x='class', hue='survived', ax=axes[0, 0], palette='RdYlGn')
axes[0, 0].set_title('üö¢ Titanic Survival by Class', fontweight='bold')
axes[0, 0].legend(title='Survived', labels=['No', 'Yes'])

# 2. Bar Plot with Error Bars (Shows mean + confidence interval)
sns.barplot(data=tips, x='day', y='total_bill', hue='sex', ax=axes[0, 1], palette='pastel')
axes[0, 1].set_title('üìä Average Bill by Day and Gender', fontweight='bold')

# 3. Point Plot (Good for showing trends)
sns.pointplot(data=tips, x='day', y='total_bill', hue='time', ax=axes[1, 0],
              markers=['o', 's'], linestyles=['-', '--'], palette='dark')
axes[1, 0].set_title('üìç Average Bill Trends: Lunch vs Dinner', fontweight='bold')

# 4. Swarm Plot (Non-overlapping categorical scatter)
sns.swarmplot(data=tips, x='day', y='tip', hue='time', ax=axes[1, 1], palette='Set2', size=6)
axes[1, 1].set_title('üêù Tip Distribution by Day (Swarm Plot)', fontweight='bold')

plt.tight_layout()
plt.show()

## 3.4 Heatmaps and Correlation Matrices

### üè• Real-World Use Case: Hospital Performance Metrics

In [None]:
# Create sample hospital data
np.random.seed(42)

departments = ['Cardiology', 'Neurology', 'Orthopedics', 'Pediatrics', 'Oncology', 'Emergency']
metrics = ['Patient Satisfaction', 'Wait Time Score', 'Treatment Success', 'Staff Efficiency', 'Cost Efficiency']

# Generate performance scores (0-100)
performance_data = np.random.randint(60, 100, size=(len(departments), len(metrics)))
performance_df = pd.DataFrame(performance_data, index=departments, columns=metrics)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Performance Heatmap
sns.heatmap(performance_df, annot=True, fmt='d', cmap='RdYlGn', 
            ax=axes[0], linewidths=0.5, cbar_kws={'label': 'Score'})
axes[0].set_title('üè• Hospital Department Performance Scores', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Performance Metrics')
axes[0].set_ylabel('Departments')

# 2. Correlation Heatmap
# Using tips dataset for correlation example
tips_numeric = tips[['total_bill', 'tip', 'size']].copy()
tips_numeric['tip_pct'] = tips['tip'] / tips['total_bill'] * 100
correlation = tips_numeric.corr()

mask = np.triu(np.ones_like(correlation, dtype=bool))  # Upper triangle mask
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', 
            ax=axes[1], mask=mask, center=0, linewidths=0.5,
            square=True, cbar_kws={'label': 'Correlation'})
axes[1].set_title('üîó Correlation Matrix: Tips Dataset', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Clustered Heatmap - Shows hierarchical clustering
# Real-World Use Case: Customer Segmentation Based on Behavior

np.random.seed(42)

# Generate customer behavior data
n_customers = 20
behaviors = ['Purchase Frequency', 'Avg Order Value', 'Website Visits', 
             'Email Opens', 'Support Tickets', 'Return Rate']

# Create customer IDs
customer_ids = [f'Customer_{i:03d}' for i in range(1, n_customers + 1)]

# Generate behavioral scores
behavior_data = np.random.rand(n_customers, len(behaviors)) * 100
behavior_df = pd.DataFrame(behavior_data, index=customer_ids, columns=behaviors)

# Create clustered heatmap
g = sns.clustermap(behavior_df, cmap='viridis', figsize=(12, 10),
                   standard_scale=1, linewidths=0.5,
                   cbar_kws={'label': 'Standardized Score'})
g.fig.suptitle('üë• Customer Segmentation: Behavioral Clustering', fontsize=14, fontweight='bold', y=1.02)
plt.show()

print("\nüí° INSIGHT: Clustered heatmaps automatically group similar customers and behaviors,")
print("   making it easy to identify customer segments and behavioral patterns.")

## 3.5 FacetGrid - Multi-plot Layouts

In [None]:
# FacetGrid - Creating multiple plots based on categorical variables

# Example 1: Scatter plots faceted by time and smoker status
g = sns.FacetGrid(tips, col='time', row='smoker', height=4, aspect=1.2)
g.map_dataframe(sns.scatterplot, x='total_bill', y='tip', hue='day', palette='deep', alpha=0.7)
g.add_legend(title='Day')
g.fig.suptitle('üí∞ Tips vs Bills: Segmented Analysis', y=1.02, fontsize=14, fontweight='bold')
plt.show()

In [None]:
# Example 2: Distribution plots faceted by species
g = sns.FacetGrid(iris, col='species', height=4, aspect=1)
g.map(sns.histplot, 'petal_length', kde=True, color='#3498db')
g.fig.suptitle('üå∏ Petal Length Distribution by Species', y=1.02, fontsize=14, fontweight='bold')
plt.show()

---

# 4. Choosing the Right Chart

Selecting the appropriate visualization is crucial for effective communication. Use this guide:

## Chart Selection Decision Tree

```
What do you want to show?
‚îÇ
‚îú‚îÄ‚îÄ COMPARISON
‚îÇ   ‚îú‚îÄ‚îÄ Among items ‚Üí Bar Chart (vertical or horizontal)
‚îÇ   ‚îú‚îÄ‚îÄ Over time ‚Üí Line Chart
‚îÇ   ‚îî‚îÄ‚îÄ Many categories ‚Üí Grouped/Stacked Bar
‚îÇ
‚îú‚îÄ‚îÄ DISTRIBUTION
‚îÇ   ‚îú‚îÄ‚îÄ Single variable ‚Üí Histogram, Box Plot
‚îÇ   ‚îú‚îÄ‚îÄ Multiple groups ‚Üí Multiple Histograms, Violin Plot
‚îÇ   ‚îî‚îÄ‚îÄ Show shape ‚Üí KDE Plot
‚îÇ
‚îú‚îÄ‚îÄ RELATIONSHIP
‚îÇ   ‚îú‚îÄ‚îÄ Two variables ‚Üí Scatter Plot
‚îÇ   ‚îú‚îÄ‚îÄ Three+ variables ‚Üí Bubble Chart, Color-coded Scatter
‚îÇ   ‚îî‚îÄ‚îÄ Many variables ‚Üí Pair Plot, Heatmap
‚îÇ
‚îú‚îÄ‚îÄ COMPOSITION
‚îÇ   ‚îú‚îÄ‚îÄ Static ‚Üí Pie/Donut Chart (‚â§7 categories)
‚îÇ   ‚îî‚îÄ‚îÄ Over time ‚Üí Stacked Area/Bar Chart
‚îÇ
‚îî‚îÄ‚îÄ TREND
    ‚îú‚îÄ‚îÄ Time-based ‚Üí Line Chart
    ‚îî‚îÄ‚îÄ With uncertainty ‚Üí Line + Confidence Band
```

In [None]:
# Visual Reference: Choosing the Right Chart

fig, axes = plt.subplots(2, 4, figsize=(18, 10))
fig.suptitle('üìä Chart Type Reference Guide', fontsize=16, fontweight='bold', y=1.02)

np.random.seed(42)

# 1. Bar Chart - Comparison
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
axes[0, 0].bar(categories, values, color='#3498db', edgecolor='white')
axes[0, 0].set_title('üìä Bar Chart\n"Comparison among items"', fontweight='bold')

# 2. Line Chart - Trends
x = np.arange(10)
y = np.cumsum(np.random.randn(10))
axes[0, 1].plot(x, y, marker='o', color='#e74c3c', linewidth=2)
axes[0, 1].set_title('üìà Line Chart\n"Trends over time"', fontweight='bold')

# 3. Scatter Plot - Relationship
x = np.random.rand(50) * 100
y = x * 0.7 + np.random.randn(50) * 15
axes[0, 2].scatter(x, y, alpha=0.6, color='#2ecc71', s=60)
axes[0, 2].set_title('üîµ Scatter Plot\n"Relationships"', fontweight='bold')

# 4. Pie Chart - Composition
sizes = [35, 30, 20, 15]
axes[0, 3].pie(sizes, labels=['A', 'B', 'C', 'D'], autopct='%1.0f%%', 
               colors=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
axes[0, 3].set_title('ü•ß Pie Chart\n"Part-to-whole"', fontweight='bold')

# 5. Histogram - Distribution
data = np.random.normal(50, 15, 500)
axes[1, 0].hist(data, bins=20, color='#9b59b6', edgecolor='white')
axes[1, 0].set_title('üìâ Histogram\n"Distribution"', fontweight='bold')

# 6. Box Plot - Distribution + Outliers
data = [np.random.normal(0, std, 100) for std in range(1, 5)]
bp = axes[1, 1].boxplot(data, patch_artist=True)
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
axes[1, 1].set_title('üì¶ Box Plot\n"Spread & Outliers"', fontweight='bold')

# 7. Heatmap - Matrix Data
matrix = np.random.rand(5, 5)
im = axes[1, 2].imshow(matrix, cmap='YlOrRd')
axes[1, 2].set_title('üî• Heatmap\n"Matrix values"', fontweight='bold')

# 8. Area Chart - Cumulative Trends
x = np.arange(10)
y1 = np.random.randint(20, 50, 10)
y2 = np.random.randint(10, 30, 10)
axes[1, 3].fill_between(x, y1, alpha=0.5, label='A')
axes[1, 3].fill_between(x, y2, alpha=0.5, label='B')
axes[1, 3].legend()
axes[1, 3].set_title('üìä Area Chart\n"Cumulative trends"', fontweight='bold')

plt.tight_layout()
plt.show()

## Common Visualization Mistakes to Avoid

| ‚ùå Mistake | ‚úÖ Better Practice |
|-----------|--------------------|
| Using 3D charts for 2D data | Stick to 2D visualizations |
| Pie charts with many categories | Use bar charts for >7 categories |
| Truncating y-axis in bar charts | Start y-axis at 0 |
| Rainbow color scales for sequential data | Use sequential color palettes |
| Too many elements on one chart | Simplify or use multiple charts |
| Missing axis labels/titles | Always include clear labels |

In [None]:
# Demonstration: Good vs Bad Practices

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Example data
companies = ['Company A', 'Company B']
profits = [100, 105]  # Very similar values

# BAD: Truncated Y-axis (misleading!)
axes[0, 0].bar(companies, profits, color=['#e74c3c', '#3498db'])
axes[0, 0].set_ylim(98, 106)  # Truncated!
axes[0, 0].set_title('‚ùå BAD: Truncated Y-Axis\n(Makes 5% difference look huge!)', fontweight='bold', color='red')
axes[0, 0].set_ylabel('Profit ($M)')

# GOOD: Full Y-axis
axes[0, 1].bar(companies, profits, color=['#e74c3c', '#3498db'])
axes[0, 1].set_ylim(0, 120)  # Starts at 0
axes[0, 1].set_title('‚úÖ GOOD: Y-Axis Starting at 0\n(Shows true proportion)', fontweight='bold', color='green')
axes[0, 1].set_ylabel('Profit ($M)')

# BAD: Too many pie slices
many_categories = [f'Cat {i}' for i in range(12)]
many_values = np.random.randint(5, 15, 12)
axes[1, 0].pie(many_values, labels=many_categories, autopct='%1.0f%%')
axes[1, 0].set_title('‚ùå BAD: Pie Chart with 12 Categories\n(Hard to read!)', fontweight='bold', color='red')

# GOOD: Horizontal bar chart for many categories
sorted_idx = np.argsort(many_values)
axes[1, 1].barh([many_categories[i] for i in sorted_idx], 
               [many_values[i] for i in sorted_idx],
               color=plt.cm.viridis(np.linspace(0.2, 0.8, 12)))
axes[1, 1].set_title('‚úÖ GOOD: Horizontal Bar Chart\n(Easy comparison!)', fontweight='bold', color='green')
axes[1, 1].set_xlabel('Value')

plt.tight_layout()
plt.show()

---

# 5. Storytelling with Data

Effective data visualization isn't just about creating charts‚Äîit's about telling a compelling story that drives action.

## The Data Storytelling Framework

```
1. CONTEXT ‚Üí Set the scene
   "What's the current situation?"
   
2. CONFLICT ‚Üí Identify the problem/opportunity
   "What needs to change?"
   
3. RESOLUTION ‚Üí Present the insight
   "What should we do about it?"
```

## Key Principles:

1. **Lead with the insight** - Don't make readers search for it
2. **Use strategic emphasis** - Draw attention to what matters
3. **Remove clutter** - Every element should earn its place
4. **Tell one story per chart** - Don't overwhelm

In [None]:
# Storytelling Example: Quarterly Business Review
# Scenario: Presenting sales performance to stakeholders

# Create comprehensive sales data
np.random.seed(42)

quarters = ['Q1 2024', 'Q2 2024', 'Q3 2024', 'Q4 2024', 'Q1 2025']
actual_sales = [850, 920, 980, 1150, 1280]  # Growing trend
target_sales = [900, 950, 1000, 1050, 1100]  # Targets
competitor_sales = [800, 850, 920, 1000, 1050]  # Competitor

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('üìä Quarterly Business Review: Sales Performance Story', 
             fontsize=16, fontweight='bold', y=1.02)

# Panel 1: The Big Picture - We're beating targets!
ax1 = axes[0, 0]
x = np.arange(len(quarters))
width = 0.35

bars1 = ax1.bar(x - width/2, actual_sales, width, label='Actual', color='#27ae60', edgecolor='white')
bars2 = ax1.bar(x + width/2, target_sales, width, label='Target', color='#bdc3c7', edgecolor='white')

# Highlight the key insight
for i, (actual, target) in enumerate(zip(actual_sales, target_sales)):
    if actual > target:
        ax1.annotate('‚úì', xy=(i - width/2, actual + 20), fontsize=16, ha='center', color='green')

ax1.set_title('üéØ Story 1: Exceeding Targets\n"We beat targets in 4 of 5 quarters"', 
              fontweight='bold', fontsize=11)
ax1.set_xticks(x)
ax1.set_xticklabels(quarters)
ax1.set_ylabel('Sales ($K)')
ax1.legend()
ax1.set_ylim(0, 1500)

# Panel 2: The Trend - Growth is accelerating
ax2 = axes[0, 1]
ax2.plot(quarters, actual_sales, marker='o', linewidth=3, markersize=10, 
         color='#3498db', label='Our Sales')
ax2.plot(quarters, competitor_sales, marker='s', linewidth=2, markersize=8, 
         color='#e74c3c', linestyle='--', alpha=0.7, label='Competitor')

# Annotate the gap
ax2.annotate('Gap widening!\n+$230K', xy=(4, 1280), xytext=(3.5, 1380),
             fontsize=10, ha='center', color='#27ae60', fontweight='bold',
             arrowprops=dict(arrowstyle='->', color='#27ae60'))

ax2.fill_between(quarters, competitor_sales, actual_sales, alpha=0.2, color='#27ae60')
ax2.set_title('üìà Story 2: Outpacing Competition\n"We\'re growing faster than competitors"', 
              fontweight='bold', fontsize=11)
ax2.set_ylabel('Sales ($K)')
ax2.legend()
ax2.set_ylim(700, 1500)

# Panel 3: The Breakdown - Where growth is coming from
ax3 = axes[1, 0]
products = ['Product A', 'Product B', 'Product C', 'Product D']
q1_sales = [300, 250, 200, 100]
q5_sales = [450, 400, 280, 150]  # Q1 2025
growth = [(q5 - q1) / q1 * 100 for q1, q5 in zip(q1_sales, q5_sales)]

colors = ['#27ae60' if g > 40 else '#3498db' for g in growth]
bars = ax3.barh(products, growth, color=colors, edgecolor='white')

# Add value labels
for bar, g in zip(bars, growth):
    ax3.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2, 
             f'+{g:.0f}%', va='center', fontweight='bold')

ax3.axvline(x=40, color='red', linestyle='--', alpha=0.7, label='40% growth target')
ax3.set_title('üöÄ Story 3: Growth Drivers\n"Products A & B are our growth engines"', 
              fontweight='bold', fontsize=11)
ax3.set_xlabel('YoY Growth (%)')
ax3.legend()

# Panel 4: The Call to Action
ax4 = axes[1, 1]
ax4.axis('off')

summary_text = """
üìã KEY TAKEAWAYS

‚úÖ Exceeded targets in 4 of 5 quarters
‚úÖ Growing 20% faster than main competitor  
‚úÖ Products A & B driving 65% of growth

üéØ RECOMMENDED ACTIONS

1. Increase marketing spend on Products A & B
2. Investigate Product D underperformance
3. Set aggressive Q2 2025 target: $1,400K

üí∞ PROJECTED IMPACT

If trends continue: +$500K annual revenue
"""

ax4.text(0.1, 0.9, summary_text, transform=ax4.transAxes, fontsize=12,
         verticalalignment='top', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='#ecf0f1', edgecolor='#bdc3c7', alpha=0.8))

ax4.set_title('üìù Story 4: The Call to Action', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.show()

## Using Color Strategically

In [None]:
# Strategic use of color to emphasize insights

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Data: Monthly performance
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
performance = [85, 82, 88, 91, 78, 72, 75, 89, 94, 97, 93, 99]
target = 85

# Strategy 1: Gray out context, highlight the insight
colors1 = ['#e74c3c' if p < target else '#bdc3c7' for p in performance]
colors1[-1] = '#27ae60'  # Highlight best month

axes[0].bar(months, performance, color=colors1, edgecolor='white')
axes[0].axhline(y=target, color='#3498db', linestyle='--', linewidth=2, label=f'Target ({target}%)')
axes[0].set_title('Strategy 1: Highlight Exceptions\n"3 months below target, December was best"', fontweight='bold')
axes[0].set_ylabel('Performance Score')
axes[0].legend()

# Strategy 2: Sequential color to show magnitude
norm = plt.Normalize(min(performance), max(performance))
colors2 = plt.cm.RdYlGn(norm(performance))

axes[1].bar(months, performance, color=colors2, edgecolor='white')
axes[1].set_title('Strategy 2: Sequential Colors\n"Color intensity shows performance level"', fontweight='bold')
axes[1].set_ylabel('Performance Score')

# Strategy 3: Categorical colors for groups
quarters = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]
quarter_colors = {1: '#3498db', 2: '#e74c3c', 3: '#2ecc71', 4: '#f39c12'}
colors3 = [quarter_colors[q] for q in quarters]

axes[2].bar(months, performance, color=colors3, edgecolor='white')
axes[2].set_title('Strategy 3: Categorical Colors\n"Group by quarters for comparison"', fontweight='bold')
axes[2].set_ylabel('Performance Score')

# Add legend
import matplotlib.patches as mpatches
legend_patches = [mpatches.Patch(color=c, label=f'Q{q}') for q, c in quarter_colors.items()]
axes[2].legend(handles=legend_patches, loc='lower right')

plt.tight_layout()
plt.show()

## Adding Annotations for Context

In [None]:
# Creating annotated visualizations that tell a complete story

# Scenario: Stock price analysis with key events
np.random.seed(42)

# Generate stock price data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
price = 100 + np.cumsum(np.random.randn(365) * 2)

# Add some events
events = {
    '2024-03-15': ('Product Launch', 'green'),
    '2024-06-01': ('CEO Resignation', 'red'),
    '2024-08-20': ('Strong Earnings', 'green'),
    '2024-11-15': ('Market Expansion', 'blue')
}

fig, ax = plt.subplots(figsize=(16, 8))

# Plot stock price
ax.plot(dates, price, linewidth=2, color='#2c3e50', alpha=0.8)
ax.fill_between(dates, price.min(), price, alpha=0.1, color='#3498db')

# Add moving average
ma_30 = pd.Series(price).rolling(30).mean()
ax.plot(dates, ma_30, linewidth=2, color='#e74c3c', linestyle='--', label='30-day MA', alpha=0.7)

# Annotate key events
for date_str, (label, color) in events.items():
    event_date = pd.Timestamp(date_str)
    idx = (dates == event_date).argmax()
    event_price = price[idx]
    
    ax.annotate(label, 
                xy=(event_date, event_price), 
                xytext=(event_date + pd.Timedelta(days=20), event_price + 15),
                fontsize=10, fontweight='bold', color=color,
                arrowprops=dict(arrowstyle='->', color=color, lw=1.5),
                bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor=color, alpha=0.9))
    
    ax.axvline(x=event_date, color=color, linestyle=':', alpha=0.5)
    ax.scatter([event_date], [event_price], color=color, s=100, zorder=5, edgecolors='white', linewidth=2)

# Add summary box
summary = f"""üìä Performance Summary
Start: ${price[0]:.2f}
End: ${price[-1]:.2f}
Change: {((price[-1]/price[0])-1)*100:+.1f}%
Max: ${price.max():.2f}
Min: ${price.min():.2f}"""

ax.text(0.02, 0.98, summary, transform=ax.transAxes, fontsize=10,
        verticalalignment='top', fontfamily='monospace',
        bbox=dict(boxstyle='round', facecolor='#ecf0f1', edgecolor='#bdc3c7', alpha=0.9))

ax.set_title('üìà TechCorp (TECH) Stock Price Analysis - 2024\nKey Events & Performance', 
             fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Stock Price ($)', fontsize=11)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

# 6. Practice Projects

## Project 1: Visual Exploratory Data Analysis (EDA)

### üè™ Scenario: Retail Store Sales Analysis

In [None]:
# Generate comprehensive retail dataset
np.random.seed(42)

n_records = 1000

# Create realistic retail data
retail_data = pd.DataFrame({
    'transaction_id': range(1, n_records + 1),
    'date': pd.date_range('2024-01-01', periods=n_records, freq='4H'),
    'store': np.random.choice(['Downtown', 'Mall', 'Suburb', 'Airport'], n_records, p=[0.3, 0.35, 0.25, 0.1]),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home', 'Beauty'], n_records, p=[0.2, 0.25, 0.3, 0.15, 0.1]),
    'payment_method': np.random.choice(['Card', 'Cash', 'Mobile'], n_records, p=[0.6, 0.25, 0.15]),
    'items': np.random.poisson(3, n_records) + 1,
    'amount': np.random.exponential(75, n_records) + 10,
    'customer_age': np.random.normal(38, 12, n_records).clip(18, 75).astype(int),
    'is_member': np.random.choice([True, False], n_records, p=[0.4, 0.6])
})

# Add derived columns
retail_data['day_of_week'] = retail_data['date'].dt.day_name()
retail_data['hour'] = retail_data['date'].dt.hour
retail_data['month'] = retail_data['date'].dt.month_name()
retail_data['avg_item_value'] = retail_data['amount'] / retail_data['items']

print("üõí Retail Dataset Generated!")
print(f"Shape: {retail_data.shape}")
print("\nSample records:")
retail_data.head()

In [None]:
# Complete EDA Dashboard

fig = plt.figure(figsize=(18, 14))
gs = GridSpec(3, 3, figure=fig, hspace=0.35, wspace=0.3)

# 1. Sales by Store (Pie)
ax1 = fig.add_subplot(gs[0, 0])
store_sales = retail_data.groupby('store')['amount'].sum()
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
ax1.pie(store_sales, labels=store_sales.index, autopct='%1.1f%%', colors=colors, explode=[0.05, 0, 0, 0])
ax1.set_title('üè™ Sales Distribution by Store', fontweight='bold')

# 2. Category Performance (Horizontal Bar)
ax2 = fig.add_subplot(gs[0, 1])
cat_sales = retail_data.groupby('category')['amount'].sum().sort_values()
bars = ax2.barh(cat_sales.index, cat_sales.values, color=plt.cm.viridis(np.linspace(0.2, 0.8, len(cat_sales))))
ax2.set_title('üì¶ Sales by Category', fontweight='bold')
ax2.set_xlabel('Total Sales ($)')
for bar in bars:
    ax2.text(bar.get_width() + 500, bar.get_y() + bar.get_height()/2, 
             f'${bar.get_width()/1000:.1f}K', va='center', fontsize=9)

# 3. Daily Sales Pattern (Line)
ax3 = fig.add_subplot(gs[0, 2])
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_sales = retail_data.groupby('day_of_week')['amount'].mean().reindex(day_order)
ax3.plot(range(7), daily_sales.values, marker='o', linewidth=2, markersize=8, color='#9b59b6')
ax3.fill_between(range(7), daily_sales.values, alpha=0.3, color='#9b59b6')
ax3.set_xticks(range(7))
ax3.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
ax3.set_title('üìÖ Average Sales by Day', fontweight='bold')
ax3.set_ylabel('Avg Sale ($)')

# 4. Hourly Sales Heatmap by Store
ax4 = fig.add_subplot(gs[1, :2])
hourly_store = retail_data.pivot_table(values='amount', index='store', columns='hour', aggfunc='mean')
sns.heatmap(hourly_store, cmap='YlOrRd', ax=ax4, annot=False, cbar_kws={'label': 'Avg Sale ($)'})
ax4.set_title('üïê Hourly Sales Pattern by Store (Heatmap)', fontweight='bold')
ax4.set_xlabel('Hour of Day')

# 5. Transaction Amount Distribution
ax5 = fig.add_subplot(gs[1, 2])
sns.histplot(retail_data['amount'], kde=True, ax=ax5, color='#3498db', bins=30)
ax5.axvline(retail_data['amount'].median(), color='red', linestyle='--', label=f'Median: ${retail_data["amount"].median():.0f}')
ax5.set_title('üíµ Transaction Amount Distribution', fontweight='bold')
ax5.set_xlabel('Amount ($)')
ax5.legend()

# 6. Member vs Non-Member Spending
ax6 = fig.add_subplot(gs[2, 0])
sns.boxplot(data=retail_data, x='is_member', y='amount', ax=ax6, palette='Set2')
ax6.set_xticklabels(['Non-Member', 'Member'])
ax6.set_title('üé´ Member vs Non-Member Spending', fontweight='bold')
ax6.set_ylabel('Transaction Amount ($)')

# 7. Age vs Spending (Scatter)
ax7 = fig.add_subplot(gs[2, 1])
sns.scatterplot(data=retail_data.sample(200), x='customer_age', y='amount', 
                hue='category', ax=ax7, alpha=0.7, s=60)
ax7.set_title('üë• Age vs Spending by Category', fontweight='bold')
ax7.set_xlabel('Customer Age')
ax7.set_ylabel('Amount ($)')
ax7.legend(loc='upper right', fontsize=8)

# 8. Payment Methods (Count)
ax8 = fig.add_subplot(gs[2, 2])
payment_counts = retail_data['payment_method'].value_counts()
bars = ax8.bar(payment_counts.index, payment_counts.values, 
               color=['#3498db', '#2ecc71', '#f39c12'], edgecolor='white')
ax8.set_title('üí≥ Payment Method Distribution', fontweight='bold')
ax8.set_ylabel('Transaction Count')
for bar in bars:
    ax8.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
             f'{bar.get_height():.0f}', ha='center', fontsize=10, fontweight='bold')

fig.suptitle('üõí Retail Store Analytics Dashboard - Complete EDA', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

## Project 2: Creating an Executive Summary Report

In [None]:
# Executive Summary: Key Insights Report

fig = plt.figure(figsize=(16, 12))

# Custom layout for executive report
gs = GridSpec(3, 4, figure=fig, hspace=0.4, wspace=0.3,
              height_ratios=[1, 1.2, 1])

# KPI Cards (Top Row)
kpis = [
    ('Total Revenue', f'${retail_data["amount"].sum()/1000:.1f}K', '‚Üë 12%', '#27ae60'),
    ('Transactions', f'{len(retail_data):,}', '‚Üë 8%', '#3498db'),
    ('Avg Order Value', f'${retail_data["amount"].mean():.2f}', '‚Üë 5%', '#9b59b6'),
    ('Member Rate', f'{retail_data["is_member"].mean()*100:.1f}%', '‚Üë 3%', '#f39c12')
]

for i, (title, value, change, color) in enumerate(kpis):
    ax = fig.add_subplot(gs[0, i])
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')
    
    # Background box
    rect = plt.Rectangle((0.05, 0.05), 0.9, 0.9, facecolor=color, alpha=0.15, 
                         edgecolor=color, linewidth=2, transform=ax.transAxes)
    ax.add_patch(rect)
    
    ax.text(0.5, 0.75, title, ha='center', va='center', fontsize=11, color='gray', fontweight='bold')
    ax.text(0.5, 0.45, value, ha='center', va='center', fontsize=20, color=color, fontweight='bold')
    ax.text(0.5, 0.15, change, ha='center', va='center', fontsize=12, color='green', fontweight='bold')

# Main Chart: Revenue Trend
ax_main = fig.add_subplot(gs[1, :3])
monthly_revenue = retail_data.groupby(retail_data['date'].dt.to_period('M'))['amount'].sum()
monthly_revenue.index = monthly_revenue.index.astype(str)
ax_main.bar(monthly_revenue.index, monthly_revenue.values, color='#3498db', edgecolor='white', alpha=0.7)
ax_main.plot(monthly_revenue.index, monthly_revenue.values, marker='o', color='#e74c3c', linewidth=2, markersize=8)
ax_main.set_title('üìà Monthly Revenue Trend', fontweight='bold', fontsize=12)
ax_main.set_ylabel('Revenue ($)')
ax_main.tick_params(axis='x', rotation=45)
ax_main.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# Side Chart: Top Performers
ax_side = fig.add_subplot(gs[1, 3])
store_perf = retail_data.groupby('store')['amount'].sum().sort_values(ascending=True)
colors_side = ['#bdc3c7', '#bdc3c7', '#bdc3c7', '#27ae60']  # Highlight top
ax_side.barh(store_perf.index, store_perf.values, color=colors_side)
ax_side.set_title('üèÜ Store Rankings', fontweight='bold', fontsize=12)
ax_side.set_xlabel('Total Revenue ($)')

# Bottom Row: Detailed Insights
# Category mix
ax_bottom1 = fig.add_subplot(gs[2, 0:2])
cat_by_store = retail_data.groupby(['store', 'category'])['amount'].sum().unstack()
cat_by_store.plot(kind='bar', stacked=True, ax=ax_bottom1, colormap='viridis', edgecolor='white')
ax_bottom1.set_title('üìä Category Mix by Store', fontweight='bold', fontsize=11)
ax_bottom1.set_ylabel('Revenue ($)')
ax_bottom1.legend(title='Category', bbox_to_anchor=(1.02, 1), fontsize=8)
ax_bottom1.tick_params(axis='x', rotation=0)

# Key Findings Text Box
ax_text = fig.add_subplot(gs[2, 2:])
ax_text.axis('off')

findings = """
üìã KEY FINDINGS & RECOMMENDATIONS

‚úÖ STRENGTHS:
‚Ä¢ Mall store leads revenue (+35% vs avg)
‚Ä¢ Food category shows strong performance
‚Ä¢ Weekend sales 40% higher than weekdays

‚ö†Ô∏è OPPORTUNITIES:
‚Ä¢ Airport store underperforming
‚Ä¢ Beauty category needs attention
‚Ä¢ Member conversion rate below target

üéØ ACTIONS:
1. Increase Airport marketing budget
2. Launch Beauty category promotion
3. Implement member referral program
"""

ax_text.text(0.05, 0.95, findings, transform=ax_text.transAxes, fontsize=10,
            verticalalignment='top', fontfamily='sans-serif',
            bbox=dict(boxstyle='round', facecolor='#f8f9fa', edgecolor='#dee2e6', alpha=0.9))

fig.suptitle('üìä EXECUTIVE SUMMARY: Retail Performance Report', 
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Saving Visualizations

In [None]:
# How to save visualizations in different formats

# Create a sample plot to save
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(['A', 'B', 'C', 'D'], [25, 40, 30, 55], color='#3498db')
ax.set_title('Sample Chart for Export', fontweight='bold')
ax.set_ylabel('Values')

# Save in different formats
# Uncomment these lines to actually save:

# fig.savefig('chart.png', dpi=300, bbox_inches='tight')  # High-res PNG
# fig.savefig('chart.pdf', bbox_inches='tight')           # PDF for print
# fig.savefig('chart.svg', bbox_inches='tight')           # SVG for web
# fig.savefig('chart.jpg', dpi=150, quality=95)           # JPEG

print("üìÅ Common export formats:")
print("  ‚Ä¢ PNG (dpi=300): Best for presentations and documents")
print("  ‚Ä¢ PDF: Best for print quality and scalability")
print("  ‚Ä¢ SVG: Best for web and interactive applications")
print("  ‚Ä¢ JPEG: Smaller file size, lossy compression")

plt.show()

---

# üìù Summary & Key Takeaways

## What We Learned:

### 1. Visualization Fundamentals
- Data visualization transforms numbers into insights
- Follow principles: Clarity, Accuracy, Efficiency, Aesthetics
- Always visualize data; statistics can be misleading

### 2. Matplotlib
- Foundation of Python visualization
- Line plots for trends, Bar charts for comparisons, Scatter plots for relationships
- Customizable through object-oriented interface

### 3. Seaborn
- Built on Matplotlib with beautiful defaults
- Excellent for statistical visualizations
- Direct DataFrame integration

### 4. Chart Selection
- Match chart type to your data and message
- Avoid common mistakes (3D effects, truncated axes)
- Simplicity often beats complexity

### 5. Storytelling with Data
- Lead with insights, not just data
- Use color strategically
- Add context through annotations
- Every visualization should answer: "So what?"

---

## üèãÔ∏è Practice Exercises

1. **Basic**: Create a line chart showing temperature variations over a week
2. **Intermediate**: Build a dashboard analyzing your personal expenses
3. **Advanced**: Create an interactive storytelling visualization for a topic you're passionate about

---

## üìö Additional Resources

- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
- [Seaborn Tutorial](https://seaborn.pydata.org/tutorial.html)
- [Storytelling with Data (book)](https://www.storytellingwithdata.com/)
- [Python Graph Gallery](https://www.python-graph-gallery.com/)