# Data Visualization

**Course Duration**: 3 hours  
**Learning Outcomes**:
- Understand data storytelling and effective data visualization principles
- Create professional plots using Matplotlib
- Analyze distributions and relationships using Seaborn

## ⏱️ HOUR 1: Data Storytelling & Matplotlib Foundations

**Duration**: 60 minutes
- 5 min: Setup and imports
- 20 min: Recap activity
- 40 min: Data storytelling concepts

### 🎯 Why Visualize Data?

Making informative visualizations is one of the most important tasks in data analysis. Visualization serves multiple critical purposes:

1. **Exploratory Data Analysis (EDA)**
   - Identify outliers and anomalies
   - Discover patterns and relationships
   - Detect data quality issues requiring transformation

2. **Communication to Non-Technical Audiences**
   - Make complex findings accessible
   - Support storytelling with evidence
   - Facilitate decision-making

3. **Interactive Visualization for the Web**
   - Build dashboards and applications
   - Enable real-time exploration
   - Share insights dynamically

**Key Principle**: A good visualization reveals truth and tells a story.

### Getting Started: Imports & Setup

We'll import three essential libraries for data visualization:
- **NumPy**: Numerical computing and random data generation
- **Pandas**: Data manipulation and analysis
- **Matplotlib**: Low-level plotting library for full control

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

print('✓ Libraries imported successfully')

### Understanding Matplotlib

**Matplotlib** is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader Pandas stack. It provides an object-oriented API for plotting graphs and charts.

**Key Hierarchy**:
- **Figure**: The top-level container for all plot elements
- **Axes**: The plot area where data is rendered
- **Artists**: Individual elements (lines, patches, text, etc.)

**Why Matplotlib?**
- Full control over every plot element
- Foundation for higher-level libraries (like Seaborn)
- Publication-quality output
- Extensive documentation and community support

### Your First Plot: Simple Line Plot

The simplest way to create a plot is using the pyplot interface (`plt`). The default plot type is a line plot.

In [None]:
# Create simple data
data = np.arange(10)  # Numbers 0 to 9

# Create the simplest possible plot
plt.plot(data)
plt.title('My First Plot')
plt.xlabel('Index')
plt.ylabel('Value')
plt.show()

**What happened?**
- `np.arange(10)`: Creates array [0, 1, 2, ..., 9]
- `plt.plot(data)`: Creates a line plot using values as y-axis (index as x-axis)
- `plt.show()`: Displays the plot

### Plotting Pandas Series vs DataFrame

Pandas objects have built-in plotting methods. When you plot a **Series**, the index becomes the x-axis and values become the y-axis.

In [None]:
# Create a Pandas Series with custom index
data_series = pd.Series(np.arange(5), index=np.arange(2, 12, 2))
print('Series:')
print(data_series)

# Plot the series
# Note: Index (2, 4, 6, 8, 10) appears on x-axis
data_series.plot(figsize=(10, 4), title='Series Plot')
plt.ylabel('Value')
plt.show()

## ⏱️ HOUR 2: Mastering Matplotlib

**Duration**: 60 minutes
- 30 min: Code-along 1 - Figures, subplots, and basic plots
- 10 min: Break
- 20 min: Code-along 2 - Styling and customization

### Code-along 1: Figures and Subplots

**Learning Objective**: Understand Matplotlib's architecture and create multi-panel layouts

A plot lives inside a **Figure** object. Subplots (called **axes** in Matplotlib) are the actual plot areas.

In [None]:
# Step 1: Create a figure (empty canvas)
fig = plt.figure(figsize=(10, 6))
print(f'Figure created: {fig}')

In [None]:
# Step 2: Create a subplot (axes)
# addsubplot(rows, cols, position)
ax = fig.add_subplot(1, 1, 1)  # 1 row, 1 col, 1st position

# Step 3: Plot data on the axes
data = np.random.standard_normal(50).cumsum()
ax.plot(data, color='black', linewidth=2)
ax.set_title('Single Subplot Example')
ax.set_xlabel('Time')
ax.set_ylabel('Cumulative Return')

fig

**Key Concept**: 
- `fig`: The container for all plot elements
- `ax`: The plot area where we draw data
- All customization (colors, labels, etc.) happens on the `ax` object

### Creating Multiple Subplots

You can create multiple plots in a grid layout by changing the subplot position.

In [None]:
# Create a 2x2 grid of subplots
fig = plt.figure(figsize=(12, 8))

# Position 1: Top-left (2 rows, 2 cols, position 1)
ax1 = fig.add_subplot(2, 2, 1)
ax1.plot(np.random.standard_normal(50).cumsum(), color='black')
ax1.set_title('Plot 1: Line Plot')

# Position 2: Top-right
ax2 = fig.add_subplot(2, 2, 2)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.standard_normal(30))
ax2.set_title('Plot 2: Scatter Plot')

# Position 3: Bottom-left
ax3 = fig.add_subplot(2, 2, 3)
ax3.hist(np.random.standard_normal(100), bins=20, color='black', alpha=0.7)
ax3.set_title('Plot 3: Histogram')

# Position 4: Bottom-right
ax4 = fig.add_subplot(2, 2, 4)
ax4.plot(np.random.standard_normal(50).cumsum(), color='black', linestyle='--', linewidth=2)
ax4.set_title('Plot 4: Dashed Line')

# Adjust spacing
fig.tight_layout()
fig

**Tips**:
- `figsize=(12, 8)`: Width=12 inches, Height=8 inches
- Each `add_subplot()` adds a new plot
- `fig.tight_layout()`: Automatically adjusts spacing

### Convenience Method: `plt.subplots()`

For faster creation, use `plt.subplots()` which returns both the figure and axes array.

In [None]:
# Create 2x2 grid in one line
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# axes is a 2D array, access by [row, col]
print(f'axes shape: {axes.shape}')
print(f'Accessing axes[0, 0]: {axes[0, 0]}')

In [None]:
# Plot different data in each subplot
fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharex=True, sharey=True)

# Create a loop to fill all subplots
for i in range(2):
    for j in range(2):
        ax = axes[i, j]
        # Plot histogram with random data
        ax.hist(np.random.standard_normal(500), bins=50, color='black', alpha=0.5)
        ax.set_title(f'Distribution {i*2 + j + 1}')

fig.suptitle('Multiple Histograms', fontsize=14, y=1.00)
fig.tight_layout()
fig

**Parameters**:
- `sharex=True`: All subplots share the same x-axis scale
- `sharey=True`: All subplots share the same y-axis scale
- `fig.suptitle()`: Title for the entire figure

### Code-along 2: Styling and Customization

**Learning Objective**: Customize plots with colors, markers, and line styles

The `plot()` function accepts many styling parameters to make plots more informative and visually appealing.

In [None]:
# Create data
data = np.random.standard_normal(50).cumsum()

# Create figure with plot
fig, ax = plt.subplots(figsize=(10, 6))

# Plot with styling
ax.plot(data, 
        color='green',           # Line color
        linestyle='--',          # Dashed line
        marker='o',              # Circle markers
        markersize=4,            # Marker size
        alpha=0.7,               # Transparency
        label='Data')

ax.set_title('Customized Line Plot')
ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.legend()  # Show legend
ax.grid(True, alpha=0.3)  # Add grid

fig

**Style Options**:
- `color`: 'blue', 'g', '#008000', or RGB tuple
- `linestyle`: '-' (solid), '--' (dashed), '-.' (dash-dot), ':' (dotted)
- `marker`: 'o' (circle), 's' (square), '^' (triangle), etc.
- `linewidth`: Thickness of line (default=1.5)

### Quick Styling with Format Strings

You can also use a concise format string: `[marker][linestyle][color]`

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Format string examples
data1 = np.random.standard_normal(50).cumsum()
data2 = np.random.standard_normal(50).cumsum()

ax.plot(data1, 'o-r', label='Red circles with line')     # marker-linestyle-color
ax.plot(data2, 's--b', label='Blue squares dashed')      # marker-linestyle-color

ax.set_title('Format String Examples')
ax.legend()
ax.grid(True, alpha=0.3)

fig

### Customizing Ticks, Labels, and Limits

Control exactly where ticks appear and what labels they display.

In [None]:
# Create data
rng = np.random.RandomState(42)
data = rng.standard_normal(100).cumsum()

fig, ax = plt.subplots(figsize=(12, 6))

# Plot
ax.plot(data, color='black', linewidth=1.5)

# Set x-axis ticks and labels
ticks = [0, 25, 50, 75, 100]
labels = ['Start', 'Q1', 'Mid', 'Q3', 'End']
ax.set_xticks(ticks)
ax.set_xticklabels(labels, rotation=45)

# Set axis limits
ax.set_xlim(-5, 105)
ax.set_ylim(-5, 10)

# Labels and title
ax.set_xlabel('Time Period', fontsize=12)
ax.set_ylabel('Cumulative Value', fontsize=12)
ax.set_title('Customized Axes', fontsize=14, fontweight='bold')

fig

**Key Methods**:
- `set_xticks()`: Set where ticks appear
- `set_xticklabels()`: Set what the ticks display
- `set_xlim()` / `set_ylim()`: Set axis ranges
- `rotation`: Angle for tick labels (helpful when crowded)

### Adding Annotations

Highlight important points with arrows, text, and annotations.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

# Plot data
data = np.random.standard_normal(100).cumsum()
ax.plot(data, color='black', linewidth=2)

# Find the maximum value
max_idx = data.argmax()
max_val = data[max_idx]

# Add annotation with arrow
ax.annotate(
    'Peak Value',                           # Text
    xy=(max_idx, max_val),                 # Point to annotate
    xytext=(max_idx - 20, max_val + 2),    # Text position
    arrowprops=dict(                        # Arrow properties
        facecolor='red', 
        headwidth=8, 
        width=2,
        headlength=6
    ),
    fontsize=11,
    color='red'
)

# Add text at minimum
min_idx = data.argmin()
min_val = data[min_idx]
ax.text(min_idx, min_val - 1.5, 'Lowest Point', 
        fontsize=10, ha='center', color='blue')

# Add horizontal line at zero
ax.axhline(y=0, color='gray', linestyle=':', alpha=0.5)

ax.set_title('Annotated Plot')
ax.grid(True, alpha=0.3)

fig

## ⏱️ HOUR 3: Statistical Graphics with Seaborn

**Duration**: 60 minutes
- 50 min: Code-along 3 - Statistical visualizations
- 10 min: Q&A and resources

### Introduction to Seaborn

**Seaborn** is a Python library that builds on Matplotlib to provide a higher-level interface for statistical data visualization. It automatically handles data aggregation and uses attractive default styling.

**Key Advantages**:
- Automatic aggregation and statistical estimation
- Beautiful default aesthetics
- Tight integration with Pandas
- Simplified multi-variable visualization

In [None]:
import seaborn as sns

# Set style for better-looking plots
sns.set_style('whitegrid')

print('✓ Seaborn imported and configured')

In [None]:
# Load example dataset
tips = sns.load_dataset('tips')
print(f'Dataset shape: {tips.shape}')
print('\nFirst few rows:')
tips.head()

### Bar Plots with Aggregation

**Use Case**: Summarize one variable for each category

Seaborn's `barplot` automatically aggregates data (computes mean, 95% confidence interval).

In [None]:
# Create tip percentage column
tips['tip_pct'] = tips['tip'] / tips['total_bill']

# Simple bar plot: average tip percentage by day
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='day', y='tip_pct', data=tips, ax=ax, color='steelblue')
ax.set_title('Average Tipping Percentage by Day of Week')
ax.set_ylabel('Tip Percentage')
ax.set_xlabel('Day')
fig

**Interpretation**:
- Bar height = Mean (average) tip percentage
- Black lines = 95% confidence interval
- Wider interval = more variability in the data

In [None]:
# Add a categorical dimension with hue
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x='day', y='tip_pct', hue='sex', data=tips, ax=ax)
ax.set_title('Tipping Percentage by Day and Gender')
ax.set_ylabel('Tip Percentage')
ax.set_xlabel('Day')
fig

**New Parameter**:
- `hue='sex'`: Creates separate bars by gender (another categorical dimension)

### Analyzing Distributions

**Use Cases**: Understand the shape and spread of data

Key distribution characteristics:
- **Central tendency**: Where most data clusters
- **Spread**: How dispersed the data is
- **Shape**: Symmetry, skewness, modality

#### Histograms: Bar Chart of Distribution

Divides data into bins and shows frequency.

In [None]:
# Simple histogram
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(tips['total_bill'], bins=30, color='steelblue', alpha=0.7, edgecolor='black')
ax.set_title('Distribution of Total Bill Amount')
ax.set_xlabel('Total Bill ($)')
ax.set_ylabel('Frequency')
ax.grid(True, alpha=0.3)
fig

**Parameters**:
- `bins=30`: Number of bars (higher = more detail)
- `alpha=0.7`: Transparency
- `edgecolor`: Border color of bars

#### Kernel Density Estimation (KDE): Smooth Distribution Curve

Fits a smooth probability distribution to the data.

In [None]:
# Density plot
fig, ax = plt.subplots(figsize=(10, 6))
tips['total_bill'].plot.density(ax=ax, linewidth=2, color='steelblue')
ax.set_title('Kernel Density Estimate of Total Bill')
ax.set_xlabel('Total Bill ($)')
ax.fill_between(tips['total_bill'].plot.density().get_lines()[0].get_xdata(),
                  tips['total_bill'].plot.density().get_lines()[0].get_ydata(),
                  alpha=0.3, color='steelblue')
ax.grid(True, alpha=0.3)
fig

In [None]:
# Combine histogram and KDE with seaborn
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(tips['total_bill'], bins=30, kde=True, ax=ax, color='steelblue', alpha=0.6)
ax.set_title('Total Bill Distribution (Histogram + KDE)')
ax.set_xlabel('Total Bill ($)')
fig

### Visualizing Relationships

**Use Case**: Explore how two continuous variables relate to each other

In [None]:
# Scatter plot with regression line
fig, ax = plt.subplots(figsize=(10, 6))
sns.regplot(x='total_bill', y='tip', data=tips, ax=ax, 
            scatter_kws={'alpha': 0.6, 's': 50},
            line_kws={'color': 'red', 'linewidth': 2})
ax.set_title('Relationship: Total Bill vs Tip Amount')
ax.set_xlabel('Total Bill ($)')
ax.set_ylabel('Tip ($)')
fig

**Interpretation**:
- Scatter points: Individual observations
- Red line: Best-fit linear regression
- Shaded area: 95% confidence interval of regression
- **Positive relationship**: Higher bills → higher tips

### Pairs Plot: All Relationships at Once

Visualize relationships among all numeric variables simultaneously.

In [None]:
# Create pairs plot
# This shows all pairwise relationships
numeric_cols = tips[['total_bill', 'tip', 'size']]
sns.pairplot(numeric_cols, diag_kind='kde', plot_kws={'alpha': 0.6})

**Layout**:
- **Diagonal**: Distribution of each variable (KDE)
- **Below diagonal**: Scatter plots of relationships
- **Mirrored**: Shows relationship from both directions

### Box Plots: Statistical Distributions by Category

**Use Case**: Compare distributions across groups using five-number summary

**Box plot components**:
- **Box**: Middle 50% of data (Q1 to Q3, interquartile range)
- **Orange line**: Median (Q2)
- **Whiskers**: Extend to show range
- **Points**: Outliers

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=tips, ax=ax, palette='Set2')
ax.set_title('Total Bill Distribution by Day of Week')
ax.set_xlabel('Day')
ax.set_ylabel('Total Bill ($)')
fig

In [None]:
# Add hue for additional dimension
fig, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(x='day', y='total_bill', hue='sex', data=tips, ax=ax, palette='Set2')
ax.set_title('Total Bill by Day and Gender')
ax.set_xlabel('Day')
ax.set_ylabel('Total Bill ($)')
fig

### Facet Grids: Multi-dimensional Analysis

**Use Case**: Show how relationship changes across groups

Creates a grid of plots, splitting data by categorical variables.

In [None]:
# Create a facet grid
g = sns.catplot(
    x='day',                    # X-axis variable
    y='tip_pct',               # Y-axis variable
    hue='sex',                 # Color by gender
    col='smoker',              # Separate columns by smoker status
    kind='bar',                # Bar plot type
    data=tips,
    height=5,
    aspect=1.2
)
g.set_titles('Smoker: {col_name}')
g.set_axis_labels('Day', 'Tip Percentage')

**Layout**:
- **Columns** (`col='smoker'`): Separate plots for non-smokers and smokers
- **Hue** (`hue='sex'`): Different colors for male/female
- **X-axis** (`x='day'`): Days of week on each plot

## 🎓 Summary & Closure

### Key Takeaways

1. **Data Visualization is Communication**
   - Choose plots that reveal truth in data
   - Customize for clarity and impact
   - Make context and interpretation obvious

2. **Matplotlib is the Foundation**
   - Figures → Axes → Artists hierarchy
   - Object-oriented API gives full control
   - Many high-level libraries build on it

3. **Seaborn Simplifies Statistical Graphics**
   - Handles aggregation automatically
   - Beautiful defaults
   - Perfect for exploratory analysis

### Best Practices

✓ **DO**:
- Start with simple plots to understand your data
- Add meaningful titles, labels, and legends
- Use colors meaningfully, not decoratively
- Let the data speak for itself

✗ **DON'T**:
- Use 3D plots unless essential
- Clutter plots with unnecessary elements
- Use rainbow color schemes (hard on eyes)
- Distort axes to exaggerate effects

### Resources & References

**Official Documentation**:
- [Matplotlib Documentation](https://matplotlib.org/)
- [Seaborn Documentation](https://seaborn.pydata.org/)
- [Pandas Plotting](https://pandas.pydata.org/docs/user_guide/visualization.html)

**Learning Resources**:
- Matplotlib Gallery: https://matplotlib.org/stable/gallery/
- Seaborn Example Gallery: https://seaborn.pydata.org/examples.html

**Books**:
- "Python for Data Analysis" by Wes McKinney
- "Fundamentals of Data Visualization" by Claus Wilke (free online)

---

## 📚 Optional/Advanced Topics

The following sections are optional and can be explored for deeper understanding.

### Advanced: Pandas `.plot()` Method

Pandas objects have convenient `.plot()` methods that use Matplotlib under the hood.

In [None]:
# Create sample DataFrame
df = pd.DataFrame(
    np.random.uniform(size=(6, 4)),
    index=['Group 1', 'Group 2', 'Group 3', 'Group 4', 'Group 5', 'Group 6'],
    columns=pd.Index(['A', 'B', 'C', 'D'], name='Category')
)

print(df)

In [None]:
# Vertical bar plot
df.plot.bar(figsize=(12, 6), alpha=0.7)
plt.title('Bar Plot of DataFrame')
plt.ylabel('Value')
plt.tight_layout()
plt.show()

In [None]:
# Horizontal stacked bar plot
df.plot.barh(stacked=True, figsize=(10, 6), alpha=0.7)
plt.title('Stacked Horizontal Bar Plot')
plt.xlabel('Value')
plt.tight_layout()
plt.show()

### Advanced: Global Configuration with rcParams

Customize default Matplotlib behavior globally.

In [None]:
# View current configuration
print('Current Figure Size:', plt.rcParams['figure.figsize'])
print('Current Font Size:', plt.rcParams['font.size'])
print('Current Line Width:', plt.rcParams['lines.linewidth'])

In [None]:
# Customize global settings
plt.rc('figure', figsize=(12, 6))
plt.rc('font', size=11)
plt.rc('lines', linewidth=2)
plt.rc('legend', framealpha=0.9, fontsize='small')

# Verify changes
print('Updated Figure Size:', plt.rcParams['figure.figsize'])
print('Updated Font Size:', plt.rcParams['font.size'])

### Advanced: Saving High-Quality Figures

Export plots for publications and presentations.

In [None]:
# Create example plot
fig, ax = plt.subplots(figsize=(10, 6))
data = np.random.standard_normal(100).cumsum()
ax.plot(data, linewidth=2, color='steelblue')
ax.set_title('Example Figure for Saving', fontsize=14, fontweight='bold')
ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.grid(True, alpha=0.3)

# Save with different options
# fig.savefig('plot.png', dpi=150, bbox_inches='tight', facecolor='white')
# fig.savefig('plot.pdf', dpi=300, bbox_inches='tight')
# fig.savefig('plot.svg', dpi=150, bbox_inches='tight')

print('Save options:')
print('- dpi=150: Standard screen resolution')
print('- dpi=300: Publication quality')
print('- bbox_inches="tight": Remove white space')
print('- facecolor="white": Ensure white background')