# Week 5 Lab: Data Visualization Techniques

**Estimated Time:** 60-90 minutes  
**Objective:** Master data visualization by choosing the right chart types, applying design principles, ensuring accessibility, and building interactive charts — using Philippine regional economic data throughout.

In this lab, you will:
- Build comparison, distribution, and relationship charts with Matplotlib
- Apply Tufte's data-ink ratio principle to clean up cluttered charts
- Spot and fix common chart mistakes (pie chart abuse, rainbow colors)
- Use colorblind-safe palettes and redundant encoding for accessibility
- Create small multiples and multi-panel dashboards with Seaborn
- Write an annotated, publication-quality data story figure
- Build interactive charts with Plotly Express

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')
pd.set_option('display.max_columns', None)
np.random.seed(42)

print('✓ Libraries imported successfully!')

## Dataset: Philippine Regional Economic Indicators (PSA 2023)

We will use this dataset across all 7 parts. Take a moment to understand its structure.

In [None]:
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III',
               'CALABARZON', 'MIMAROPA', 'Region V', 'Region VI',
               'Region VII', 'Region VIII', 'Region IX', 'Region X',
               'Region XI', 'Region XII', 'BARMM'],
    'Population_M': [13.48, 1.80, 5.30, 3.68, 12.42, 16.20, 3.23,
                     6.08, 7.95, 8.08, 4.55, 3.87, 5.02, 5.24, 4.93, 4.40],
    'GDP_Billions': [6842, 352, 418, 310, 1876, 2821, 286, 462,
                     1054, 1342, 284, 318, 724, 1021, 485, 312],
    'Poverty_Rate': [3.2, 14.8, 11.2, 14.1, 10.5, 8.1, 22.4, 28.3,
                     20.1, 18.7, 30.5, 33.8, 25.6, 16.3, 29.1, 63.0],
    'Literacy_Rate': [99.2, 96.8, 97.5, 96.2, 97.8, 98.1, 95.3, 96.0,
                      96.5, 97.1, 95.8, 94.2, 96.0, 97.0, 95.5, 82.5],
    'Tourism_Arrivals_K': [4200, 180, 320, 150, 580, 890, 420, 310,
                           650, 1200, 210, 120, 280, 380, 160, 85],
    'Island_Group': ['Luzon', 'Luzon', 'Luzon', 'Luzon', 'Luzon',
                     'Luzon', 'Luzon', 'Luzon', 'Visayas', 'Visayas',
                     'Visayas', 'Mindanao', 'Mindanao', 'Mindanao',
                     'Mindanao', 'Mindanao']
}

df = pd.DataFrame(data)
print(f'Dataset: {df.shape[0]} regions × {df.shape[1]} columns')
df

---
## Part 1: Comparison Charts

Comparison charts answer the question: **"Which categories are largest / smallest / most different?"**

The key decision is *horizontal vs vertical bars* — use horizontal when category names are long or ranking matters top-to-bottom.

### Exercise 1.1: Ranked Horizontal Bar Chart

Create a **horizontal bar chart** of GDP by region, sorted highest to lowest.

- Sort the DataFrame by `GDP_Billions` ascending (so the top bar is highest in `barh`)
- Use `ax.barh()` with `color='steelblue'`
- x-axis label: `'GDP (Billion PHP)'`
- Title: `'Philippine GDP by Region (PSA 2023)'`
- After the chart, print the name and GDP of the highest and lowest regions

In [None]:
# TODO: Sort by GDP_Billions ascending
df_sorted = None  # Your code here

fig, ax = plt.subplots(figsize=(10, 7))

# TODO: Create horizontal bar chart
# Hint: ax.barh(df_sorted['Region'], df_sorted['GDP_Billions'], color='steelblue')
# Your code here

ax.set_xlabel('GDP (Billion PHP)')
ax.set_title('Philippine GDP by Region (PSA 2023)', fontweight='bold')
plt.tight_layout()
plt.show()

# TODO: Print top and bottom region by GDP
# Your code here

### Exercise 1.2: Grouped Bar Chart

Compare **Poverty Rate vs Literacy Rate** side-by-side for the 8 most populous regions.

- Select top 8 by `Population_M`
- Use `np.arange(len(top8))` for x positions and `width = 0.35`
- Poverty bars: `color='#EF4444'`; Literacy bars: `color='#2563EB'`
- Add a legend and rotate x-tick labels 45°

In [None]:
top8 = df.nlargest(8, 'Population_M')

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(top8))
width = 0.35

# TODO: Two side-by-side bar groups
# Hint: ax.bar(x - width/2, top8['Poverty_Rate'],  width, label='Poverty Rate (%)',  color='#EF4444')
#        ax.bar(x + width/2, top8['Literacy_Rate'], width, label='Literacy Rate (%)', color='#2563EB')
# Your code here

# TODO: Set x-tick labels to region names, rotated 45°
# Hint: ax.set_xticks(x); ax.set_xticklabels(top8['Region'], rotation=45, ha='right')
# Your code here

ax.set_ylabel('Rate (%)')
ax.set_title('Poverty Rate vs Literacy Rate — Top 8 Regions by Population', fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()

---
## Part 2: Distribution Charts

Distribution charts answer: **"How is this variable spread? Where is the center? Are there outliers?"**

Histograms show shape; box plots show spread, quartiles, and outliers. Use them together for a complete picture.

### Exercise 2.1: Histogram with KDE Overlay

Visualize the distribution of **Poverty Rate** across all 16 regions.

- `sns.histplot()` with `kde=True`, `bins=8`, `color='steelblue'`
- Add a dashed red vertical line at the **mean**
- Add a dashed green vertical line at the **median**
- Label both lines in the legend
- After the chart, print mean and median and explain in one sentence why they differ

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# TODO: Histogram with KDE
# Your code here

# TODO: Mean and median vertical lines
mean_pov   = None  # Your code here
median_pov = None  # Your code here
# ax.axvline(mean_pov,   color='red',   linestyle='--', label=f'Mean: {mean_pov:.1f}%')
# ax.axvline(median_pov, color='green', linestyle='--', label=f'Median: {median_pov:.1f}%')
# Your code here

ax.set_xlabel('Poverty Rate (%)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Regional Poverty Rates', fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()

# TODO: Print mean, median, and 1-sentence explanation
# Your code here

### Exercise 2.2: Box Plots by Island Group

Compare poverty rate distributions across **Luzon, Visayas, and Mindanao**.

- `sns.boxplot()` with `x='Island_Group'`, `y='Poverty_Rate'`, `order=['Luzon','Visayas','Mindanao']`
- Overlay `sns.stripplot()` with `color='black'`, `size=6`, `jitter=True`, `alpha=0.7`
- After the chart, print the **median** poverty rate for each island group

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
order = ['Luzon', 'Visayas', 'Mindanao']

# TODO: Box plot
# Your code here

# TODO: Strip plot overlay
# Your code here

ax.set_ylabel('Poverty Rate (%)')
ax.set_title('Poverty Rate Distribution by Island Group', fontweight='bold')
plt.tight_layout()
plt.show()

# TODO: Print median poverty rate per island group
# Hint: df.groupby('Island_Group')['Poverty_Rate'].median().sort_values(ascending=False)
# Your code here

---
## Part 3: Relationship Charts

Relationship charts answer: **"Is there a pattern between two variables?"**

Scatter plots reveal direction and strength; heatmaps show all pairwise correlations at once.

### Exercise 3.1: Scatter Plot with Regression Line

Explore the relationship between **GDP and Tourism Arrivals**.

- `sns.scatterplot()` with `hue='Island_Group'`, `s=120`
- Overlay `sns.regplot()` with `scatter=False`, `color='gray'`, dashed line
- Label axes and add a title
- After the chart, compute and print the Pearson correlation coefficient

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))

# TODO: Scatter plot colored by Island_Group
# Your code here

# TODO: Regression line overlay
# Hint: sns.regplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
#                    scatter=False, color='gray',
#                    line_kws={'linestyle': '--', 'linewidth': 1.5}, ax=ax)
# Your code here

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Tourism Arrivals (Thousands)')
ax.set_title('GDP vs Tourism Arrivals by Region', fontweight='bold')
plt.tight_layout()
plt.show()

# TODO: Print Pearson r between GDP_Billions and Tourism_Arrivals_K
# Hint: df['GDP_Billions'].corr(df['Tourism_Arrivals_K'])
# Your code here

### Exercise 3.2: Correlation Heatmap

Visualize all pairwise correlations in one chart.

- Compute `.corr()` on numeric columns only
- `sns.heatmap()` with `annot=True`, `cmap='coolwarm'`, `center=0`, `fmt='.2f'`, `linewidths=0.5`
- After the chart, identify and print the **strongest positive** and **strongest negative** correlation pair

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

# TODO: Correlation matrix
corr = None  # Your code here

# TODO: Heatmap
# Your code here

ax.set_title('Correlation Matrix — Philippine Regional Indicators', fontweight='bold')
plt.tight_layout()
plt.show()

# TODO: Print strongest positive and negative correlation pairs (exclude self-correlations)
# Your code here

---
## Part 4: Design Principles

Knowing *which* chart to use is only half the job. This part applies **Tufte's data-ink ratio** and demonstrates the most common chart mistakes covered in the lecture.

### Exercise 4.1: Tufte Data-Ink Clean-Up

The first cell produces a deliberately cluttered bar chart. Run it first, then complete the second cell to produce a clean version.

Your clean chart must:
1. Use a single muted color `'#4a90d9'`
2. Remove all four spines
3. Remove x-axis ticks — add **direct value labels** instead
4. Remove gridlines
5. Use an **active title** that states the insight (not just the variable name)

In [None]:
# --- BAD CHART: Run this to see the problems ---
df_top5 = df.nlargest(5, 'GDP_Billions').sort_values('GDP_Billions')

fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(df_top5['Region'], df_top5['GDP_Billions'],
        color=['red', 'orange', 'yellow', 'green', 'purple'])
ax.set_title('GDP', fontsize=10)
ax.set_xlabel('billions')
ax.grid(True, linestyle='-', linewidth=2)
plt.tight_layout()
plt.show()
print('Problems: rainbow colors, heavy grid, vague title, no direct labels')

In [None]:
# --- GOOD CHART: Your Tufte clean-up ---
fig, ax = plt.subplots(figsize=(9, 4))

# TODO: Single-color bars
bars = None  # ax.barh(df_top5['Region'], df_top5['GDP_Billions'], color='#4a90d9')
# Your code here

# TODO: Remove all four spines
# Hint: for spine in ax.spines.values(): spine.set_visible(False)
# Your code here

# TODO: Remove x-axis ticks
# Your code here

# TODO: Direct value labels on each bar
# Hint: for bar in bars:
#           ax.text(bar.get_width() + 50, bar.get_y() + bar.get_height()/2,
#                   f'{bar.get_width():,.0f}B', va='center', fontsize=10)
# Your code here

# TODO: Active title that states the insight
ax.set_title('', fontweight='bold')  # Replace empty string with your active title
ax.grid(False)
plt.tight_layout()
plt.show()

### Exercise 4.2: Bad Chart vs Good Chart — The Pie Problem

Create a side-by-side comparison:

- **Left (bad):** `ax.pie()` with all 16 regions, `autopct='%1.0f%%'`
- **Right (good):** top 5 regions + an `'Others'` bar (sum of remaining 11), horizontal, sorted ascending; use gray for the Others bar

In a comment, write one sentence explaining what makes the bar chart cognitively easier to read.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# LEFT: Bad — pie with 16 slices
# TODO: ax1.pie(df['GDP_Billions'], labels=df['Region'], autopct='%1.0f%%')
# Your code here
ax1.set_title('Bad: Pie Chart (16 slices)', fontweight='bold', color='#dc2626')

# RIGHT: Good — top 5 + Others
top5   = None  # df.nlargest(5, 'GDP_Billions')
others = None  # df['GDP_Billions'].sum() - top5['GDP_Billions'].sum()

# TODO: Build bar chart (top5 sorted ascending + 'Others' at bottom)
# Your code here

ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_title('Good: Bar Chart (Top 5 + Others)', fontweight='bold', color='#16a34a')
plt.tight_layout()
plt.show()

# Why the bar chart is easier:
# ...

---
## Part 5: Color & Accessibility

About 8% of males have some form of color vision deficiency. These exercises apply two key fixes: colorblind-safe palettes and redundant encoding.

### Exercise 5.1: Compare Three Color Palettes

Create the **same scatter plot** three times side-by-side, each with a different palette:
1. `palette=None` (matplotlib default)
2. `palette='colorblind'` (seaborn's 8-color accessible palette)
3. `palette='viridis'` (perceptually uniform)

All three: `GDP_Billions` vs `Poverty_Rate`, colored by `Island_Group`, `s=100`.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

palettes = [
    ('Default',         None),
    ('Colorblind-safe', 'colorblind'),
    ('Viridis',         'viridis'),
]

for ax, (name, palette) in zip(axes, palettes):
    # TODO: sns.scatterplot with hue='Island_Group', the given palette, s=100
    # Your code here

    ax.set_title(f'{name} Palette', fontweight='bold')
    ax.set_xlabel('GDP (Billion PHP)')
    ax.set_ylabel('Poverty Rate (%)')

plt.suptitle('Same Data — Three Color Palettes', fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### Exercise 5.2: Redundant Encoding (Color + Shape)

**Redundant encoding** uses two visual channels to convey the same grouping — so the chart remains readable in grayscale or for colorblind viewers.

Create two side-by-side scatter plots:
- **Left:** `hue='Island_Group'` only (color, no shape distinction)
- **Right:** `hue='Island_Group'` AND `style='Island_Group'` — `s=150`, `palette='colorblind'`

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# LEFT: Color only
# TODO: Your code here
ax1.set_title('Color Only', fontweight='bold')
ax1.set_xlabel('GDP (Billion PHP)')
ax1.set_ylabel('Poverty Rate (%)')

# RIGHT: Color + Shape (redundant encoding)
# TODO: Your code here
ax2.set_title('Color + Shape (Redundant Encoding)', fontweight='bold')
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_ylabel('Poverty Rate (%)')

plt.tight_layout()
plt.show()
print('The right chart stays readable in grayscale — groups are still distinguishable by shape.')

---
## Part 6: Small Multiples & Dashboards

Small multiples apply the same chart type across sub-groups, allowing easy comparison without visual clutter. Dashboards combine multiple chart types into one coordinated view.

### Exercise 6.1: FacetGrid — GDP Distribution by Island Group

Create a FacetGrid showing the GDP distribution for each island group in a separate panel.

- `sns.FacetGrid(df, col='Island_Group', col_order=['Luzon','Visayas','Mindanao'], height=4)`
- Map `sns.histplot` with `bins=8, color='steelblue'`
- Set column titles and axis labels
- After the grid, print which island group has the highest GDP **variance**

In [None]:
# TODO: Create FacetGrid
# Hint:
# g = sns.FacetGrid(df, col='Island_Group',
#                    col_order=['Luzon', 'Visayas', 'Mindanao'], height=4)
# g.map(sns.histplot, 'GDP_Billions', bins=8, color='steelblue')
# g.set_titles(col_template='{col_name}')
# g.set_axis_labels('GDP (Billion PHP)', 'Count')
# Your code here

plt.suptitle('GDP Distribution by Island Group', fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# TODO: Print GDP variance per island group
# Your code here

### Exercise 6.2: 2×2 Dashboard

Build a coordinated 2×2 dashboard with four panels:

| Panel | Chart | Variables |
|-------|-------|-----------|
| Top-left | Horizontal bar | Top 5 GDP |
| Top-right | Scatter | GDP vs Poverty, colored by Island_Group |
| Bottom-left | Box plot | Poverty Rate by Island Group |
| Bottom-right | Line chart | NCR and CALABARZON GDP trend (2018–2023) |

Main title: `'Philippine Economy at a Glance'`

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# --- Top-left: Horizontal bar (Top 5 GDP) ---
ax = axes[0, 0]
# TODO: Your code here
ax.set_title('Top 5 Regions by GDP', fontweight='bold')

# --- Top-right: Scatter (GDP vs Poverty) ---
ax = axes[0, 1]
# TODO: sns.scatterplot with hue='Island_Group', legend=False
# Your code here
ax.set_title('GDP vs Poverty Rate', fontweight='bold')

# --- Bottom-left: Box plot (Poverty by Island Group) ---
ax = axes[1, 0]
# TODO: sns.boxplot with order=['Luzon','Visayas','Mindanao']
# Your code here
ax.set_title('Poverty Distribution by Island Group', fontweight='bold')

# --- Bottom-right: Line chart (GDP over time) ---
ax = axes[1, 1]
years   = [2018, 2019, 2020, 2021, 2022, 2023]
ncr_gdp = [5800, 6100, 5200, 5500, 6200, 6842]
cal_gdp = [2200, 2400, 2100, 2300, 2600, 2821]
# TODO: Plot both lines with markers and labels
# Your code here
ax.set_xlabel('Year')
ax.set_ylabel('GDP (Billion PHP)')
ax.set_title('GDP Trend 2018–2023', fontweight='bold')
ax.legend()

fig.suptitle('Philippine Economy at a Glance', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

---
## Part 7: Plotly Interactive Visualization + Data Story

Parts 1–6 used Matplotlib and Seaborn, which produce static charts. This part introduces **Plotly** (covered in the lecture) for interactive, web-ready charts — and closes with a publication-quality data story figure.

### Exercise 7.1: Plotly Bar Chart with Hover

Recreate the GDP bar chart from Part 1 using `plotly.express`. Notice:
- Hover tooltips appear **for free** — no extra code needed
- Clicking legend items hides/shows island groups
- Zoom and pan are built-in

Use `px.bar()` with `orientation='h'`, `color='Island_Group'`, `color_discrete_sequence=px.colors.qualitative.Safe`.

In [None]:
# TODO: Sort df by GDP_Billions descending
df_px = None  # Your code here

# TODO: Plotly horizontal bar chart
# Hint:
# fig = px.bar(df_px, x='GDP_Billions', y='Region', orientation='h',
#               color='Island_Group',
#               title='Philippine GDP by Region — Hover to Explore',
#               labels={'GDP_Billions': 'GDP (Billion PHP)', 'Region': ''},
#               color_discrete_sequence=px.colors.qualitative.Safe)
# fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig = None  # Your code here

# TODO: fig.show()

### Exercise 7.2: Plotly Bubble Chart

Recreate the GDP vs Tourism scatter as a **bubble chart** where bubble size encodes `Population_M`.

- `px.scatter()` with `size='Population_M'`, `hover_data=['Region', 'Poverty_Rate']`
- After running, hover over BARMM and NCR — describe in a comment what contrast you notice

In [None]:
# TODO: Plotly bubble scatter
# Hint:
# fig = px.scatter(df, x='GDP_Billions', y='Tourism_Arrivals_K',
#                   color='Island_Group', size='Population_M',
#                   hover_data=['Region', 'Poverty_Rate'],
#                   title='GDP vs Tourism (Bubble = Population)',
#                   labels={'GDP_Billions': 'GDP (B PHP)',
#                           'Tourism_Arrivals_K': 'Tourism Arrivals (Thousands)'},
#                   color_discrete_sequence=px.colors.qualitative.Safe)
fig = None  # Your code here

# TODO: fig.show()

# Hover observation:
# BARMM vs NCR contrast: ...

### Exercise 7.3: Annotated Data Story Figure

Choose one insight from the data and create a **publication-quality Matplotlib figure** that communicates it clearly.

Requirements:
- Active title that **states the insight** (not just the variable name)
- At least one `ax.annotate()` with an arrow pointing to the key data point
- Clean Tufte-style design (no unnecessary spines, no heavy gridlines)
- Source note at the bottom right: `Source: PSA 2023`
- Colorblind-safe colors

**Example insights:**
- NCR's GDP is more than double CALABARZON's — next largest region
- BARMM's poverty rate (63%) is 20× NCR's (3.2%)
- Every Mindanao region has a higher poverty rate than every Luzon region
- Negative trend: regions with higher GDP tend to have lower poverty rates

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# TODO: Your visualization code here

# TODO: Annotation with arrow
# Hint: ax.annotate('Key insight text',
#                    xy=(x_data, y_data), xytext=(x_text, y_text),
#                    arrowprops=dict(arrowstyle='->', color='red', lw=1.5),
#                    fontsize=11, fontweight='bold', color='red')
# Your code here

# TODO: Source note
# Hint: ax.text(0.99, -0.10, 'Source: PSA 2023', transform=ax.transAxes,
#               fontsize=9, ha='right', color='gray', style='italic')
# Your code here

plt.tight_layout()
plt.show()

### Exercise 7.4: Data Story Caption

Write a **2–3 sentence caption** for your figure above.

A strong data story caption:
1. States the main finding
2. Provides context (why it matters)
3. Suggests an implication or action

**Your Caption:**

[Write your 2–3 sentence data story caption here]

---
## Reflection Questions

### Question 1

For each of the three chart families in Parts 1–3 (comparison, distribution, relationship), state the data question it answers and give one example from this lab where a *different* chart type would have failed.

**Your Answer:** [Your answer here]

### Question 2

In Exercise 4.1 you applied Tufte's data-ink ratio. Name **two specific changes** you made and explain the cognitive benefit of each.

**Your Answer:** [Your answer here]

### Question 3

About 8% of males have some form of color vision deficiency. Describe one **real-world scenario** (outside this lab) where failing to use redundant encoding could cause a serious problem.

**Your Answer:** [Your answer here]

### Question 4

When is a FacetGrid (small multiples) a better choice than a single grouped chart, and when is it worse? Give specific criteria.

**Your Answer:** [Your answer here]

### Question 5

You used both Matplotlib/Seaborn (Parts 1–6) and Plotly (Part 7) for the same data. For each tool, describe **one context** where it is clearly the better choice and explain why.

**Your Answer:** [Your answer here]

---

## Congratulations!

You've completed the Week 5 Lab on Data Visualization Techniques.

**Key Skills Practiced:**
- Horizontal bar charts and grouped bar charts (Matplotlib)
- Histograms with KDE overlay and box plots with strip plot overlay (Seaborn)
- Scatter plots with regression lines and correlation heatmaps
- Tufte data-ink clean-up: removing spines, direct labels, active titles
- Pie chart vs bar chart comparison
- Colorblind-safe palettes and redundant color + shape encoding
- FacetGrid small multiples and 2×2 coordinated dashboards
- Annotated publication-quality data story figure
- Plotly Express: interactive bar chart, bubble scatter with hover data

**Remember:** Check the **solution notebook** if you need help!

**Next Steps:** Week 6 Lab — Building Interactive Dashboards

---

*CMSC 178DA — Data Analytics | University of the Philippines Cebu*