# Week 5 Lab: Data Visualization Techniques

**Estimated Time:** 30-60 minutes  
**Objective:** Apply the visualization principles from lecture by choosing appropriate chart types, improving chart design, and building interactive visualizations.

In this lab, you will:
- Choose the right chart type for each data question
- Apply Tufte's data-ink ratio principle to improve chart clarity
- Use colorblind-safe palettes and redundant encoding for accessibility
- Build interactive charts with Plotly

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import plotly.express as px

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')
pd.set_option('display.max_columns', None)
np.random.seed(42)

# Philippine Regional Economic Indicators (PSA 2023)
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III',
               'CALABARZON', 'MIMAROPA', 'Region V', 'Region VI',
               'Region VII', 'Region VIII', 'Region IX', 'Region X',
               'Region XI', 'Region XII', 'BARMM'],
    'Population_M': [13.48, 1.80, 5.30, 3.68, 12.42, 16.20, 3.23,
                     6.08, 7.95, 8.08, 4.55, 3.87, 5.02, 5.24, 4.93, 4.40],
    'GDP_Billions': [6842, 352, 418, 310, 1876, 2821, 286, 462,
                     1054, 1342, 284, 318, 724, 1021, 485, 312],
    'Poverty_Rate': [3.2, 14.8, 11.2, 14.1, 10.5, 8.1, 22.4, 28.3,
                     20.1, 18.7, 30.5, 33.8, 25.6, 16.3, 29.1, 63.0],
    'Literacy_Rate': [99.2, 96.8, 97.5, 96.2, 97.8, 98.1, 95.3, 96.0,
                      96.5, 97.1, 95.8, 94.2, 96.0, 97.0, 95.5, 82.5],
    'Tourism_Arrivals_K': [4200, 180, 320, 150, 580, 890, 420, 310,
                           650, 1200, 210, 120, 280, 380, 160, 85],
    'Island_Group': ['Luzon', 'Luzon', 'Luzon', 'Luzon', 'Luzon',
                     'Luzon', 'Luzon', 'Luzon', 'Visayas', 'Visayas',
                     'Visayas', 'Mindanao', 'Mindanao', 'Mindanao',
                     'Mindanao', 'Mindanao']
}

df = pd.DataFrame(data)
print(f'Dataset loaded: {df.shape[0]} regions, {df.shape[1]} columns')
df.head()

---
## Part 1: Choosing the Right Chart

The lecture introduced a chart taxonomy based on the **data question** you are answering. Each exercise here pairs a specific question with the appropriate chart type.

### Exercise 1.1 — Comparison: Horizontal Bar Chart

**Data question:** *Which regions have the highest GDP?*

When comparing many categories with long names, a **horizontal bar chart** is clearer than a vertical one — labels don't overlap and ranking is easy to read top-to-bottom.

- Sort `df` by `GDP_Billions` in ascending order (so the largest bar ends up at the top)
- Use `ax.barh()` with `color='steelblue'`
- Label x-axis: "GDP (Billion PHP)"
- Title: "Philippine GDP by Region (PSA 2023)"

In [None]:
# TODO: Sort the DataFrame by GDP_Billions (ascending — so the tallest bar is at top)
df_sorted = None  # Your code here

fig, ax = plt.subplots(figsize=(10, 7))

# TODO: Create the horizontal bar chart
# Hint: ax.barh(df_sorted['Region'], df_sorted['GDP_Billions'], color='steelblue')
# Your code here

ax.set_xlabel('GDP (Billion PHP)')
ax.set_title('Philippine GDP by Region (PSA 2023)', fontweight='bold')
plt.tight_layout()
plt.show()

### Exercise 1.2 — Distribution: Box Plot by Group

**Data question:** *How does poverty spread differ across island groups?*

A **box plot** shows median, quartiles, and outliers — perfect for comparing distributions between groups. Overlaying a strip plot shows individual data points, preventing the box from hiding small sample sizes.

- Use `sns.boxplot(data=df, x='Island_Group', y='Poverty_Rate', ax=ax)`
- Overlay `sns.stripplot()` with `color='black'`, `size=5`, `jitter=True`
- Title: "Poverty Rate by Island Group"
- After the chart: print which island group has the highest median poverty rate

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

# TODO: Create the box plot
# Your code here

# TODO: Overlay individual data points
# Your code here

ax.set_title('Poverty Rate by Island Group', fontweight='bold')
ax.set_ylabel('Poverty Rate (%)')
plt.tight_layout()
plt.show()

# TODO: Print the island group with the highest median poverty rate
# Hint: df.groupby('Island_Group')['Poverty_Rate'].median().idxmax()
# Your code here

### Exercise 1.3 — Relationship: Scatter + Regression

**Data question:** *Is there a relationship between GDP and tourism arrivals?*

A **scatter plot** reveals relationships between two continuous variables. Adding a regression line shows the trend direction. Coloring by a third variable (Island Group) reveals sub-group patterns.

- Use `sns.scatterplot()` with `hue='Island_Group'` and `s=120`
- Overlay `sns.regplot()` with `scatter=False` and `color='gray'` for the overall trend line
- Label axes: "GDP (Billion PHP)" and "Tourism Arrivals (Thousands)"
- Title: "GDP vs Tourism Arrivals (by Island Group)"

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))

# TODO: Scatter plot colored by Island_Group
# Your code here

# TODO: Add overall regression line
# Hint: sns.regplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
#                    scatter=False, color='gray', line_kws={'linestyle': '--'}, ax=ax)
# Your code here

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Tourism Arrivals (Thousands)')
ax.set_title('GDP vs Tourism Arrivals (by Island Group)', fontweight='bold')
plt.tight_layout()
plt.show()

---
## Part 2: Design Principles

Knowing *which* chart to use is only half the work. The lecture covered how to make charts **honest and clear** — Tufte's data-ink ratio and common mistakes like truncated axes and chart-type mismatches.

### Exercise 2.1 — Tufte Clean-Up

Below is a deliberately cluttered chart. Your job is to produce a **clean version** applying Tufte's data-ink ratio:

1. Remove all four spines (`ax.spines[side].set_visible(False)`)
2. Remove x-axis ticks (`ax.set_xticks([])`)
3. Add direct value labels on each bar (no axis needed)
4. Use a single muted color (`'#4a90d9'`)

Run the first cell to see the bad chart, then complete the second cell to fix it.

In [None]:
# Bad chart — intentionally cluttered
df_top5 = df.nlargest(5, 'GDP_Billions').sort_values('GDP_Billions')

fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(df_top5['Region'], df_top5['GDP_Billions'],
        color=['red', 'blue', 'green', 'orange', 'purple'])
ax.set_title('GDP', fontsize=10)  # Uninformative title
ax.grid(True, which='both', linestyle='-', linewidth=1.5)
ax.set_xlabel('Billions')
plt.tight_layout()
plt.show()
print('↑ Bad chart: rainbow colors, heavy grid, vague title, no direct labels')

In [None]:
# TODO: Create a Tufte-style clean version
fig, ax = plt.subplots(figsize=(9, 4))

# TODO: Create horizontal bars with a single muted color
bars = None  # ax.barh(df_top5['Region'], df_top5['GDP_Billions'], color='#4a90d9')
# Your code here

# TODO: Remove all four spines
# Hint: for spine in ax.spines.values(): spine.set_visible(False)
# Your code here

# TODO: Remove x-axis ticks
# Hint: ax.set_xticks([])
# Your code here

# TODO: Add direct value labels on each bar
# Hint: for bar in bars:
#           ax.text(bar.get_width() + 50, bar.get_y() + bar.get_height()/2,
#                   f"{bar.get_width():,.0f}B", va='center', fontsize=10)
# Your code here

ax.set_title('Top 5 Philippine Regions by GDP (2023)', fontweight='bold')  # Active, informative title
ax.grid(False)
plt.tight_layout()
plt.show()

### Exercise 2.2 — Bad vs Good: The Pie Chart Problem

The lecture showed why pie charts fail with many categories — humans cannot accurately judge angles. Create a side-by-side comparison:

- **Left (bad):** `ax.pie()` with all 16 regions, default colors, no labels
- **Right (good):** horizontal bar — top 5 regions + "Others" (sum of the rest), sorted ascending

After building both, write a 1-sentence comment below the cell explaining what makes the bar chart easier to read.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# LEFT: Bad — pie chart with 16 slices
# TODO: ax1.pie(df['GDP_Billions'], labels=df['Region'], autopct='%1.0f%%')
# Your code here
ax1.set_title('Bad: Pie Chart (16 slices)', fontweight='bold', color='#dc2626')

# RIGHT: Good — top 5 + Others bar chart
# TODO: Compute top 5 GDP rows, sum the rest as 'Others'
top5 = None  # df.nlargest(5, 'GDP_Billions')
others_gdp = None  # df['GDP_Billions'].sum() - top5['GDP_Billions'].sum()
# Your code here

# TODO: Build a small DataFrame for the bar chart:
# regions: list of top5 region names + ['Others']
# gdp: list of top5 GDP values + [others_gdp], sorted ascending for barh
# Your code here

# TODO: ax2.barh(...)
# Your code here

ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_title('Good: Bar Chart (Top 5 + Others)', fontweight='bold', color='#16a34a')
plt.tight_layout()
plt.show()

# Your 1-sentence observation:
# The bar chart is easier to read because ...

---
## Part 3: Accessibility & Color

The lecture covered inclusive design: colorblind-safe palettes and redundant encoding (using both color AND shape to convey the same grouping).

### Exercise 3.1 — Colorblind-Safe Palette

Create the **same scatter plot** twice, side by side:
- **Left:** default matplotlib colors (no palette set)
- **Right:** `palette='colorblind'` (seaborn's 8-color accessible palette)

Both plots: GDP vs Poverty Rate, colored by Island_Group.

Notice how the right panel's colors remain distinguishable for people with deuteranopia (red-green color blindness).

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# LEFT: Default colors
# TODO: sns.scatterplot with no palette, hue='Island_Group', s=100
# Your code here
ax1.set_title('Default Colors', fontweight='bold')
ax1.set_xlabel('GDP (Billion PHP)')
ax1.set_ylabel('Poverty Rate (%)')

# RIGHT: Colorblind-safe palette
# TODO: sns.scatterplot with palette='colorblind', hue='Island_Group', s=100
# Your code here
ax2.set_title('Colorblind-Safe Palette', fontweight='bold')
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_ylabel('Poverty Rate (%)')

plt.tight_layout()
plt.show()

### Exercise 3.2 — Redundant Encoding

Redundant encoding means conveying the same information through **multiple visual channels** — here, both color AND marker shape.

Even in grayscale printouts (or for people who cannot perceive color differences), the groups remain distinguishable by shape.

- Use `hue='Island_Group'` AND `style='Island_Group'` in `sns.scatterplot()`
- Set `s=150` and `palette='colorblind'`
- Title: "GDP vs Poverty (Color + Shape Encoding)"

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

# TODO: Scatter plot with both hue and style set to 'Island_Group'
# Hint: sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
#                        hue='Island_Group', style='Island_Group',
#                        s=150, palette='colorblind', ax=ax)
# Your code here

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Poverty Rate (%)')
ax.set_title('GDP vs Poverty (Color + Shape Encoding)', fontweight='bold')
plt.tight_layout()
plt.show()

---
## Part 4: Plotly — Interactive Visualization

The lecture introduced **Plotly** as the tool for interactive, web-ready charts. Unlike matplotlib, Plotly charts have hover tooltips, zoom, and pan built-in — no extra code needed.

This section uses `plotly.express` (abbreviated `px`), which follows the same grammar as seaborn: pass a DataFrame, specify column names for axes and color.

### Exercise 4.1 — Plotly Express Bar Chart

Recreate the GDP bar chart from Exercise 1.1 using Plotly. Notice that you get hover tooltips **for free** — hover over any bar to see the exact GDP value and region name.

- Use `px.bar()` with `orientation='h'` for horizontal
- Set `x='GDP_Billions'`, `y='Region'`, `color='Island_Group'`
- Sort so the largest bar appears at the top: sort df by GDP descending before passing to px.bar
- Title: "Philippine GDP by Region (Interactive)"

In [None]:
# TODO: Sort df by GDP_Billions descending
df_plotly = None  # df.sort_values('GDP_Billions', ascending=False)
# Your code here

# TODO: Create Plotly horizontal bar chart
# Hint: fig = px.bar(df_plotly, x='GDP_Billions', y='Region', orientation='h',
#                     color='Island_Group',
#                     title='Philippine GDP by Region (Interactive)',
#                     labels={'GDP_Billions': 'GDP (Billion PHP)'},
#                     color_discrete_sequence=px.colors.qualitative.Safe)
fig = None  # Your code here

# TODO: Show the figure
# fig.show()

### Exercise 4.2 — Plotly Scatter with Hover Data

Recreate the GDP vs Tourism scatter from Exercise 1.3 using Plotly. Add custom hover data so students can see the Region name and Population when hovering.

- `px.scatter()` with `x='GDP_Billions'`, `y='Tourism_Arrivals_K'`, `color='Island_Group'`
- Add `size='Population_M'` (bubble chart — encodes a 3rd variable via point size)
- Add `hover_data=['Region', 'Poverty_Rate']` so hover shows region name and poverty rate
- Title: "GDP vs Tourism Arrivals (Bubble = Population)"

In [None]:
# TODO: Create Plotly bubble scatter chart
# Hint: fig = px.scatter(df, x='GDP_Billions', y='Tourism_Arrivals_K',
#                         color='Island_Group', size='Population_M',
#                         hover_data=['Region', 'Poverty_Rate'],
#                         title='GDP vs Tourism Arrivals (Bubble = Population)',
#                         labels={'GDP_Billions': 'GDP (B PHP)',
#                                 'Tourism_Arrivals_K': 'Tourism (Thousands)'},
#                         color_discrete_sequence=px.colors.qualitative.Safe)
fig = None  # Your code here

# TODO: Show the figure
# fig.show()

# After running: hover over NCR to see its poverty rate and population.
# Which region has the largest bubble?

---
## Reflection Questions

### Question 1

For each of the three chart types in Part 1 (horizontal bar, box plot, scatter), explain what **data question** it is best suited to answer. Use specific examples from the exercises.

**Your Answer:**

[Your answer here]

### Question 2

In Exercise 2.1, you applied Tufte's data-ink ratio principle. Name **one specific change** you made (e.g., removing a spine, adding direct labels) and explain how it improved the chart's clarity.

**Your Answer:**

[Your answer here]

### Question 3

What is **one advantage** and **one limitation** of Plotly interactive charts compared to matplotlib static charts? Think about contexts like a formal report, a live presentation, or an online dashboard.

**Your Answer:**

[Your answer here]

---

## Congratulations!

You've completed the Week 5 Lab on Data Visualization Techniques.

**Key Skills Practiced:**
- Selecting chart types based on data question (comparison, distribution, relationship)
- Applying Tufte's data-ink ratio to remove chart clutter
- Identifying why pie charts fail with many categories
- Using colorblind-safe palettes and redundant encoding for accessibility
- Building interactive charts with Plotly Express (hover, size encoding)

**Remember:** Check the **solution notebook** if you need help!

**Next Steps:**
- Week 6 Lab: Building Interactive Dashboards (extends Plotly to multi-panel layouts)

---

*CMSC 178DA - Data Analytics | University of the Philippines Cebu*