# Week 5 Lab: Data Visualization Techniques — SOLUTION

**Estimated Time:** 30-60 minutes  
**Objective:** Apply the visualization principles from lecture by choosing appropriate chart types, improving chart design, and building interactive visualizations.

In this lab, you will:
- Choose the right chart type for each data question
- Apply Tufte's data-ink ratio principle to improve chart clarity
- Use colorblind-safe palettes and redundant encoding for accessibility
- Build interactive charts with Plotly

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import plotly.express as px

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')
pd.set_option('display.max_columns', None)
np.random.seed(42)

# Philippine Regional Economic Indicators (PSA 2023)
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III',
               'CALABARZON', 'MIMAROPA', 'Region V', 'Region VI',
               'Region VII', 'Region VIII', 'Region IX', 'Region X',
               'Region XI', 'Region XII', 'BARMM'],
    'Population_M': [13.48, 1.80, 5.30, 3.68, 12.42, 16.20, 3.23,
                     6.08, 7.95, 8.08, 4.55, 3.87, 5.02, 5.24, 4.93, 4.40],
    'GDP_Billions': [6842, 352, 418, 310, 1876, 2821, 286, 462,
                     1054, 1342, 284, 318, 724, 1021, 485, 312],
    'Poverty_Rate': [3.2, 14.8, 11.2, 14.1, 10.5, 8.1, 22.4, 28.3,
                     20.1, 18.7, 30.5, 33.8, 25.6, 16.3, 29.1, 63.0],
    'Literacy_Rate': [99.2, 96.8, 97.5, 96.2, 97.8, 98.1, 95.3, 96.0,
                      96.5, 97.1, 95.8, 94.2, 96.0, 97.0, 95.5, 82.5],
    'Tourism_Arrivals_K': [4200, 180, 320, 150, 580, 890, 420, 310,
                           650, 1200, 210, 120, 280, 380, 160, 85],
    'Island_Group': ['Luzon', 'Luzon', 'Luzon', 'Luzon', 'Luzon',
                     'Luzon', 'Luzon', 'Luzon', 'Visayas', 'Visayas',
                     'Visayas', 'Mindanao', 'Mindanao', 'Mindanao',
                     'Mindanao', 'Mindanao']
}

df = pd.DataFrame(data)
print(f'Dataset loaded: {df.shape[0]} regions, {df.shape[1]} columns')
df.head()

---
## Part 1: Choosing the Right Chart

The lecture introduced a chart taxonomy based on the **data question** you are answering. Each exercise here pairs a specific question with the appropriate chart type.

### Exercise 1.1 — Comparison: Horizontal Bar Chart

**Data question:** *Which regions have the highest GDP?*

When comparing many categories with long names, a **horizontal bar chart** is clearer than a vertical one — labels don't overlap and ranking is easy to read top-to-bottom.

- Sort `df` by `GDP_Billions` in ascending order (so the largest bar ends up at the top)
- Use `ax.barh()` with `color='steelblue'`
- Label x-axis: "GDP (Billion PHP)"
- Title: "Philippine GDP by Region (PSA 2023)"

In [None]:
# Sort ascending so NCR (highest) appears at the top in barh
df_sorted = df.sort_values('GDP_Billions', ascending=True)

fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(df_sorted['Region'], df_sorted['GDP_Billions'], color='steelblue')

ax.set_xlabel('GDP (Billion PHP)')
ax.set_title('Philippine GDP by Region (PSA 2023)', fontweight='bold')
plt.tight_layout()
plt.show()

print(f"Highest GDP: {df.loc[df['GDP_Billions'].idxmax(), 'Region']} "
      f"({df['GDP_Billions'].max():,} B PHP)")
print(f"Lowest GDP:  {df.loc[df['GDP_Billions'].idxmin(), 'Region']} "
      f"({df['GDP_Billions'].min():,} B PHP)")

### Exercise 1.2 — Distribution: Box Plot by Group

**Data question:** *How does poverty spread differ across island groups?*

A **box plot** shows median, quartiles, and outliers — perfect for comparing distributions between groups. Overlaying a strip plot shows individual data points, preventing the box from hiding small sample sizes.

- Use `sns.boxplot(data=df, x='Island_Group', y='Poverty_Rate', ax=ax)`
- Overlay `sns.stripplot()` with `color='black'`, `size=5`, `jitter=True`
- Title: "Poverty Rate by Island Group"
- After the chart: print which island group has the highest median poverty rate

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

sns.boxplot(data=df, x='Island_Group', y='Poverty_Rate', ax=ax)
sns.stripplot(data=df, x='Island_Group', y='Poverty_Rate',
              color='black', size=5, jitter=True, ax=ax)

ax.set_title('Poverty Rate by Island Group', fontweight='bold')
ax.set_ylabel('Poverty Rate (%)')
plt.tight_layout()
plt.show()

medians = df.groupby('Island_Group')['Poverty_Rate'].median()
print('Median poverty rates by island group:')
print(medians.sort_values(ascending=False))
print(f"\nHighest median: {medians.idxmax()} ({medians.max():.1f}%)")

### Exercise 1.3 — Relationship: Scatter + Regression

**Data question:** *Is there a relationship between GDP and tourism arrivals?*

A **scatter plot** reveals relationships between two continuous variables. Adding a regression line shows the trend direction. Coloring by a third variable (Island Group) reveals sub-group patterns.

- Use `sns.scatterplot()` with `hue='Island_Group'` and `s=120`
- Overlay `sns.regplot()` with `scatter=False` and `color='gray'` for the overall trend line
- Label axes: "GDP (Billion PHP)" and "Tourism Arrivals (Thousands)"
- Title: "GDP vs Tourism Arrivals (by Island Group)"

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))

sns.scatterplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
                hue='Island_Group', s=120, ax=ax)

sns.regplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
            scatter=False, color='gray',
            line_kws={'linestyle': '--', 'linewidth': 1.5}, ax=ax)

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Tourism Arrivals (Thousands)')
ax.set_title('GDP vs Tourism Arrivals (by Island Group)', fontweight='bold')
plt.tight_layout()
plt.show()

corr = df['GDP_Billions'].corr(df['Tourism_Arrivals_K'])
print(f"Pearson r = {corr:.3f}")
print("Positive relationship: regions with higher GDP attract more tourists.")

---
## Part 2: Design Principles

Knowing *which* chart to use is only half the work. The lecture covered how to make charts **honest and clear** — Tufte's data-ink ratio and common mistakes like truncated axes and chart-type mismatches.

### Exercise 2.1 — Tufte Clean-Up

Below is a deliberately cluttered chart. Your job is to produce a **clean version** applying Tufte's data-ink ratio:

1. Remove all four spines
2. Remove x-axis ticks
3. Add direct value labels on each bar
4. Use a single muted color (`'#4a90d9'`)

Run the first cell to see the bad chart, then complete the second cell to fix it.

In [None]:
# Bad chart — intentionally cluttered
df_top5 = df.nlargest(5, 'GDP_Billions').sort_values('GDP_Billions')

fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(df_top5['Region'], df_top5['GDP_Billions'],
        color=['red', 'blue', 'green', 'orange', 'purple'])
ax.set_title('GDP', fontsize=10)
ax.grid(True, which='both', linestyle='-', linewidth=1.5)
ax.set_xlabel('Billions')
plt.tight_layout()
plt.show()
print('↑ Bad chart: rainbow colors, heavy grid, vague title, no direct labels')

In [None]:
# Clean Tufte-style chart
fig, ax = plt.subplots(figsize=(9, 4))

bars = ax.barh(df_top5['Region'], df_top5['GDP_Billions'], color='#4a90d9')

# 1. Remove all four spines
for spine in ax.spines.values():
    spine.set_visible(False)

# 2. Remove x-axis ticks (direct labels replace them)
ax.set_xticks([])

# 3. Add direct value labels on each bar
for bar in bars:
    ax.text(bar.get_width() + 50,
            bar.get_y() + bar.get_height() / 2,
            f'{bar.get_width():,.0f}B',
            va='center', fontsize=10, color='#333333')

ax.set_title('Top 5 Philippine Regions by GDP (2023)', fontweight='bold')
ax.grid(False)
plt.tight_layout()
plt.show()
print('↑ Clean chart: single color, no spines, direct labels, informative title')

### Exercise 2.2 — Bad vs Good: The Pie Chart Problem

The lecture showed why pie charts fail with many categories — humans cannot accurately judge angles. Create a side-by-side comparison:

- **Left (bad):** `ax.pie()` with all 16 regions, default colors, no labels
- **Right (good):** horizontal bar — top 5 regions + "Others" (sum of the rest), sorted ascending

After building both, write a 1-sentence comment below the cell explaining what makes the bar chart easier to read.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# LEFT: Bad — pie chart with 16 slices
ax1.pie(df['GDP_Billions'], labels=df['Region'], autopct='%1.0f%%')
ax1.set_title('Bad: Pie Chart (16 slices)', fontweight='bold', color='#dc2626')

# RIGHT: Good — top 5 + Others bar chart
top5 = df.nlargest(5, 'GDP_Billions')
others_gdp = df['GDP_Billions'].sum() - top5['GDP_Billions'].sum()

bar_regions = list(top5.sort_values('GDP_Billions')['Region']) + ['Others']
bar_gdp = list(top5.sort_values('GDP_Billions')['GDP_Billions']) + [others_gdp]
colors = ['#4a90d9'] * 5 + ['#94a3b8']

ax2.barh(bar_regions, bar_gdp, color=colors)
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_title('Good: Bar Chart (Top 5 + Others)', fontweight='bold', color='#16a34a')

plt.tight_layout()
plt.show()

# The bar chart is easier to read because bar lengths share a common baseline,
# allowing direct length comparison — unlike pie slice angles which humans
# judge poorly, especially when slices are similar in size.

---
## Part 3: Accessibility & Color

The lecture covered inclusive design: colorblind-safe palettes and redundant encoding (using both color AND shape to convey the same grouping).

### Exercise 3.1 — Colorblind-Safe Palette

Create the **same scatter plot** twice, side by side:
- **Left:** default matplotlib colors
- **Right:** `palette='colorblind'`

Both: GDP vs Poverty Rate, colored by Island_Group.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# LEFT: Default colors (temporarily override the colorblind palette set in setup)
with plt.rc_context({'axes.prop_cycle': plt.cycler(color=['#1f77b4', '#ff7f0e', '#2ca02c'])}):
    sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                    hue='Island_Group', s=100, ax=ax1)
ax1.set_title('Default Colors', fontweight='bold')
ax1.set_xlabel('GDP (Billion PHP)')
ax1.set_ylabel('Poverty Rate (%)')

# RIGHT: Colorblind-safe palette
sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                hue='Island_Group', palette='colorblind', s=100, ax=ax2)
ax2.set_title('Colorblind-Safe Palette', fontweight='bold')
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_ylabel('Poverty Rate (%)')

plt.tight_layout()
plt.show()
print("The colorblind-safe palette uses colors that stay distinguishable")
print("for people with deuteranopia (red-green color blindness).")

### Exercise 3.2 — Redundant Encoding

Redundant encoding means conveying the same information through **multiple visual channels** — here, both color AND marker shape.

- Use `hue='Island_Group'` AND `style='Island_Group'`
- Set `s=150` and `palette='colorblind'`

In [None]:
fig, ax = plt.subplots(figsize=(9, 6))

sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                hue='Island_Group', style='Island_Group',
                s=150, palette='colorblind', ax=ax)

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Poverty Rate (%)')
ax.set_title('GDP vs Poverty (Color + Shape Encoding)', fontweight='bold')
plt.tight_layout()
plt.show()

print("With redundant encoding, groups are distinguishable even in grayscale")
print("or for viewers with color vision deficiency.")

---
## Part 4: Plotly — Interactive Visualization

The lecture introduced **Plotly** as the tool for interactive, web-ready charts. Unlike matplotlib, Plotly charts have hover tooltips, zoom, and pan built-in — no extra code needed.

### Exercise 4.1 — Plotly Express Bar Chart

Recreate the GDP bar chart using Plotly. Hover over any bar to see the exact GDP value and region name.

In [None]:
df_plotly = df.sort_values('GDP_Billions', ascending=False)

fig = px.bar(
    df_plotly,
    x='GDP_Billions',
    y='Region',
    orientation='h',
    color='Island_Group',
    title='Philippine GDP by Region (Interactive)',
    labels={'GDP_Billions': 'GDP (Billion PHP)', 'Region': ''},
    color_discrete_sequence=px.colors.qualitative.Safe
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

print("Hover over any bar to see exact GDP, region name, and island group.")

### Exercise 4.2 — Plotly Scatter with Hover Data

Bubble chart: GDP vs Tourism, sized by Population, with hover showing Region and Poverty Rate.

In [None]:
fig = px.scatter(
    df,
    x='GDP_Billions',
    y='Tourism_Arrivals_K',
    color='Island_Group',
    size='Population_M',
    hover_data=['Region', 'Poverty_Rate'],
    title='GDP vs Tourism Arrivals (Bubble = Population)',
    labels={
        'GDP_Billions': 'GDP (B PHP)',
        'Tourism_Arrivals_K': 'Tourism (Thousands)'
    },
    color_discrete_sequence=px.colors.qualitative.Safe
)
fig.show()

largest_bubble = df.loc[df['Population_M'].idxmax(), 'Region']
print(f"Largest bubble: {largest_bubble} ({df['Population_M'].max()} million people)")
print("Hover over NCR to see its poverty rate (3.2%) despite its massive GDP.")

---
## Reflection Questions

### Question 1

For each of the three chart types in Part 1 (horizontal bar, box plot, scatter), explain what **data question** it is best suited to answer.

**Model Answer:**

- **Horizontal bar chart** answers *"Which categories rank highest/lowest?"* — e.g., "Which Philippine regions have the highest GDP?" It works best when category names are long (they fit on the y-axis) and ranking order matters. Bar length is one of the most accurately perceived visual encodings.

- **Box plot** answers *"How spread out is this variable, and does the spread differ across groups?"* — e.g., "Does poverty vary more within Mindanao or within Luzon?" It simultaneously shows median, quartiles, and outliers — information that a simple mean bar would hide entirely.

- **Scatter plot** answers *"Is there a relationship between two continuous variables?"* — e.g., "Do regions with higher GDP attract more tourists?" The regression overlay confirms the trend direction, while color encoding reveals whether sub-groups (island groups) follow different patterns.

### Question 2

In Exercise 2.1, you applied Tufte's data-ink ratio principle. Name **one specific change** and explain how it improved clarity.

**Model Answer:**

Removing the x-axis ticks and replacing them with direct value labels (e.g., "6,842B" at the end of NCR's bar) improved clarity significantly. Readers no longer need to visually trace from the bar's end down to the axis and interpolate a value — the exact number appears right where their eye already is. This removes two cognitive steps (tracing + estimating) and replaces them with one (reading). It also eliminated the x-axis entirely, removing ink that served no purpose once direct labels were added — a textbook application of Tufte's principle that every non-data ink element which can be removed, should be.

### Question 3

What is **one advantage** and **one limitation** of Plotly interactive charts vs matplotlib static charts?

**Model Answer:**

**Advantage:** Plotly charts allow readers to actively explore the data — hovering reveals exact values, zooming into dense regions reveals hidden points, and clicking legend items hides/shows groups. In Exercise 4.2, a reader can hover over BARMM to instantly see its 63% poverty rate in context, without needing a separate annotation. For online dashboards (Week 6) and presentations where the audience has screen access, this interactivity replaces multiple static charts.

**Limitation:** Plotly requires JavaScript and a web browser to render. Charts cannot be embedded in a PDF, printed report, or academic paper without first exporting them to static images (using `fig.write_image()`). For formal submissions — a thesis chapter, a PSA report, or a printed conference poster — matplotlib's print-ready PNG/PDF output remains the more reliable choice. Interactive charts also add a package dependency (`plotly`) not always present in server or restricted environments.

---

## Congratulations!

You've completed the Week 5 Lab on Data Visualization Techniques.

**Key Skills Practiced:**
- Selecting chart types based on data question (comparison, distribution, relationship)
- Applying Tufte's data-ink ratio to remove chart clutter
- Identifying why pie charts fail with many categories
- Using colorblind-safe palettes and redundant encoding for accessibility
- Building interactive charts with Plotly Express (hover, bubble size encoding)

**Next Steps:**
- Week 6 Lab: Building Interactive Dashboards (extends Plotly to multi-panel layouts)

---

*CMSC 178DA - Data Analytics | University of the Philippines Cebu*