# Week 5 Lab: Data Visualization Techniques — SOLUTION

**Estimated Time:** 60-90 minutes  
**Objective:** Master data visualization by choosing the right chart types, applying design principles, ensuring accessibility, and building interactive charts.

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')
pd.set_option('display.max_columns', None)
np.random.seed(42)

print('✓ Libraries imported successfully!')

In [None]:
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III',
               'CALABARZON', 'MIMAROPA', 'Region V', 'Region VI',
               'Region VII', 'Region VIII', 'Region IX', 'Region X',
               'Region XI', 'Region XII', 'BARMM'],
    'Population_M': [13.48, 1.80, 5.30, 3.68, 12.42, 16.20, 3.23,
                     6.08, 7.95, 8.08, 4.55, 3.87, 5.02, 5.24, 4.93, 4.40],
    'GDP_Billions': [6842, 352, 418, 310, 1876, 2821, 286, 462,
                     1054, 1342, 284, 318, 724, 1021, 485, 312],
    'Poverty_Rate': [3.2, 14.8, 11.2, 14.1, 10.5, 8.1, 22.4, 28.3,
                     20.1, 18.7, 30.5, 33.8, 25.6, 16.3, 29.1, 63.0],
    'Literacy_Rate': [99.2, 96.8, 97.5, 96.2, 97.8, 98.1, 95.3, 96.0,
                      96.5, 97.1, 95.8, 94.2, 96.0, 97.0, 95.5, 82.5],
    'Tourism_Arrivals_K': [4200, 180, 320, 150, 580, 890, 420, 310,
                           650, 1200, 210, 120, 280, 380, 160, 85],
    'Island_Group': ['Luzon', 'Luzon', 'Luzon', 'Luzon', 'Luzon',
                     'Luzon', 'Luzon', 'Luzon', 'Visayas', 'Visayas',
                     'Visayas', 'Mindanao', 'Mindanao', 'Mindanao',
                     'Mindanao', 'Mindanao']
}

df = pd.DataFrame(data)
print(f'Dataset: {df.shape[0]} regions × {df.shape[1]} columns')
df

---
## Part 1: Comparison Charts

In [None]:
# Exercise 1.1: Ranked Horizontal Bar Chart
df_sorted = df.sort_values('GDP_Billions', ascending=True)

fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(df_sorted['Region'], df_sorted['GDP_Billions'], color='steelblue')
ax.set_xlabel('GDP (Billion PHP)')
ax.set_title('Philippine GDP by Region (PSA 2023)', fontweight='bold')
plt.tight_layout()
plt.show()

top    = df.loc[df['GDP_Billions'].idxmax()]
bottom = df.loc[df['GDP_Billions'].idxmin()]
print(f'Highest: {top["Region"]} — PHP {top["GDP_Billions"]:,}B')
print(f'Lowest:  {bottom["Region"]} — PHP {bottom["GDP_Billions"]:,}B')
print(f'NCR is {top["GDP_Billions"]/bottom["GDP_Billions"]:.0f}× larger than {bottom["Region"]}')

In [None]:
# Exercise 1.2: Grouped Bar Chart
top8 = df.nlargest(8, 'Population_M')

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(top8))
width = 0.35

ax.bar(x - width/2, top8['Poverty_Rate'],  width, label='Poverty Rate (%)',  color='#EF4444')
ax.bar(x + width/2, top8['Literacy_Rate'], width, label='Literacy Rate (%)', color='#2563EB')

ax.set_xticks(x)
ax.set_xticklabels(top8['Region'], rotation=45, ha='right')
ax.set_ylabel('Rate (%)')
ax.set_title('Poverty Rate vs Literacy Rate — Top 8 Regions by Population', fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()

---
## Part 2: Distribution Charts

In [None]:
# Exercise 2.1: Histogram with KDE Overlay
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(df['Poverty_Rate'], kde=True, bins=8, color='steelblue', ax=ax)

mean_pov   = df['Poverty_Rate'].mean()
median_pov = df['Poverty_Rate'].median()

ax.axvline(mean_pov,   color='red',   linestyle='--', linewidth=2,
           label=f'Mean: {mean_pov:.1f}%')
ax.axvline(median_pov, color='green', linestyle='--', linewidth=2,
           label=f'Median: {median_pov:.1f}%')

ax.set_xlabel('Poverty Rate (%)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Regional Poverty Rates', fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()

print(f'Mean: {mean_pov:.1f}%  |  Median: {median_pov:.1f}%')
print('Mean > Median because BARMM (63%) is a right-side outlier that pulls the average up — right-skewed distribution.')

In [None]:
# Exercise 2.2: Box Plots by Island Group
fig, ax = plt.subplots(figsize=(8, 6))
order = ['Luzon', 'Visayas', 'Mindanao']

sns.boxplot(data=df, x='Island_Group', y='Poverty_Rate', order=order, ax=ax)
sns.stripplot(data=df, x='Island_Group', y='Poverty_Rate', order=order,
              color='black', size=6, jitter=True, alpha=0.7, ax=ax)

ax.set_ylabel('Poverty Rate (%)')
ax.set_title('Poverty Rate Distribution by Island Group', fontweight='bold')
plt.tight_layout()
plt.show()

print('Median poverty rates:')
print(df.groupby('Island_Group')['Poverty_Rate'].median().sort_values(ascending=False))

---
## Part 3: Relationship Charts

In [None]:
# Exercise 3.1: Scatter Plot with Regression Line
fig, ax = plt.subplots(figsize=(10, 7))

sns.scatterplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
                hue='Island_Group', s=120, ax=ax)
sns.regplot(data=df, x='GDP_Billions', y='Tourism_Arrivals_K',
            scatter=False, color='gray',
            line_kws={'linestyle': '--', 'linewidth': 1.5}, ax=ax)

ax.set_xlabel('GDP (Billion PHP)')
ax.set_ylabel('Tourism Arrivals (Thousands)')
ax.set_title('GDP vs Tourism Arrivals by Region', fontweight='bold')
plt.tight_layout()
plt.show()

r = df['GDP_Billions'].corr(df['Tourism_Arrivals_K'])
print(f'Pearson r = {r:.3f} — strong positive relationship')

In [None]:
# Exercise 3.2: Correlation Heatmap
fig, ax = plt.subplots(figsize=(8, 6))

corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5, ax=ax)

ax.set_title('Correlation Matrix — Philippine Regional Indicators', fontweight='bold')
plt.tight_layout()
plt.show()

pairs = corr.unstack()
pairs = pairs[pairs < 1.0].sort_values()
print(f'Strongest negative: {pairs.index[0]}  r={pairs.iloc[0]:.3f}')
print(f'Strongest positive: {pairs.index[-1]}  r={pairs.iloc[-1]:.3f}')

---
## Part 4: Design Principles

In [None]:
# Exercise 4.1 — Bad chart
df_top5 = df.nlargest(5, 'GDP_Billions').sort_values('GDP_Billions')

fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(df_top5['Region'], df_top5['GDP_Billions'],
        color=['red','orange','yellow','green','purple'])
ax.set_title('GDP', fontsize=10)
ax.set_xlabel('billions')
ax.grid(True, linestyle='-', linewidth=2)
plt.tight_layout()
plt.show()
print('Problems: rainbow colors, heavy grid, vague title, no direct labels')

In [None]:
# Exercise 4.1 — Tufte clean chart
fig, ax = plt.subplots(figsize=(9, 4))

bars = ax.barh(df_top5['Region'], df_top5['GDP_Billions'], color='#4a90d9')

for spine in ax.spines.values():
    spine.set_visible(False)
ax.set_xticks([])

for bar in bars:
    ax.text(bar.get_width() + 50,
            bar.get_y() + bar.get_height() / 2,
            f'{bar.get_width():,.0f}B',
            va='center', fontsize=10, color='#333333')

ax.set_title("NCR's GDP Is More Than Double CALABARZON's", fontweight='bold')
ax.grid(False)
plt.tight_layout()
plt.show()

In [None]:
# Exercise 4.2 — Pie vs Bar
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

ax1.pie(df['GDP_Billions'], labels=df['Region'], autopct='%1.0f%%')
ax1.set_title('Bad: Pie Chart (16 slices)', fontweight='bold', color='#dc2626')

top5   = df.nlargest(5, 'GDP_Billions')
others = df['GDP_Billions'].sum() - top5['GDP_Billions'].sum()
bar_regions = list(top5.sort_values('GDP_Billions')['Region']) + ['Others']
bar_gdp     = list(top5.sort_values('GDP_Billions')['GDP_Billions']) + [others]
bar_colors  = ['#4a90d9'] * 5 + ['#94a3b8']

ax2.barh(bar_regions, bar_gdp, color=bar_colors)
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_title('Good: Bar Chart (Top 5 + Others)', fontweight='bold', color='#16a34a')
plt.tight_layout()
plt.show()

# The bar chart is easier because bars share a common baseline so length comparisons
# are precise; pie slice angles are hard to estimate, especially for similar-sized slices.

---
## Part 5: Color & Accessibility

In [None]:
# Exercise 5.1: Three Palettes
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, (name, palette) in zip(axes, [('Default', None), ('Colorblind-safe', 'colorblind'), ('Viridis', 'viridis')]):
    sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                    hue='Island_Group', palette=palette, s=100, ax=ax)
    ax.set_title(f'{name} Palette', fontweight='bold')
    ax.set_xlabel('GDP (Billion PHP)')
    ax.set_ylabel('Poverty Rate (%)')

plt.suptitle('Same Data — Three Color Palettes', fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Exercise 5.2: Redundant Encoding
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                hue='Island_Group', palette='colorblind', s=150, ax=ax1)
ax1.set_title('Color Only', fontweight='bold')
ax1.set_xlabel('GDP (Billion PHP)')
ax1.set_ylabel('Poverty Rate (%)')

sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                hue='Island_Group', style='Island_Group',
                palette='colorblind', s=150, ax=ax2)
ax2.set_title('Color + Shape (Redundant Encoding)', fontweight='bold')
ax2.set_xlabel('GDP (Billion PHP)')
ax2.set_ylabel('Poverty Rate (%)')

plt.tight_layout()
plt.show()

---
## Part 6: Small Multiples & Dashboards

In [None]:
# Exercise 6.1: FacetGrid
g = sns.FacetGrid(df, col='Island_Group',
                   col_order=['Luzon', 'Visayas', 'Mindanao'], height=4)
g.map(sns.histplot, 'GDP_Billions', bins=8, color='steelblue')
g.set_titles(col_template='{col_name}')
g.set_axis_labels('GDP (Billion PHP)', 'Count')

plt.suptitle('GDP Distribution by Island Group', fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print('GDP variance by island group:')
print(df.groupby('Island_Group')['GDP_Billions'].var().sort_values(ascending=False))
print('Luzon has by far the highest variance — NCR (6,842B) vs MIMAROPA (286B)')

In [None]:
# Exercise 6.2: 2x2 Dashboard
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

ax = axes[0, 0]
top5_s = df.nlargest(5, 'GDP_Billions').sort_values('GDP_Billions')
ax.barh(top5_s['Region'], top5_s['GDP_Billions'], color='steelblue')
ax.set_xlabel('GDP (Billion PHP)')
ax.set_title('Top 5 Regions by GDP', fontweight='bold')

ax = axes[0, 1]
sns.scatterplot(data=df, x='GDP_Billions', y='Poverty_Rate',
                hue='Island_Group', s=80, ax=ax, legend=False)
ax.set_xlabel('GDP (B PHP)')
ax.set_ylabel('Poverty Rate (%)')
ax.set_title('GDP vs Poverty Rate', fontweight='bold')

ax = axes[1, 0]
sns.boxplot(data=df, x='Island_Group', y='Poverty_Rate',
            order=['Luzon', 'Visayas', 'Mindanao'], ax=ax)
ax.set_ylabel('Poverty Rate (%)')
ax.set_title('Poverty Distribution by Island Group', fontweight='bold')

ax = axes[1, 1]
years   = [2018, 2019, 2020, 2021, 2022, 2023]
ncr_gdp = [5800, 6100, 5200, 5500, 6200, 6842]
cal_gdp = [2200, 2400, 2100, 2300, 2600, 2821]
ax.plot(years, ncr_gdp, '-o', label='NCR',        color='steelblue', linewidth=2)
ax.plot(years, cal_gdp, '-s', label='CALABARZON', color='#7C3AED',   linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('GDP (Billion PHP)')
ax.set_title('GDP Trend 2018–2023', fontweight='bold')
ax.legend()

fig.suptitle('Philippine Economy at a Glance', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

---
## Part 7: Plotly Interactive Visualization + Data Story

In [None]:
# Exercise 7.1: Plotly Bar Chart
df_px = df.sort_values('GDP_Billions', ascending=False)

fig = px.bar(
    df_px, x='GDP_Billions', y='Region', orientation='h',
    color='Island_Group',
    title='Philippine GDP by Region — Hover to Explore',
    labels={'GDP_Billions': 'GDP (Billion PHP)', 'Region': ''},
    color_discrete_sequence=px.colors.qualitative.Safe
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

In [None]:
# Exercise 7.2: Plotly Bubble Chart
fig = px.scatter(
    df, x='GDP_Billions', y='Tourism_Arrivals_K',
    color='Island_Group', size='Population_M',
    hover_data=['Region', 'Poverty_Rate'],
    title='GDP vs Tourism (Bubble = Population)',
    labels={'GDP_Billions': 'GDP (B PHP)',
            'Tourism_Arrivals_K': 'Tourism Arrivals (Thousands)'},
    color_discrete_sequence=px.colors.qualitative.Safe
)
fig.show()

# BARMM: tiny GDP (312B), tiny tourism (85K), 63% poverty — worst-served region
# NCR: enormous GDP (6842B), high tourism (4200K), only 3.2% poverty

In [None]:
# Exercise 7.3: Annotated Data Story Figure
fig, ax = plt.subplots(figsize=(10, 6))

df_s = df.sort_values('Poverty_Rate', ascending=True)
bar_colors = ['#EF4444' if r == 'BARMM' else '#4a90d9' for r in df_s['Region']]
ax.barh(df_s['Region'], df_s['Poverty_Rate'], color=bar_colors)

for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(False)

barmm_pos = list(df_s['Region']).index('BARMM')
ax.annotate('BARMM: 63%\n(20× NCR)',
            xy=(63.0, barmm_pos), xytext=(42, barmm_pos - 3),
            arrowprops=dict(arrowstyle='->', color='#EF4444', lw=1.5),
            fontsize=11, fontweight='bold', color='#EF4444')

ax.set_xlabel('Poverty Rate (%)')
ax.set_title("BARMM's Poverty Rate Is 20× Higher Than NCR's",
             fontweight='bold', fontsize=13)
ax.text(0.99, -0.08, 'Source: PSA 2023', transform=ax.transAxes,
        fontsize=9, ha='right', color='gray', style='italic')

plt.tight_layout()
plt.show()

### Exercise 7.4: Data Story Caption — Model Answer

BARMM (Bangsamoro Autonomous Region in Muslim Mindanao) has the highest poverty rate in the Philippines at 63% — roughly 20 times higher than NCR's 3.2% and more than double any other region. This extreme disparity reflects decades of armed conflict, limited infrastructure investment, and restricted access to education and health services in the region. Sustained peace efforts and targeted government spending — particularly in education and livelihood programs — are essential prerequisites before BARMM can close this gap with the rest of the country.

---
## Reflection Questions — Model Answers

**Q1 — Chart families:**
- Comparison (bar): *"Which categories are largest?"* A pie chart with 16 slices (Ex 4.2) would have failed — humans can't accurately compare angles.
- Distribution (histogram/boxplot): *"How is the variable spread?"* A bar of means would have hidden BARMM's outlier status in Ex 2.1.
- Relationship (scatter): *"Is there a pattern between two continuous variables?"* A bar chart can't encode two continuous variables simultaneously (Ex 3.1).

**Q2 — Two Tufte changes:**
1. Removing spines and gridlines eliminates non-data ink that competes visually with bars — the reader's eye goes directly to the data.
2. Direct value labels instead of x-axis ticks remove two cognitive steps (tracing + estimating) and replace them with one (reading).

**Q3 — Real-world redundant encoding failure:**
A medical dashboard using only red/green to indicate patient vitals (normal/critical) would be unreadable by ~8% of male clinicians with red-green color blindness. In an emergency, a doctor unable to distinguish a critical alert from a normal reading could delay treatment — potentially fatal.

**Q4 — FacetGrid vs grouped chart:**
Better when sub-groups have enough data to form meaningful distributions (Ex 6.1 shows Luzon's spread vs Mindanao's). Worse when you need precise value comparisons across groups, when sub-groups are tiny (2-3 points each), or when you have many groups.

**Q5 — Matplotlib vs Plotly:**
- Matplotlib: formal academic papers, PDF reports, printed posters — `savefig()` produces publication-quality static output; Plotly requires a browser.
- Plotly: online dashboards and presentations where audiences explore data live (Week 6). The bubble chart encoded four variables and revealed per-region details on hover — something requiring separate annotations for every point in Matplotlib.

---

## Congratulations!

**Key Skills Practiced:**
- Horizontal bar and grouped bar charts (Matplotlib)
- Histograms with KDE overlay and box plots with strip plot overlay (Seaborn)
- Scatter plots with regression lines and correlation heatmaps
- Tufte data-ink clean-up: spines, direct labels, active titles
- Pie vs bar chart comparison
- Colorblind-safe palettes and redundant color + shape encoding
- FacetGrid small multiples and 2×2 coordinated dashboards
- Annotated publication-quality data story figure
- Plotly Express: interactive bar and bubble charts with hover data

**Next Steps:** Week 6 Lab — Building Interactive Dashboards

---

*CMSC 178DA — Data Analytics | University of the Philippines Cebu*