<a href="https://colab.research.google.com/github/mtazike/Visualization_Design_Exercise/blob/main/Week06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multivariate Data

In this exercise, we will explore a few different ways to express more than a few variables in a single visualization.

In [2]:
import pandas as pd
import plotly.express as px

In [3]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQkC5sLOdpoyzxkMm3ax22OZIKZ99kUBa8AuiJG2xGSCnwgX28xSkoF6fCoR2WRyE0WTz4m-kQESChv/pub?gid=1808016370&single=true&output=csv'
who_df = pd.read_csv(url)
who_df.head()

Unnamed: 0,Country (location),ISO code,region,income group,year,Health Exp. (% of GDP),Health Exp. per Capita (USD),Gov. Health Exp. (USD),Private Health Exp. (USD),Out-of-Pocket Exp. per Capita (USD),"Gov. Health Exp. per Capita (USD, 2022 prices)",Value,Category_highlight
0,Algeria,DZA,AFR,Lower-middle,2000,3.214854,61.857853,103533.985,40261.19922,1485.909342,1022.24963,False,other
1,Algeria,DZA,AFR,Lower-middle,2001,3.536286,67.058594,123663.777,38492.03125,1646.495321,1146.437871,False,other
2,Algeria,DZA,AFR,Lower-middle,2002,3.441696,66.681633,126996.8608,41630.37109,1724.133123,1331.83535,False,other
3,Algeria,DZA,AFR,Lower-middle,2003,3.325694,75.951309,145057.4834,43985.0,1689.917331,1164.169817,False,other
4,Algeria,DZA,AFR,Lower-middle,2004,3.290305,92.68763,155499.6782,62326.91406,1676.443072,1202.531803,False,other


# Exercises



## EXERCISE 1

First, devise a question that specifically relates to a **categorical** column of your data, and share it here. _Try to pick (or create) a column with **fewer than 7 categories**_.

Select 3-4 **continuous** columns of your data, and present them in a [parallel-coordinates plot](https://plotly.com/python/parallel-coordinates-plot/). Color the lines using the categorical column of data. Try a few different color maps, and explain why you chose the one you do.

What insight (if any) can you gather from this visualization? Is there a "takeaway message"?

In [22]:
import plotly.express as px

income_map = {
    "Low": 0,
    "Lower-middle": 1,
    "Upper-middle": 2,
    "High": 3
}
df["income_group_num"] = df["income group"].map(income_map)

fig = px.parallel_coordinates(
    df,
    dimensions=[
        "Health Exp. (% of GDP)",
        "Health Exp. per Capita (USD)",
        "Private Health Exp. (USD)",
        "Gov. Health Exp. per Capita (USD, 2022 prices)"
    ],
    color="income_group_num",
    color_continuous_scale=px.colors.diverging.Tealrose,
    labels={"income_group_num": "Income Group"}
)

fig.update_layout(title="Parallel Coordinates of Health Expenditure by Income Group")
fig.show()


In [36]:
# Filter dataset for year 2020
df_2020 = df[df["year"] == 2020].copy()

# Map income group again for the filtered data
df_2020["income_group_num"] = df_2020["income group"].map(income_map)

# Use df_2020 in the plot
fig = px.parallel_coordinates(
    df_2020,
    dimensions=[
        "Health Exp. (% of GDP)",
        "Health Exp. per Capita (USD)",
        "Private Health Exp. (USD)",
        "Gov. Health Exp. per Capita (USD, 2022 prices)"
    ],
    color="income_group_num",
    color_continuous_scale=px.colors.qualitative.Set1,  # clearer colors
    labels={"income_group_num": "Income Group"}
)
fig.update_layout(title="Parallel Coordinates of Health Expenditure by Income Group (2020)")
fig.show()

**Exercise 1 Explanation**

**Question**: How do different income groups (Low, Lower-middle, Upper-middle, High) compare in terms of health expenditure patterns?
**Design Choices**: I selected income group as the categorical column because it has 4 categories, making it a clear and manageable way to compare groups.
I chose 4 continuous variables: Health Exp. (% of GDP), Health Exp. per Capita (USD), Private Health Exp. (USD), Gov. Health Exp. per Capita (USD, 2022 prices).
These capture different aspects of national health spending.
At first, I plotted the entire dataset across all years. However, the figure was crowded, and the variation was hard to interpret. To improve clarity, **I filtered the data to a single year (2020)**. This reduced overlap and made group differences more visible.

**Color Choice**: I tested continuous colormaps like *Tealrose*, but they blended categories together. I instead used a qualitative palette (**Set1**), which assigns distinct colors to each category and more accurately represents group differences.


**Takeaway**
*   Higher income groups show consistently larger per capita and government expenditure, even though the % of GDP spent is not always higher.
*   Lower income groups spend less in absolute terms, and their government spending is comparatively lower, suggesting heavier reliance on private or out-of-pocket spending.
*   The main takeaway is that income level strongly shapes health expenditure patterns, with wealthier groups investing more per capita in healthcare.
*   Filtering by one year (2020) provided a clearer picture than using the full dataset, which was too cluttered.

## EXERCISE 2

Now, present the same information using [subplots](https://plotly.com/python/subplots/) of *other* visualization types (i.e., do not use more parallel-coordinates plots). Try to *minimize* the colors used in the subplots. E.g., maybe one of the subplots could be colorless.

Between this visualization and the ones before it, which do you think is most effective at communicating the takeaway message? Why?

In [37]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Create subplots: 2 rows, 2 columns
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        "Health Exp. (% of GDP) by Income Group",
        "Health Exp. per Capita (USD) by Income Group",
        "Private Health Exp. (USD) by Income Group",
        "Gov. Health Exp. per Capita (USD, 2022 prices) by Income Group"
    ]
)

# 1. Health Exp. (% of GDP)
box1 = px.box(df_2020, x="income group", y="Health Exp. (% of GDP)")
for trace in box1.data:
    fig.add_trace(trace, row=1, col=1)

# 2. Health Exp. per Capita
box2 = px.box(df_2020, x="income group", y="Health Exp. per Capita (USD)")
for trace in box2.data:
    fig.add_trace(trace, row=1, col=2)

# 3. Private Health Exp.
box3 = px.box(df_2020, x="income group", y="Private Health Exp. (USD)")
for trace in box3.data:
    fig.add_trace(trace, row=2, col=1)

# 4. Gov. Health Exp. per Capita
box4 = px.box(df_2020, x="income group", y="Gov. Health Exp. per Capita (USD, 2022 prices)")
for trace in box4.data:
    fig.add_trace(trace, row=2, col=2)

# Update layout
fig.update_layout(
    height=800, width=1000,
    title="Boxplot Subplots of Health Expenditure Variables by Income Group (2020)",
    showlegend=False
)

fig.show()


In [38]:
import plotly.express as px
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        "Health Exp. (% of GDP)",
        "Health Exp. per Capita (USD)",
        "Private Health Exp. (USD)",
        "Gov. Health Exp. per Capita (USD, 2022 prices)"
    ]
)

variables = [
    "Health Exp. (% of GDP)",
    "Health Exp. per Capita (USD)",
    "Private Health Exp. (USD)",
    "Gov. Health Exp. per Capita (USD, 2022 prices)"
]

row, col = 1, 1
for var in variables:
    box = px.box(
        df_2020,
        x="income group",
        y=var,
        color_discrete_sequence=["gray"]  # make all boxplots gray
    )
    for trace in box.data:
        fig.add_trace(trace, row=row, col=col)

    col += 1
    if col == 3:  # move to next row
        row += 1
        col = 1

fig.update_layout(
    height=800, width=1000,
    title="Monochrome Boxplots of Health Expenditure Variables by Income Group (2020)",
    showlegend=False
)

fig.show()


**Exercise 2 Explanation**

**Design Choices**: I used boxplot subplots to present the same information as in Exercise 1. Each subplot shows one health expenditure variable by income group. I made two versions: one with default Plotly colors and one in monochrome gray.

**Reason for Design**: Boxplots are effective for showing distributions across categories, which makes them a good fit for comparing income groups. Subplots keep the variables separated but easy to view together, and the monochrome version reduces reliance on color so the focus stays on the data itself.

**Comparison with Exercise 1**:
*   List item The parallel coordinates plot (Exercise 1) showed all variables at once and revealed some relationships, but it was cluttered with overlapping lines.
*   List item The boxplot subplots (Exercise 2) deliver the message more clearly: high-income countries spend more per capita and on government health expenditure, while low-income groups spend much less.
*   List item Overall, the boxplots communicate the differences more clearly, while the parallel coordinates are better for exploring multivariate patterns.








## EXERCISE 3

*Recall the idea of enclosure and containment from the reading.*

Now, take a look at this article on [Horizontal and Vertical Lines and Rectangles](https://plotly.com/python/horizontal-vertical-shapes/) in Plotly. With your data, build a useful visualization which uses these methods.

Make sure to include annotation(s) appropriately.

In [57]:
import plotly.express as px

# Boxplot with all data points shown
fig = px.box(
    df_2020,
    x="income group",
    y="Health Exp. per Capita (USD)",
    points="all",  # show all individual data points
    hover_data=["Country (location)"],  # add country names to hover tooltips
    title="Health Exp. per Capita by Income Group (2020) with Highlighted Zone"
)

# Horizontal line at median
median_value = df_2020["Health Exp. per Capita (USD)"].median()
fig.add_hline(
    y=median_value,
    line_dash="dot",
    annotation_text="Median Spending",
    annotation_position="bottom right"
)

# Rectangle for "High Spending Zone"
fig.add_hrect(
    y0=5000, y1=12000,
    fillcolor="lightgreen", opacity=0.2, line_width=0,
    annotation_text="High Spending Zone",
    annotation_position="top left"
)

fig.add_annotation(
    x="High", y=11000,
    text="High-income countries dominate here",
    showarrow=True,
    arrowhead=2
)

fig.show()


**Exercise 3 Explanation**

For this exercise, I built on the boxplot from Exercise 2 and added elements of enclosure and annotation. I included a horizontal line to mark the median spending, and a shaded rectangle to highlight the “High Spending Zone.” I also added annotations to point out where high-income countries dominate.