<a href="https://colab.research.google.com/github/mtazike/Visualization_Design_Exercise/blob/main/Week05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> **9/30/2024 clarifications in <font color='blue'>blue</font>**

# Markers and Channels

In this exercise, we will explore the effects of different markers and channel options on a visualization. We will also start to use the customization capabilities of Plotly.

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
import pandas as pd

# paste your URL here
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQkC5sLOdpoyzxkMm3ax22OZIKZ99kUBa8AuiJG2xGSCnwgX28xSkoF6fCoR2WRyE0WTz4m-kQESChv/pub?gid=1808016370&single=true&output=csv'
who_df = pd.read_csv(url)
who_df.head()


Unnamed: 0,Country (location),ISO code,region,income group,year,Health Exp. (% of GDP),Health Exp. per Capita (USD),Gov. Health Exp. (USD),Private Health Exp. (USD),Out-of-Pocket Exp. per Capita (USD),"Gov. Health Exp. per Capita (USD, 2022 prices)",Value,Category_highlight
0,Algeria,DZA,AFR,Lower-middle,2000,3.214854,61.857853,103533.985,40261.19922,1485.909342,1022.24963,False,other
1,Algeria,DZA,AFR,Lower-middle,2001,3.536286,67.058594,123663.777,38492.03125,1646.495321,1146.437871,False,other
2,Algeria,DZA,AFR,Lower-middle,2002,3.441696,66.681633,126996.8608,41630.37109,1724.133123,1331.83535,False,other
3,Algeria,DZA,AFR,Lower-middle,2003,3.325694,75.951309,145057.4834,43985.0,1689.917331,1164.169817,False,other
4,Algeria,DZA,AFR,Lower-middle,2004,3.290305,92.68763,155499.6782,62326.91406,1676.443072,1202.531803,False,other


# Exercises

For the exercises below, you'll need to do the following:

1. Devise a particularly valuable question that your data might be able to answer. That is, think about the aspect of your data you think to be most valuable, and frame a useful question around it.
2. Based on your question, pick two *ordered* (numeric) variables in your data: one **primary variable** and one **secondary variable**. The primary variable should be more essential to the question asked.
3. Lastly, pick a separate ***categorical* variable** that might also relate to the question. <font color='blue'>*Note: See 5.5.2. in the reading: be careful not to plot too many category levels in  your visualizations (e.g., a plot with more than 7 categories will be indiscriminable).*</font>

## EXERCISE 1

*For this exercise, refer to sections 5.3 and 5.4 in the reading.*

With your primary variable as the focus, build a visualization which uses two different (unique) channels. E.g., a simple (unformatted) scatterplot would not work here because both the x-axis and the y-axis measure position. *Note: one of these channels will be higher on the ranked list in Figure 5.6 in the reading.*

- You are welcome to incorporate your secondary <font color='blue'>or</font> categorical variables here.
- <font color='blue'>**Do not use too many variables in a visualization**. Usually, 1-3 variables per plot is reasonable; otherwise, things get confusing.</font>
- <font color='darkred'>**Avoid** assigning the same variable to multiple channels.</font> (E.g., a $y$ variable should not also be assigned a color.)

Customize the markers in your visualization using `update_` or `add_` as described in the Plotly documentation.

In [None]:
import plotly.express as px

fig = px.scatter(
    who_df,
    x="Health Exp. per Capita (USD)",     # primary variable
    y="Health Exp. (% of GDP)",           # secondary variable
    color="region",                       # categorical variable
    hover_name="Country (location)",
    title="Health Expenditure Patterns Across Regions"
)

fig.show()

In [None]:
fig = px.scatter(
    who_df,
    x="Health Exp. per Capita (USD)",
    y="Health Exp. (% of GDP)",
    color="region",
    hover_name="Country (location)",
    title="Health Expenditure Patterns Across Regions"
)

# update methods
fig.update_layout(
    xaxis_title="Health Expenditure per Capita (USD)",
    yaxis_title="Health Expenditure (% of GDP)",
    legend_title="WHO Region",
    template="plotly_white"   # cleaner background
)

fig.update_traces(marker=dict(size=6, opacity=0.7, line=dict(width=1, color='black')))

fig.show()


In [None]:
fig = px.scatter(
    who_df,
    x="Health Exp. per Capita (USD)",
    y="Health Exp. (% of GDP)",
    color="region",
    hover_name="Country (location)",
    title="Health Expenditure Patterns Across Regions"
)

# update methods
fig.update_layout(
    xaxis_title="Health Expenditure per Capita (USD)",
    yaxis_title="Health Expenditure (% of GDP)",
    legend_title="WHO Region",
    template="plotly_white",          # cleaner background
    xaxis=dict(range=[0, 6000]),      # limit x-axis
    yaxis=dict(range=[0, 40])         # limit y-axis
)

fig.update_traces(marker=dict(size=8, opacity=0.7, line=dict(width=1, color='black')))

# add a mean line
mean_value = who_df["Health Exp. per Capita (USD)"].mean()
fig.add_shape(
    type="line",
    x0=mean_value, x1=mean_value,
    y0=0, y1=who_df["Health Exp. (% of GDP)"].max(),
    line=dict(color="green", dash="dot"),
    name="Mean line"
)


fig.show()


**Explanation od Exercise 1** <br>
This scatterplot visualization uses point marks. The x-position channel encodes *Health Expenditure per Capita (USD)* (**primary variable**), and the y-position channel encodes *Health Expenditure (% of GDP)* (**Secondary variable**). The **color hue** channel represents the **categorical variable** *region*, separating countries into WHO regions. A vertical line was added to indicate the **mean value** of per capita spending, helping contextualize the data. These choices follow the principles of expressiveness (ordered data for position, categorical data for color) and effectiveness (position is the most accurate channel).

## EXERCISE 2

Using the exact same visualization you did in the previous exercise, just **swap the <font color='blue'>primary and secondary variables</font>.** I.e., make the secondary continuous variable your focus.

Is it clear that the first visualization is more effective for your question than the second? Why or why not, and what do you think?

In [None]:
import plotly.express as px

# scatterplot with swapped axes
fig2 = px.scatter(
    who_df,
    x="Health Exp. (% of GDP)",           # now primary
    y="Health Exp. per Capita (USD)",     # now secondary
    color="region",
    hover_name="Country (location)",
    title="Health Expenditure (% of GDP) vs. Per Capita Across Regions"
)

# update methods
fig2.update_layout(
    xaxis_title="Health Expenditure (% of GDP)",
    yaxis_title="Health Expenditure per Capita (USD)",
    legend_title="WHO Region",
    template="plotly_white",
    xaxis=dict(range=[0, 40]),     # adjust to spread points
    yaxis=dict(range=[0, 6000])    # adjust to spread points
)

fig2.update_traces(marker=dict(size=7, opacity=0.7, line=dict(width=1, color='black')))

# add method mean line for % of GDP
mean_value = who_df["Health Exp. (% of GDP)"].mean()
fig2.add_shape(
    type="line",
    x0=mean_value, x1=mean_value,
    y0=0, y1=who_df["Health Exp. per Capita (USD)"].max(),
    line=dict(color="red", dash="dot"),
    name="Mean line"
)

fig2.show()


<font color='darkblue'>
This scatterplot also uses point marks. The x-position channel now encodes *Health Expenditure (% of GDP)* (primary variable), while the y-position channel encodes *Health Expenditure per Capita (USD)* (secondary variable). The color hue channel continues to represent *region*. A vertical line indicates the global mean GDP share spent on health, helping contextualize variation across countries.

**Is it clear that the first visualization is more effective for your question than the second? Why or why not, and what do you think?**

<font color='darkblue'>
Yes, the first visualization is more effective. In Exercise 1, per capita spending is mapped to the x-axis, which Munzner’s framework ranks as the most accurate channel for ordered data. This makes it easier to compare across countries and regions. In Exercise 2, swapping the variables places per capita spending on the y-axis, which is less effective for showing large numeric variation. As a result, the second visualization is harder to interpret and less aligned with the original question, where per capita spending is the stronger focus.

</font>

## EXERCISE 3

Build two very different visualizations that showcase your **categorical** variable. Which one is most effective, and why?

In [None]:
import plotly.express as px

# Create the first box plot
fig3a = px.box(
    who_df,
    x="region",
    y="Health Exp. per Capita (USD)",
    color="region",
    hover_name="Country (location)",
    title="Distribution of Per Capita Health Expenditure by Region"
)

fig3a.update_layout(
    xaxis_title="WHO Region",
    yaxis_title="Health Expenditure per Capita (USD)",
    template="plotly_white",
    yaxis=dict(range=[0, 6000])    # adjust to spread points

)
fig3a.show()

In [None]:
import plotly.express as px

# Group by region to get mean values
mean_by_region = who_df.groupby("region")["Health Exp. (% of GDP)"].mean().reset_index()


# Create the bar chart

fig3b = px.bar(
    who_df,
    x="region",
    y="Health Exp. (% of GDP)",
    color="region",
    hover_name="Country (location)",
    title="Average Health Expenditure (% of GDP) by Region"
)

fig3b.update_traces(
    texttemplate='%{y:.2f}',      # show numeric y-values (2 decimals)
    textposition='outside'        # place labels outside bars
)

fig3b.update_layout(
    xaxis_title="WHO Region",
    yaxis_title="Average Health Expenditure (% of GDP)",
    template="plotly_white"
)

fig3b.show()

**3a (Box Plot):**
Shows the distribution of per capita health expenditure across regions. It highlights variability, medians, and outliers within each region.

**3b (Bar Chart):**
Shows the average health expenditure (% of GDP) by region. It makes regional comparisons clear with numeric labels for easy interpretation.

<font color='darkblue'>
The bar chart (3b) is more effective for comparing regional averages because it provides clear numeric values and a straightforward comparison across regions. However, the box plot (3a) is valuable for showing distribution and variation, which the bar chart hides. Together, they provide complementary insights, but I believe in terms of clarity and quick interpretation, 3b is the most effective.
