# CSE 512 Uncertainty Exercise

In [3]:
import altair as alt
import pandas as pd

In this exercise, you will design and assess visualizations of uncertain data.

## Instructions
We'll be working in teams of 2-3 for this exercise. Create a fork of this notebook by clicking on the fork button above (next to the share button). You will use your group's notebook to keep track of your work.

Next, click on the share button in your fork and select the option to save it as 'unlisted'. This will only allow people with access to the link to view the notebook.

**Note**: Make sure you submit your group's notebook *before* the end of lecture!

---
## Task 1: Experimental Results

Given experimental data, how might you best convey the results? Are there meaningful differences between conditions?

You're analyzing measurements of birds. Scientists have measured the length and depth of bird beaks on a number islands. Based on other physical characteristics of the birds, the data have been grouped into two conditions (`A` and `B`).

_Note: To work with the data outside of Observable, download the "birds.json" file, available as a file attachment linked under the office clip logo in the upper right._

In [5]:
birds_df = pd.read_csv("data/birds.csv")
birds_df.head()

Unnamed: 0,condition,bill_length,bill_depth,sex
0,B,39.1,18.7,male
1,B,39.5,17.4,female
2,B,40.3,18.0,female
3,B,36.7,19.3,female
4,B,39.3,20.6,male


For your first analysis question, you want to assess if `beak_length` varies significantly across conditions. Create two visualization to compare `beak_length` by `condition`: one focused on the mean length, and another focused on the distribution of values.

### Convey the mean (average) lengths

Plot the average `beak_length` for each `condition`. Also include a measure of spread, such as the standard error of the mean or the interquartile range.

In [19]:
alt.Chart(
    birds_df,
).mark_boxplot().encode(
    x=alt.X("bill_length:Q", title="Bill Length (mm)"),
    y=alt.Y("condition:N", title="Condition"),
).properties(
    width=700,
    height=100,
    title="Average Bill Length by Condition",
)

What measure of spread did you choose, and why?

- _We chose to use a box plot because it automatically plots the mean and the interquartile ranges which was how we visualized the spread._

### Convey the distribution

Now instead create a visualization intended to better convey the overall distribution of `beak_length` measurements, again divided by `condition`. Examples might include plotting raw values, histograms, or density plots.

In [20]:
alt.Chart(birds_df).mark_bar().encode(
    x=alt.X("bill_length:Q", bin=alt.Bin(maxbins=20), title="Bill Length (mm)"),
    y=alt.Y("count():Q", title="Count"),
    row=alt.Row("condition:N", title="Condition"),
).properties(
    width=600,
    height=200,
    title="Distribution of Bill Length by Condition",
)

Based on the charts you've created, how would you describe the difference between condition A and condition B?

- _Condition B is almost bimodal, at least it is skewed right with the long tail_
- _Condition A is more densely concentrated and has a higher mean than Condition A_

---
## Task 2: Assessing Correlation

Your next task is to assess the relationship between `bill_length` and `bill_depth`, again grouped by `condition`. Below is a scatter plot of the two variables, overlaid with a [linear regression model](https://vega.github.io/vega-lite/docs/regression.html) fit to _all_ of the data. Modify the chart to include an additional layer that shows regression fits _per condition_, with each regression line  color coded by `condition`.

In [31]:
base = alt.Chart(birds_df).encode(
    x=alt.X('bill_length:Q', scale=alt.Scale(zero=False)),
    y=alt.Y('bill_depth:Q', scale=alt.Scale(zero=False))
)

# Scatter plot layer
points = base.mark_circle().encode(
    color=alt.Color('condition:N')
)

# Regression line layer
regression_line = base.transform_regression(
    'bill_length', 'bill_depth'
).mark_line(color='black')


regression_line_cond_A = base.transform_filter(
    alt.datum.condition == 'A'
).transform_regression(
    'bill_length', 'bill_depth'
).mark_line(color='steelblue')

regression_line_cond_B = base.transform_filter(
    alt.datum.condition == 'B'
).transform_regression(
    'bill_length', 'bill_depth'
).mark_line(color='darkorange')


# Combine layers
chart = alt.layer(
    points,
    regression_line,
    regression_line_cond_A,
    regression_line_cond_B
).properties(
    width=700,
    height=700
)
chart

What do the different regression models (overall vs. subdivided) convey? Describe the trends and try to make sense of any potential contradictions you find.

- _The different regression models convey that though overall looks negatively correlated when conditioning on `Condition` we see they are actually posively correlated._
- _This seemingly contradictory discovery is a well known paradox called simpsons paradox._

---
## Task 3: Election Uncertainty

Two politicians are locked in a tight race and vote counting is underway. How might you communicate the uncertainty and/or likely outcome of the race?

Mickey Mouse and Donald Duck are locked in a tight election to become mayor of Disneyland. With exactly 80% of the total vote counted, the state of the race is:

| Candidate    | Votes |
| :----------- | ----: |
| Mickey Mouse | 18,073 |
| Donald Duck  | 17,847 |

In [None]:
# Create the data manually
data = pd.DataFrame([
    {'Candidate': 'Mickey Mouse', 'Votes': 18073},
    {'Candidate': 'Donald Duck', 'Votes': 17847},
])

# Create the chart
chart = alt.Chart(data).mark_bar().encode(
    x=alt.X('Votes:Q'),
    y=alt.Y('Candidate:N',
            scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
            title=None),
    color=alt.Color('Candidate:N',
                    scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
                    legend=None)
).properties(
    width=560
).configure_view(
    stroke=None  # Removes the border around the chart
)
chart

The votes counted so far have come from all parts of Disneyland, and there is no evidence of any bias in the current counts relative to Disneyland's population.

Create a visualization that conveys not just the state of the race, but its associated uncertainty. You are free to apply statistical methods to the data (in which case you may wish to search for appropriate methods online), but you may also consider forms of uncertainty visualization that do _not_ require statistical modeling.

In [36]:
current_vote_total = data.Votes.sum()
remaining_votes = current_vote_total/8 * 2
remaining_votes

np.float64(8980.0)

In [37]:
remaining_votes/(remaining_votes + current_vote_total)

np.float64(0.2)

In [57]:
# Create the data manually
data = pd.DataFrame([
    {'Candidate': 'Mickey Mouse', 'Votes': 18073},
    {'Candidate': 'Donald Duck', 'Votes': 17847},
])
data_unc = pd.DataFrame([
    {'Candidate': 'Mickey Mouse', 'Votes': 18073 + remaining_votes},
    {'Candidate': 'Donald Duck', 'Votes': 17847 + remaining_votes},
])

# Create the chart
chart = alt.Chart(data).mark_bar().encode(
    x=alt.X('Votes:Q'),
    y=alt.Y('Candidate:N',
            scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
            title=None),
    color=alt.Color('Candidate:N',
                    scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
                    legend=None)
).properties(
    width=800,
    height=200,
)

chart_unc = alt.Chart(data_unc).mark_bar().encode(
    x=alt.X('Votes:Q'),
    y=alt.Y('Candidate:N',
            scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
            title=None),
    color=alt.Color('Candidate:N',
                    scale=alt.Scale(domain=['Mickey Mouse', 'Donald Duck']),
                    legend=None),
    opacity=alt.value(0.5)  # Set opacity to 50%
).properties(
    width=800,
    height=200,
)
final_chart = alt.layer(
    chart,
    chart_unc
).properties(
    title="Vote Counts with Uncertainty"
)
final_chart

In addition to your visualization, address the following questions:

What aspect of uncertainty are you attempting to convey with your image?

- _We are trying to convey the uncertainty of how the remaining votes will be distributed among the two candidates_

How well do you believe your image achieves this goal? Why?

- _We successfully visualize the remaining 20% of the vote which is yet to be counted. However, one weakness is that we don't visualize how the votes could be distributed between the two candidates. We only compare the possiblites that they each get all 20% or not any at all._