<a href="https://colab.research.google.com/github/odu-cs625-datavis/public-fall24-mcw/blob/main/MultiView_in_Vega_Altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Multi-View Composition in Vega-Altair**

*Based on [Multi-View Composition notebook](https://observablehq.com/@uwdata/multi-view-composition) from @uwdata*

When visualizing a number of different data fields, we might be tempted to use as many visual encoding channels as we can: `x`, `y`, `color`, `size`, `shape`, and so on. However, as the number of encoding channels increases, a chart can rapidly become cluttered and difficult to read. An alternative to "over-loading" a single chart is to instead _compose multiple charts_ in a way that facilitates rapid comparisons.

In this notebook, we will examine a variety of operations for _multi-view composition_:

- layer: place compatible charts directly on top of each other,
- facet: partition data into multiple charts, organized in rows or columns,
- concatenate: position arbitrary charts within a shared layout, and
- repeat: take a base chart specification and apply it to multiple data fields.

We'll then look at how these operations form a _view composition algebra_, in which the operations can be combined to build a variety of complex multi-view displays.

In [2]:
!pip install altair==5.4.1

Collecting altair==5.4.1
  Downloading altair-5.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting narwhals>=1.5.2 (from altair==5.4.1)
  Downloading narwhals-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Downloading altair-5.4.1-py3-none-any.whl (658 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.1/658.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading narwhals-1.14.0-py3-none-any.whl (213 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.4/213.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: narwhals, altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Uninstalling altair-4.2.2:
      Successfully uninstalled altair-4.2.2
Successfully installed altair-5.4.1 narwhals-1.14.0


In [3]:
import altair as alt
import pandas as pd
alt.__version__  # if doesn't say '5.4.1', restart the runtime

'5.4.1'

## **Weather Data**

We will be visualizing weather statistics for the U.S. cities of Seattle and New York. Let's load the dataset and peek at the first and last 10 rows:

In [4]:
weather = pd.read_csv("https://raw.githubusercontent.com/vega/vega-datasets/main/data/weather.csv")
weather.head(10)

Unnamed: 0,location,date,precipitation,temp_max,temp_min,wind,weather
0,Seattle,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,Seattle,2012-01-02,10.9,10.6,2.8,4.5,rain
2,Seattle,2012-01-03,0.8,11.7,7.2,2.3,rain
3,Seattle,2012-01-04,20.3,12.2,5.6,4.7,rain
4,Seattle,2012-01-05,1.3,8.9,2.8,6.1,rain
5,Seattle,2012-01-06,2.5,4.4,2.2,2.2,rain
6,Seattle,2012-01-07,0.0,7.2,2.8,2.3,rain
7,Seattle,2012-01-08,0.0,10.0,2.8,2.0,sun
8,Seattle,2012-01-09,4.3,9.4,5.0,3.4,rain
9,Seattle,2012-01-10,1.0,6.1,0.6,3.4,rain


In [5]:
weather.tail(10)

Unnamed: 0,location,date,precipitation,temp_max,temp_min,wind,weather
2912,New York,2015-12-22,4.8,15.6,11.1,3.8,rain
2913,New York,2015-12-23,29.5,17.2,8.9,4.5,rain
2914,New York,2015-12-24,0.5,20.6,13.9,4.9,rain
2915,New York,2015-12-25,2.5,17.8,11.1,0.9,rain
2916,New York,2015-12-26,0.3,15.6,9.4,4.8,rain
2917,New York,2015-12-27,2.0,17.2,8.9,5.5,rain
2918,New York,2015-12-28,1.3,8.9,1.7,6.3,snow
2919,New York,2015-12-29,16.8,9.4,1.1,5.3,rain
2920,New York,2015-12-30,9.4,10.6,5.0,3.0,rain
2921,New York,2015-12-31,1.5,11.1,6.1,5.5,rain


We will create multi-view displays to examine weather within and across the cities.

---
## **Layer**


One of the most common ways of combining multiple charts is to *layer* marks on top of each other. If the underlying scale domains are compatible, we can merge them to form _shared axes_. If either of the `x` or `y` encodings is not compatible, we might instead create a _dual-axis chart_, which overlays marks using separate scales and axes.

### **Shared Axes**

Let's start by plotting the minimum and maximum average temperatures per month:

In [6]:
alt.Chart(weather).mark_area(tooltip=True).encode(
    x=alt.X('month(date):T', title=None),
    y=alt.Y('average(temp_max):Q', title='Avg. Temperature °C'),
    y2='average(temp_min):Q'
)

_The plot shows us temperature ranges for each month over the entirety of our data. However, this is pretty misleading as it aggregates the measurements for both Seattle and New York!_

Let's **subdivide the data by location using a color encoding**, while also adjusting the mark opacity to accommodate overlapping areas:

In [7]:
alt.Chart(weather).mark_area(opacity=0.3, tooltip=True).encode(
    x=alt.X('month(date):T', title=None),
    y=alt.Y('average(temp_max):Q', title='Avg. Temperature °C'),
    y2='average(temp_min):Q',
    color=alt.Color('location:N')
)

_We can see that Seattle is more temperate: warmer in the winter, and cooler in the summer._

In this case we've created a layered chart without any special features by simply subdividing the area marks by color. While the chart above shows us the temperature ranges, we might also want to emphasize the middle of the range.

Let's create a **line chart showing the average temperature midpoint**. We'll use a `transform_calculate` to compute the midpoints between the minimum and maximum daily temperatures and create a *derived attribute* named `temp_mid`.

In [8]:
alt.Chart(weather).transform_calculate(
    temp_mid='(datum.temp_min + datum.temp_max) / 2'
).mark_line(tooltip=True).encode(
    x=alt.X('month(date):T', title=None),
    y=alt.Y('average(temp_mid):Q', title="Average Temperature Midpoint (°C)"),
    color=alt.Color('location:N')
)

We'd now like to combine these charts by **layering the midpoint lines over the range areas**. Using the `alt.layer()` method, we can specify that we want a new layered chart in which `chart1` is the first layer and `chart2` is a second layer drawn on top. (Note that if we set a custom axis title within one of the layers, it will automatically be used as a shared axis title for all the layers.)

In [9]:
# Chart 1: Area chart for temp_min and temp_max
tempMinMax = alt.Chart(weather).mark_area(opacity=0.3, tooltip=True).encode(
    x=alt.X('month(date):T', title=None),
    y=alt.Y('average(temp_max):Q', title="Average Temperature Range (°C)"),
    y2='average(temp_min):Q',
    color=alt.Color('location:N')
)

# Chart 2: Line chart for temp_mid
tempMid = alt.Chart(weather).transform_calculate(
    temp_mid='(datum.temp_min + datum.temp_max) / 2'
).mark_line(tooltip=True).encode(
    x=alt.X('month(date):T'),
    y=alt.Y('average(temp_mid):Q'),
    color=alt.Color('location:N')
)

alt.layer(tempMinMax, tempMid)

**Q1:** _What happens if both layers have custom axis titles? Modify the code above to find out..._

When creating multiple views, we might find ourselves redundantly specifying the same input data for multiple marks. If we want to, we can **move a shared data definition to the \`layer\`-level** for more compact specifications, like so:

In [10]:
# area chart (tempMinMax)
tempMinMax = alt.Chart(weather).mark_area(opacity=0.3, tooltip=True).encode(
    x=alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')),
    y=alt.Y('average(temp_max):Q', title='Avg. Temperature °C'),
    y2='average(temp_min):Q',
    color=alt.Color('location:N')
)

# line chart (tempMid)
tempMid = alt.Chart(weather).mark_line(tooltip=True).encode(
    x=alt.X('month(date):T'),
    y=alt.Y('average(temp_mid):Q'),
    color=alt.Color('location:N')
)

alt.layer(tempMinMax, tempMid).transform_calculate(
    temp_mid='(datum.temp_min + datum.temp_max) / 2')

Note that the order of inputs to a layer matters, as subsequent layers will be drawn on top of earlier layers.

**Q2:** _Try swapping the order of the charts in the cells above. What happens? (Hint: look closely at the color of the `line` marks.)_

### **Dual-Axis Charts**

*MCW: I'm deleting these examples because dual-axis charts are easy to get wrong, so just don't use them.*

While dual-axis charts can be useful, _they are often prone to misinterpretation_, as the different units and axis scales may be incommensurate. As is feasible, you might consider transformations that map different data fields to shared units, for example showing [quantiles](https://en.wikipedia.org/wiki/Quantile) or relative percentage change.

---
## **Facet**

*Faceting* involves subdividing a dataset into groups and creating a separate plot for each group. In earlier notebooks, we learned how to create faceted charts using the `row` and `column` encoding channels. We'll first review those channels and then show how they are instances of the more general `facet` class.

Let's start with a basic histogram of maximum temperature values in Seattle:

In [11]:
alt.Chart(weather).mark_bar(tooltip=True).transform_filter(
    alt.datum.location == 'Seattle'
).encode(
    x=alt.X('temp_max:Q', bin=True, title='Max Temperature (°C)'),
    y=alt.Y('count()')
)

_How does this temperature profile change based on the weather of a given day – that is, whether there was drizzle, fog, rain, snow, or sun?_

Let's use the `column` encoding channel to **facet the data by weather type**. We can also use `color` as a redundant encoding, using a customized color range:

In [12]:
colors = alt.Scale(
    domain=['drizzle', 'fog', 'rain', 'snow', 'sun'],
    range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52']
)

alt.Chart(weather).mark_bar(tooltip=True).transform_filter(
    alt.datum.location == 'Seattle'
).encode(
    x=alt.X('temp_max:Q', bin=True, title='Max Temp (°C)'),
    y=alt.Y('count()', title='number of days'),
    color=alt.Color('weather:N', scale=colors),  # use custom color scale
    column=alt.Column('weather:N')  # column facet
).properties(
    width=150,
    height=150
)

_Unsurprisingly, those rare snow days center on the coldest temperatures, followed by rainy and foggy days. Sunny days are warmer and, despite Seattle stereotypes, are the most plentiful. Though as any Seattleite can tell you, the drizzle occasionally comes, no matter the temperature!_

In addition to `row` and `column` encoding channels *within* a mark definition, we can take a basic mark definition and then apply faceting using an explicit `facet` function.

Let's **recreate the chart above, but this time using `facet`**. We start with the same basic histogram definition, but remove the data source, filter transform, and column channel. We can then invoke the `facet` method, passing in the data and specifying that we should facet into columns according to the `weather` field. The `facet` method accepts both `row` and `column` parameters. The two can be used together to create a 2D grid of faceted plots.

Finally we include our filter transform, applying it to the top-level faceted chart. While we could apply the filter transform to the histogram definition as before, that is slightly less efficient. Rather than filter out "New York" values within each facet cell, applying the filter to the faceted chart lets Vega-Lite know that we can filter out those values up front, prior to the facet subdivision.

In [13]:
colors = alt.Scale(
    domain=['drizzle', 'fog', 'rain', 'snow', 'sun'],
    range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52']
)

alt.Chart(weather).transform_filter(
    alt.datum.location == 'Seattle'
).mark_bar(tooltip=True).encode(
    alt.X('temp_max:Q', bin=True, title='Max Temp (°C)'),
    alt.Y('count():Q', title="number of days"),
    alt.Color('weather:N', scale=colors)
).properties(
    width=150,
    height=150
).facet(
    column=alt.Column('weather:N')
)

Given all the extra code above, why would we want to use an explicit `facet` function? For basic charts, we should certainly use the `column` or `row` encoding channels if we can. However, using the `facet` function explicitly is useful if we want to facet composed views, such as layered charts.

Let's revisit our layered temperature plots from earlier. Instead of plotting data for New York and Seattle in the same plot, let's **break them up into separate facets**. The individual chart definitions are nearly the same as before: one area mark and one line mark. We can layer the charts much as before,  then invoke `facet` on the layered view, passing in the data and specifying `column` facets based on the `location` field:

In [14]:
# Calculate the midpoint temperature
weather['temp_mid'] = (weather['temp_min'] + weather['temp_max']) / 2

# Define the area chart (tempMinMax)
tempMinMax = alt.Chart(weather).mark_area(opacity=0.3, tooltip=True).encode(
    alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')),
    alt.Y('average(temp_max):Q', title='Avg. Temperature (°C)'),
    alt.Y2('average(temp_min):Q'),
    alt.Color('location:N')
)

# Define the line chart (tempMid)
tempMid = alt.Chart(weather).mark_line(tooltip=True).encode(
    alt.X('month(date):T'),
    alt.Y('average(temp_mid):Q'),
    alt.Color('location:N')
)

alt.layer(tempMinMax, tempMid).facet(
    column=alt.Column('location:N')
).properties(
    data=weather
)

The faceted charts we have seen so far use the same axis domains across the facet cells. This default of using *shared* scales and axes helps aid accurate comparison of values. However, in some cases you may wish to **scale each chart independently**, for example if the range of values in the cells differs significantly.

*MCW: Here's another place where I'm removing the example.*

To borrow a cliché: just because you *can* do something, doesn't mean you *should*...

---
## **Concatenate**

Faceting creates [small multiple](https://en.wikipedia.org/wiki/Small_multiple) plots that show separate subdivisions of the data. However, we might wish to create a **multi-view display with different views of the *same* dataset (not subsets) or views involving *different* datasets**.

Vega-Lite provides *concatenation* functions to combine arbitrary charts into a composed chart. The `hconcat` function performs horizontal concatenation, while the `vconcat` function performs vertical concatenation.

Let's start with a basic line chart showing the average maximum temperature per month for both New York and Seattle, much like we've seen before:

In [15]:
alt.Chart(weather).mark_line(tooltip=True).encode(
    alt.X('month(date):T', title=None),
    alt.Y('average(temp_max):Q', title="Average Max Temperature (°C)"),
    alt.Color('location:N')
)

_What if we want to view not just temperature, but also precipitation and wind levels?_

Let's create a **concatenated chart consisting of three plots**. We'll start by defining a "base" chart definition that contains all the aspects that should be shared by our three plots. We can then modify this base chart to create customized variants, with different y-axis encodings for the `temp_max`, `precipitation`, and `wind` fields. We can then concatenate them using the `hconcat` function:

In [16]:
# Base chart with shared properties
base = alt.Chart(weather).mark_line(tooltip=True).encode(
    alt.X('month(date):T', title=None),
    alt.Color('location:N')
).properties(
    width=240,
    height=180
)

# Individual charts for temperature, precipitation, and wind
temp = base.encode(alt.Y('average(temp_max):Q', title="Average Max Temperature (°C)"))
precip = base.encode(alt.Y('average(precipitation):Q', title="Average Precipitation (in)"))
wind = base.encode(alt.Y('average(wind):Q', title="Average Wind (mph)"))

# Concatenate the charts horizontally
alt.hconcat(temp, precip, wind)

Vertical concatenation works similarly to horizontal concatenation.

**Q3:** _Using the `alt.vconcat` method, modify the code to use a vertical ordering instead of a horizontal ordering._

Finally, note that horizontal and vertical concatenation can be combined.

**Q4:** _What happens if you write something like `alt.vconcat(alt.hconcat(temp, precip), wind)`?_

As we will revisit later, concatenation functions let you combine any and all charts into a multi-view dashboard!



---


## **Repeat**



The concatenation functions above are quite general, allowing arbitrary charts to be composed. Nevertheless, the example above was still a bit verbose: we have three very similar charts, yet have to define them separately and then concatenate them.

For cases where only one or two variables are changing, the `repeat` function provides a convenient shortcut for creating multiple charts. Given a *template* specification with some free variables, the repeat function will then create a chart for each specified assignment of those variables.

Let's **recreate our concatenation example above using the `repeat` function**. The only aspect that changes across charts is the choice of data field for the `y` encoding channel. To create a template specification, we can use the *repeater variable* `alt.repeat('column')` as our y-axis field. This code simply states that we want to use the variable assigned to the `column` repeater, which organizes repeated charts in a horizontal direction.

We then invoke the `repeat` method, passing in data field names for each column:

In [17]:
alt.Chart(weather).mark_line(tooltip=True).encode(
    alt.X('month(date):T', title=None),
    alt.Y(alt.repeat('column'), aggregate='average'),
    alt.Color('location:N')
).properties(
    width=240,
    height=180
).repeat(
    column=['temp_max', 'precipitation', 'wind']
)

Repetition is supported for both columns and rows.

**Q5:** _What happens if you modify the code above to use `column` instead of `row`?_

We can also use `row` and `column` repetition together! One common visualization for exploratory data analysis is the [scatter plot matrix (or SPLOM)](https://en.wikipedia.org/wiki/Scatter_plot#Scatterplot_matrices). Given a collection of variables to inspect, a SPLOM provides a grid of all pairwise plots of those variables, allowing us to assess potential associations.

Let's **use the `repeat` function to create a SPLOM** for the `temp_max`, `precipitation`, and `wind` fields. We first create our template specification, with repeater variables for both the x- and y-axis data fields. We then invoke `repeat`, passing in arrays of field names to use for both `row` and `column`. Vega-Lite will then generate the [cross product (or, Cartesian product)](https://en.wikipedia.org/wiki/Cartesian_product) to create the full space of repeated charts.

_Looking at these plots, there does not appear to be a strong association between precipitation and wind, though we do see that extreme wind and precipitation events occur in similar temperature ranges (~5-15° C). However, this observation is not particularly surprising: if we revisit our histogram at the beginning of the facet section, we can plainly see that the days with maximum temperatures in the range of 5-15° C are the most commonly occurring._

In [18]:
alt.Chart(weather).transform_filter(
    alt.datum.location == 'Seattle'
).mark_circle(size=15, opacity=0.5, tooltip=True).encode(
    alt.X(alt.repeat('column'), type='quantitative'),
    alt.Y(alt.repeat('row'), type='quantitative')
).properties(
    width=150,
    height=150
).repeat(
    row=['temp_max', 'precipitation', 'wind'],
    column=['temp_max', 'precipitation', 'wind']
)

*Now modify the code above to get a better understanding of chart repetition.*

**Q6:** *Try adding another variable (`temp_min`) to the SPLOM.*

**Q7:** *What happens if you rearrange the order of the field names in either the `row` or `column` arguments to the `repeat` function?*

_Finally, to really appreciate what the `repeat` function provides, take a moment to imagine how you might recreate the SPLOM above using only `hconcat` and `vconcat`!_

---
## **A View Composition Algebra**


Together, the composition functions `layer`, `facet`, `concat`, and `repeat` form a *view composition algebra*: **the various operators can be combined to construct a variety of multi-view visualizations**.

As an example, let's start with **two basic charts: a histogram and a simple line (a single `rule` mark) showing a global average**.

In [19]:
basic1 = alt.Chart(weather).mark_bar(tooltip=True).transform_filter(alt.datum.location == 'Seattle'
).encode(
    alt.X('month(date):O'),  # month extracted as an ordinal value
    alt.Y('average(temp_max):Q')
)

basic2 = alt.Chart(weather).mark_rule(stroke='firebrick').transform_filter(
    alt.datum.location == 'Seattle'
).encode(
    alt.Y('average(temp_max):Q')
)

alt.hconcat(basic1, basic2)

We can combine the two charts using a `layer` function, and then `repeat` that layered chart to show **histograms with overlaid averages for multiple fields**:

In [20]:
base = alt.Chart(weather).transform_filter(alt.datum.location == "Seattle")

# bar chart layer
bars = base.mark_bar(tooltip=True).encode(
    alt.X('month(date):O', title='Month'),
    alt.Y(alt.repeat("column"), type='quantitative', aggregate='average')
)

# rule chart layer
rule = base.mark_rule(stroke='firebrick').encode(
    alt.Y(alt.repeat("column"), type='quantitative', aggregate='average')
)

alt.layer(bars, rule).properties(
    width=200,
    height=150
).repeat(
    column=['temp_max', 'precipitation', 'wind']
)

Focusing only on the multi-view composition operators, the model for the visualization above is:

~~~
 repeat(column: [...])
 |- layer
    |- basic1
    |- basic2
~~~

Now let's explore how we can apply *all* the operators within a **final dashboard that provides an overview of Seattle weather**. We'll combine the SPLOM and faceted histogram displays from earlier sections with the repeated histograms above:

In [21]:
base = alt.Chart(weather).transform_filter(alt.datum.location == "Seattle")

# Scatter Plot Matrix (SPLOM)
splom = base.mark_circle(size=15, opacity=0.5, tooltip=True).encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=125,
    height=125
).repeat(
    row=['temp_max', 'precipitation', 'wind'],
    column=['temp_max', 'precipitation', 'wind']
)

# Date histograms
date_hist = alt.layer(base.mark_bar(tooltip=True).encode(
        alt.X('month(date):O', title='Month'),
        alt.Y(alt.repeat("row"), type='quantitative', aggregate='average')
    ),
    base.mark_rule(stroke='firebrick').encode(
        alt.Y(alt.repeat("row"), type='quantitative', aggregate='average')
    )
).properties(
    width=175,
    height=125
).repeat(
    row=['temp_max', 'precipitation', 'wind']
)

# Temperature histograms
temp_hist = base.mark_bar(tooltip=True).encode(
    alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
    alt.Y('count()', title='Count'),
    alt.Color('weather:N', scale=alt.Scale(
        domain=['drizzle', 'fog', 'rain', 'snow', 'sun'],
        range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52']
    ))
).properties(
    width=115,
    height=100
).facet(
    column='weather'
)

# Dashboard
alt.vconcat(
    alt.hconcat(splom, date_hist),  # SPLOM and date histograms
    temp_hist,  # Temperature histograms below
    title='Seattle Weather Dashboard'
).resolve_legend(
    color='independent'
).configure_axis(
    labelAngle=0
)

The full composition model for this dashboard is:

~~~
 vconcat
 |- hconcat
 |  |- repeat(row: [...], column: [...])
 |  |  |- splom base chart
 |  |- repeat(row: [...])
 |     |- layer
 |        |- dateHist base chart 1
 |        |- dateHist base chart 2
 |- facet(column: 'weather')
    |- tempHist base chart
~~~

_Phew!_ The dashboard also includes a **few customizations to improve the layout**:

- We adjust chart `width` and `height` properties to assist alignment and ensure the full visualization fits on the screen.
- We add `alt.resolve_legend(color='independent')` to ensure the color legend is associated directly with the colored histograms by temperature. Otherwise, the legend will resolve to the dashboard as a whole.
- We use `alt.configure_axis(labelAngle=0)` to ensure that no axis labels are rotated. This helps to ensure proper alignment among the scatter plots in the SPLOM and the histograms by month on the right.

_Try removing or modifying any of these adjustments and see how the dashboard layout responds!_

This dashboard can be reused to show data for other locations or from other datasets.

**Q8:** _Update the dashboard to show weather patterns for New York instead of Seattle._