<a href="https://colab.research.google.com/github/lqh-0514/6.894/blob/master/Visual_Encoding_with_Vega_Lite_%26_Altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visual Encodings with Vega-Lite & Altair

The data and image models we learned about last time gave us a language for talking about visualization design. Today, we'll be more concrete by using those models to construct visualizations.

In particular, we'll be using [Altair](https://altair-viz.github.io/) which provides a Pythonic interface to the [Vega-Lite](https://vega.github.io/vega-lite/) visualization grammar. By "grammar", we mean that the *data*, *transformations*, *marks*, and *encoding channels* we learned about last time are all exposed as building blocks. Creating visualizations involves combining these building blocks, and describing their properties, rather than writing for-loops and low-level drawing commands. This approach is known as *declarative specification*: describing *what* we want the visualization to look like, rather than *how* it should be computed and drawn. 

To learn more about Vega-Lite and declarative specification, watch the [Vega-Lite presentation video from OpenVisConf 2017](https://www.youtube.com/watch?v=9uaHRWj04D4).

This notebook will guide you through the basic process of creating visualizations in Altair. We recommend keeping a reference to the [Altair documentation](https://altair-viz.github.io/) handy!

## Imports

To start, we must import the necessary libraries: Pandas for data frames and Altair for visualization.

In [0]:
import pandas as pd
import altair as alt

## Data

Data in Altair is built around the Pandas data frame, which consists of a set of named data *columns*. We will also regularly refer to data columns as data *fields*.

Altair makes it easy to load example datasets from the [vega-datasets](https://github.com/vega/vega-datasets) collection into a Pandas data frame. In this notebook, we will be visualizing global health and population data for a number of countries, over the time period of 1955 to 2005. The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). If you haven't seen the talk, we encourage you to watch it!

In [0]:
from vega_datasets import data as vega_data
data = vega_data.gapminder()

How big is the data?

In [0]:
data.shape

(693, 6)

That's 693 rows, and 6 columns! 

We can also take a peak at the data to examine it in more detail:

In [0]:
data.head(5)

Unnamed: 0,cluster,country,fertility,life_expect,pop,year
0,0,Afghanistan,7.7,30.332,8891209,1955
1,0,Afghanistan,7.7,31.997,9829450,1960
2,0,Afghanistan,7.7,34.02,10997885,1965
3,0,Afghanistan,7.7,36.088,12430623,1970
4,0,Afghanistan,7.7,38.438,14132019,1975


So, for each `country` and `year` (in 5-year intervals), we have measures of fertility in terms of the number of children per woman (`fertility`), life expectancy in years (`life_expect`), and total population (`pop`).

We also see a `cluster` field with an integer code. What might this represent? We'll try and solve this mystery as we visualize the data!

We can also create a smaller data frame, but filtering down the data like so:

In [0]:
data2000 = data.loc[data['year'] == 2000]

In [0]:
data2000.head(5)

Unnamed: 0,cluster,country,fertility,life_expect,pop,year
9,0,Afghanistan,7.4792,42.129,23898198,2000
20,3,Argentina,2.35,74.34,37497728,2000
31,3,Aruba,2.124,73.451,69539,2000
42,4,Australia,1.756,80.37,19164620,2000
53,1,Austria,1.382,78.98,8113413,2000


For more information about data frames - and some useful transformations to prepare Pandas data frames for plotting with Altair! - see the [Specifying Data with Altair documentation](https://altair-viz.github.io/user_guide/data.html).

## The Chart Object

The fundamental object in Altair is the `Chart`, which takes a data frame as a single argument:

In [0]:
chart = alt.Chart(data2000)

So far, we have defined the `Chart` object and passed it the simple data frame we generated above. We have not yet told the chart to *do* anything with the data.

## Marks and Encodings

With a chart object in hand, we can now specify how we would like the data to be visualized. We first indicate what kind of graphical *mark* (geometric shape) we want to use to represent the data. We can set the `mark` attribute of the chart object using the the `Chart.mark_*` methods.

For example, we can show the data as a point using `Chart.mark_point()`:

In [0]:
alt.Chart(data2000).mark_point()

Here the rendering consists of one point per row in the dataset, all plotted on top of each other, since we have not yet specified positions for these points.

To visually separate the points, we can map various *encoding channels*, or *channels* for short, to fields in the dataset. For example, we could *encode* the field `fertility` of the data with the `x` channel, which represents the x-axis position of the points. To specify this, use the `Chart.encode` method:

In [0]:
alt.Chart(data2000).mark_point().encode(
  x='fertility',
)

The `encode()` method builds a key-value mapping between encoding channels (such as `x`, `y`, `color`, `shape`, `size`, *etc.*) to fields in the dataset, accessed by field name. 

For Pandas data frames, Altair automatically determines an appropriate data type for the mapped column. In this case, Altair uses the *quantitative* type for numerical values. 

But, we can also explicitly annotate data types -- necessary when we don't use Pandas, or when we want to override the default inference (e.g., treating numeric data as nominal).

For instance, here we've encoded the `cluster` field on the `y` channel. Though it is numeric data, we want to treat clusters as discrete values and so annotate it as an ordinal (`O`) data type. 

**What would have happened if we didn't include this annotation?** Try it out in the cell below!

In [0]:
alt.Chart(data2000).mark_point().encode(
    x='fertility',
    y='cluster:O'
)

## Activity: Practice Visual Encodings

For the following activities, you may find it useful to refer to Altair's documentation on [marks](https://altair-viz.github.io/user_guide/marks.html) and [visual encodings](https://altair-viz.github.io/user_guide/encoding.html).

### 1D or Univariate Visualizations

Univariate visualizations are a useful first step to get a summary of your data values and understand their distribution.

**Recreate the following visualization (called a "strip plot") of life expectancy in the year 2000.**

![A strip plot of life expectancy in the year 2000](https://drive.google.com/uc?export=view&id=17i0ucQrcD5JrtmRPqEy3dl7vSboP0G_1)

Depending on the type of mark used, strip plots may also be known as dot plots. 

Histograms are another type of univariate visualizations, depiciting the number of data values that fall within discrete groups. 

**Recreate the following histogram showing the number of countries in each cluster in the year 2000.** _Hint_: A special `count()` function can be used instead of a field when specifying a visual encoding. 

![Cluster histogram](https://drive.google.com/uc?export=view&id=1Ds66sSGcvoZodJSpEvKd9aBTexurn_RB)



We can also produce histograms of quantitative fields by discretizing the data into _bins_. 

_Note:_ To produce this histogram, we'll need to use the more verbose form of defining encodings (e.g., `alt.X()` instead of `x=`) which allows us to specifying [a binning strategy](https://altair-viz.github.io/user_guide/encoding.html#binning-and-aggregation).

**Recreate the following binned histogram of life expectancy in the year 2000:**

![Binned histogram of life expectancy](https://drive.google.com/uc?export=view&id=1dv1llnCxgR8opOAFooEUiLo0z_bRJBPh)

**Compare and contrast the two univariate summaries of life expectancy.**

What are the strengths of the strip plot vs. the binned histogram and vice-versa?

**Create 2 more univariate visualizations of the full dataset (`data`).** 

Experiment with a variety of mark types and encoding channels.

### 2D or Bivariate Visualizations

While univariate visualizations help us understand one single variable at a time, bivariate (or 2D) visualizations can help us identify the relationship between two variables in our dataset.

**Recreate the following bivariate visualizations (using the full dataset, `data`):**

![A scatterplot of life expectancy against fertility](https://drive.google.com/uc?export=view&id=11AINUEhqdei4Uz19j9qaDtOOeUFfsWUV)

Bivariate visualizations do not have to only use the `x` and `y` encoding channels. For example, **recreate the following bivariate chart:**

![A bubble chart of population against fertility](https://drive.google.com/uc?export=view&id=1Qmdix71Y6rzOxqTHcFlv0GJu3bmO7HiG)

And we can also explore non-quantitative fields as part of bivariate visualizations. For instance, **recreate** China's fertility over the years:

![A line chart of China's fertility](https://drive.google.com/uc?export=view&id=1UzX1glTA71N1lAQYdkJeN1X089y9V-Ku)



Similarly, **recreate the following dot plot of life expectancy over the years:**

![A dot plot of year against life expectancy](https://drive.google.com/uc?export=view&id=1I42Ns11TCKGV_E_A1iWWvMESqlYb-Azj)

And  **[layer](https://altair-viz.github.io/user_guide/compound_charts.html#layer-chart) a reference mean line**:

![Dot plot of life expectancy over years with a reference mean line](https://drive.google.com/uc?export=view&id=1GiJ8mmjXBfTAKiSM4Z0t3KWbuXji8ent)



We can also represent the above data distribution as a **two-dimensional histogram**:

![2D histogram of year x life expectancy](https://drive.google.com/uc?export=view&id=1DE98kz8_5goPMGLsmHv2inIKztg2LPKt)

Or, alternatively, as a **heatmap**:

![Heatmap of year x life expectancy](https://drive.google.com/uc?export=view&id=1LiabjR1aM6z8zVZ4z4KvmnALcGAAAZn0)

**Compare the *size* and *color*-based 2D histograms above. Which encoding do you think should be preferred? Why? In which plot can you more precisely compare the magnitude of individual values? In which plot can you more accurately see the overall density of values?**

**Create 2 more bivariate visualizations, exploring a range of marks and encoding channels.**

### N-D or Multivariate Visualizations



But, of course, we don't have to stop at two dimensions. 

**Recreate the following multi-line chart**. _Hint_: You might find the [`query`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) pandas function handy.

![Multi-line chart of China, India, and the United States' fertility over the years](https://drive.google.com/uc?export=view&id=1V6WS1TN2LdfafLgNZSXCE0AAzVWqZpUT)

Bubble charts are a popular means of visualizing 3+ dimensions. **Recreate the following bubble chart for the year 2000:**

![Gapminder bubble chart](https://drive.google.com/uc?export=view&id=1koeBxLHMZ-VTErxK2ndTDtBnhrQmWa6m)

So far, we've focused on displaying all variables within a single plot. However, we quickly approach the limits of this strategy. For example, if we encode any more data in the bubble chart above (e.g., using opacity or shape), we begin to make it more difficult for readers to decode insights from the visualization

For multi-dimensional data, a promising approach is a family of techniques called _small multiple displays_: using a series of charts that use the same (or similar) visual encodings such that they can be easily compared.

One example of small multiples is the **Trellis plot**: the full dataset is partitioned by a nominal or ordinal variable, and each resulting subset of the data is displayed arrayed across [rows or columns](https://altair-viz.github.io/user_guide/encoding.html?highlight=facet#encoding-channels). These partitions are often called **[facets](https://altair-viz.github.io/user_guide/compound_charts.html#faceted-charts).**

**Recreate the following Trellis plot**, reusing our bubble chart visualization and partitioning our data by `year`.

![Trellis Bubble Chart](https://drive.google.com/uc?export=view&id=1uTAg17wsa2kLGyJNjfQ52_Dq9dDvEx5b)

_Hint: You may find it useful to specify a `width` and `height` in your `Chart` constructor to facilitate comparisons across subviews._

Another example of a small multiples display is called the **Scatterplot Matrix or SPLOM.** Instead of partitioning our data into subsets, the SPLOM [repeats](https://altair-viz.github.io/user_guide/compound_charts.html#repeated-charts) the full dataset and visualizes pairs of variables at a time.  

In [0]:
alt.Chart(data, width=175, height=175).mark_point().encode(
  alt.X(alt.repeat('column'), type='quantitative'),
  alt.Y(alt.repeat('row'), type='quantitative')
).repeat(
  row=['life_expect', 'fertility', 'pop'],
  column=['life_expect', 'fertility', 'pop']
)

**Create 2 more multivariate visualizations, exploring a range of marks and encoding channels.**

## Effective Visual Encoding

When we map data values to encoding channels, Altair/Vega-Lite makes a number of design decisions for us including what colors and sizes are used for plotted points, the size of the visualization, and how the axes ticks are drawn.

To more effectively encode our data, we may wish to customize these defaults. And, in order to do so, we need to dig into _how_ visual encoding occurs. The workhorse that actually performs this mapping is called a _scale_: a function that takes a data value as input (the scale _domain_) and returns a visual value, such as a pixel position or RGB color, as output (the scale _range_). These scale functions are themselves visualized using _guides_: _axes_ visualize scales with spatial ranges and _legends_ visual scales with color, size, or shape ranges.

Let's look again at our bubble chart for the year 2000:

![Gapminder bubble chart](https://drive.google.com/uc?export=view&id=1koeBxLHMZ-VTErxK2ndTDtBnhrQmWa6m)

By default, for linear quantitative scales, Altair/Vega-Lite includes zero to ensure a proper baseline for comparing ratio-valued data. However, in some cases, a zero baseline may be meaningless, or you may want to focus on interval comparisons.

**Use [`alt.Scale`](https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html#altair.Scale) to disable automatic zero-inclusion** for both axes to produce the following visualization:

![Non-zero Gapminder bubble chart](https://drive.google.com/uc?export=view&id=1vsxVpeJoDpOnqhcCcCYIt48_9u4Bn53s)



Now the axis scales no longer include zero by default. Some padding still remains as the axis domain end points are automatically snapped to _nice_ numbers like multiples of 5 or 10. 

**What happens if you disable niceness?**

Next, let's take a look at our _size_ channel. It encodes population but the plotted sizes are dominated by two countries: India and China (add a `tooltip` encoding to confirm this).

One fix might be to have population values map to a wider _range_ of circle sizes. We can specify this by **customizing our size scale's `range` (e.g., `[0, 1000]`)**.

![Bubble Chart with larger bubbles](https://drive.google.com/uc?export=view&id=1lxYc4k1zXd2B0MgQgQRIZYtrTEaxiaDc)


Let's dive deeper into the distibution of populations in the year 2000. Here's a dot plot with tooltips enabled to help us identify which points correspond to which country:

In [0]:
alt.Chart(data2000).mark_circle().encode(
  alt.X('pop'),
  alt.Y('cluster:N'),
  alt.Color('cluster:N'),
  alt.Tooltip('country')
)

This dot plot isn't terrible informative right now because we have two outliers: the populations of India and China are orders of magnitude larger than other countries. 

So what are our options? 

One might be to just **adjust the domain of our x-scale** and **[clip the outliers](https://altair-viz.github.io/user_guide/customization.html?highlight=clip#adjusting-axis-limits)**:

![Clipped population dot plot](https://drive.google.com/uc?export=view&id=1em5qy126q1T_hfzZLupOCVAn0Pnr17eC)



**What are the pros and cons of this strategy?**

An alternate strategy might preserve the outliers but better depict that population values span several orders of magnitude. 

By default, Altair/Vega-Lite uses a `linear` mapping (`y = mx + b`) between the domain values (`population`) and the range values (_pixels_). To get a better overview of this data, we may  instead favor a **logarithmic (`log`) scale**:

![Log dot plot of population in 2000](https://drive.google.com/uc?export=view&id=1-sRZg_d6hf5Vesx4C9joBWOn3OFLK4X9)


Now, the domain values (`population`) map to range values (_pixels_) via the function `y = mlog(x) + b`.

And, if we were only interested in demarcating every order of magnitude, we could **tweak [the number of x-axis ticks](https://altair-viz.github.io/user_guide/generated/core/altair.Axis.html?highlight=ticks)** for a less noisy display:

![Log dot plot of population with 5 x-axis ticks](https://drive.google.com/uc?export=view&id=1UkK5CcOJVvtQbfVM_bF2VP-RrMpwTjES)


Besides addressing data skew, log scales are also useful when we want to focus on **multiplicative factors**. In a standard linear scale, a visual (pixel) distance of 10 units might correspond to an *addition* of 10 units in the data domain. A logarithmic transform maps between multiplication and addition, such that `log(u) + log(v) = log(u*v)`. As a result, in a logarithmic scale, a visual distance of 10 units instead corresponds to *multiplication* by 10 units in the data domain (assuming a base 10 logarithm).

For instance, if we want to visualize Google's stock price over time, the default (`linear`) scale allows us to see the absolute change:


In [0]:
stocks = vega_data.stocks()
goog = stocks.loc[stocks['symbol'] == 'GOOG']
alt.Chart(goog).mark_line().encode(
  alt.X('date'),
  alt.Y('price')
)

By changing `price` to use a `log` scale, we instead visualize the percentage change. Each unit of distance now reflects an equal percentage change (i.e., percentage gains take up the same amount of space). For example, the vertical distance between \$100 and \$200 and \$300 and \$600 is the same as each range represents a 100% increase.

In [0]:
alt.Chart(goog).mark_line().encode(
  alt.X('date'),
  alt.Y('price', scale=alt.Scale(type='log'))
)

Log scales, however, do come with some constraints:

*   Only positive numbers are supported as the logarithm of negative numbers is undefined.
*   They require some level of audience familiarity -- your readers must understand that log scales depict changes in terms of orders of magnitude or percentage change.



## Submit

Remember to submit your notebook to LMOD by doing the following:

1. Click the _Share_ button in the upper right-hand corner.
2. Click _Get a Shareable Link_ in the dialog box, and double check that _Anyone with a link **can view**_ is selected.

## Acknowledgements

This notebook was adapted from material developed by Jeffrey Heer, Dominik Moritz, and Brock Craft.