<a href="https://colab.research.google.com/github/odu-cs625-datavis/public-fall24-mcw/blob/main/Marks_and_Channels_with_Vega_Altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Marks and Channels with Vega-Altair**

Based on the [Data Types, Graphical Marks, and Visual Encoding Channels tutorial](https://observablehq.com/@uwdata/data-types-graphical-marks-and-visual-encoding-channels) from the [UW Interactive Data Lab](https://idl.cs.washington.edu). I'm focusing here on basic marks and channels in Vega-Lite.  See the full tutorial for more information and additional customizations.

Marks, channels, and data types are the building blocks of creating charts and visualizations. We will use this terminology in our coding examples.

From Chapter 5 in [VAD](https://www.cs.ubc.ca/~tmm/vadbook/):
* "*marks* are the basic geometric elements that depcit items or links"
* "*channels* control their [marks] appearance"
* "the effectiveness of a channel for encoding data depends on its *type*"


References
* [Data Transformations](https://altair-viz.github.io/user_guide/transform/index.html)
* [Example Gallery](https://altair-viz.github.io/gallery/index.html)
* [Marks](https://altair-viz.github.io/user_guide/marks/index.html)
* [Encodings](https://altair-viz.github.io/user_guide/encodings/index.html)
* [Specifying Data](https://altair-viz.github.io/user_guide/data.html)

To work along with this tutorial, save a copy of this Colab Notebook in your Google Drive.

## **Data**

First, we'll import the dataset that we'll use for this tutorial.  We will visualize global health and population measures for countries of the world, recorded over the years 1955 to 2005. The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). (If you haven't seen the talk, I encourage you to watch it!)

Before creating the visualizations, we first need to transform or filter the data. There are two ways to do this in Altair by applying:

1. Standard pandas data transformations *BEFORE* chart definition
2. Vega-Lite’s data transformation tools *WITHIN* chart definition

Often, the first option is preferable when the data source is already a dataframe, such as those from `vega-datasets`. Additionally, pandas provides greater flexibility in data manipulation compared to Vega-Lite. The second approach is useful when the data source is not a dataframe, such as a URL pointing to a JSON or CSV file, or when different views of the data are needed for compound charts.

We'll first install the [`vega-datasets`](https://github.com/vega/vega-datasets) library, and then import `altair` and the `data` module from vega_datasets. For this tutorial, we will use the first approach by loading the `gapminder` dataset as a pandas data frame.


In [1]:
pip install vega_datasets



In [2]:
import altair as alt
from vega_datasets import data as vega_data

data = vega_data.gapminder()
data

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
0,1955,Afghanistan,0,8891209,30.332,7.7000
1,1960,Afghanistan,0,9829450,31.997,7.7000
2,1965,Afghanistan,0,10997885,34.020,7.7000
3,1970,Afghanistan,0,12430623,36.088,7.7000
4,1975,Afghanistan,0,14132019,38.438,7.7000
...,...,...,...,...,...,...
688,1985,Venezuela,3,16997509,70.190,3.6485
689,1990,Venezuela,3,19325222,71.150,3.2500
690,1995,Venezuela,3,21555902,72.146,2.9415
691,2000,Venezuela,3,23542649,72.766,2.7230


For each country and year (in 5-year intervals), we have measures of fertility in terms of the number of children per woman (`fertility`), life expectancy in years (`life_expect`), and total population (`pop`).


Next, we'll create subsets of the original dataframe `data`, which we will use later. The one below should contain the data only for the year 2000.

In [3]:
data_2000 = data[data['year'] == 2000]
data_2000.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
9,2000,Afghanistan,0,23898198,42.129,7.4792
20,2000,Argentina,3,37497728,74.34,2.35
31,2000,Aruba,3,69539,73.451,2.124
42,2000,Australia,4,19164620,80.37,1.756
53,2000,Austria,1,8113413,78.98,1.382


This next one has data for countries labeled cluster 1 in the year 2000.

In [4]:
data_2000c1 = data_2000[data_2000['cluster'] == 1]
data_2000c1.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
53,2000,Austria,1,8113413,78.98,1.382
97,2000,Belgium,1,10263618,78.32,1.638
185,2000,Croatia,1,4410830,74.876,1.348
251,2000,Finland,1,5168595,78.37,1.754
262,2000,France,1,59381628,79.59,1.8833


We'll also create a dataset that has values only for the US.

In [5]:
data_US = data[data['country'] == 'United States' ]
data_US.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
671,1955,United States,3,165931000,69.49,3.706
672,1960,United States,3,180671000,70.21,3.314
673,1965,United States,3,194303000,70.76,2.545
674,1970,United States,3,205052000,71.34,2.016
675,1975,United States,3,215973000,73.38,1.788


Finally, we'll have a dataset with only a few countries.

In [6]:
countries = ["United States", "France", "Austria", "Brazil", "Germany"]
data_5countries = data[data['country'].isin(countries)]
data_5countries.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
44,1955,Austria,1,6946885,67.48,2.52
45,1960,Austria,1,7047437,69.54,2.78
46,1965,Austria,1,7270889,70.14,2.53
47,1970,Austria,1,7467086,70.63,2.02
48,1975,Austria,1,7578903,72.17,1.64


## **Vega-Altair**

In Vega-Altair, we first have to decide what type of *mark* will be used ([Vega-Altair mark properties](https://altair-viz.github.io/user_guide/marks/index.html)).

Then we use encodings (using the `encode()` method) to bind data fields to available encoding *channels* for that mark type.

In this notebook we'll examine the following encoding channels:

- `x`: Horizontal (x-axis) position of the mark.
- `y`: Vertical (y-axis) position of the mark.
- `size`: Size of the mark. May correspond to area or length, depending on the mark type.
- `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
- `shape`: Plotting symbol shape for `point` marks.

For a complete list of available channels, see the [Vega-Altair encoding documentation](https://altair-viz.github.io/user_guide/encodings/index.html).

For the first example, we'll use point marks with the x channel mapped to `fertility` and the y channel mapped to `life_expect`.

In Vega-Altair, when you specify the data attribute name, you also specify its data type:
* nominal (`:N`) - same as categorical
* ordinal (`:O`) - same as ordered
* quantitative (`:Q`)
* temporal (`:T`) - time values, corresponding to Python [Date](https://docs.python.org/3/library/datetime.html#date-objects) values
  * Example temporal values include date strings such as `“2019-01-04”` and `“Jan 04 2019”`, as well as standardized date-times such as the [ISO date-time format](https://en.wikipedia.org/wiki/ISO_8601): `“2019-01-04T17:50:35.643Z”`. There are no temporal values in our global development dataset above, as the `year` field is encoded as an integer.  
  * The temporal type in Vega-Altair supports reasoning about time units (year, month, day, hour, etc.), and provides methods for requesting specific time intervals. For more details about temporal data in Vega-Altair, see the [TimeUnit documentation](https://altair-viz.github.io/user_guide/transform/timeunit.html).

Since both `fertility` and `life_expect` are quantitative values, we use `:Q`.

In [7]:
alt.Chart(data_2000).mark_point().encode(
    x = 'fertility:Q',
    y = 'life_expect:Q'
)

**Q1:** *What happens to the chart above if you change `:Q` to `:O` for the x-axis or the "fertility" field? Why?*

**Create a new cell and enter the code below.**

## **Adding Channels**

We can apply additional channels to the marks by specifying them inside the `encode()` method.

The size encoding channel sets a mark's size or extent. The meaning of the channel can vary based on the mark type. For point marks, the size channel maps to the pixel area of the plotting symbol, such that the diameter of the point matches the square root of the size value.

A key thing to note is that when you set color inside the `encode()` method, it maps the channel to the data. If you set the color in the `mark` property, it sets the color directly (i.e., it's not tied to the data).

In [8]:
alt.Chart(data_2000).mark_point().encode(
    x = 'fertility:Q',
    y = 'life_expect:Q',
    color = 'cluster:N',
    size = 'pop:Q'
)

You can see the difference between *mapping* color and *setting* color here.

In [9]:
alt.Chart(data_2000).mark_point(color = 'blue').encode(
    x = 'fertility:Q',
    y = 'life_expect:Q',
    size = 'pop:Q'
)

**Q2:** *What happens if you set color in both the mark and encoding properties?*

**Create a new cell and enter the code below.**

## **Marks Other than Point**

We can apply the shape channel to point marks using `shape`.

In [10]:
alt.Chart(data_2000).mark_point().encode(
    x = 'fertility:Q',
    y = 'life_expect:Q',
    shape = 'cluster:N',
)

To use a line mark, we can set the `mark_line()` method.

In [11]:
alt.Chart(data_US).mark_line().encode(
    x = 'year:O',
    y = 'fertility:Q',
)

**Q3:** *Experiment with different datatypes for year in the cell above. Can you explain the differences between using `:O`, `:N`, `:Q`, and `:T`? Why doesn't `:T` work?*

We can use color to split items by a categorical value. Below, we'll map the color channel to country name and when we use lines, we'll get a different color line for each country.

In [12]:
alt.Chart(data_5countries).mark_line().encode(
    x = 'year:O',
    y = 'fertility:Q',
    color = 'country:N'
)

Note that the colors used here different hues, so they are appropriate for categorical (or, nominal) data.

In Vega-Altair (unlike Python Seaborn), when we move from a line mark with `mark_line()` to an area mark with `mark_area()` with multiple categories, the default is to create a *stacked* chart, which doesn't make sense for this particular dataset.

In [13]:
alt.Chart(data_5countries).mark_area().encode(
    x = 'year:O',
    y = 'fertility:Q',
    color = 'country:N'
)

Now, we'll create a bar chart using a line mark (but called `mark_bar()`) with horizontal spatial region based on country name and position on the vertical axis based on population.

In [14]:
alt.Chart(data_2000c1).mark_bar().encode(
    x = 'country:N',
    y = 'pop:Q'
)

We can flip this so it's more readable and map the country name to the vertical spatial region, just by swapping the x and y parameters.

In [15]:
alt.Chart(data_2000c1).mark_bar().encode(
    y = 'country:N',
    x = 'pop:Q'
)

Instead of sorting by the categorical value, we'll usually want to sort by the quantitative value. We can do that here by adding the [`sort`](https://altair-viz.github.io/gallery/bar_chart_sorted.html) command to the y-axis encoding. The parameter `"-x"` says to sort based on the attribute mapped to the x-axis in descending order.

In [17]:
alt.Chart(data_2000c1).mark_bar().encode(
    y = alt.Y('country:N', sort="-x"),
    x = 'pop:Q'
)