<a href="https://colab.research.google.com/github/odu-cs625-datavis/public-fall23-mcw/blob/main/Marks_Channels_Seaborn_Objects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Marks and Channels with Seaborn Objects Tutorial

Marks, channels, and data types are the building blocks of creating charts and visualizations. We will use this terminology in our coding examples.

From Chapter 5 in [VAD](https://www.cs.ubc.ca/~tmm/vadbook/):
* "*marks* are the basic geometric elements that depcit items or links"
* "*channels* control their [marks] appearance"
* "the effectiveness of a channel for encoding data depends on its *type*"

References:
* A Quick Introduction to the Seaborn Objects System, https://www.sharpsightlabs.com/blog/seaborn-objects-introduction/
* The seaborn.objects interface, https://seaborn.pydata.org/tutorial/objects_interface.html
* Properties of Mark objects, https://seaborn.pydata.org/tutorial/properties.html

To work along with this tutorial, sign in to your Google account and File > Save a copy in Drive.

## Data

First, we'll import the dataset that we'll use for this tutorial.  We will visualize global health and population measures for countries of the world, recorded over the years 1955 to 2005. The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). (If you haven't seen the talk, I encourage you to watch it!)

(The vega-datasets GitHub repo has lots of sample datasets that you can use to practice, see https://github.com/vega/vega-datasets/blob/main/SOURCES.md.)

We'll use Pandas to import the gapminder.json datafile into a dataframe.

In [None]:
import pandas as pd

In [None]:
df = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@2/data/gapminder.json')

In [None]:
print (df.head())

For each `country` and `year` (in 5-year intervals), we have measures of fertility in terms of the number of children per woman (`fertility`), life expectancy in years (`life_expect`), and total population (`pop`).

Next, we'll create some smaller dataframes that we'll use later. The one below should contain the data only for the year 2000.

In [None]:
df_2000 = df.query("year == 2000")
print (df_2000.head())

This one has data for countries labeled cluster 1 in the year 2000.

In [None]:
df_2000c1 = df_2000.query("cluster == 1")
print(df_2000c1.head())

We'll also create a dataset that has values only for the US.

In [None]:
df_US = df.query("country == 'United States'")
print(df_US.head())

Finally, we'll have a set with only a few countries.

In [None]:
df_5countries = df.query("country in ('United States', 'France', 'Austria', 'Brazil', 'Germany')")
print(df_5countries.head())

In [None]:
df_2000_5countries = df_5countries.query("year == 2000")

## Seaborn's Objects Interface

In September 2022, the seaborn.objects interface was released. It is based on the [Grammar of Graphics](https://vita.had.co.nz/papers/layered-grammar.pdf), which is what R's [ggplot2](https://ggplot2.tidyverse.org/) is also based on. So, if you've created charts in R before, this should be a bit familiar. It also follows along with the VAD textbook's terminology of marks and channels.

First, we need to import the library:

In [None]:
import seaborn as sns
import seaborn.objects as so

Every Seaborn Objects chart uses the [`so.Plot()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.html) function.

In [None]:
so.Plot()

Note that inside the notebook environment, you don't need to explicitly call the [`show()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.show.html) function, but if you want to run these scripts locally, you'll need to add `.show()` as the final command: `so.Plot().show()`

`so.Plot()` takes several parameters, the most important of which are `data`, `x`, and `y`. As you might expect, `data` specifies the dataframe to use.

Most charts will map data to the position channels, so `x` specifies the attribute (column in the dataframe) to be mapped to the horizontal position *channel*, and `y` specifies the attribute to be mapped to the vertical position *channel*.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect')
)

The axes are set, but there's nothing in the chart because no *mark* has been specified.  That's where the `add()` function comes in.

There are several types of marks available ([full list](https://seaborn.pydata.org/api.html#mark-objects)).  Some of the more common are:
* [`so.Dot()`](https://seaborn.pydata.org/generated/seaborn.objects.Dot.html)
* [`so.Line()`](https://seaborn.pydata.org/generated/seaborn.objects.Line.html)
* [`so.Bar()`](https://seaborn.pydata.org/generated/seaborn.objects.Bar.html)
* [`so.Area()`](https://seaborn.pydata.org/generated/seaborn.objects.Area.html)

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect')
  .add(so.Dot())
)

We can apply additional channels to the marks by specifying them as parameters to `so.Plot()`.

A key thing to note is that when you set `color` in the `Plot()` function, it *maps* the channel to the data. If you set the color in the `Dot()` function, it sets the color directly (i.e., it's not tied to the data).

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dot())
)

As you can tell, there's some customization that needs to be done, but we'll get to that later.

You can see the difference between *mapping* color and *setting* color here.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         pointsize='pop')
  .add(so.Dot(color="green"))
)

### Q1

*What happens if you set color in both `Plot()` and `Dot()`?*

Create a new cell and enter the code below.

We looked at using the following channels so far: position on the horizontal axis (`x`), position on the vertical axis (`y`), color (`color`), and area (`pointsize`).  Here we'll show examples of using some of the other channels.

We can apply the shape channel to point marks using the [`marker`](https://seaborn.pydata.org/tutorial/properties.html#marker) property.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         marker='cluster')
  .add(so.Dot())
)

We can use a line mark as a connector between points.  We can add multiple marks by including more `add()` functions (in this example, both points and lines).

In [None]:
(so.Plot(data=df_US, x='year', y='fertility')
  .add(so.Dot())
  .add(so.Line())
)

We can use color to split items by a categorical value.  Below, we'll map the color channel to country name and when we use lines, we'll get a different color line for each country.

In [None]:
(so.Plot(data=df_5countries, x='year', y='fertility',
         color='country')
  .add(so.Line())
)

Note that the colors used here different *hues*, so they are appropriate for categorical data.

### Q2

Why were the colors different (using different saturation values rather than different hues) when `cluster` was the attribute that color was mapped to?

We can also show this same chart with a filled area.

In [None]:
(so.Plot(data=df_5countries, x='year', y='fertility',
         color='country')
  .add(so.Area())
)

Below we use a line mark with horizontal spatial region based on country name and position on the vertical axis based on population. Although we call it a line mark in VAD terminology, it's called a bar mark in Seaborn.

In [None]:
(so.Plot(data=df_2000c1, x='country', y='pop')
  .add(so.Bar())
)

We can flip this so it's more readable and map the country name to the vertical spatial region, just by swapping the x and y parameters.

In [None]:
(so.Plot(data=df_2000c1, y='country', x='pop')
  .add(so.Bar())
)

There are several statistical transformations that can be applied to data, using `Stat` objects.  We'll look at using the [`Agg()`](https://seaborn.pydata.org/generated/seaborn.objects.Agg.html#) that operates on each group separately. The default action is to compute the mean.

In this example, we'll compute the mean fertility rate for each country over all the years.



In [None]:
(so.Plot(data=df_5countries, y='country', x='fertility')
  .add(so.Bar(), so.Agg())
)

# Extra Customizations

In this section, we'll look at some extra customizations for Seaborn charts.

* theme customization, using [`Plot.theme()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.theme.html)
  * the note on the API page says that currently the only valid argument is a dict of matplotlib [rc parameters](https://matplotlib.org/stable/tutorials/introductory/customizing.html)
  * these parameters can also include Seaborn [`axes_style`](https://seaborn.pydata.org/generated/seaborn.axes_style.html) parameters ([examples](https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles))
* color palettes, using [`Plot.scale()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.scale.html)
* mark customization (edge color, thickness, fill color, size ranges)
* titles and labels, using [`Plot.label()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.label.html)
* axis label formatting
* chart size, using [`Plot.layout()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.layout.html)


In [None]:
from seaborn import axes_style

We'll look at most of these customizations with the same base chart.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dot())
)

The customization below uses the `ticks` style, which has a white background with outside tick marks.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dot())
  .theme({**axes_style("ticks")})  # change theme
)

Scales, such as axis scales and color scales, are controlled with the [`Plot.scale()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.scale.html) function.

We'll first look at adjusting the colors used for the clusters. In [Q2](https://colab.research.google.com/drive/1u4Szlh00_FixXDyNdAKeDvQoTE1i7PMO#scrollTo=oJtbcb2ThOnz), we noted that the color scale used *color saturation* for the `cluster` attribute. *Color saturation* is a *magnitude channel* that is suitable for an *ordered attribute*. However, the cluster is the result of some grouping, so it's really a *categorical attribute*. Thus, we need to specify a color palette based on *color hue*, which is an *identity channel*.

Various color palettes are shown in the [Choosing color palettes tutorial](https://seaborn.pydata.org/tutorial/color_palettes.html). For this example, we'll pick the `tab10` palette.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dot())
  .theme({**axes_style("ticks")})
  .scale(color="tab10")            # change color palette
)

In some of the smaller dots, it's difficult to see the color, so one thing we could add is a black border around the dots to see if that will help them stand out. For this, we can specify the [`edgecolor` parameter](https://seaborn.pydata.org/tutorial/properties.html#color-properties) to the mark.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dot(edgecolor="black"))  # add border on dot
  .theme({**axes_style("ticks")})
  .scale(color="tab10")
)

Yet another option would be to use unfilled circles instead of filled circles to help distinguish between dots that might be overlapping. For this, we can use [`so.Dots()`](https://seaborn.pydata.org/generated/seaborn.objects.Dots.html) instead of `so.Dot()`.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots())                    # use open circles
  .theme({**axes_style("ticks")})
  .scale(color="tab10")
)

By default, the fill color is a lower saturation version of the edge color. But, in the same way that you can set the `edgecolor` for all dots, you can also set the `fillcolor`.

Also, with `so.Dots()`, we can set the `stroke` size to make the border thicker. Note that for some marks, including [`so.Dot()`](https://seaborn.pydata.org/generated/seaborn.objects.Dot.html), this thickness is controlled by the `edgewidth` property.  These differences are all named in the API documentation.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots(fillcolor="white", stroke=1.5))  # change fill color and border size
  .theme({**axes_style("ticks")})
  .scale(color="tab10")
)

Next, let's look at adjusting the circle sizes. Just as we changed the range of colors used for the clusters, we can adjust the range of sizes used for population using `Plot.scale()` to modify the `pointsize` channel.

In this first example, we'll just adjust the range for the [`pointsize`](https://seaborn.pydata.org/tutorial/properties.html#pointsize-property) channel to be the tuple `(1,20)`. Note that the documentation specifies that the magnitude is represented by the diameter of the circle rather than the area.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots(fillcolor="white", stroke=1.5))
  .theme({**axes_style("ticks")})
  .scale(color="tab10",
         pointsize=(1,20))     # change circle size range
)

In addition to making the range of the circles larger, we can also make the legend look nicer.  

Since we want a bit more control, we'll use the [so.Continuous()](https://seaborn.pydata.org/generated/seaborn.objects.Continuous.html) function. We can still specify the range as the `(1,20)` tuple, but we can also increase the number of circle sizes shown using `tick` and format the number shown using `label`:

* `tick(upto=5)` says to choose "nice" locations (like even numbers) for the divisions, but don't create more than 5 divisions

* `label(unit="")` says to use [SI prefixes](https://en.wikipedia.org/wiki/Metric_prefix) with the given unit. Since population doesn't have a unit (like grams (g) for instance), we can specify no additional suffix by using `""`.

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots(fillcolor="white", stroke=1.5))
  .theme({**axes_style("ticks")})
  .scale(color="tab10",
         # change circle size range, number of circles in legend, label SI units
         pointsize=so.Continuous((1,20)).tick(upto=5).label(unit=""))
)

### Q3
Edit the cell above to experiment with different values for `upto`.  

Then, copy the code into a new cell below and replace `upto` with `count` and specify a number of circles to be shown in the legend.  

*What difference do you see with using `upto` vs. `count`?*

Now let's look at customizing the labels and titles on the chart using [`Plot.label()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.label.html). We can provide a label for each channel that we specify in `Plot()` as well as a `title` for the chart.



In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots(fillcolor="white", stroke=1.5))
  .theme({**axes_style("ticks")})
  .scale(color="tab10",
         pointsize=so.Continuous((1,20)).tick(upto=5).label(unit=""))
  .label(x="Fertility (children per woman)",       # set title, x, y, and legend labels
         y="Life Expectancy (years)",
         color="Region (cluster)",
         pointsize="Population",
         title="Countries with higher fertility tend to have lower life expectancy")
)

We can also use the `so.Continuous().label()` function to adjust how items on the main x,y axis are labeled. For instance, in bar chart showing population below, the population values are displayed in scientific notation by default.

In [None]:
(so.Plot(data=df_2000_5countries,
         x='country',
         y='pop')
  .add(so.Bar())
)

We can apply `so.Continuous().label()` to the y scale to adjust this display.

In [None]:
(so.Plot(data=df_2000_5countries,
         x='country',
         y='pop')
  .add(so.Bar())
  .scale(y=so.Continuous().label(unit=""))  # use SI units
)

If we want to use commas instead of SI units, we can use the `like=` parameter with a [formatter](https://docs.python.org/3/library/string.html#format-examples).

In [None]:
(so.Plot(data=df_2000_5countries,
         x='country',
         y='pop')
  .add(so.Bar())
  .scale(y=so.Continuous().label(like="{x:,.0f}"))  # use formatter (commas, 0 digits after the decimal)
)

Our final customization is to change the chart size. Let's make it wider so that it fills more of the screen.  This can be done with [`Plot.layout()`](https://seaborn.pydata.org/generated/seaborn.objects.Plot.layout.html).

In [None]:
(so.Plot(data=df_2000,
         x='fertility',
         y='life_expect',
         color='cluster',
         pointsize='pop')
  .add(so.Dots(fillcolor="white", stroke=1.5))
  .theme({**axes_style("ticks")})
  .scale(color="tab10",
         pointsize=so.Continuous((1,20)).tick(upto=5).label(unit=""))
  .label(x="Fertility (children per woman)",
         y="Life Expectancy (years)",
         color="Region (cluster)",
         pointsize="Population",
         title="Countries with higher fertility tend to have lower life expectancy")
  .layout(size=(12,6))           # change plot size
)