# Data Types, Graphical Marks, and Visual Encoding Channels

In [1]:
import pandas as pd
import altair as alt
from vega_datasets import data as vega_data

alt.renderers.enable('mimetype')

alt.__version__

'5.0.0rc1'

## Data


In [2]:
data = vega_data.gapminder()

Let's also create a smaller data frame, filtered down to values for the year 2000 only:

In [3]:
# data2000 = 

### Data Types


#### Nominal (N)

#### Ordinal (O)

#### Quantitative (Q)

#### Temporal (T)

## Global Config vs. Local Config vs. Encoding


In [4]:
cars = vega_data.cars.url

alt.Chart(cars).mark_point().encode(
    x='Acceleration:Q',
    y='Horsepower:Q'
)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


### Global Config


use configure_mark to use opacity and color.

In [5]:
alt.Chart(cars).mark_point().encode(
    x='Acceleration:Q',
    y='Horsepower:Q'
).___(  # <--
    ___=0.2, # <--
    ___='red' # <--
)

SyntaxError: keyword argument repeated: ___ (2950217712.py, line 6)

### Local Config

In [None]:
alt.Chart(cars).mark_point(___, ___).encode( #<--
    x='Acceleration:Q',
    y='Horsepower:Q'
)

### Encoding

In [None]:
alt.Chart(cars).mark_point().encode(
    x='Acceleration:Q',
    y='Horsepower:Q',
    ___=alt.value(0.2), #<--
    ___=alt.value('red') #<--
)

Encoding settings will always override local or global configuration settings.

## Encodings

- In this notebook we'll examine the following encoding channels:

- `x`: Horizontal (x-axis) position of the mark.
- `y`: Vertical (y-axis) position of the mark.
- `size`: Size of the mark. May correspond to area or length, depending on the mark type.
- `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
- `opacity`: Mark opacity, ranging from 0 (fully transparent) to 1 (fully opaque).
- `shape`: Plotting symbol shape for `point` marks.
- `tooltip`: Tooltip text to display upon mouse hover over the mark.
- `order`: Mark ordering, determines line/area point order and drawing order.
- `column`: Facet the data into horizontally-aligned subplots.
- `row`: Facet the data into vertically-aligned subplots.

### X


In [None]:
alt.Chart(___).mark_point().encode( #<--
    alt.X('___') #<--
)

### Y

- The `y` encoding channel sets a mark's vertical position (y-coordinate). 


- Add the `cluster` field using an ordinal (`O`) data type. 


In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('___'), # <--
    alt.Y('___') # <--
)

- _What happens to the chart above if you swap the `O` and `Q` field types?_

- If we instead add the `life_expect` field as a quantitative (`Q`) variable, the result is a scatter plot with linear scales for both axes:

In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('___'), # <--
    alt.Y('___') # <--
)

- To disable automatic inclusion of zero, configure the scale mapping using the encoding `scale` attribute:

In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q')
        .scale(___), # <--
    alt.Y('life_expect:Q')
        .scale(___) # <--
)

### Size

- Let's augment our scatter plot by encoding population (`pop`) on the `size` channel.

In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('___') # <--
)


- Here we update the size encoding to range from 0 pixels (for zero values) to 1,000 pixels (for the maximum value in the scale domain):

In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[___]) # <--
)

### Color and Opacity

- Here, we encode the `cluster` field using the `color` channel and a nominal (`N`) data type, resulting in a distinct hue for each cluster value. 

In [None]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('___') # <--
)

- If we prefer filled shapes, we can can pass a `filled=True` parameter to the `mark_point` method:

In [None]:
alt.Chart(data2000).mark_point(___).encode( # <--
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('___') # <--
)

- By default, Altair uses a bit of transparency to help combat over-plotting. 


- We are free to further adjust the opacity, either by passing a default value to the `mark_*` method, or using a dedicated encoding channel.


- Here we demonstrate how to provide a constant value to an encoding channel instead of binding a data field (use 0.5):

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('cluster:N'),
    alt.OpacityValue(___) # <--
)

### Shape

- The shape encoding channel should only be used with nominal data, as perceptual rank-order and magnitude comparisons are not supported.

Let's encode the `cluster` field using `shape` as well as `color`.

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Shape('___') # <--
)

### Tooltips & Ordering

- Let's add a tooltip encoding for the `country` field, then investigate which countries are being represented.

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Tooltip('___') # <--
)

- The `order` encoding channel determines the order of data points

- Let's order the values in descending rank order by the population (`pop`), ensuring that smaller circles are drawn later than larger circles:

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Tooltip('country:N'),
    alt.Order('___', sort='descending') # <--
)

- To show multiple values, we can provide the `tooltip` channel an array of encodings, one for each field we want to include:

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
        .scale(range=[0,1000]),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Order('pop:Q', sort='descending'),
    tooltip = [                              
        'country:N',
        '___:Q', # <--
        '___:Q' # <--
    ]   
)

Now we can see multiple data fields upon mouse over!

### Column and Row Facets

- Here is a trellis plot that divides the data into one column per \`cluster\` value:

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Tooltip('country:N'),
    alt.Order('pop:Q', sort='descending'),
    alt.Column('___') # <--
)

The plot above does not fit on screen, making it difficult to compare all the sub-plots to each other! 

In [None]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000]),
             legend=alt.Legend(orient='bottom', titleOrient='left')), 
    alt.Color('cluster:N', legend=None), 
    alt.OpacityValue(0.5),
    alt.Tooltip('country:N'),
    alt.Order('pop:Q', sort='descending'),
    alt.Column('cluster:N')
).properties(___=135, ___=135) # <--

## Graphical Marks

- Our exploration of encoding channels above exclusively uses `point` marks to visualize the data.

- However, the `point` mark type is only one of the many geometric shapes that can be used to visually represent data. Altair includes a number of built-in mark types, including:

- `mark_area()` - Filled areas defined by a top-line and a baseline.
- `mark_bar()` -	Rectangular bars.
- `mark_circle()`	- Scatter plot points as filled circles.
- `mark_line()` - Connected line segments.
- `mark_point()` - Scatter plot points with configurable shapes.
- `mark_rect()` - Filled rectangles, useful for heatmaps.
- `mark_rule()` - Vertical or horizontal lines spanning the axis.
- `mark_square()` - Scatter plot points as filled squares.
- `mark_text()` - Scatter plot points represented by text.
- `mark_tick()` - Vertical or horizontal tick marks.	

### Point Marks

Below is a dot plot of `fertility`, with the `cluster` field redundantly encoded using both the `y` and `shape` channels. 

In [None]:
alt.Chart(data2000).___().encode( # <--
    alt.X('fertility:Q'),
    alt.Y('cluster:N'),
    alt.Shape('cluster:N')
)

In addition to encoding channels, marks can be stylized by providing values to the `mark_*()` methods.

For example: point marks are drawn with stroked outlines by default, but can be specified to use `filled` shapes instead (filled=True). 

Similarly, you can set a default `size` to set the total pixel area of the point mark (choose 100).

In [None]:
alt.Chart(data2000).mark_point(___, ___).encode(  # <--
    alt.X('fertility:Q'),
    alt.Y('cluster:N'),
    alt.Shape('cluster:N')
)

### Circle Marks

- The `circle` mark type is a convenient shorthand for `point` marks drawn as filled circles.

In [None]:
alt.Chart(data2000).___(size=100).encode( # <--
    alt.X('fertility:Q'),
    alt.Y('cluster:N'),
    alt.Shape('cluster:N')
)

### Square Marks

The `square` mark type is a convenient shorthand for `point` marks drawn as filled squares.

In [None]:
alt.Chart(data2000).___(size=100).encode( # <--
    alt.X('fertility:Q'),
    alt.Y('cluster:N'),
    alt.Shape('cluster:N')
)

### Tick Marks

- The `tick` mark type conveys a data point using a short line segment or "tick". 

In [None]:
alt.Chart(data2000).___().encode( # <--
    alt.X('fertility:Q'),
    alt.Y('cluster:N'),
    alt.Shape('cluster:N')
)

### Bar Marks

In [None]:
alt.Chart(data2000).___().encode( # <--
    alt.X('country:N'),
    alt.Y('pop:Q')
)

- Bars can also be stacked. 


- Let's change the `x` encoding to use the `cluster` field, and encode `country` using the `color` channel. 


- We'll also disable the legend (which would be very long with colors for all countries!) and use tooltips for the country name.

In [None]:
alt.Chart(data2000).mark_bar().encode(
    alt.X('___'), # <--
    alt.Y('pop:Q'),
    alt.Color('___', legend=None), # <--
    alt.Tooltip('___')  # <--
)

- The chart below uses the `x` (starting point) and `x2` (ending point) channels to show the range of life expectancies within each regional cluster. 


- Below we use the `min` and `max` aggregation functions to determine the end points of the range; 

In [None]:
alt.Chart(data2000).mark_bar().encode(
    alt.X('___(life_expect):Q'), # <--
    alt.X2('___(life_expect):Q'), # <--
    alt.Y('cluster:N') 
)

### Line Marks

- For example so that a line's slope conveys information about the rate of change.

Let's plot a line chart of fertility per country over the years, using the full, unfiltered global development data frame. We'll again hide the legend and use tooltips instead.

In [None]:
alt.Chart(data).___().encode( # <--
    alt.X('year:O'),
    alt.Y('fertility:Q'),
    alt.Color('country:N', legend=None),
    alt.Tooltip('country:N')
).properties(
    width=400
)

- Let's change some of the default mark parameters to customize the plot. 

- `strokeWidth` to determine the thickness of the lines (use 3)

- `opacity` to add some transparency (use 0.5). 

- Let's use `'monotone'` interpolation to provide smooth lines that are also guaranteed not to inadvertently generate "false" minimum or maximum values as a result of the interpolation.

In [None]:
alt.Chart(data).mark_line(
    strokeWidth=___, # <--
    opacity=___, # <--
    interpolate='___' # <--
).encode(
    alt.X('year:O'),
    alt.Y('fertility:Q'),
    alt.Color('country:N', legend=None),
    alt.Tooltip('country:N')
).properties(
    width=400
)

#### Slope graph

- The `line` mark can also be used to create *slope graphs*, charts that highlight the change in value between two comparison points using line slopes.

Below let's create a slope graph comparing the populations of each country at minimum and maximum years in our full dataset: 1955 and 2005. We first create a new Pandas data frame filtered to those years, then use Altair to create the slope graph.

In [None]:
dataTime = data.loc[(data['year'] == ___) | (data['year'] == ___)] # <--

alt.Chart(dataTime).___(opacity=0.5).encode( # <--
    alt.X('year:O'),
    alt.Y('pop:Q'),
    alt.Color('country:N', legend=None),
    alt.Tooltip('country:N')
).properties(
    width={"step": ___} # adjust the step parameter, e.g. 50
)

### Area Marks

- The `area` mark type combines aspects of `line` and `bar` marks 

- it visualizes connections (slopes) among data points, but also shows a filled region, with one edge defaulting to a zero-valued baseline.

The chart below is an area chart of population over time for just the United States:

In [None]:
dataUS = data.loc[data['country'] == '___'] # <--

alt.Chart(dataUS).___().encode( # <--
    alt.X('year:O'),
    alt.Y('fertility:Q')
)

- Similar to `line` marks, `area` marks support an `interpolate` parameter.

In [None]:
alt.Chart(dataUS).mark_area(___='monotone').encode( # <--
    alt.X('year:O'),
    alt.Y('fertility:Q')
)

Similar to `bar` marks, `area` marks also support stacking. Here we create a new data frame with data for the three North American countries, then plot them using an `area` mark and a `color` encoding channel to stack by country.

In [None]:
dataNA = data.loc[  
    (data['country'] == 'United States') |
    (data['country'] == 'Canada') |
    (data['country'] == 'Mexico')
]

alt.Chart(dataNA).___().encode( # <--
    alt.X('year:O'),
    alt.Y('pop:Q'),
    alt.Color('country:N') 
)

By default, stacking is performed relative to a zero baseline. However, other `stack` options are available:

  * `center` - to stack relative to a baseline in the center of the chart, creating a *streamgraph* visualization, and

   * `normalize` - to normalize the summed data at each stacking point to 100%, enabling percentage comparisons.

Below we adapt the chart by setting the `y` encoding `stack` attribute to `center`. What happens if you instead set it `normalize`?

In [None]:
alt.Chart(dataNA).mark_area().encode(
    alt.X('year:O'),
    alt.Y('pop:Q', stack='___'), # <--
    alt.Color('country:N')
)

- To disable stacking altogether, set the  `stack` attribute to `None`. 


- We can also add `opacity` as a default mark parameter to ensure we see the overlapping areas!

In [None]:
alt.Chart(dataNA).mark_area(___=0.5).encode( # <--
    alt.X('year:O'),
    alt.Y('pop:Q', stack=___), # <--
    alt.Color('country:N')
)

- The `area` mark type also supports data-driven baselines, with both the upper and lower series determined by data fields. 


- As with `bar` marks, we can use the `x` and `x2` (or `y` and `y2`) channels to provide end points for the area mark.



- The chart below visualizes the range of minimum and maximum fertility, per year, for North American countries:

In [None]:
alt.Chart(dataNA).mark_area().encode(
    alt.X('year:O'),
    alt.Y('___(fertility):Q'), # <--
    alt.___('___(fertility):Q') # <--
).properties(
    width={"step": ___} #<-- e.g., use 40
)

All the `area` mark examples above use a vertically oriented area. However, Altair and Vega-Lite support horizontal areas as well. Let's transpose the chart above, simply by swapping the `x` and `y` channels.

In [None]:
alt.Chart(dataNA).mark_area().encode(
    alt.___('year:O'), # <--
    alt.___('min(fertility):Q'), # <--
    alt.___('max(fertility):Q') # <--
).properties(
    width={"step": 40}
)