# First steps with Altair
## Hands-on practical work

### Altair declarative visualization process:

- Get the data
- Store it into a *DataFrame*
- Call the *alt.Chart()* function with the data as a parameter
- Define the marks to be used (with *mark_bar*, *mark_point*...)
- Design the visualization by setting the visual variables with *.encode()*
- Eventually, lay out many different charts
- Eventually, define the interactions

# Basic example: A simple bar chart

In [None]:
#@title First example: a custom bar chart
import altair as alt
import pandas as pd

# Data
data = pd.DataFrame(
    {'Country': ['Germany', 'France', 'Italy', 'Spain', 'Poland', 'Romania',
                 'Netherlands', 'Belgium', 'Czech Republic', 'Greece'],
     'Population (Millions)': [83.155, 67.657, 59.236, 47.399, 37.840,
                               19.202, 17.475, 11.555, 10.702, 10.679]})

# Creating the chart, with mark_bar to define a bar chart
alt.Chart(data).mark_bar().encode(
    alt.X('Country:N'),                 # Categories in X axis
    alt.Y('Population (Millions):Q'),   # Length of the bars in Y
).properties(title = 'Population')      # Title of the chart

We can now change the configuration so that each bar is encoded in a different color. To do so, we change the parameters that the "encode" function receives. In this case, we ask Altair to change the color of the bars according to a variable, the country name. This can be made in a very simple way:

In [None]:
#@title Bar charts with colors per categories
import altair as alt
import pandas as pd

# Data
data = pd.DataFrame(
    {'Country': ['Germany', 'France', 'Italy', 'Spain', 'Poland', 'Romania',
                 'Netherlands', 'Belgium', 'Czech Republic', 'Greece'],
     'Population (Millions)': [83.155, 67.657, 59.236, 47.399, 37.840,
                               19.202, 17.475, 11.555, 10.702, 10.679]})

alt.Chart(data).mark_bar().encode(
    alt.X('Country:N'),
    alt.Y('Population (Millions):Q'),
    alt.Color('Country')                # Color depends on the country value
).properties(title = 'Population')

Altair provides a number of ways to encode the marks. Though the names do not directly map to the actual marks (e.g., a line chart is called mark_line, but the actual marks in a line chart are points, since these are the ones that change directly with the X and Y encodings. Lines are plotted to join points).


In [None]:
#@title Scatterplot

import altair as alt
import pandas as pd

# Data of the different iPhone models with release year, and release price
data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S',
                 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7',
                 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
                 '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699,
                  649, 999, 999, 1099, 1099]}

# Create a DataFrame with this data
df = pd.DataFrame(data)

alt.Chart(df).mark_point().encode(          # This chart uses points as marks
    alt.X('year:O'),
    alt.Y('price:Q'),
).properties(title = 'Apple smartphones')

# Marks
We can change the configuration of the marks. You have the following basic options: mark_line, mark_area, mark_circle... but also other types such as mark_rule, mark_tick, mark_trail, mark_text...
Test with those types...


# Exercises

Try simple charts (same data than above) with the following marks:

- Line mark
- Area mark
- Rule mark
- Circle mark



In [None]:

import altair as alt
import pandas as pd

# Data of the different iPhone models with release year, and release price
data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S',
                 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7',
                 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
                 '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699,
                  649, 999, 999, 1099, 1099]}

# Create a DataFrame with this data
df = pd.DataFrame(data)

alt.Chart(df).mark_line().encode(          # This chart uses points as marks
    alt.X('year:O'),
    alt.Y('price:Q'),
).properties(title = 'Apple smartphones')

In [None]:

import altair as alt
import pandas as pd

# Data of the different iPhone models with release year, and release price
data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S',
                 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7',
                 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
                 '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699,
                  649, 999, 999, 1099, 1099]}

# Create a DataFrame with this data
df = pd.DataFrame(data)

alt.Chart(df).mark_rule().encode(          # This chart uses points as marks
    alt.X('year:O'),
    alt.Y('price:Q'),
).properties(title = 'Apple smartphones')

Now let's try to illustrate the smartphones chart with some text identifying the mobiles... This can be addressed in several ways.


In [None]:
#@title Text mark
import altair as alt
import pandas as pd

# Data of the different iPhone models with release year, and release price
data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S',
                 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7',
                 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
                 '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699,
                  649, 999, 999, 1099, 1099]}

df = pd.DataFrame(data)

alt.Chart(df).mark_text().encode(
    alt.X('year:O'),
    alt.Y('price:Q'),
    alt.Text('name')
).properties(title = 'Apple smartphones')

This might become cluttered.
Maybe we can add the text only on hover... This could be achieved with a tooltip.

In [None]:
#@title Scatterplot with tooltip

import altair as alt
import pandas as pd

# Data of the different iPhone models with release year, and release price
data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S',
                 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7',
                 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
                 '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699,
                  649, 999, 999, 1099, 1099]}

# Create a DataFrame with this data
df = pd.DataFrame(data)

alt.Chart(df).mark_point(
    shape = 'square', filled = True).encode( # Use filled points with square shape
    alt.X('year:O'),
    alt.Y('price:Q'),
    tooltip = 'name'                        # The name will appear on hover
).properties(title = 'Apple smartphones')



---



# Using the last version of altair

By default, the last version might not be installed, you can install it through the following code:


In [None]:
!pip install altair==5.0.1

# Loading data

In [None]:
#@title Loading the Vega datasets

import altair as alt
import pandas as pd
from vega_datasets import data


As we have seen, the data can be uploaded to Altair using dataframes. We can get several data samples from the vega datasets repository. In Altair, you would do it as follows:

You can check the available datasets here: https://github.com/vega/vega-datasets/tree/main/data



For example, for loading the stocks dataset, and plot it as a line chart, we could do the following:


In [None]:
import altair as alt
import pandas as pd
from vega_datasets import data

df = data.stocks.url

alt.Chart(df).mark_line(
    ).encode(
    x='date:T',
    y='price:Q',
    color='symbol:N'
).transform_calculate(
    year='year(datum.date)'
    )

# transform_calculate transforms data to only the year can be used for other things

Note that the stock prize of Google did not appear until 2005 and on. We could filter out the data of the previous year with a simple filtering operation:

# Simple filtering

We can filter the data using a *transform_filter()* function that uses *alt.datum* and an expression to define the fields we want to filter in.

In [None]:
#@title Example with transform_filter
import altair as alt
import pandas as pd
from vega_datasets import data

df = data.stocks.url

alt.Chart(df).mark_line(
    ).encode(
    x='date:T',
    y='price:Q',
    color='symbol:N'
).transform_calculate(
    year='year(datum.date)'
    ).transform_filter(       # Filtering per year
        alt.datum.year > 2005)



---



## Data input

Data typically comes in two forms in a Dataframe: wide and long

### Wide form

In [None]:
wide_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01'],
                          'AAPL': [189.95, 182.22, 198.08],
                          'AMZN': [89.15, 90.56, 92.64]})
wide_form

### Long form (which is the preferred by altair)

In [None]:
long_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01'],
                          'company': ['AAPL', 'AAPL', 'AAPL',
                                      'AMZN', 'AMZN', 'AMZN'],
                          'price': [189.95, 182.22, 198.08,
                                    89.15, 90.56, 92.64]})
long_form



---



# Data types
Altair supports five data types:


*   **Quantitative:** a continuous real-valued quantity
*   **Ordinal**: a discrete ordered quantity
*   **Nominal:** a discrete unordered category
*   **Temporal:** a time or date value
*   **geojson:** a geographic shape


that can be specified explicitly (verbosely or shorthand) in the encode method of the chart.

In [None]:
#@title Explicit encoding in the functions
from vega_datasets import data

cars = data.cars.url

alt.Chart(cars).mark_point().encode(
    alt.X('Acceleration', type='quantitative'),
    alt.Y('Miles_per_Gallon', type='quantitative'),
    alt.Color('Origin', type='nominal')
)

This is equivalent to:



In [None]:
#@title Scatterplot
from vega_datasets import data

cars = data.cars.url

alt.Chart(cars).mark_point().encode(
    alt.X('Acceleration:Q'),
    alt.Y('Miles_per_Gallon:Q'),
    alt.Color('Origin:N')
)

And it is also equivalent to this notation:

In [None]:
#@title Scatterplot
from vega_datasets import data

cars = data.cars.url

alt.Chart(cars).mark_point().encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    color = 'Origin:N'
)



---



# Channels configuration
## Visual properties
#### Color:

* **color**: default color of the mark
* **fill**: color that fills the mark (has higher precedence than color)
* **fillOpacity**: float indicating the opacity [0..1]
* **filled**: boolean indicating whether the mark is filled
* **opacity**: float indicating the overall opacity [0..1]
* **strokeOpacity**: float indicating the stroke opacity [0..1]

#### Shape and position:
* **height**: height of the marks
* **shape**: for point marks, shape can be
circle, square, cross, diamond, triangle up, triangle down, triangle right, or triangle left
* **Other shapes**: arrow, wedge, triangle
* **A custom SVG path** (defined in a rectangle between -1 and 1)
* **size**: the size of the shape. For point, circle and square, it will be the pixel area of the marks.
* **x**: X coordinates of the marks, or width of horizontal bars (and area marks).
* **y**: Y coordinates of the marks, or height of vertical bars (and area marks).
* **x2**: X2 coordinates for ranged shapes (area, bar, rect, and rule)
* **y2**: Y2 coordinates for ranged shapes (area, bar, rect, and rule)
* **width**: width of the marks.

#### Other properties refered to the strokes:
* **stroke**: Default color for the stroke. It has higher precedence than default color (defined using config.color)
* **strokeDash**: An array of alternating stroke and space lengths, for creating dashed or dotted lines, that may depend on the encoding.
* **strokeWidth**: The width of the stroke, in pixels.
* **thickness**: thickness of the tick mark.
* **tooltip**: Tooltip text to show upon mouse hover over the object.

# Exercises

### Modify the last chart so that:


- Marks are triangles
- Marks are all crimson (color)
- Marks are all squares
- Marks are diamonds
- Change the opacity of the marks to 0.2
- Change the size of the marks to 50
- Change the opacity of the marks, and make it dependent on the y variable
- Change the opacity of the marks based on another variable


In [None]:
import altair as alt

from vega_datasets import data

cars = data.cars.url

alt.Chart(cars).mark_point(shape = 'triangle',filled = True, color ='crimson',size = 50, opacity=0.2).encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    opacity = 'Miles_per_Gallon:Q'
)

In [None]:
#@title Example
from vega_datasets import data
df = data.us_employment.url
alt.Chart(df).mark_line(
    strokeDash=[10,50],
    stroke = 'black',
    strokeWidth = 5,
    strokeOpacity = 0.5,
    size = 10,
).encode(
    x='month:T',
    y='government:Q'
)

# Exercises

*   Change the color of the line
*   Change the pattern



In [None]:
cars = data.cars.url

alt.Chart(cars).mark_point(shape = 'triangle',color='maroon').encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    opacity = 'Miles_per_Gallon:Q'
)

In [None]:
alt.Chart(cars).mark_point(shape='square', fillOpacity=0.5).encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    opacity = 'Miles_per_Gallon:Q'
)

In [None]:
alt.Chart(cars).mark_point(shape='arrow', size=200, stroke="red", fill="blue").encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    opacity = 'Miles_per_Gallon:Q'
)

In [None]:
alt.Chart(cars).mark_point(shape='diamond', size=50).encode(
    x = 'Acceleration:Q',
    y = 'Miles_per_Gallon:Q',
    color = 'Miles_per_Gallon:Q',
    opacity = 'Horsepower:Q'
)



---



# Visualization techniques

# Bar charts

Data: One key one value

Marks: Lines

Tasks: Compare/lookup (really easy)

May scale to hundreds of elements

Guidelines:
- **Always** start at zero
- Make labels easy to read (horizontal if possible)
- Order based on data or labels
- By default, prefer neutral colors
- Gridlines if precision is required
- If data is ordered (e.g., in time), line charts typically better
- Don't use hundreds of bars


In [None]:
#@title Horizontal bar chart
import altair as alt
import pandas as pd

data = pd.DataFrame({'Country': ['Germany', 'France', 'Italy', 'Spain', 'Poland', 'Romania', 'Netherlands', 'Belgium', 'Czech Republic', 'Greece'],
        'Population (Millions)': [83.155, 67.657, 59.236, 47.399, 37.840, 19.202, 17.475, 11.555, 10.702, 10.679]})



alt.Chart(data).mark_bar().encode(
    alt.X('Population (Millions):Q'),
    alt.Y('Country:N'),
).properties(title = 'Population')

# Grouped/Paired bar charts

Data: Two keys (categories), one value (quantitative)

Marks: Lines

Tasks: Compare/lookup (easier within group than between groups)

Scales laess than simple bar charts

Trends may difficult to perceive, even with small groups

Guidelines:
- **Always** start at zero
- Same as bar charts
- Don't use them if one category is time

In [None]:
#@title Grouped bar chart

import altair as alt
from vega_datasets import data

source = data.barley() # Dimensions: year (2 values), site, yield, variety

alt.Chart(source).mark_bar().encode(
    x='year:O',
    y='sum(yield):Q',
    color='year:N',
    column='site:N' # To create the groups
)

# Stacked bar charts

Data: Two keys, one value

Tasks: Compare/lookup (not so easy)

Scales less than bar charts

Main benefit is that we can see the total quantity

May become difficult to read

Guidelines:
- **Always** start at zero
- Same as bar charts
- Difficult to compare bewteen groups
- Difficult to compare within groups
- Not use when total quantity does not make sense
- Use with few categories

In [None]:
#@title Stacked bar chart

import altair as alt
import pandas as pd
from vega_datasets import data

df = data.cars.url

alt.Chart(df).mark_bar().encode(
    x=alt.X('Miles_per_Gallon:Q', bin=True),
    y='count(Miles_per_Gallon):Q',
    color = 'Origin:N'
)

# Dot plot

AKA Cleveland dot plot

Mark: Point

Visual variable: X position

Guidelines:

- Do not need to start at zero
- Must be ordered by quantity
- Suitable when small differences must be shown
- If values are relevant, label axes suitably

In [None]:
#@title Dot plot

import pandas as pd

data = {'Country': ['Hong Kong', 'Japan', 'Macao', 'Switzerland', 'Singapore', 'Italy', 'US', 'Vietnam'],
        'Life Expectancy': [85.29, 85.03, 84.68, 84.25, 84.07, 84.01, 79.11, 75.77]}

df = pd.DataFrame(data)

alt.Chart(df).mark_circle(opacity = 1).encode(
    alt.X('Life Expectancy:Q', scale=alt.Scale(domain=(70,90))),
    alt.Y('Country:N', sort='-x'),
).properties(title = 'Life expectancy')

# Scatterplots

Data: Two (quantitative) values, no keys

Mark: Point

Visual channel: Position in 2D plane

Task: Find correlations

Guidelines:
- Avoid clutter
-

In [None]:
#@title Example of a simple scatterplot

import altair as alt
import pandas as pd

data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S', 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7', 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699, 649, 999, 999, 1099, 1099]}

df = pd.DataFrame(data)

alt.Chart(df).mark_circle().encode(
    alt.X('year:O'),
    alt.Y('price:Q'),
    alt.Tooltip('name')
).properties(title = 'Apple smartphones')

# Bubble charts
We can configure the size of the circles by encoding the units sold. This would yield a bubble chart.
Bubble charts can increase the number of variables we encode. The following are the most common visual variables used:
- Size
- Color

But we can use other, such as angle, shape, etc.

In [None]:
#@title Example of a bubble chart

import altair as alt
import pandas as pd

data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S', 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7', 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699, 649, 999, 999, 1099, 1099],
        'units_sold': [1.39, 11.63, 20.73, 73.17, 37.04, 27.40, 51.03, 74.47, 78.29, 77.32, 52.22, 45.09, 18.68, 119]}

df = pd.DataFrame(data)

alt.Chart(df).mark_circle(opacity = 1).encode(
    alt.X('year:O'),
    alt.Y('price:Q'),
    alt.Size('units_sold:Q'),
    alt.Tooltip('name')
).properties(title = 'Apple smartphones')

# Line charts

Data: One key (quantitative or ordered), one value

Mark: Points (lines connect the points, but the marks are points)

Channels: Position in 2D, lines to connect

Taks: Communicating trends



In [None]:
#@title Example of a line chart

import altair as alt
import pandas as pd

data = {'name': ['iPhone', 'iPhone 3G', 'iPhone 3GS', 'iPhone 4', 'iPhone 4S', 'iPhone 5', 'iPhone 5S', 'iPhone 6', 'iPhone 6S', 'iPhone 7', 'iPhone X', 'iPhone XS', 'iPhone 11 Pro', 'iPhone 12 Pro'],
        'year': ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020'],
        'price': [499, 599, 699, 599, 649, 649, 649, 649, 699, 649, 999, 999, 1099, 1099]}

df = pd.DataFrame(data)

alt.Chart(df).mark_line().encode(
    alt.X('year:O'),
    alt.Y('price:Q'),
    alt.Tooltip('name')
).properties(title = 'Price of Apple smartphones')

In [None]:
#@title Another example, with more lines

import altair as alt
import pandas as pd
from vega_datasets import data

df = data.iowa_electricity()


alt.Chart(df).mark_line().encode(
    alt.X('year:T'),
    alt.Y('net_generation:Q'),
    color = 'source:N'
)


In [None]:
#@title Another example, with more lines

import altair as alt
import pandas as pd
from vega_datasets import data

wide_form = pd.DataFrame(data.jobs())

long_form = wide_form.melt(id_vars=['year','job','sex'] , value_vars='count')


alt.data_transformers.disable_max_rows()

alt.Chart(long_form).mark_line().encode(
    x = 'year:O',
    y = 'sum(value):Q',
    color = 'job'
)