In [1]:
import numpy as np
import pandas as pd
import altair as alt

# This line enables altair to run in your notebook with a live internet connection. 
alt.renderers.enable('default')

# Optionally, for offline rendering in Jupyter Notebook, you can use the notebook renderer:
# alt.renderers.enable('notebook')

RendererRegistry.enable('default')

# Altair: Statistical Visualization for Python

[Altair](http://github.com/altair-viz/altair/) provides a declarative Python API for statistical visualization, built on top of [Vega-Lite](http://vega.github.io/vega-lite/). For more complete documentation, see [Altair's Documentation](http://altair-viz.github.io/).

### About Altair
The key feature of this declarative approach is that the user is free to think about the data, rather than the mechanics of the visualization. Vega-Lite specifications are expressed in JavaScript Object Notation (JSON), a cross-platform format often used for storage of nested and/or hierarchical data. Altair builds a Python layer on top of this, so that rather than writing raw JSON strings the user can write declarative Python code.

This notebook highlights the various building blocks necessary to construct a visualization with Altair, as well as some cool usages. 

### Quick example
Let's start with a quick example. For this, we'll read in the cars dataset with Pandas and display a simple chart. 

In [2]:
cars = pd.read_csv('data/cars_dataset.csv')
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


In [3]:
alt.Chart(cars).mark_circle().encode(
    x = 'Horsepower',
    y = 'Miles_per_Gallon'
)

Unlike matplotlib, Altair expects the data passed to it to be in a specific format and *named*. This can be in the form of a URL string pointing to a JSON or CSV formatted text file, an `altair.Data` object, or - as is most common - a Pandas DataFrame. This is why we imported cars as a Pandas dataframe. 

NumPy arrays, which we could use with matplotlib, are not sufficient for use with altair as it is not a form of named data with labeled columns. 


# Building blocks

Creating a new altair plot consists of chaining various calls. In our example, we've already seen that different functions are chained to create the output chart. In this notebook, we will discuss the key Altair's building blocks:

1. [The Chart object](#object)
2. [Marks](#marks)
3. [Data encodings](#encodings)
4. [Chart properties](#prop)
5. [Interactive](#interactive)

<a id='object'></a>
## 1. The Chart object

The fundamental object in Altair is the ``Chart``. It takes a single argument: the dataframe to be visualized. 

In [4]:
chart = alt.Chart(cars)

Fundamentally, a ``Chart`` is an object which knows how to emit a JSON dictionary representing the data and visualization encodings (see below), which can be sent to the notebook and rendered by the Vega-Lite JavaScript library.

Here is what that JSON looks like for the current chart (since the chart is not yet complete, we turn off chart validation):

In [None]:
chart.to_dict(validate=False)

At this point the specification contains only the data and the default configuration, but no visualization specification.

<a id='marks'></a>
## 2. Marks

Next, it needs to be decided what kind of mark should be used to represent the data. Each mark has its own associated `mark_*` method, the arguments of which can vary depending on the type of mark. The mark property is what specifies how exactly the data should be represented on the plot.

In the previous example, `mark_circle` was used to represent each data point as a circle. Other options include (but are not limited to): 
* `mark_area()`: a filled area plot
* `mark_bar()`: a bar plot
* `mark_line()`: a line plot
* `mark_circle()`: a scatter plot with filled circles
* `mark_point()`: a scatter plot with configurable point shapes 
* `mark_square()`: a scatter plot with filled squares


More types of marks can be found in the [Documentation](https://altair-viz.github.io/user_guide/marks.html). 

In [5]:
alt.Chart(cars).mark_square(
    color='red',
    opacity=.2
).encode(
    x='Horsepower',
    y='Miles_per_Gallon',
)

The mark property specifies exactly how the data should be represented on the plot, which is **independent** of the data.

### <mark>Exercise 1: Your first Altair plot</mark>

Create a scatterplot of `Weight_in_lbs` vs `Acceleration`. 

Let each data point be represented by an empty green circle with 0.4 opacity and a size of 50

In [None]:
#%load answers/altair_ex_1.py

<a id='encodings'></a>
## 3 Data encodings
The next step is to add *visual encodings* (or *encodings* for short) to the chart. A visual encoding specifies how a given data column should be mapped onto the visual properties of the visualization.
Some of the more frequently used visual encodings are listed here:

* X: x-axis value
* Y: y-axis value
* Color: color of the mark
* Opacity: transparency/opacity of the mark
* Shape: shape of the mark
* Size: size of the mark

For a complete list of these encodings, see the [Encodings](https://altair-viz.github.io/user_guide/encoding.html) section of the documentation.

Visual encodings can be created with the `encode()` method of the `Chart` object.

In [None]:
alt.Chart(cars).mark_point().encode(
    x = 'Horsepower',
    y = 'Miles_per_Gallon'
)

This chart plots - as expected - Horsepower vs. Miles per Gallon. However, all points are of the same color. What if we want to distinguish them based on some other property? This can be easily changed: 

In [6]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon'
)

Something strange is happening here - the data in the `Cylinder` column is interpreted as a continuous scale between 3 (min. number of cylinders) and 8 (max. number of cylinders). That's not what we want! 

The data type of a column can be set explicitly using a one letter code attached to the column name with a colon:

|Data Type|Code|Description|
|---|---|---|
|quantitative|Q|Numerical quantity (real-valued)|
|nominal|N|Name / Unordered categorical|
|ordinal|O|Ordered categorical|
|temporal|T|Date/time|

In [9]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Cylinders:N', # also try N and Q
)

Another neat addition based on the data is the **tooltip**, which shows you information from the dataframe upon hovering over the data point. 

In [12]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Cylinders:N',
    tooltip=['Origin', 'Year', 'Name']
)

***Important to note!***

`encode()` handles all plotting information based on the ***data***. E.g. when the colours in the plot are dependent on data, they are specified in `encode()`. If they are not (maybe you are just specifying one block colour) they are handled by `mark_*`.

This, for example, would not work: 

In [None]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='red'
)

Because `'red'` is not one of the column names in the dataframe that was passed. 

Let's rewrite this so that it does what we want:

### <mark>Exercise 2: Encode your plot</mark>

Create a new scatter plot with acceleration vs. the weight. The color of the mark is dependent on the number of cylinders. Each mark should display the name of the car when you hover over it, and the shape of the mark determined based on the country of origin. The size of the mark should be 100. 

*Bonus*: Adjust the color of the mark to not be a shade of blue, but distinct different colors.

In [None]:
# %load answers/altair_ex_2.py

<a id='prop'></a>
## 4 Properties 

Whereas marks and data encodings are concerned with the visual output of plot contents, `properties` allow you to alter or specify generic properties of the chart itself. Such as the height and width: 

In [13]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin'
).properties(
    width=400,
    height=200
)

It also allows you to define a *selection interval*. Depending on the axis you add to `encoding` (x, y or both), it allows you to select a specific section of the chart. We can then select a part of the plot in a draggable box (or line). 

In [14]:
interval = alt.selection_interval(encodings=['x', 'y'])

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin'
).properties(
    selection=interval
)

But that's not all! The neat thing about this is that it allows you to make properties of the chart conditional on this selection. For example: 

In [15]:
interval = alt.selection_interval(encodings=['x', 'y'])

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).properties(
    selection=interval
)

In this example, all data points *within* the selection are shown in their true color. The data points outside the selection are displayed in gray. Note that we wrap the color lightgray in `alt.value` to prevent altair from interpreting this as a column name in the dataframe. 

Pretty neat, we'll see later on why this is useful.

### 5 Interactive
A chart in altair can very easily be made interactive by simply adding `.interactive()`. This will allow you to zoom in, both on the data as well as the axis. 

In [16]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Cylinders:O'
).properties(
    width=400,
    height=200
).interactive()

### <mark>Exercise 3: More plots</mark>

1. Create a scatter plot with 100 data points. All data points are x * 2. 
    * a. Let the color be dependent on x.
    * b. Let the color be dependent on x % 4. 
    * c. Alternate between the colors red and blue for each element (bonus) (hint: see FieldOneOfPredicate) 


2. Plot weight against miles for the cars dataset.
    * The chart should be 300 by 400
    * The name and year of the car is displayed as you hover over the elements. 
    * The color of the marks are gray if they fall outside of the selection -- if they fall within, they're based on the origin of the car. 
    * Make the selection based only on the weight of the car

In [None]:
# %load answers/altair_ex_3.py

# Operators & Combining Plots
Operators are a way to combine multiple plots. The most important ones are: 
* `|`: horizontally stacks plots. 
* `&`: vertically stacks plots. 
* `+`: combines two plots. 

Let's first define a base chart, where `x` is not yet specified. 

In [17]:
base_chart = alt.Chart(cars).mark_point().encode(
    y='Miles_per_Gallon',
    color='Cylinders:O',
).properties(
    width=200,
    height=200
).interactive()

Define `x` as Horsepower and Acceleration respectively, and stack the two charts horizontally: 

In [18]:
base_chart.encode(x='Horsepower') | base_chart.encode(x='Acceleration')

Now let's stack the same charts vertically: 

In [19]:
base_chart.encode(x='Horsepower') & base_chart.encode(x='Acceleration')

Stacking charts together which are interactive yields some interesting results. 

In [None]:
base_chart = alt.Chart(cars).mark_point().encode(
    y='Miles_per_Gallon',
    color='Origin'
).interactive()

base_chart.encode(x = 'Horsepower') | base_chart.encode(x = 'Acceleration')

As we zoom in or out on the Miles_per_Gallon on one chart, it has a similar effect on the other chart. This has an interesting effect when we combine it with the interval selection we've seen previously. 

Every time the selection is moved around, the renderer is signalled that the selection has changed and what the points currently in the selection are. These points are then highlighted on *both* charts. 

In [20]:
interval = alt.selection_interval(encodings=['x', 'y'])

chart = alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).properties(selection=interval)

chart | chart.encode(x='Acceleration')

### <mark>Exercise 4: Combining skills</mark>

**Question 1:** 
Create two charts next to each other plotting horsepower vs. miles_per_gallon: one containing only USA data points, the other containing all data_points. Add a selection interval and make the selected data points the color of their origin.

Say you create a selection box on the chart with USA data only. Then take a look at the highlighted data points in the chart with all data points. Does this behave as expected? Do the highlighted points all belong to USA-manufactured cars, or are all points that fall within that x-y range box included? Why do you think that is? 

**Question 2:**
Create a 2x2 grid with charts plotting horsepower vs. miles_per_gallon. First chart: all data, second chart: USA data only, third chart: Japanese cars only, fourth chart: European cars only. 
    

**Question 3:**
Create 4 interactive charts in a 2x2 grid where the color is based on the origin and the y-axis is miles per gallon. The tooltip provides information on the name and year the car was made. The x-axis for the 4 charts are, respectively: weight, horsepower, acceleration and displacement. 

Is there a difference in how you contruct the four charts between question 2 and 3? 

In [None]:
# %load answers/altair_ex_4.py

# Saving your chart

Say we're happy with our chart, and we want to save it. There are a few options. 
* HTML
* JSON

Saving your chart is as easy as calling `.save` after your chart with the appropriate extension. 

There are other formats you can save in, some of which require the altair_saver package. For now let's just look at html and json.

In [None]:
chart

In [None]:
chart.save('chart.html')

In [None]:
print(chart.to_html())

In [None]:
print(chart.to_json()) 

# Limitation

As great as `altair` is for many purposes, there is a key limitation to working with it, and that is related to the size of the dataset.

In [21]:
data = pd.DataFrame({'x': range(1000)})
alt.Chart(data).mark_line().encode(x='x', y='x')

In [22]:
data = pd.DataFrame({'x': range(10000)})
alt.Chart(data).mark_line().encode(x='x', y='x')

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation

alt.Chart(...)

When the number of rows in your dataset exceeds 5000, altair will return a `MaxRowsError`. This is not because Altair cannot handle larger datasets, but it is because it is important for the user to think carefully about how large datasets are handled. Altair typically leads to relatively large notebooks, even with smaller datasets. It is quite easy to end up with very large notebooks of you make many visualisations of a large dataset, and this error is a way of preventing that.

There are a few ways to circumvent this, including disabling the error through `alt.data_transformers.disable_max_rows()`. You are, however, strongly discouraged from doing this as the performance will most likely suffer. 

# Summary
Altair is a great, intuitive, declarative alternative for `matplotlib` and internal plotting of `pandas` with many advantages:

* Seperation of concerns: <br><br>
    - Pass the data/dataframe with `alt.Chart()`
    - Determine the mark with `.mark_*()`
    - All things concerning your data are handled through `.encode()`
    - Use `.properties()` to set the non-data non-mark properties of your plot
    - Make your plot interactive with `.interactive()` <br><br>
* Easily combine plots with operators `|`, `&` and `+`
* Provides interactivity with tooltip and selection interval
* Save your plots to HTML or JSON for easy access for the front-end

*Limitations*: per default, altair only accepts dataframes with less than 5000 rows. There are ways around this, but it is strongly discouraged as this can easily lead to relatively large notebooks.
