# Introduction to Data Visualisation with Vega-Altair

Welcome to this three-hour course on data visualisation using Vega-Altair.

In this course, you'll learn how to use Vega-Altair, a powerful Python library, to create a variety of interactive and static data visualisations.

## What is Vega-Altair?

Vega-Altair is a declarative statistical visualisation library for Python. It is based on the Vega and Vega-Lite grammar, which allows you to describe your data visualisation in a structured, high-level way.

### Why use Vega-Altair?

- **Declarative:** Instead of specifying *how* to draw a plot, you specify *what* you want to visualise, and Vega-Altair takes care of the details.
- **Interactive:** Vega-Altair allows for interactive plots that you can pan, zoom, and hover over to inspect the data.
- **Versatile:** You can create a wide variety of chart types, from basic scatter plots and line charts to more complex visualisations.
- **Integration with Pandas:** Vega-Altair works seamlessly with Pandas DataFrames, making it easy to visualise your data.

## Prerequisites
Before starting this course, you should have completed an introductory Python and Jupyter Notebook course, as well as a Pandas course. You should be familiar with:

* Basic Python syntax, such as variables, data types, and control flow.
* Using Jupyter Notebooks or Jupyter Lab.
* Pandas DataFrames, and data loading, manipulation and descriptive statistics.

Let's start!

## Installation and Setup in Jupyter Lab

Before we can use Vega-Altair, we need to install it. This guide is intended for people running a local installation of Python and using Jupyter Notebook or Jupyter Lab.

### Installing (or upgrading) Vega-Altair from PyPI
Vega-Altair can be installed via `pip` from PyPI. This command installs the vega-altair library using `pip`, Python’s package installer.

```text
! pip install altair vega
! pip install --upgrade altair vega
```

### ModuleNotFoundError

When using Vega-Altair, you may encounter the error message `ModuleNotFoundError`. This is caused by so-called Vega-Altair dependencies. When you install the vega-altair library for Python, it requires other libraries to function correctly. These required libraries are called dependencies.

When you install Vega-Altair using pip, Python’s package installer, it automatically installs any required dependencies. However, optional dependencies are not installed by default and must be installed separately if needed.

### Enabling the Jupyter Lab extension
To view the plots, we need to enable the jupyterlab vega extension. This will make altair plots visible. You can do this by running the following code in a code cell:

```text
! jupyter labextension install @jupyter-vega/jupyterlab-vega
```

### Importing Vega-Altair
Even though we have now installed Vega-Altair, we still need to import it when we want to use it in our code. This is conventionally done at the top of the script (together with any other imports) to make it easier for future readers (including ourselves!) to see which packages are used.
Aliasing altair as `alt` is a widely adopted convention that simplifies the syntax for accessing its functionalities.

```python
import altair as alt
```

## Basic Syntax
Vega-Altair uses a consistent grammar to create visualisations. Here are the key components:

1. **`alt.Chart(data)`:** This creates a chart object that takes a Pandas DataFrame as input.
2. **`mark_*()`:** This specifies the type of mark to use for the chart (e.g. `mark_point()` for scatter plots, `mark_line()` for line plots, `mark_bar()` for bar charts).
3. **`.encode(x='column1', y='column2', ...)`:** This maps columns in the DataFrame to visual properties of the marks.

### Encodings
Encodings are a fundamental concept in Vega-Altair. They specify how the data columns are mapped to visual properties. Some common encodings include:

* **x:** The x-position of the mark on the plot.
* **y:** The y-position of the mark on the plot.
* **color:** The color of the mark.
* **size:** The size of the mark.
* **tooltip:** The content of the tooltip that appears when you hover over the mark.

Let's create a very simple plot with the titanic data!

In [1]:
import altair as alt
import pandas as pd

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
data = pd.read_csv('titanic.csv')

# Rename columns
data.head()
# Create the plot
chart = alt.Chart(data).mark_point().encode(x='Age', y='Fare')

# Display the chart
chart

ModuleNotFoundError: No module named 'altair'

### Exercise
Change the code above so that it displays a line chart instead of a scatter plot.

````{admonition} Solution
:class: dropdown

```python
import altair as alt
import pandas as pd

# Assuming 'data' has been loaded and columns are appropriately renamed
chart = alt.Chart(data).mark_line().encode(
    x='Age',
    y='Fare'
)

chart
```

## Creating Basic Charts

Let's move on to create some more basic charts: scatter plots, line charts, bar charts and histograms. We will continue with the titanic dataset.

### Scatter plots
A scatter plot is a graph in which the values of two variables are plotted along two axes. They can be used to observe relationships between variables.

In [2]:
alt.Chart(data).mark_point().encode(x='Age', y='Fare')

NameError: name 'alt' is not defined

### Line charts
A line chart is a graph that displays information as a series of data points called 'markers' connected by straight line segments. They are used to visualise trends over time or other continuous variables.

In [3]:
alt.Chart(data).mark_line().encode(x='Age', y='Fare')

NameError: name 'alt' is not defined

### Bar charts
A bar chart is a graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. They are used to compare values across different categories.

In [4]:
alt.Chart(data).mark_bar().encode(x='Pclass', y='count()')

NameError: name 'alt' is not defined

Notice how the `Pclass` is displayed as numerical data.\
If we want to display it as ordinal data instead, we must add the ordinal encoding `:O`.

In [5]:
alt.Chart(data).mark_bar().encode(x='Pclass:O', y='count()')

NameError: name 'alt' is not defined

### Histograms
Histograms are a type of bar chart that visualises the distribution of a single numerical variable. The data is split into bins, and the height of each bar shows how many data points fall within that bin.

In [6]:
alt.Chart(data).mark_bar().encode(alt.X('Age', bin=True), y='count()')

NameError: name 'alt' is not defined

### Changing Properties

You can customise plots in many different ways by adding extra parameters to the `.encode` method. Here are two common examples:

**Color:** The `color` property can be set to a specific color, and/or it can be mapped to a column in your DataFrame to display different categories in different colors.

**Tooltip:** You can specify tooltips that appear when you hover over the marks.

In [7]:
alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color=alt.Color('Pclass:O', scale=alt.Scale(scheme='redyellowgreen')),
    tooltip=['Pclass','Age', 'Fare']
)

NameError: name 'alt' is not defined

### Data selection

Sometimes you only want to visualise some of your data. You can use the `transform_filter` method to select rows in your dataframe, before visualising.

In [8]:
alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color='Sex',
    tooltip=['Name','Pclass','Age', 'Fare']
).transform_filter(alt.datum.Pclass == 1)

NameError: name 'alt' is not defined

### Hands-on Exercise

Use the data above to create the following plots:

1. A scatterplot of Age against Fare, where the points are coloured by the Survival, and the tooltips show the Survival, Age and Fare.
2. A bar chart of Pclass and the number of people in each class.
3. A histogram of Age.

Try experimenting with different properties.

````{admonition} Solution
:class: dropdown

1. Scatterplot of Age against Fare colored by Survival.
```python
scatter_plot = alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color='Survived:N',
    tooltip=['Survived', 'Age', 'Fare']
)
scatter_plot
```
2. Bar chart of Pclass with the count of people in each class.
```python
bar_chart = alt.Chart(data).mark_bar().encode(
    x='Pclass:N',
    y='count()'
)
bar_chart
```
3. Histogram of Age.
```python
histogram = alt.Chart(data).mark_bar().encode(
    x=alt.X('Age', bin=True),
    y='count()'
)
histogram
```

## Data Transformation
Sometimes the data needs to be transformed before visualising it. You have already seen one example with `transform_filter`. Here, we will also cover how to create new columns and how to do aggregations.
We will continue with the titanic dataset.

### Filter rows
As already mentioned, you can filter rows with `transform_filter`. As an example, let us only look at the people who survived.

In [9]:
alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color=alt.Color('Pclass:N', scale=alt.Scale(scheme='pastel1')),
).transform_filter(alt.datum.Survived == 1)

NameError: name 'alt' is not defined

### Create new columns
You can create new columns based on your existing data by using the `transform_calculate` method. Here is an example where we create a new column called `Age_in_days`.

In [10]:
alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color=alt.Color('Pclass:N', scale=alt.Scale(scheme='category10')),
    tooltip=[
        'Age',
        'Age_in_days:Q'
        #alt.Tooltip('Age_in_days:Q', title='Age in Days')  # Explicitly define the field and type
    ]
).transform_calculate(
    Age_in_days='datum.Age * 365'
)


NameError: name 'alt' is not defined

### Aggregations

Aggregations are used to compute summary statistics, such as the mean, median or sum of values of a column. Here, we will create a bar chart where each bar corresponds to a Pclass, and the bar represents the average Fare.

In [11]:
alt.Chart(data).mark_bar().encode(
    x='Pclass:O',
    y='mean(Fare)'
)

NameError: name 'alt' is not defined

### Hands-on Exercise
1. Create a scatter plot of `Age` and `Fare` but only for the people who did not survive.
2. Create a new column called `Family_size` which is the SibSp plus the Parch.
3. Create a bar plot showing the median Age for each Pclass.

````{admonition} Solution
:class: dropdown

1. Scatter plot for non-survivors.
```python
non_survivors_plot = alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare'
).transform_filter(
    alt.datum.Survived == 0
)
non_survivors_plot
```
2. Adding a new column `Family_size`.
```python
data['Family_size'] = data['SibSp'] + data['Parch']
# Show the new column
data[['Family_size']].head()
```
3. Bar plot of median Age for each Pclass.
```python
median_age_plot = alt.Chart(data).mark_bar().encode(
    x='Pclass:N',
    y='median(Age)'
)
median_age_plot
```

## Customization and Exploration

You can customise your charts in many ways. Here are some important examples:

### Titles and Axes Labels
You can add titles to the plots and axes labels to clarify what is visualised.

In [12]:
alt.Chart(data).mark_point().encode(
    x=alt.X('Age', title='Age of passenger'),
    y=alt.Y('Fare', title = 'Fare'),
    color=alt.Color('Pclass:N', scale=alt.Scale(scheme='set1')),
).properties(
    title='Titanic passenger data'
)

NameError: name 'alt' is not defined

### Adding Legends

Legends are used to display what the colours or symbols of the plot mean. These are added automatically when you specify the `color` encoding, but you can customize them further.

In [13]:
alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color=alt.Color('Pclass:N', scale=alt.Scale(scheme='tableau10'), title = 'Passenger Class')
)

NameError: name 'alt' is not defined

```{admonition} Color Scales
:class: seealso

Did you notice how we have played with different color scales for the passenger class?\
You can read more about the `altair.Scale` [here](https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html).
```

### Saving Charts

Once you have created a chart you like, you can save it to a file. Here is how you can save it as a json file.

In [14]:
chart = alt.Chart(data).mark_point().encode(
    x='Age',
    y='Fare',
    color='Pclass'
)

chart.save('titanic_plot.json')

NameError: name 'alt' is not defined

### How to find information about properties?

The documentation is very useful to learn about all the different parameters that you can specify. The Vega-Altair documentation can be found here:

[altair-viz.github.io](https://altair-viz.github.io/)

The documentation includes many examples of plots.

You can also use the help functionality within Python, such as `help(alt.Chart().encode)`.

### Hands-on Exercise

1. Change the axis labels to be more descriptive.
2. Change the legend title to something more specific.
3. Save one of the plots you have created so far.