In [None]:
import google.auth
import gcsfs

import pandas as pd
import numpy as np
import altair as alt
from altair_saver import save

from IPython.display import Image, display

from hospitalization import get_hospitalization_data

In [None]:
# get credentials
credentials, project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
gcp_fs = gcsfs.GCSFileSystem()

# 0. Introduction

Why is altair cool?

altair is a declarative visualization library based on Vega. The key idea is that the altair grammar enables to declare links between _data columns_ and _visual encoding channels_.

What is Vega? Vega is a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format. Altair is the python API of the higher-level language Vega-Lite. Check out some beautiful and motivating Vega [examples](https://vega.github.io/vega/examples/).

Let's start with something real simple. I'm going to load a public data file from GCP, but if it's not working for you or the GCP file system is not set up, you can go to this [link](https://console.cloud.google.com/storage/browser/_details/covid-public-data/csv/incidence_20210225.csv), download the .csv and use `pd.read_csv` to read the dataframe.

In [None]:
filename = 'covid-public-data/csv/incidence_20210225.csv'
with gcp_fs.open(filename) as file_obj:
    source = pd.read_csv(file_obj, index_col=0)

In [None]:
source['date'] = pd.to_datetime(source['date'])

I will create a smaller dataframe for UK incidence in the last few months.

In [None]:
UK_incidence = source[(source.region=='UK') & (source.date >= '2020-12-01')]

In [None]:
UK_incidence.head()

# 1. Building blocks for an altair chart ('10)
_built on this tutorial: https://altair-viz.github.io/getting_started/starting.html_

The fundamental object in Altair is the `Chart`, which takes a dataframe as a single argument:

In [None]:
chart = alt.Chart(UK_incidence)

Taking the chart object, we can now specify how we would like the data to be visualized. This is done by selecting a `mark_*` attribute of the chart object. The most used marks are the followings:

* `mark_area()`	A filled area plot
* `mark_bar()`	A bar plot
* `mark_circle()`	A scatter plot with filled circles
* `mark_line()`	A line plot
* `mark_point(`)	A scatter plot with configurable point shapes
* `mark_rect()`	A filled rectangle, used for heatmaps
* `mark_rule()`	A vertical or horizontal line spanning the axis
* `mark_text()`	A scatter plot with points represented by text

Read more about the marks [here](https://altair-viz.github.io/user_guide/marks.html).

In [None]:
alt.Chart(UK_incidence).mark_bar()

This visualizations consists of one bar per row in the dataset, all plotted on top of each other, since we have not yet specified positions for these points.

To visually separate the points, we can map various [encoding channels](https://altair-viz.github.io/user_guide/encoding.html) to columns in the dataset. For example, we could encode the variable _date_ of the data with the x channel, which represents the x-axis position of the points. This can be done straightforwardly via the `Chart.encode()` method.

Some important encodings are:
* `x`
* `y`
* `color`
* `size`
* `column`
* `row`

In [None]:
alt.Chart(
    UK_incidence
).mark_bar().encode(
    x="date",
    y="pop_mid",
)

We can also exchange the barplots for lines with simply changing the mark type.

In [None]:
alt.Chart(
    UK_incidence
).mark_line().encode(
    x="date",
    y="pop_mid",
)

The `encode()` method builds a key-value mapping between encoding channels (such as x, y, color, shape, size, etc.) to columns in the dataset, accessed by column name.

The details of any mapping depend on the type of the data. Altair recognizes four main data types:
    
* quantitative	`Q`	a continuous real-valued quantity
* ordinal	`O`	a discrete ordered quantity
* nominal	`N`	a discrete unordered category
* temporal	`T`	a time or date value

If types are not specified for data input as a DataFrame, Altair defaults to quantitative for any numeric data, temporal for date/time data, and nominal for string data, but that these defaults are by no means always the correct choice.

We can change the types by appending `:Q`/`:O`/`:N`/`:T` to the column names in the encoding specification. Take a look how the chart format changes when we change the type of the `date` column to nominal! The data type also has certain implications on way axis labels, color scales, etc are handled.

In [None]:
alt.Chart(
    UK_incidence
).mark_bar().encode(
    x="date:O",
    y="pop_mid:Q",
)

Instead of using shortcuts to define the x and y encoding channels, you can call the whole object so you can customize its properties better.

In [None]:
my_first_chart = alt.Chart(
    UK_incidence,
    title='My chart title'
).mark_bar().encode(
    x=alt.X("date", title = 'My date axis'),
    y=alt.Y("pop_mid", title = 'Y axis with the daily incidence', axis=alt.Axis(format='s'))
)
my_first_chart

Now you can save this chart by calling

In [None]:
my_first_chart.save('my_first_chart.png', scale_factor=4.0)

# 2. Tidy data, colors and facets ('10)

Altair works best with [tidy data](http://vita.had.co.nz/papers/tidy-data.html) also known as long format: _Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table._

The incidence data is already in long format, but we can convert it to wide to see the differences.

In [None]:
incidence_wide = pd.pivot_table(
    source,
    values='pop_mid',
    index=['date'],
    columns=['region'], aggfunc=np.sum
).reset_index()

incidence_wide

This sort of data would be much harder to use with Altair.

Now we can explore how to take advantage of the tidy format to use colors to plot several variables simultaneously.

In [None]:
alt.Chart(
    source,
    title='My chart title'
).mark_line().encode(
    x="date",
    y="pop_mid",
    color='region'
)

Note that you can use the color as an encoding channel in the property `.encode()` or you can also specify the color in the `mark_*` property if you don't want to color by any of the variables.

We can also use facets (rows or columns) to plot several variables in separate subplots.

In [None]:
Wales_Scotland_England_incidence = source[source.region.isin(['Wales', 'Scotland', 'England'])]

alt.Chart(
    Wales_Scotland_England_incidence,
    title='My chart title'
).mark_bar(color='orange').encode(
    x='date',
    y='pop_mid',
    column='region'
)

In [None]:
Wales_Scotland_England_incidence = source[source.region.isin(['Wales', 'Scotland', 'England'])]

alt.Chart(
    Wales_Scotland_England_incidence,
    title='My chart title'
).mark_bar().encode(
    x='date',
    y='pop_mid',
    row='region'
)

If you don't need to have the same axis on all plots, you can add the `resolve_scale()` method at the end of your chart specification.

In [None]:
Wales_Scotland_England_incidence = source[source.region.isin(['Wales', 'Scotland', 'England']) & (source.date >= '2020-12-01')]

alt.Chart(
    Wales_Scotland_England_incidence,
    title='My chart title'
).mark_bar().encode(
    x='date',
    y='pop_mid',
    column='region'
).resolve_scale(y='independent')

One great thing in Altair is that you can easily combine multiple charts by using the following functions:
    
* `alt.hconcat()` for horizontal concatenation
* `alt.vconcat()` for vertical concatentation
* `alt.layer()` for layering

Suppose you want to see the UK incidence on a linechart and England incidence in barcharts. We can create the two charts separately and then combine them with `alt.hconcat`.

In [None]:
UK_line = alt.Chart(
    UK_incidence,
    title='UK incidence on a linechart'
).mark_line(color='red').encode(
    x=alt.X("date", title = 'My date axis'),
    y=alt.Y("pop_mid", title = 'incidence', axis=alt.Axis(format='s')),
)

England_bars = alt.Chart(
    Wales_Scotland_England_incidence[Wales_Scotland_England_incidence.region == 'England'],
    title='England incidence on a barchart'
).mark_bar().encode(
    x=alt.X("date", title = 'My date axis'),
    y=alt.Y("pop_mid", title = 'incidence', axis=alt.Axis(format='s')),
)

alt.hconcat(UK_line, England_bars)

Or you can also overlay the above two charts to see the UK incidence line and the England incidence bars on the same plot.

In [None]:
alt.layer(UK_line, England_bars)

Alternatively, we can also use the following shortcuts for concatenation:
    
* `alt.hconcat()` for horizontal concatenation --> `|`
* `alt.vconcat()` for vertical concatentation --> `&`
* `alt.layer()` for layering --> `+`

In [None]:
UK_line + England_bars

Another useful example where we might want to layer charts is adding confidence intervals.

In [None]:
UK_incidence.head()

In [None]:
base = alt.Chart(
    UK_incidence,
    title='UK incidence on a linechart'
)

UK_line = base.mark_line(color='red').encode(
    x=alt.X("date", title = 'My date axis'),
    y=alt.Y("pop_mid", title = 'incidence', axis=alt.Axis(format='s')),
)

UK_CI = base.mark_area(color='red', opacity=0.4).encode(
    x=alt.X("date", title = 'My date axis'),
    y='pop_low',
    y2='pop_up'
)

display(UK_line | UK_CI)

display(UK_line + UK_CI)

Another feature that comes handy when building composite charts is Altair's [own syntax](https://altair-viz.github.io/user_guide/transform/index.html) for data transformations / filtering. There's a lot you can do without doing your transformations in pandas, but let's stick to the most basic one: you can filter your data.

In [None]:
my_basic_chart = alt.Chart(
    Wales_Scotland_England_incidence # passing the whole dataset
).mark_line().encode(
    x=alt.X("date", title = 'My date axis'),
    y=alt.Y("pop_mid", axis=alt.Axis(format='s')),
    color='region'
)

my_basic_chart.transform_filter(
    (alt.datum.region == 'Scotland') | (alt.datum.region == 'Wales')
).properties(
    title='I filtered Scotland and Wales only'
)


## Now that you know everything, let's try to create some charts! ('20)

in the next two cells, we load some data to explore

In [None]:
filename = 'covid-public-data/csv/RevisedStats/prevalence_history_20210228.csv'
with gcp_fs.open(filename) as file_obj:
    prev_df = pd.read_csv(file_obj)
    
nhs_bed_occupancy = get_hospitalization_data()

In [None]:
prevalence_and_hospitalization = prev_df.merge(
    nhs_bed_occupancy,
    how='left',
    on = ['date', 'region']
)

prevalence_and_hospitalization_per_date = prevalence_and_hospitalization.groupby('date')[['active_cases', 'hospital_cases']].sum().reset_index()
prevalence_and_hospitalization_per_region = prevalence_and_hospitalization[
    prevalence_and_hospitalization.date >= '2021-02-01'
]

__Exercise 1__ Make a linechart of prevalence over time, where the lines are colored by region! (Use the dataset `prev_df`) <br>
__Exercise 2__ Make a barchart of prevalence over time in your three favorite regions where the bars are colored by the size of prevalence and regions are on separate plots (faceted by rows) (hint: use `.resolve_scale(y='independent')` to set different scales for the rows).<br>
__Exercise 3A__ Use the dataset `prevalence_and_hospitalization_per_region`, which contains the number of active cases and hospitalized cases for each date and region since the beginning of February. Plot the number of active cases against the number of hospitalized cases with a scatterplot, colored by region! <br>
__Exercise 3B__ Take the plot from the previous exercise and add a tooltip! You can use the encoding channel `tooltip` to reveal some information about the datapoints when you hover over them. In this case, you can show the date of each point. <br>

## Bonus: interactivity!

In [None]:
prev_hosp_inc = prevalence_and_hospitalization.merge(
    source.assign(date = source['date'].astype(str)),
    on = ['date', 'region'],
    how='left'
)

In [None]:
selection = alt.selection_multi(fields=['region'])
color = alt.condition(selection,
                      alt.Color('region:N', legend=None),
                      alt.value('lightgray'))

base = alt.Chart(prev_hosp_inc).mark_line().encode(
    x='date:T',
    color=color,
).properties(
    width=250,
    height=250
)

legend = alt.Chart(prev_hosp_inc).mark_bar().encode(
    y=alt.Y('region:N', axis=alt.Axis(orient='right')),
    color=color
).add_selection(
    selection
)

base.encode(y='active_cases') | base.encode(y='hospital_cases') | legend
