# Plotting with Hail

The [`hail.ggplot2`](https://hail.is/docs/0.2/ggplot2/index.html) module provides a set of functions for aggregating and plotting your data. The module attempts to mimic the API of [R's `ggplot2` library](https://ggplot2.tidyverse.org) as closely as possible. The module displays plots visually using the [Vega-Altair](https://altair-viz.github.io/) library.

On this page, you'll find an explanation of the basics of the module, as well as examples of how to create some commonly-used types of plot.

## Example Data

In order to provide example plots, we'll need example data. The following code uses [`hail.utils.range_table`](https://hail.is/docs/0.2/utils/index.html#hail.utils.range_table) to generate a table with a column containing the index of each row, and a few other columns that each perform a simple calculation on it:

In [1]:
import hail as hl

data = hl.utils.range_table(100)
data = data.annotate(idx_2=(data.idx * 2), idx_3=(data.idx * 3))
data.show()

Initializing Hail with default parameters...
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/Users/irademac/miniconda3/envs/main/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.3
SparkUI available at http://192.168.1.14:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.122-2a2fa74f2ad3
LOGGING: writing to /Users/irademac/src/hail/hail/python/hail/docs/tutorials/hail

idx,idx_2,idx_3
int32,int32,int32
0,0,0
1,2,3
2,4,6
3,6,9
4,8,12
5,10,15
6,12,18
7,14,21
8,16,24
9,18,27


## Plot Objects

When creating a plot, we start out with a basic **plot object**, which wraps our data. We can create such an object using the [`ggplot`](https://hail.is/docs/0.2/ggplot2/index.html#hail.ggplot2.ggplot) function, and take a closer look using the [`pprint`](https://hail.is/docs/0.2/ggplot2/index.html#hail.ggplot2.pprint) function:

In [2]:
from hail.ggplot2 import ggplot
from hail.ggplot2.typecheck import pprint

plot = ggplot(data)
pprint(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = None,
        y = None,
        rest = {}
    ),
    geoms = []
)


## Aesthetic Mappings

Next, we'll need to specify which values to plot along our x- and y-axes.

We'll use the [`aes`](https://hail.is/docs/0.2/ggplot2/index.html#hail.ggplot2.aes) function to create an **aesthetic mapping**, which links data to visual components of our plot. `aes` assumes its first two arguments are `x` and `y`, and can be passed any other arguments via keyword.

Let's map the `idx` column of our data to the x-axis, and the `idx_2` column to the y-axis:

In [3]:
from hail.ggplot2 import aes

plot += aes(data.idx, data.idx_2)
pprint(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = []
)


## Layers

Like R's `ggplot2`, `hail.ggplot2` bases its approach to plotting on a [layered grammar of graphics](https://ggplot2-book.org/introduction.html#what-is-the-grammar-of-graphics).

A **layer** is a self-contained collection of visual characteristics, data, and/or aggregations over that data. We'll build up our plot from the data by adding different layers to it.

### Geoms

A **geom** is a type of layer that specifies which kind of plot we're making. For example, `geom_point` indicates that we're making a scatterplot:

In [4]:
from hail.ggplot2 import geom_point

plot += geom_point()
pprint(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        )
    ]
)


Since we've now added information about what the plot will look like, we can display it using the `show` function:

In [5]:
from hail.ggplot2 import show

show(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        )
    ]
)


We can also add multiple geoms to a single plot.

For example, we can plot a line on top of our scatterplot using `geom_line`. This line will still plot the values of the `idx` column along the x-axis, but use the `idx_3` column for the y-axis. We can override the `y` mapping for this geom by passing in its own `aes`:

In [6]:
from hail.ggplot2 import geom_line

plot += geom_line(aes(y=data.idx_3))
show(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        ),
        GeomLine(
            mapping = Mapping(
                x = None,
                y = <Int32Expression of type int32>,
                rest = {}
            ),
            data = None
        )
    ]
)


### Stats

TODO: you may have noticed that some geoms, like histogram, implicitly compute some statistics about the data before rendering it. but what if you need to transform your data independently of a geom?

TODO: stats are cached, so when you recompute them, the plot object will attempt to reuse the cached values of previously applied aggregations

### Removing Layers

With R's `ggplot2`, if you add a layer to a plot object, [it can be tough to remove it](https://stackoverflow.com/questions/50434608/remove-geoms-from-an-existing-ggplot-chart).

However, `hail.ggplot2` plot objects keep track of each addition made to them, allowing us to use the `undo` method to roll back a single addition:

In [7]:
from hail.ggplot2 import undo

plot += geom_line()
show(plot)
plot = undo(plot)
show(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        ),
        GeomLine(
            mapping = Mapping(
                x = None,
                y = <Int32Expression of type int32>,
                rest = {}
            ),
            data = None
        ),
        GeomLine(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        )
    ]
)
Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        Geom

We can also specify the number of additions to undo via the `depth` keyword argument:

In [8]:
plot = plot + geom_line() + geom_point()
show(plot)
plot = undo(plot, depth=2)
show(plot)

Plot(
    data = <hail.table.Table object at 0x165c8d5b0>,
    mapping = Mapping(
        x = <Int32Expression of type int32>,
        y = <Int32Expression of type int32>,
        rest = {}
    ),
    geoms = [
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        ),
        GeomLine(
            mapping = Mapping(
                x = None,
                y = <Int32Expression of type int32>,
                rest = {}
            ),
            data = None
        ),
        GeomLine(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        ),
        GeomPoint(
            mapping = Mapping(
                x = None,
                y = None,
                rest = {}
            ),
            data = None
        )
    ]
)
Plot(
    data = <hail.table.Table object at 

## Titles

TODO

## Axis Labels

TODO

## Scales

TODO: what are scales

## Facets

TODO

## Examples

### Histogram

TODO: explain a bit about histograms

We've already created a histogram [above](#Displaying-Plots). Let's take another look at it:

In [9]:
show(plot3)

NameError: name 'plot3' is not defined

TODO: talk about default settings

We can specify the number of `bins` as an argument to `geom_histogram`:

In [None]:
plot4 = plot2 + geom_histogram(bins=50)
show(plot4)

### Cumulative Histogram

TODO

### 2D Histogram

TODO

### Scatter Plot

TODO

### QQ Plot

TODO: To create a quantile-quantile (QQ) plot, ...

### Manhattan Plot

TODO (use actual genetics data)

## Common Mistakes

TODO: don't use the class constructors, call the special functions

TODO: example specifically demonstrating downsampling