# Aggregation

Let's start up Hail, import `matplotlib` and read our movie dataset.

In [None]:
import hail as hl
hl.init()

import matplotlib.pyplot as plt
import seaborn
seaborn.set()
%matplotlib inline

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')

# Exploring More Deeply

In the last section, we inspected the structure of the data and displayed a few example values.

How do we get a deeper feel for the data?  One of the most natural things to do is to create a summary of a large number of values.  For example, you could ask:

 - How many women and men are in the dataset?
 - What is the average age?  Youngest?  Oldest?
 - What are all the occupations that appear, and how many of each?

We can do these things with *aggregation*.  Aggregation combines many values together to create a summary.

# Aggregation

To start, we'll aggregate all the values in a table.  (Later, we'll learn how to aggregate over subsets.)

We can do this with the Table [aggregate](https://hail.is/docs/devel/hail.Table.html#hail.Table.aggregate) method.

An call to `aggregate` has two parts:

 - The expression you want to aggregate over (e.g. a field of a `Table`).
 - The *aggregator* that says how to combine the values into the summary.
 
Hail has a large suite of [aggregators](https://hail.is/docs/devel/aggregators.html) for summarizing data.  Let's see some in action!

# Aggregators

Aggregators live in the `hl.agg` module.  The simplest aggregator is [hl.agg.count](https://hail.is/docs/devel/aggregators.html#hail.expr.aggregators.count).  It takes no arguments and returns the number of values aggregated.

In [None]:
users.aggregate(hl.agg.count())

In [None]:
users.count()

# Aggregators

[hl.agg.stats](https://hail.is/docs/devel/aggregators.html#hail.expr.aggregators.stats) computes a bunch of stats of a numeric expression.  There are also numeric aggregators for `mean`, `min`, `max`, `sum`, `product` and `array_sum`.

In [None]:
users.show()

In [None]:
users.aggregate(hl.agg.stats(users.age))

# Aggregators

`stats` and friends work great for numeric data.  Wwhat about other data types, like categorical data?  

[hl.agg.counter](https://hail.is/docs/devel/aggregators.html#hail.expr.aggregators.counter) is modeled on the Python Counter object.

It counts the number of times each distinct value occurs in the collection of values being aggregated.

In [None]:
users.aggregate(hl.agg.counter(users.occupation))

# Filter

You can filter elements of a collection before aggregating it by using [hl.agg.filter](https://hail.is/docs/devel/aggregators.html#hail.expr.aggregators.filter).

In [None]:
users.aggregate(hl.agg.count(hl.agg.filter(users.sex == 'M', users.sex)))

# Filter

The argument to filter can be a boolean expresion (like above) or a Python lambda that takes the values in the collection and returns a boolean (True if the value should be kept).  This mirrors the interface to the builtin Python `filter` function.

In [None]:
users.aggregate(hl.agg.count(hl.agg.filter(lambda sex: sex == 'M', users.sex)))

# Histograms

As we saw in the GWAS example, [hl.agg.hist](https://hail.is/docs/devel/aggregators.html#hail.expr.aggregators.hist) can be used to build a histogram over numeric data.

In [None]:
hist = users.aggregate(hl.agg.hist(users.age, 10, 70, 60))
hist

In [None]:
# note to future self, hist should have start and end, and this should just be hl.plot.hist(hist)
plt.xlim(10, 70)
plt.bar(hist.bin_edges[:-1], hist.bin_freq)
plt.show()

There are a few aggregators for collecting values.
 - `take` takes a few values.  It has an optional `ordering`.
 - `collect` takes all values.

In [None]:
users.aggregate(hl.agg.take(users.occupation, 5))

In [None]:
users.aggregate(hl.agg.take(users.age, 5, -users.age))

Warning!  Aggregators like `collect` and `counter` return Python objects and can fail with out of memory errors if you apply them to collections that are too large (e.g. all the 50T genotypes of the UKB).