# Filtering and annotation

Let's load up our dataset again.

In [None]:
import hail as hl
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')

# Filter

You can filter the rows of a table with [filter](https://hail.is/docs/devel/hail.Table.html#hail.Table.filter).  You can filter on an expression, `filter` keeps those rows for which the expression evaluates to `True`.  `filter` returns another `Table`.

In [None]:
users.filter(users.occupation == 'programmer').count()

# Annotate

You can add new column(s) to a table with [annotate](https://hail.is/docs/devel/hail.Table.html#hail.Table.annotate).  Let's mean-center and variance-normalize the `age` field.

In [None]:
stats = users.aggregate(hl.agg.stats(users.age))
missing_occupations = hl.set(['other', 'none'])

t = users.annotate(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()

Note: `annotate` is functional: it doesn't change users, but returns a new table.  This is also true of `filter`.  In fact, all operations in Hail are functional.

In [None]:
users.describe()

However, we could have assigned the Python variable `users` with the result of the annotation if we wanted to appear to "add" `normalized_age` to the users table.  But don't forget that annotate returns a new table.

There are two other annotate methods: [select](https://hail.is/docs/devel/hail.Table.html#hail.Table.select) and [transmute](https://hail.is/docs/devel/hail.Table.html#hail.Table.transmute).  `select` returns a table with an entire new set of fields.  `transmute` replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged.  `transmute` is useful for transforming data into a new form.  How about some examples?

In [None]:
(users.select(len_occupation = hl.len(users.occupation))
 .describe())

In [None]:
(users.transmute(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null(hl.tstr),
                                 users.occupation))
 .describe())

Finally, you can add global fields with [annotate_globals](https://hail.is/docs/devel/hail.Table.html#hail.Table.annotate_globals).  Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.

In [None]:
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()

In [None]:
t.cloudable

In [None]:
t.cloudable.value

# Exercise

Try one of:

 - Z-score normalize the age field of `users`.
 - Convert `zip` to an integer.  Hints: Not all zipcodes are US zipcodes!  Use [hl.int32](https://hail.is/docs/devel/functions/constructors.html#hail.expr.functions.int32) to convert a string to an integer.  Use [StringExpression.matches](https://hail.is/docs/devel/expressions.html#hail.expr.expression.StringExpression.matches) to see if a string matches a regular expression.