## Introduction to the Expression Language

This notebook starts with the basics of the Hail expression language, and builds up practical experience with the type system, syntax, and functionality. By the end of this notebook, we hope that you will be comfortable enough to start using the expression language to slice, dice, filter, and query genetic data. These are covered in the next notebook!

The best part about a Jupyter Notebook is that you don't just have to run what we've written - you can and **should** change the code and see what happens!

## Setup

Every Hail practical notebook starts the same: import the necessary modules, and construct a [HailContext](https://hail.is/hail/hail.HailContext.html#hail.HailContext). This is the entry point for Hail functionality. This object also wraps a SparkContext, which can be accessed with `hc.sc`.

As always, visit the [documentation](https://hail.is/hail/api.html) on the Hail website for full reference.

In [None]:
from hail import *
hc = HailContext()

## Hail Expression Language

The Hail expression language is used everywhere in Hail: filtering conditions, describing covariates and phenotypes, storing summary statistics about variants and samples, generating synthetic data, plotting, exporting, and more. The Hail expression language takes the form of Python strings passed into various Hail methods like [filter_variants_expr](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.annotate_variants_expr) and [linear regression](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.linreg).

The expression language is a programming language just like Python or R or Scala. While the syntax is different, programming experience will certainly translate. We have built the expression language with the hope that even people new to programming are able to use it to explore genetic data, even if this means copying motifs and expressions found on places like [Hail discussion forum](http://discuss.hail.is/).

For learning purposes, `HailContext` contains the method [eval_expr_typed](https://hail.is/hail/hail.HailContext.html#hail.HailContext.eval_expr_typed). This method takes a Python string of Hail expr code, evaluates it, and returns a tuple with the result and the type.  We'll be using this method throughout the expression language tutorial.

## Hail Types


The Hail expression language is strongly typed, meaning that every _expression_ has an associated type.

Hail defines the following types:

Primitives:
 - [Int](https://hail.is/hail/types.html#int)
 - [Double](https://hail.is/hail/types.html#double)
 - [Float](https://hail.is/hail/types.html#float)
 - [Long](https://hail.is/hail/types.html#long)
 - [Boolean](https://hail.is/hail/types.html#boolean)
 - [String](https://hail.is/hail/types.html#string)
 
Compound Types:
 - [Array[T]](https://hail.is/hail/types.html#array)
 - [Set[T]](https://hail.is/hail/types.html#set)
 - [Dict[K, V]](https://hail.is/hail/types.html#dict)
 - [Aggregable[T]](https://hail.is/hail/types.html#aggregable)
 - [Struct](https://hail.is/hail/types.html#struct)
 
Genetic Types:
 - [Variant](https://hail.is/hail/types.html#variant)
 - [Locus](https://hail.is/hail/types.html#locus)
 - [AltAllele](https://hail.is/hail/types.html#altallele)
 - [Interval](https://hail.is/hail/types.html#interval)
 - [Genotype](https://hail.is/hail/types.html#genotype)
 - [Call](https://hail.is/hail/types.html#call)

## Primitive Types

Let's start with simple primitive types. Primitive types are a basic building block for any programming language - these are things like numbers and strings and boolean values. 

Hail expressions are passed as Python strings to Hail methods.

In [None]:
# the Boolean literals are 'true' and 'false'
hc.eval_expr_typed('true') 

The return value is `True`, not `true`.  Why?  When values are returned by Hail methods, they are returned as the corresponding Python value.

In [None]:
hc.eval_expr_typed('123')

In [None]:
hc.eval_expr_typed('123.45')

String literals are denoted with double-quotes. The 'u' preceding the printed result denotes a unicode string, and is safe to ignore.

In [None]:
hc.eval_expr_typed('"Hello, world"')

Primitive types support all the usual operations you'd expect.  For details, refer to the documentation on [operators](https://hail.is/hail/operators.html) and [types](https://hail.is/hail/types.html). Here are some examples.

In [None]:
hc.eval_expr_typed('3 + 8')

In [None]:
hc.eval_expr_typed('3.2 * 0.5')

In [None]:
hc.eval_expr_typed('3 ** 3')

In [None]:
hc.eval_expr_typed('25 ** 0.5')

In [None]:
hc.eval_expr_typed('true || false')

In [None]:
hc.eval_expr_typed('true && false')

## Missingness

Like R, all values in Hail can be missing.  Most operations, like addition, return missing if any of their inputs is missing.  There are a few special operations for manipulating missing values.  There is also a missing literal, but you have to specify it's type.  Missing Hail values are converted to `None` in Python.

In [None]:
hc.eval_expr_typed('NA: Int') # missing Int

In [None]:
hc.eval_expr_typed('NA: Dict[String, Int]')

In [None]:
hc.eval_expr_typed('1 + NA: Int')

You can test missingness with `isDefined` and `isMissing`.

In [None]:
hc.eval_expr_typed('isDefined(1)')

In [None]:
hc.eval_expr_typed('isDefined(NA: Int)')

In [None]:
hc.eval_expr_typed('isMissing(NA: Double)')

`orElse` lets you convert missing to a default value and `orMissing` lets you turn a value into missing based on a condtion.

In [None]:
hc.eval_expr_typed('orElse(5, 2)')

In [None]:
hc.eval_expr_typed('orElse(NA: Int, 2)')

In [None]:
hc.eval_expr_typed('orMissing(true, 5)')

In [None]:
hc.eval_expr_typed('orMissing(false, 5)')

## Let

You can assign a value to a variable with a `let` expression.  Here is an example.

In [None]:
hc.eval_expr_typed('let a = 5 in a + 1')

The variable, here `a` is only visible in the body of the let, the expression following `in`.  You can assign multiple variables.  Variable assignments are separated by `and`.  Each variable is visible in the right hand side of the following variables as well as the body of the let.  For example:

In [None]:
hc.eval_expr_typed('''
let a = 5
and b = a + 1
 in a * b
''')

## Conditionals

Unlike other languages, conditionals in Hail return a value.  The arms of the conditional must have the same type.  The predicate must be of type Boolean.  If the predicate is missing, the value of the entire conditional is missing.  Here are some simple examples.

In [None]:
hc.eval_expr_typed('if (true) 1 else 2')

In [None]:
hc.eval_expr_typed('if (false) 1 else 2')

In [None]:
hc.eval_expr_typed('if (NA: Boolean) 1 else 2')

The `if` and `else` branches need to return the same type. The below expression is invalid.

In [None]:
# Uncomment and run the below code to see the error message

# hc.eval_expr_typed('if (true) 1 else "two"')

## Compound Types

Hail has several compound types:
 - [Array[T]](https://hail.is/hail/types.html#array)
 - [Set[T]](https://hail.is/hail/types.html#set)
 - [Dict[K, V]](https://hail.is/hail/types.html#dict)
 - [Aggregable[T]](https://hail.is/hail/types.html#aggregable)
 - [Struct](https://hail.is/hail/types.html#struct)
 
`T`, `K` and `V` here mean any type, including other compound types.  Hail's `Array[T]` objects are similar to Python's lists, except they must be homogenous: that is, each element must be of the same type.  Arrays are 0-indexed.  Here are some examples of simple array expressions.

Array literals are constructed with square brackets.

In [None]:
hc.eval_expr_typed('[1, 2, 3, 4, 5]')

Arrays are indexed with square brackets and support Python's slice syntax.

In [None]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[0]')

In [None]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:3]')

In [None]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a[1:]')

In [None]:
hc.eval_expr_typed('let a = [1, 2, 3, 4, 5] in a.length()')

Arrays can be transformed with functional operators `filter` and `map`.  These operations return a new array, never modify the original.

In [None]:
# keep the elements that are less than 10
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10)')

In [None]:
# square the elements of an array
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x * x)')

In [None]:
# combine the two: keep elements less than 10 and then square them
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(x => x < 10).map(x => x * x)')

In the above filter / map expressions, you can see a strange syntax:

`x => x < 10`

This syntax is a [lambda function](https://en.wikipedia.org/wiki/Anonymous_function). The functions `filter` and `map` take _functions_ as arguments! A Hail lambda function takes the form:

`binding => expression`

That we named the binding 'x' in every example above is a point of preference, and no more. We can name the bindings anything we want.

In [None]:
# use 'foo' and 'bar' as bindings
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.filter(foo => foo < 10).map(bar => bar * bar)')

The full list of methods on arrays can be found [here](https://hail.is/hail/types.html#array-t).

## Numeric Arrays

Numeric arrays, like `Array[Int]` and `Array[Double]` have additional operations like `max`, `mean`, `median`, `sort`.  For a full list, see, for example, [Array[Int]](https://hail.is/hail/types.html#array-int).  Here are a few examples.

In [None]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].sum()')

In [None]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].max()')

In [None]:
hc.eval_expr_typed('[1, 2, 22, 7, 10, 11].mean()')

In [None]:
# take the square root of each element
hc.eval_expr_typed('let a = [1, 2, 22, 7, 10, 11] in a.map(x => x ** 0.5)')

## Exercise

Write an expression that calculates the sum of the squared residuals (x - mean) of an array.

In [None]:
# Uncomment the below code by deleting the triple-quotes and write an expression to calculate the residuals.

"""
result, t = hc.eval_expr_typed('''
let a = [1, -2, 11, 3, -2]
and mean = <FILL IN>
in a.map(x => <FILL IN> ).sum() 
''')
"""

try:
    print('Your result: %s (%s)' % (result, t))
    print('Expected answer:  114.8 (Double)')
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')

What if `a` contains a missing value NA: Int? Will your code still work?

## Structs

`Struct`s are a collection of named values known as fields.  Hail does not have tuples like Python.  Unlike arrays, the values can be heterogenous.  Unlike `Dict`s, the set of names are part of the type and must be known statically.  `Struct`s are constructed with a syntax similar to Python's `dict` syntax.  `Struct` fields are accessed using the `.` syntax.

In [None]:
print(hc.eval_expr_typed('{gene: "ACBD", function: "LOF", nHet: 12}'))

In [None]:
hc.eval_expr_typed('let s = {gene: "ACBD", function: "LOF", nHet: 12} in s.gene')

In [None]:
hc.eval_expr_typed('let s = NA: Struct { gene: String, function: String, nHet: Int} in s.gene')

## Genetic Types

Hail contains several genetic types:
 - [Variant](https://hail.is/hail/types.html#variant)
 - [Locus](https://hail.is/hail/types.html#locus)
 - [AltAllele](https://hail.is/hail/types.html#altallele)
 - [Interval](https://hail.is/hail/types.html#interval)
 - [Genotype](https://hail.is/hail/types.html#genotype)
 - [Call](https://hail.is/hail/types.html#call)
 
These are designed to make it easy to manipulate genetic data. There are many built-in functions for asking common questions about these data types, like whether an alternate allele is a SNP, or the fraction of reads a called genotype that belong to the reference allele.

## Demo variables

To explore these types and constructs, we have defined five representative variables which you can access in `eval_expr`:

In [None]:
# 'v' is used to indicate 'Variant' in Hail
hc.eval_expr_typed('v')

In [None]:
# 's' is used to refer to sample ID in Hail
hc.eval_expr_typed('s')

In [None]:
# 'g' is used to refer to the genotype in Hail
hc.eval_expr_typed('g')

In [None]:
# 'sa' is used to refer to sample annotations
hc.eval_expr_typed('sa')

The above output is a bit wordy. Let's try `'va'`:

In [None]:
# 'va' is used to refer to variant annotations
hc.eval_expr_typed('va')

**This is totally illegible.** `pprint` **can solve our problems!**

`pprint` is a Python standard library module that tries to print objects legibly. Let's try it out here:

In [None]:
from pprint import pprint

In [None]:
# 'va' is used to refer to variant annotations
pprint(hc.eval_expr_typed('va'))

You'll rarely need to construct a `Variant` or `Genotype` object inside the Hail expression language. More commonly, these objects will be provided to you as variables. In the remainder of this notebook, we will explore how to to manipulate the demo variables. In the next notebook, we start using the expression langauge to annotate and filter a dataset.

First, a short demonstration of some of the methods accessible on `Variant` and `Genotype` objects:

In [None]:
hc.eval_expr_typed('v')

In [None]:
hc.eval_expr_typed('v.contig')

In [None]:
hc.eval_expr_typed('v.start')

In [None]:
hc.eval_expr_typed('v.ref')

In [None]:
hc.eval_expr_typed('v.altAlleles')

In [None]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isSNP())')

In [None]:
hc.eval_expr_typed('v.altAlleles.map(aa => aa.isInsertion())')

In [None]:
hc.eval_expr_typed('g')

In [None]:
hc.eval_expr_typed('g.dp')

In [None]:
hc.eval_expr_typed('g.ad')

In [None]:
hc.eval_expr_typed('g.fractionReadsRef()')

In [None]:
hc.eval_expr_typed('g.isHet()')

## Wrangling complex nested types

Structs and Arrays allow arbitrarily deep grouping and nesting of values.

Remember the type of `sa`:

In [None]:
pprint(hc.eval_expr_typed('sa')[1])

Select elements of a `Struct` with a `'.'`. If we want to select `PC1` from the above type, we first index into the top-level struct with `covariates`, then select the field with `PC1`:

In [None]:
hc.eval_expr_typed('sa.covariates.PC1')

We can construct an array from the struct elements:

In [None]:
hc.eval_expr_typed('[sa.covariates.PC1, sa.covariates.PC2, sa.covariates.PC3]')

Now we'll use `va`. Here's its type of `va`:

In [None]:
pprint(hc.eval_expr_typed('va')[1])

This schema is somewhat representative of typical variant annotations: `AC`, `AN`, and `AF` are typically included in the `INFO` field of a VCF.

In [None]:
hc.eval_expr_typed('va.info.AF')

In [None]:
hc.eval_expr_typed('va.info.AF[1]')

AC and AF mean "allele count" and "allele frequency" and are "A-indexed", which means that there is one element per alternate allele. Perhaps we want to construct an array which contains each alternate allele and its count and frequency.

In [None]:
pprint(hc.eval_expr_typed('''range(v.altAlleles.length()).map(i => 
                      {allele: v.altAlleles[i], 
                       count: va.info.AC[i], 
                       frequency: va.info.AF[i]})'''))

Now, let's manipulate the `va.transcripts` array. Here's what it looks like:

In [None]:
hc.eval_expr_typed('va.transcripts')

We'll start by pulling out just the gene field. Our result will be an `Array[String]`. We need to do this with the `map` function, to map each struct element of the array to its field `gene`.

In [None]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene)')

Perhaps we just want the set of unique genes:

In [None]:
hc.eval_expr_typed('va.transcripts.map(t => t.gene).toSet()')

We can find the canonical transcript with `find`, which returns the first element where the predicate is true:

In [None]:
hc.eval_expr_typed('va.transcripts.find(t => t.canonical)')

However, `find` returns `None` if there isn't an element where the predicate is true:

In [None]:
hc.eval_expr_typed('va.transcripts.find(t => t.gene == "GENE5")')

Now, we'll pull out all transcripts marked "MIS" (missense):

In [None]:
hc.eval_expr_typed('va.transcripts.filter(t => t.consequence == "MIS")')

Here's a bit of a complex motif - we can sort the transcripts by an arbitrary function. Here we'll sort so that `"LOF"` comes before `"MIS"`, and `"MIS"` comes before `"SYN"`.

In [None]:
hc.eval_expr_typed('''va.transcripts.sortBy(t => 
                        if (t.consequence == "LOF") 1 
                        else if (t.consequence == "MIS") 2 
                        else 3)''')

If we are interested in pulling out the worst-consequence transcript, we can use this sorting motif and then take the first element:

In [None]:
hc.eval_expr_typed('''va.transcripts.sortBy(t => 
                        if (t.consequence == "LOF") 1 
                        else if (t.consequence == "MIS") 2 
                        else 3)[0]''')

## Learn more!

- [Basic language constructs](https://hail.is/hail/language_constructs.html)

- [Operators](https://hail.is/hail/operators.html)

- [Functions](https://hail.is/hail/functions.html)

- [Types](https://hail.is/hail/types.html)

## Exercises

Uncomment the code blocks, fill them in, and run each block to check your answers.

In [None]:
def check(answer, answer_key):
    print('Your answer / type:')
    pprint(answer)
    print('')
    if (answer == answer_key):
        print('Correct!')
    else:
        print('Incorrect. Expected:')
        pprint(answer_key)

**Exercise 1: using** `filter` **and** `map` **to pull out the gene isoform for synonymous transcripts**

In [None]:
"""
result_1 = hc.eval_expr_typed(
'''
va.transcripts.filter(t => <FILL IN>)
  .map(t => <FILL IN>)
''')
"""
# check the answer
try:
    answer_key = [u'GENE1.1', u'GENE3.1', u'GENE3.2'], TArray(TString())
    check(result_1, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')

**Exercise 2: using** `groupBy` **and** `mapValues` **to produce a mapping from gene to all observed consequences**

Remember: `<array>.toSet()` converts an array to a Set, the desired type of the dictionary value.

Hint: Once you've grouped by gene, you can fill in the `mapValues` step with `ts => ts` to see the type of `ts`. It's an `Array[Struct{...}]`. How do we pull just one field out?

In [None]:
"""
result_2 = hc.eval_expr_typed(
'''
  va.transcripts.groupBy(t => <FILL IN>)
    .mapValues(ts => <FILL IN>)
''')
"""

# check the answer
try:
    answer_key = {u'GENE1': {u'LOF', u'SYN'}, u'GENE2': {u'MIS'}, u'GENE3': {u'SYN'}}, TDict(TString(), TSet(TString()))
    check(result_2, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')

**Exercise 3: Do the reverse: group** `va.transcripts` **by consequence, and produce a mapping from consequence to all genes with that consequence**

In [None]:
"""
result_3 = hc.eval_expr_typed(
'''
<FILL IN>
''')
"""

# check the answer
try:
    answer_key = {u'LOF': {u'GENE1'}, u'MIS': {u'GENE2'}, u'SYN': {u'GENE1', u'GENE3'}}, TDict(TString(), TSet(TString()))
    check(result_3, answer_key)
except NameError:
    print('### Remove the triple quotes around the above code to start the exercise ### ')