# Hail on Jupyter

From https://jupyter.org:

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

In the last year, the Jupyter development team released Jupyter Lab, an integrated environment for data, code, and visualizations. If you've used R Studio, this is the closest thing that works in Python (and many other languages!).


Part of what we think is so exciting about Hail is that it has coincided with a larger shift in the data science community.

Three years ago, most computational biologists at Broad analyzed genetic data using command-line tools, and took advantage of research compute clusters by explicitly using scheduling frameworks like LSF or Sun Grid Engine.

Now, they have the option to use Hail in interactive Python notebooks backed by thousands of cores on public compute clouds like [Google Cloud](https://cloud.google.com/), [Amazon Web Services](https://aws.amazon.com/), or [Microsoft Azure](https://azure.microsoft.com/).

# Using Jupyter
### Running cells
Evaluate cells using SHIFT + ENTER. Select the next cell and run it

In [1]:
print('Hello, world')

Hello, world


### Modes

Jupyter has two modes, a **navigation mode** and an **editor mode**.

#### Navigation mode:

 - <font color="blue"><strong>BLUE</strong></font> cell borders
 - `UP` / `DOWN` move between cells
 - `ENTER` while a cell is selected will move to **editing mode**.
 - Many letters are keyboard shortcuts! This is a common trap.
 
#### Editor mode:

 - <font color="green"><strong>GREEN</strong></font> cell borders
 - `UP` / `DOWN`/ move within cells before moving between cells.
 - `ESC` will return to **navigation mode**.
 - `SHIFT + ENTER` will evaluate a cell and return to **navigation mode**.

### Cell types

There are several types of cells in Jupyter notebooks. The two you will see here are **Markdown** (text) and **Code**.

In [2]:
# This is a code cell
my_variable = 5

**This is a markdown cell**, so even if something looks like code (as below), it won't get executed!

my_variable += 1

### Tips and tricks

Keyboard shortcuts:

 - `SHIFT + ENTER` to evaluate a cell
 - `ESC` to return to navigation mode
 - `y` to turn a markdown cell into code
 - `m` to turn a code cell into markdown
 - `a` to add a new cell **above** the currently selected cell
 - `b` to add a new cell **below** the currently selected cell
 - `d, d` (repeated) to delete the currently selected cell
 - `TAB` to activate code completion
 
To try this out, create a new cell below this one using `b`, and print `my_variable` by starting with `print(my` and pressing `TAB`!

# Set up our Python environment

In addition to Hail, we import a few methods from the Hail plotting library. We'll see examples soon!

In [3]:
import hail as hl
from hail.plot import output_notebook, show

Now we initialize Hail and set up plotting to display inline in the notebook.

In [4]:
hl.init()
output_notebook()

Running on Apache Spark version 2.4.0
SparkUI available at http://10.0.0.74:4042
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.35-577378849928
LOGGING: writing to /Users/kumar/Dropbox (Partners HealthCare)/HailTeam/Workshops/BroadE/hail-20200417-1705-0.2.35-577378849928.log


The workshop materials are designed to work on a small (~20MB) downsampled chunk of the public 1000 Genomes dataset.


It is possible to call command-line utilities from Jupyter by prefixing a line with a `!`:

In [5]:
! ls -1 resources/

1kg.fam
[34m1kg.mt[m[m
1kg.vcf.bgz
1kg_annotations.txt
Icon?
ensembl_gene_annotations.txt
[34mpca_scores.ht[m[m
[34mpost_qc.mt[m[m
true_pops.txt


# Part 1: Explore genetic data with Hail

#### Learning Objectives:

- To be comfortable exploring Hail data structures.
- To understand categories of functionality for performing QC.

### Import data from VCF

The [Variant Call Format (VCF)](https://en.wikipedia.org/wiki/Variant_Call_Format) is a common file format for representing genetic data collected on multiple individuals (samples).

Hail's [import_vcf](https://hail.is/docs/0.2/methods/impex.html#hail.methods.import_vcf) function can read this format.

However, VCF is a text format that is easy for humans to read, but very inefficient to process on a computer. 

The first thing we do is import (`import_vcf`) and convert the `VCF` file into a Hail native file format. This is done by using the `write` method below. The resulting file is **much** faster to process because it is scalable and easily parallelizable.

Let's read in a chunk of data, specifically from 1000 Genomes

In [6]:
hl.import_vcf('resources/1kg.vcf.bgz', min_partitions=4).write('resources/1kg.mt', overwrite=True)

2020-04-17 17:06:06 Hail: INFO: Coerced sorted dataset
2020-04-17 17:06:22 Hail: INFO: wrote matrix table with 13033 rows and 343 columns in 4 partitions to resources/1kg.mt


### Read 1KG into Hail

We represent genetic data as a Hail [`MatrixTable`](https://hail.is/docs/0.2/overview/matrix_table.html), and name our variable `mt` to indicate this.

In [7]:
mt = hl.read_matrix_table('resources/1kg.mt')

### What is a `MatrixTable`?

Let's explore it!

You can see:
 - **numeric** types:
     - integers (`int32`, `int64`), e.g. `5`
     - floating point numbers (`float32`, `float64`), e.g. `5.5` or `3e-8`
 - **strings** (`str`), e.g. `"Foo"`
 - **boolean** values  (`bool`) e.g. `True`
 - **collections**:
     - arrays (`array`), e.g. `[1,1,2,3]`
     - sets (`set`), e.g. `{1,3}`
     - dictionaries (`dict`), e.g. `{'Foo': 5, 'Bar': 10}`
 - **genetic data types**:
     - loci (`locus`), e.g. `[GRCh37] 1:10000` or `[GRCh38] chr1:10024`
     - genotype calls (`call`), e.g. `0/2` or `1|0`

In [8]:
mt.describe(widget=True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

### `show`

You can show individual fields like the sample ID (`s`), 

In [9]:
mt.s.show()

s
str
"""HG00096"""
"""HG00099"""
"""HG00105"""
"""HG00118"""
"""HG00129"""
"""HG00148"""
"""HG00177"""
"""HG00182"""
"""HG00242"""
"""HG00254"""


the locus (`locus`)

In [10]:
mt.locus.show()

locus,alleles
locus<GRCh37>,array<str>
1:904165,"[""G"",""A""]"
1:909917,"[""G"",""A""]"
1:986963,"[""C"",""T""]"
1:1509414,"[""AG"",""A""]"
1:1563691,"[""T"",""G""]"
1:1707740,"[""T"",""G""]"
1:2044130,"[""GTT"",""G""]"
1:2169908,"[""G"",""T""]"
1:2252970,"[""C"",""T""]"
1:2284195,"[""T"",""C""]"


or the called genotype (`GT`):

In [11]:
mt.GT.show()

locus,alleles,HG00096.GT,HG00099.GT,HG00105.GT,HG00118.GT,HG00129.GT,HG00148.GT,HG00177.GT,HG00182.GT,HG00242.GT,HG00254.GT,HG00265.GT,HG00271.GT,HG00274.GT
locus<GRCh37>,array<str>,call,call,call,call,call,call,call,call,call,call,call,call,call
1:904165,"[""G"",""A""]",0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
1:909917,"[""G"",""A""]",0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,
1:986963,"[""C"",""T""]",0/0,0/0,0/0,0/0,,0/0,,0/0,0/0,0/0,0/0,0/0,
1:1509414,"[""AG"",""A""]",0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
1:1563691,"[""T"",""G""]",,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
1:1707740,"[""T"",""G""]",0/1,0/1,0/1,0/0,0/0,0/1,0/1,0/1,0/0,0/1,0/0,0/0,0/0
1:2044130,"[""GTT"",""G""]",0/1,0/1,0/1,0/1,0/1,0/0,0/0,0/1,0/0,0/0,0/0,0/1,0/0
1:2169908,"[""G"",""T""]",0/0,0/0,0/1,0/0,0/0,0/0,0/0,0/0,0/0,0/0,,0/0,0/0
1:2252970,"[""C"",""T""]",0/0,,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
1:2284195,"[""T"",""C""]",1/1,0/1,0/1,0/1,0/0,1/1,0/0,0/0,0/0,0/1,0/0,0/1,0/1


### `summarize`
`summarize` Prints (potentially) useful information about any field or object:

In [12]:
mt.DP.summarize()

0,1
Non-missing,4399949 (98.43%)
Missing,70370 (1.57%)
Minimum,0
Maximum,150
Mean,7.26
Std Dev,4.40


In [13]:
mt.AD.summarize()

0,1
Non-missing,4399949 (98.43%)
Missing,70370 (1.57%)
Min Size,2
Max Size,2
Mean Size,2.00

0,1
Non-missing,8799898 (100.00%)
Missing,0
Minimum,0
Maximum,150
Mean,3.63
Std Dev,4.34


### `count`

`MatrixTable.count` returns a tuple with the number of rows (variants) and number of columns (samples).

In [14]:
mt.count()

(13033, 343)

### <font color="brightred"><strong>Exercise: </strong></font> explore other fields

Using `show` and `summarize`, explore some of the other fields.

To print fields inside the `info` structure, you must add another dot, e.g. `mt.info.AN`.

1. When using `show()`, what do you notice being printed alongside some of the fields?

2. Try replacing `show` with `take(5)`. How is that different?

*Jupyter tips*: 
1. You can tab complete field names, e.g. `mt.info.<TAB>` to see a list of attributes + methods.
2. The keyboard shortcut to add a new cell below the current one (while in navigation mode) is `b`.

### Hail has functions built for genetics

For example, `hl.summarize_variants` prints useful statistics about the genetic variants in the dataset. These are not part of the generic `summarize()` function, which must support all kinds of data, not just variant data!

In [15]:
hl.summarize_variants(mt)

Number of alleles,Count
2,13033

Allele type,Count
SNP,12800
Deletion,138
Insertion,95

Metric,Value
Transitions,9840.0
Transversions,2960.0
Ratio,3.32

Contig,Count
1,1084
2,1038
3,882
4,788
5,791
6,824
7,671
8,633
9,510
10,647


### Most of Hail's functionality is totally general-purpose!

Functions like `summarize_variants` are built out of Hail's general-purpose data manipulation functionality. We can use Hail to ask arbitrary questions about the data, in addition to built-in library functions:

In [16]:
mt.aggregate_rows(hl.agg.count_where(mt.alleles == ['A', 'T']))

150


Or if we had flight data:

```
data.aggregate(
    hl.agg.count_where(data.arrival_city == 'Denver')
)
```

The `counter` aggregator makes it possible to see distributions of categorical data, like alleles:

In [17]:
mt.aggregate_rows(hl.array(hl.agg.counter(mt.alleles)))

[(['A', 'AAAAAT'], 1),
 (['A', 'AAAAGAAAGAAAGAAAGAAAGAAAGAAAG'], 1),
 (['A', 'AAAAT'], 2),
 (['A', 'AAAGG'], 1),
 (['A', 'AATGTGAT'], 1),
 (['A', 'AC'], 4),
 (['A', 'ACT'], 1),
 (['A', 'AG'], 2),
 (['A', 'AT'], 6),
 (['A', 'ATTG'], 1),
 (['A', 'ATTT'], 1),
 (['A', 'ATTTCTTTCTTTCTTTCTTTCTTTCTTTC'], 1),
 (['A', 'C'], 526),
 (['A', 'G'], 2193),
 (['A', 'T'], 150),
 (['AAAAAG', 'A'], 1),
 (['AAAAC', 'A'], 1),
 (['AAAAT', 'A'], 1),
 (['AAAT', 'A'], 1),
 (['AAC', 'A'], 1),
 (['AAG', 'A'], 2),
 (['AAGCATC', 'A'], 1),
 (['AAT', 'A'], 1),
 (['AATATAG', 'A'], 1),
 (['AATG', 'A'], 1),
 (['AC', 'A'], 10),
 (['ACCTCAGTTCT', 'A'], 1),
 (['ACGTAG', 'A'], 1),
 (['ACT', 'A'], 3),
 (['ACTGT', 'A'], 1),
 (['AG', 'A'], 4),
 (['AGGTTTGCTGTATAGCCTCCATTAAAAAAATGAAGAGACCCAACTGATGCTTTAC', 'A'], 1),
 (['AGT', 'A'], 1),
 (['AT', 'A'], 4),
 (['ATAC', 'A'], 1),
 (['ATAGT', 'A'], 1),
 (['ATATATC', 'A'], 1),
 (['ATCT', 'A'], 1),
 (['ATG', 'A'], 1),
 (['ATGG', 'A'], 1),
 (['ATT', 'A'], 1),
 (['ATTGT', 'A'], 1),
 (['C

### Oops!

The insertions and deletions in this dataset are drowning out the interesting counts of the 12 possible SNPs. We can `filter` to SNPs and sort to uncover some interesting biology:

In [18]:
snp_counts = mt.aggregate_rows(
    hl.array(
        hl.agg.filter(
            hl.is_snp(mt.alleles[0], mt.alleles[1]),
            hl.agg.counter(mt.alleles))))
sorted(snp_counts,
       key=lambda x: -x[1])

[(['G', 'A'], 2775),
 (['C', 'T'], 2768),
 (['A', 'G'], 2193),
 (['T', 'C'], 2104),
 (['C', 'A'], 602),
 (['G', 'T'], 567),
 (['T', 'G'], 546),
 (['A', 'C'], 526),
 (['C', 'G'], 225),
 (['G', 'C'], 190),
 (['T', 'A'], 154),
 (['A', 'T'], 150)]

### <font color="brightred"><strong>Question: </strong></font> Why do the counts come in pairs? Discuss with neighbors!

# Part 2: Annotation and quality control

## Integrate sample information

We're building toward a genome-wide association test in part 3, but we don't just need genetic data to do a GWAS -- we also need phenotype data! Luckily, our `hl.utils.get_1kg` function also downloaded some simulated phenotype data.

This is a text file:

In [19]:
! head resources/1kg_annotations.txt

s	population	super_population	is_female	purple_hair	caffeine_consumption	six_toes
HG00096	GBR	EUR	false	false	5.0746e+01	false
HG00097	GBR	EUR	true	false	5.0244e+01	false
HG00098	GBR	EUR	false	false	6.3758e+01	false
HG00099	GBR	EUR	true	false	5.3899e+01	false
HG00100	GBR	EUR	true	false	4.1456e+01	false
HG00101	GBR	EUR	false	false	5.4906e+01	false
HG00102	GBR	EUR	true	false	3.8281e+01	false
HG00103	GBR	EUR	false	false	3.8200e+01	false
HG00104	GBR	EUR	true	false	5.1852e+01	false


We can import it as a [Hail Table](https://hail.is/docs/0.2/overview/table.html) with [hl.import_table](https://hail.is/docs/0.2/methods/impex.html?highlight=import_table#hail.methods.import_table).

We call it "sa" for "sample annotations".

In [20]:
sa = hl.import_table('resources/1kg_annotations.txt', 
                      impute=True, 
                      key='s')

2020-04-17 17:06:48 Hail: INFO: Reading table to impute column types
2020-04-17 17:06:48 Hail: INFO: Finished type imputation
  Loading column 's' as type 'str' (imputed)
  Loading column 'population' as type 'str' (imputed)
  Loading column 'super_population' as type 'str' (imputed)
  Loading column 'is_female' as type 'bool' (imputed)
  Loading column 'purple_hair' as type 'bool' (imputed)
  Loading column 'caffeine_consumption' as type 'float64' (imputed)
  Loading column 'six_toes' as type 'bool' (imputed)


While we can see the names and types of fields in the logging messages, we can also `show` this table:

In [21]:
sa.show()

s,population,super_population,is_female,purple_hair,caffeine_consumption,six_toes
str,str,str,bool,bool,float64,bool
"""HG00096""","""GBR""","""EUR""",False,False,50.7,False
"""HG00097""","""GBR""","""EUR""",True,False,50.2,False
"""HG00098""","""GBR""","""EUR""",False,False,63.8,False
"""HG00099""","""GBR""","""EUR""",True,False,53.9,False
"""HG00100""","""GBR""","""EUR""",True,False,41.5,False
"""HG00101""","""GBR""","""EUR""",False,False,54.9,False
"""HG00102""","""GBR""","""EUR""",True,False,38.3,False
"""HG00103""","""GBR""","""EUR""",False,False,38.2,False
"""HG00104""","""GBR""","""EUR""",True,False,51.9,False
"""HG00105""","""GBR""","""EUR""",False,False,35.7,False


And we can `summarize` each field in `sa`:

In [22]:
sa.summarize()

2020-04-17 17:06:50 Hail: INFO: Coerced sorted dataset


0,1
Non-missing,3500 (100.00%)
Missing,0
Min Size,7
Max Size,7
Mean Size,7.00
Sample Values,"['HG00096', 'HG00097', 'HG00098', 'HG00099', 'HG00100']"

0,1
Non-missing,2819 (80.54%)
Missing,681 (19.46%)
Min Size,3
Max Size,3
Mean Size,3.00
Sample Values,"['GBR', 'GBR', 'GBR', 'GBR', 'GBR']"

0,1
Non-missing,2819 (80.54%)
Missing,681 (19.46%)
Min Size,3
Max Size,3
Mean Size,3.00
Sample Values,"['EUR', 'EUR', 'EUR', 'EUR', 'EUR']"

0,1
Non-missing,3500 (100.00%)
Missing,0
Counts,"{False: 1740, True: 1760}"

0,1
Non-missing,3500 (100.00%)
Missing,0
Counts,"{False: 2813, True: 687}"

0,1
Non-missing,3500 (100.00%)
Missing,0
Minimum,14.11
Maximum,93.59
Mean,46.15
Std Dev,10.59

0,1
Non-missing,3500 (100.00%)
Missing,0
Counts,"{False: 3425, True: 75}"


## Add sample metadata into our 1KG `MatrixTable`

It just takes one line:

In [23]:
mt = mt.annotate_cols(pheno = sa[mt.s])

### What's going on here?

Understanding what's going on here is a bit more difficult. To understand, we need to understand a few pieces:

#### 1. `annotate` methods

In Hail, `annotate` methods refer to **adding new fields**. 

 - `MatrixTable`'s `annotate_cols` adds new column (**sample**) fields.
 - `MatrixTable`'s `annotate_rows` adds new row (**variant**) fields.
 - `MatrixTable`'s `annotate_entries` adds new entry (**genotype**) fields.
 - `Table`'s `annotate` adds new row fields.

In the above cell, we are adding a new column (**sample**) field called "pheno". This field should be the values in our table `sa` associated with the sample ID `s` in our `MatrixTable` - that is, this is performing a **join**.

Python uses square brackets to look up values in dictionaries:

    d = {'foo': 5, 'bar': 10}
    d['foo']

You should think of this in much the same way - for each column of `mt`, we are looking up the fields in `sa` using the sample ID `s`.

Let's look at where does this go into the `MatrixTable`

In [24]:
mt.describe(widget=True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

## Query the phenotype fields

What’s the fraction of samples with `purple_hair`?

In [25]:
mt.aggregate_cols(hl.agg.fraction(mt.pheno.purple_hair))

0.1457725947521866

How many people are in each self-reported major ancestry group?

In [26]:
mt.aggregate_cols(hl.agg.counter(mt.pheno.super_population))

{None: 56, 'AFR': 76, 'EAS': 75, 'AMR': 33, 'SAS': 62, 'EUR': 41}

### <font color="brightred"><strong>Exercise: </strong></font> Query some of these column fields using `mt.aggregate_cols`.

Some useful aggregators:
 - `hl.agg.counter`
 - `hl.agg.stats` (min, max, mean, etc)
 - `hl.agg.count_where`
 - `hl.agg.fraction`


To get started: What is the maximum value of `caffeine_consumption`? How many total samples have `purple_hair` (we calculated the `fraction` above)?

You can also explore the **row** an **entry** fields using `mt.aggregate_rows` and `mt.aggregate_entries`.

## Sample QC

We'll start with examples of sample QC.

Hail has the function [hl.sample_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc) to compute a list of useful statistics about samples from sequencing data. This function adds a new column field, `sample_qc`, with the computed statistics.

**Click the link** above to see the documentation, which lists the fields and their descriptions.

In [27]:
mt = hl.sample_qc(mt)

In [28]:
mt.describe(widget=True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

Hail includes a plotting library built on [bokeh](https://bokeh.pydata.org/en/latest/index.html) that makes it easy to visualize fields of Hail tables and matrix tables.

Let's visualize the distribution of `Mean DP` (`DP` = Read Depth) to `Call Rate`:

In [29]:
p = hl.plot.scatter(x=mt.sample_qc.dp_stats.mean,
                    y=mt.sample_qc.call_rate,
                    xlabel='Mean DP',
                    ylabel='Call Rate',
                    hover_fields={'ID': mt.s},
                    size=8)
show(p)

### <font color="brightred"><strong>Exercise: </strong></font> Plot some other fields!

Modify the cell above. Feel free to try `hl.plot.histogram` or `hl.plot.cdf` (which take a single numeric argument) as well.

### Filter columns using generated QC statistics

Before filtering samples, we should compute a raw sample count:

In [30]:
mt.count_cols()

343

`filter_cols` removes entire columns from the matrix table. Here, we keep columns (samples) where the `call_rate` is over 95%:

In [31]:
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95)


We can compute a final sample count:

In [32]:
mt.count_cols()

323

## Variant QC

Hail has the function [hl.variant_qc](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc) to compute a list of useful statistics about **variants** from sequencing data.

Once again, **Click the link** above to see the documentation!

In [33]:
mt = hl.variant_qc(mt)

In [34]:
mt.describe(widget=True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

We can `show()` the computed information:

In [35]:
mt.variant_qc.show()

locus,alleles,variant_qc.dp_stats.mean,variant_qc.dp_stats.stdev,variant_qc.dp_stats.min,variant_qc.dp_stats.max,variant_qc.gq_stats.mean,variant_qc.gq_stats.stdev,variant_qc.gq_stats.min,variant_qc.gq_stats.max,variant_qc.AC,variant_qc.AF,variant_qc.AN,variant_qc.homozygote_count,variant_qc.call_rate,variant_qc.n_called,variant_qc.n_not_called,variant_qc.n_filtered,variant_qc.n_het,variant_qc.n_non_ref,variant_qc.het_freq_hwe,variant_qc.p_value_hwe
locus<GRCh37>,array<str>,float64,float64,float64,float64,float64,float64,float64,float64,array<int32>,array<float64>,int32,array<int32>,float64,int64,int64,int64,int64,int64,float64,float64
1:904165,"[""G"",""A""]",7.53,4.0,1.0,22.0,29.1,23.4,3.0,99.0,"[556,86]","[8.66e-01,1.34e-01]",642,"[250,15]",0.994,321,2,0,56,71,0.232,3.99e-05
1:909917,"[""G"",""A""]",6.46,3.93,1.0,23.0,19.6,12.2,3.0,68.0,"[623,5]","[9.92e-01,7.96e-03]",628,"[310,1]",0.972,314,9,0,3,4,0.0158,0.00797
1:986963,"[""C"",""T""]",5.99,4.25,1.0,33.0,18.7,13.0,2.0,99.0,"[601,1]","[9.98e-01,1.66e-03]",602,"[300,0]",0.932,301,22,0,1,1,0.00332,0.5
1:1509414,"[""AG"",""A""]",6.35,3.67,1.0,23.0,18.0,11.5,0.0,88.0,"[627,3]","[9.95e-01,4.76e-03]",630,"[312,0]",0.975,315,8,0,3,3,0.00949,0.502
1:1563691,"[""T"",""G""]",6.84,4.3,1.0,20.0,18.7,13.1,0.0,99.0,"[612,10]","[9.84e-01,1.61e-02]",622,"[303,2]",0.963,311,12,0,6,8,0.0317,0.000813
1:1707740,"[""T"",""G""]",8.22,4.39,1.0,26.0,35.7,27.2,3.0,99.0,"[527,117]","[8.18e-01,1.82e-01]",644,"[221,16]",0.997,322,1,0,85,101,0.298,0.0485
1:2044130,"[""GTT"",""G""]",6.43,3.6,1.0,23.0,39.4,31.1,0.0,99.0,"[533,111]","[8.28e-01,1.72e-01]",644,"[211,0]",0.997,322,1,0,111,111,0.286,1.19e-05
1:2169908,"[""G"",""T""]",7.39,4.49,1.0,28.0,22.1,13.6,3.0,84.0,"[626,2]","[9.97e-01,3.18e-03]",628,"[312,0]",0.972,314,9,0,2,2,0.00636,0.501
1:2252970,"[""C"",""T""]",6.79,4.55,1.0,28.0,20.3,14.7,2.0,99.0,"[618,2]","[9.97e-01,3.23e-03]",620,"[308,0]",0.96,310,13,0,2,2,0.00644,0.501
1:2284195,"[""T"",""C""]",7.75,4.19,1.0,22.0,36.8,29.3,0.0,99.0,"[441,195]","[6.93e-01,3.07e-01]",636,"[160,37]",0.985,318,5,0,121,158,0.426,0.0561


Metrics like `call_rate` are important for QC. Let's plot the cumulative density function of call rate per variant:

In [36]:
show(hl.plot.cdf(mt.variant_qc.call_rate))



Before filtering variants, we should compute a raw variant count:

In [37]:
# pre-qc variant count
mt.count_rows()

13033

`filter_rows` removes entire rows of the matrix table. Here, we keep rows where the `call_rate` is over 95%:

In [38]:
mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95)

After filtering, we can see more resolution of the top end of the call rate distribution:

In [39]:
show(hl.plot.cdf(mt.variant_qc.call_rate))

We can then compute the final sample and variant count:

In [40]:
mt.count()

(12681, 323)

# Playing around for the Blog here

In [41]:
result_mt = mt.annotate_rows(gt_counter=hl.agg.counter(mt.GT))



In [46]:
result_mt = mt.annotate_rows(mean_allele_depth=hl.agg.mean(mt.AD)) 

TypeError: mean: parameter 'expr': expected expression of type float64, found <ArrayNumericExpression of type array<int32>>

In [45]:
mt.rows().show()

locus,alleles,rsid,qual,filters,info.AC,info.AF,info.AN,info.BaseQRankSum,info.ClippingRankSum,info.DP,info.DS,info.FS,info.HaplotypeScore,info.InbreedingCoeff,info.MLEAC,info.MLEAF,info.MQ,info.MQ0,info.MQRankSum,info.QD,info.ReadPosRankSum,variant_qc.dp_stats.mean,variant_qc.dp_stats.stdev,variant_qc.dp_stats.min,variant_qc.dp_stats.max,variant_qc.gq_stats.mean,variant_qc.gq_stats.stdev,variant_qc.gq_stats.min,variant_qc.gq_stats.max,variant_qc.AC,variant_qc.AF,variant_qc.AN,variant_qc.homozygote_count,variant_qc.call_rate,variant_qc.n_called,variant_qc.n_not_called,variant_qc.n_filtered,variant_qc.n_het,variant_qc.n_non_ref,variant_qc.het_freq_hwe,variant_qc.p_value_hwe
locus<GRCh37>,array<str>,str,float64,set<str>,array<int32>,array<float64>,int32,float64,float64,int32,bool,float64,float64,float64,array<int32>,array<float64>,float64,int32,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,array<int32>,array<float64>,int32,array<int32>,float64,int64,int64,int64,int64,int64,float64,float64
1:904165,"[""G"",""A""]",,52300.0,,[518],[1.03e-01],5020,-3.39,-0.17,17827,False,2.23,,0.0988,[514],[1.02e-01],59.1,0,1.45,15.0,6.29,7.53,4.0,1.0,22.0,29.1,23.4,3.0,99.0,"[556,86]","[8.66e-01,1.34e-01]",642,"[250,15]",0.994,321,2,0,56,71,0.232,3.99e-05
1:909917,"[""G"",""A""]",,1580.0,,[18],[3.73e-03],4830,-1.48,0.126,14671,False,5.52,,-0.0005,[15],[3.11e-03],59.1,0,1.76,13.7,-1.43,6.46,3.93,1.0,23.0,19.6,12.2,3.0,68.0,"[623,5]","[9.92e-01,7.96e-03]",628,"[310,1]",0.972,314,9,0,3,4,0.0158,0.00797
1:1509414,"[""AG"",""A""]",,52.5,,[23],[4.65e-03],4952,-8.26,-6.83,19926,False,675.0,,-0.0264,[4],[8.08e-04],59.4,0,-10.7,0.3,8.69,6.35,3.67,1.0,23.0,18.0,11.5,0.0,88.0,"[627,3]","[9.95e-01,4.76e-03]",630,"[312,0]",0.975,315,8,0,3,3,0.00949,0.502
1:1563691,"[""T"",""G""]",,1090.0,,[64],[1.30e-02],4766,-38.7,-5.39,15357,False,1900.0,,0.027,[22],[4.62e-03],59.0,0,1.31,5.05,1.15,6.84,4.3,1.0,20.0,18.7,13.1,0.0,99.0,"[612,10]","[9.84e-01,1.61e-02]",622,"[303,2]",0.963,311,12,0,6,8,0.0317,0.000813
1:1707740,"[""T"",""G""]",,93500.0,,[997],[1.98e-01],5034,-40.4,-0.287,19902,False,3.31,,0.0387,[983],[1.95e-01],58.3,0,9.48,13.6,2.26,8.22,4.39,1.0,26.0,35.7,27.2,3.0,99.0,"[527,117]","[8.18e-01,1.82e-01]",644,"[221,16]",0.997,322,1,0,85,101,0.298,0.0485
1:2044130,"[""GTT"",""G""]",,86800.0,,[882],[1.76e-01],5018,16.5,-16.0,15043,False,627.0,,-0.205,[883],[1.76e-01],57.2,0,5.72,7.82,5.91,6.43,3.6,1.0,23.0,39.4,31.1,0.0,99.0,"[533,111]","[8.28e-01,1.72e-01]",644,"[211,0]",0.997,322,1,0,111,111,0.286,1.19e-05
1:2169908,"[""G"",""T""]",,528.0,,[11],[2.24e-03],4912,-4.06,0.661,16870,False,3.92,,-0.0056,[7],[1.43e-03],58.9,0,-0.312,9.11,-1.72,7.39,4.49,1.0,28.0,22.1,13.6,3.0,84.0,"[626,2]","[9.97e-01,3.18e-03]",628,"[312,0]",0.972,314,9,0,2,2,0.00636,0.501
1:2252970,"[""C"",""T""]",,736.0,,[6],[1.28e-03],4682,-1.22,1.79,14900,False,2.82,,-0.0082,[6],[1.28e-03],58.7,0,0.957,10.2,0.667,6.79,4.55,1.0,28.0,20.3,14.7,2.0,99.0,"[618,2]","[9.97e-01,3.23e-03]",620,"[308,0]",0.96,310,13,0,2,2,0.00644,0.501
1:2284195,"[""T"",""C""]",,142000.0,,[1559],[3.12e-01],4990,-46.0,0.35,18176,False,2.95,,0.0925,[1552],[3.11e-01],58.6,0,16.1,15.5,-0.682,7.75,4.19,1.0,22.0,36.8,29.3,0.0,99.0,"[441,195]","[6.93e-01,3.07e-01]",636,"[160,37]",0.985,318,5,0,121,158,0.426,0.0561
1:2686238,"[""A"",""T""]",,1100.0,,[40],[8.01e-03],4994,-3.32,0.64,7787,False,0.402,,0.263,[28],[5.61e-03],36.7,0,2.34,7.84,0.094,3.81,2.57,1.0,18.0,20.7,14.4,1.0,99.0,"[634,4]","[9.94e-01,6.27e-03]",638,"[316,1]",0.988,319,4,0,2,3,0.0125,0.00471


# Write QC'ed final dataset to disk

In [43]:
mt.write('output/post_qc.mt', overwrite=True)

2020-04-17 17:07:38 Hail: INFO: wrote matrix table with 12681 rows and 323 columns in 4 partitions to output/post_qc.mt
