# Pasing statistics

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210610-0857-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

## Import data

In [4]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

In [5]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt/')

In [6]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Phasing statistics

### Phased heterozygotes

Calculate number of phased heterozygotes.

In [7]:
# Get all heterozygotes.
het_mt = mt.filter_entries(mt.GT.is_het())

In [8]:
het_mt = het_mt.annotate_cols(n_phased_hets=hl.agg.count_where(het_mt.GT.phased), n_hets = hl.agg.count_where(het_mt.GT.is_het()))

In [9]:
het_mt = het_mt.annotate_cols(phased_hets_fraction = het_mt.n_phased_hets / het_mt.n_hets)

In [10]:
p = hl.plot.histogram(het_mt.phased_hets_fraction, title='Histogram of fraction of phased heterozygotes per sample')
show(p)

### Phase block lengths

Calculate the lengths of the phase blocks. Phase blocks are defined by the `PS` 'phase set' tag on the genotypes. The `PS` tag is an integer equal to the position of the first variant in the phase block.

**NOTE:** we only look at heterozygotes, because in principle, the phase of any homozygote variants are known trivially. Therefore, if we include homozygous variants, in principle our phase block stretches from the first to the last homozygote on the chromosome.

#### Calculate phase block statistics

We will obtain the start and end positions of each phase block, defined as the positions of the first and last variants in each phase set. We will use these to calculate the phase block lengths. We will also calculate the number of variants in each phase block.

First, we make a table of entries, whose rows will be keyed by locus, alleles and sample. 

In [11]:
entries = het_mt.entries()

2021-06-10 08:57:27 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
    To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'


Group entries by phase set, chromosome and sample. This gives us a table with all genotype data grouped by these values.

In [12]:
ps_groups_ht = entries.group_by(PS_group=entries.PS, chrom_group=entries.locus.contig, sample_group=entries.s)

For each phase set in each sample, calculate the number of variants in each phase set, the start and end position of each set, and the length of each phase block.

In [13]:
ps_stats_ht = ps_groups_ht.aggregate(ps_start=hl.agg.min(entries.locus.position), ps_stop=hl.agg.max(entries.locus.position), ps_count=hl.agg.count())

ps_stats_ht = ps_stats_ht.annotate(ps_length = ps_stats_ht.ps_stop - ps_stats_ht.ps_start)

# FIXME: checkpoint only for testing, remove.
# Cache all operations by making a checkpoint.
ps_stats_ht = ps_stats_ht.checkpoint('/home/olavur/tmp/phasing_stats.ht', overwrite=True)

2021-06-10 09:01:43 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-10 09:01:56 Hail: INFO: wrote table with 4017621 rows in 37 partitions to /home/olavur/tmp/phasing_stats.ht
    Total size: 71.18 MiB
    * Rows: 71.18 MiB
    * Globals: 11.00 B
    * Smallest partition: 76453 rows (1.32 MiB)
    * Largest partition:  160761 rows (2.81 MiB)


#### Sanity check

Let's investigate the phase blocks a bit, to check that they actually make sense.

First let's look into the **phase block positions**.

Below we see a histogram of the start position of phase blocks on chromosome 1 for four different samples.

Note that with an exception of a region in the middle of the chromosome, the entire chromosome seems to be covered by phase blocks. Most likely, the gap in the middle is a "dark spot" in the genome assembly.

In [14]:
plot_list = []
for sample in ['FN001485', 'FN000020', 'FN000254', 'FN000182']:
    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == sample) & (ps_stats_ht.chrom_group == 'chr1'))
    p = hl.plot.histogram(temp.ps_start, title=sample)
    plot_list.append(p)
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))

We make a similar plot for the phase block stop positions.

In [15]:
plot_list = []
for sample in ['FN001485', 'FN000020', 'FN000254', 'FN000182']:
    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == sample) & (ps_stats_ht.chrom_group == 'chr1'))
    p = hl.plot.histogram(temp.ps_stop, title=sample)
    plot_list.append(p)
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))

Now let's look into the **phase block lengths**.

The plots below show a histogram of the phase block lengths for four different samples.  Note that we've removed blocks with length zero, but included the count in the title.

In [16]:
plot_list = []
for sample in ['FN001485', 'FN000020', 'FN000254', 'FN000182']:
    # Calculate number of zero length phase blocks.
    ps_zero = ps_stats_ht.aggregate(hl.agg.filter(ps_stats_ht.sample_group == sample, hl.agg.count_where(ps_stats_ht.ps_length == 0)))

    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == sample) & (ps_stats_ht.ps_length > 0))

    p = hl.plot.histogram(hl.log10(temp.ps_length), title='{s}. Zero length blocks: {n}'.format(s=sample, n=ps_zero))
    p.xaxis.axis_label = 'log10(Length)'
    
    plot_list.append(p)

In [17]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))

In [18]:
ps_len_stats_ht = (ps_stats_ht.group_by(ps_stats_ht.chrom_group)
    .aggregate(stats = hl.agg.stats(ps_stats_ht.ps_length)))

In [19]:
ps_len_stats_pd = ps_len_stats_ht.to_pandas()

2021-06-10 09:02:07 Hail: INFO: Ordering unsorted dataset with network shuffle


In [20]:
ps_len_stats_pd.sort_values('chrom_group')

Unnamed: 0,chrom_group,stats.mean,stats.stdev,stats.min,stats.max,stats.n,stats.sum
0,chr1,323845.6,8657725.0,0.0,248924232.0,385799,124939300000.0
1,chr10,346446.7,6600036.0,0.0,133731379.0,190712,66071540000.0
2,chr11,293977.7,6035545.0,0.0,134899987.0,231756,68131100000.0
3,chr12,315498.3,6226102.0,0.0,133222988.0,213359,67314400000.0
4,chr13,532234.1,7006704.0,0.0,96172914.0,87567,46606140000.0
5,chr14,334960.1,5227407.0,0.0,88201934.0,130658,43765220000.0
6,chr15,302622.7,4781546.0,0.0,82013819.0,136218,41222660000.0
7,chr16,338206.0,5295094.0,0.0,90204734.0,134805,45591850000.0
8,chr17,248385.1,4275575.0,0.0,83029187.0,175033,43475590000.0
9,chr18,456314.2,5909913.0,0.0,80199903.0,85364,38952810000.0


In [21]:
p = hl.plot.scatter(ps_len_stats_ht.stats.mean, ps_len_stats_ht.stats.n, hover_fields={'Chrom': ps_len_stats_ht.chrom_group}, label=ps_len_stats_ht.chrom_group)
show(p)

2021-06-10 09:02:10 Hail: INFO: Ordering unsorted dataset with network shuffle


## Compound heterozygotes

In [22]:
# Get the gene name from variant annotation.
# The annotation field is an array with one element for each transcript for the particular site.
# The various information in the annotation is separated by a pipe ("|").
het_mt = het_mt.annotate_rows(gene = het_mt.info.ANN.map(lambda x: x.split('\|')[3]))

# We will only look at one of the genes, so we arbitrarily pick the first in the list.
het_mt = het_mt.annotate_rows(gene1 = het_mt.gene[0])

In [23]:
(het_mt.group_rows_by(het_mt.gene1)
    .aggregate(phased_diploid_hets = hl.agg.filter((het_mt.GT.phased) & (het_mt.GT.ploidy == 2), hl.agg.count())))

<hail.matrixtable.MatrixTable at 0x7f53977deed0>

In [24]:
phased_hets_ht = het_mt.filter_entries(het_mt.GT.phased)

In [25]:
phased_hets_ht.group_rows_by(phased_hets_ht.gene1)

<hail.matrixtable.GroupedMatrixTable at 0x7f5397697ad0>

In [26]:
mt.GT[0].take(5)

[0, 0, 0, 0, 0]

In [27]:
lof_mt = mt.filter_rows(~hl.is_missing(mt.info.LOF))

In [28]:
lof_mt.info.LOF.take(5)

[['(OR4F5|OR4F5|1|1.00)'],
 ['(OR4F5|OR4F5|1|1.00)'],
 ['(OR4F5|OR4F5|1|1.00)'],
 ['(OR4F29|OR4F29|1|1.00)', '(OR4F3|OR4F3|1|1.00)', '(OR4F16|OR4F16|1|1.00)'],
 ['(OR4F29|OR4F29|1|1.00)', '(OR4F3|OR4F3|1|1.00)', '(OR4F16|OR4F16|1|1.00)']]