# Pasing statistics

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/phasing/hail-20210531-1111-0.2.61-3c86d3ba497a.log


In [196]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [8]:
import pandas as pd

## Import data

In [3]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

In [4]:
vcf_gz_path = '/data/projects/fargen_phase_1/data/multi_sample_data/phasing/outs/phased_merged.vcf.gz'

In [5]:
mt = hl.import_vcf(vcf_gz_path, force_bgz=True, reference_genome='GRCh38', array_elements_required=False)

In [6]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

2021-05-31 11:12:36 Hail: INFO: Coerced sorted dataset


Number of sites: 3110759
Number of samples: 472


## Filter variants

In [10]:
mt = mt.filter_rows(mt.filters.contains('VQSRTrancheSNP99.90to100.00') | mt.filters.contains('VQSRTrancheINDEL99.90to100.00'))

In [11]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

2021-05-31 11:14:39 Hail: INFO: Coerced sorted dataset


Number of sites: 617463
Number of samples: 472


## Write variants to file

In [12]:
mt.write(BASE_DIR + '/data/mt/phased_hq.mt', overwrite=True)

2021-05-31 11:15:03 Hail: INFO: Coerced sorted dataset
2021-05-31 11:16:31 Hail: INFO: wrote matrix table with 617463 rows and 472 columns in 37 partitions to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/phased_hq.mt
    Total size: 724.91 MiB
    * Rows/entries: 724.91 MiB
    * Columns: 1.77 KiB
    * Globals: 11.00 B
    * Smallest partition: 14326 rows (16.44 MiB)
    * Largest partition:  18817 rows (21.81 MiB)


Read the new data again, to avoid having to repeat all operations.

In [15]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/phased_hq.mt/')

In [16]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

Number of sites: 617463
Number of samples: 472


## Phasing statistics

### Phased heterozygotes

Calculate number of phased heterozygotes.

In [237]:
# Get all heterozygotes.
het_mt = mt.filter_entries(mt.GT.is_het())

In [238]:
het_mt = het_mt.annotate_cols(n_phased_hets=hl.agg.count_where(het_mt.GT.phased), n_hets = hl.agg.count_where(het_mt.GT.is_het()))

In [239]:
het_mt = het_mt.annotate_cols(phased_hets_fraction = het_mt.n_phased_hets / het_mt.n_hets)

In [240]:
p = hl.plot.histogram(het_mt.phased_hets_fraction, title='Histogram of fraction of phased heterozygotes per sample')
show(p)

### Phase block lengths

Calculate the lengths of the phase blocks. Phase blocks are defined by the `PS` 'phase set' tag on the genotypes. The `PS` tag is an integer equal to the position of the first variant in the phase block.

**NOTE:** we only look at heterozygotes, because in principal, the phase of any homozygote variants are known trivially. Therefore, if we include homozygous variants, in principal our phase block stretches from the first to the last homozygote on the chromosome.

#### Calculate phase block statistics

We will obtain the start and end positions of each phase block, defined as the positions of the first and last variants in each phase set. We will use these to calculate the phase block lengths. We will also calculate the number of variants in each phase block.

First, we make a table of entries, whose rows will be keyed by locus, alleles and sample. 

In [286]:
entries = het_mt.entries()

Group entries by phase set, chromosome and sample. This gives us a table with all genotype data grouped by these values.

In [287]:
ps_groups_ht = entries.group_by(PS_group=entries.PS, chrom_group=entries.locus.contig, sample_group=entries.s)

For each phase set in each sample, calculate the number of variants in each phase set, the start and end position of each set, and the length of each phase block.

In [289]:
ps_stats_ht = ps_groups_ht.aggregate(ps_start=hl.agg.min(entries.locus.position), ps_stop=hl.agg.max(entries.locus.position), ps_count=hl.agg.count())

ps_stats_ht = ps_stats_ht.annotate(ps_length = ps_stats_ht.ps_stop - ps_stats_ht.ps_start)

# FIXME: checkpoint only for testing, remove.
# Cache all operations by making a checkpoint.
ps_stats_ht = ps_stats_ht.checkpoint('/home/olavur/tmp/phasing_stats.ht')

2021-06-01 11:01:06 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-01 11:01:10 Hail: INFO: wrote table with 1254275 rows in 37 partitions to /home/olavur/tmp/phasing_stats.ht
    Total size: 22.85 MiB
    * Rows: 22.85 MiB
    * Globals: 11.00 B
    * Smallest partition: 23977 rows (432.28 KiB)
    * Largest partition:  50145 rows (967.81 KiB)


#### Sanity check

Let's investigate the phase blocks a bit, to check that they actually make sense.

First let's look into the **phase block positions**.

Below we see a histogram of the start position of phase blocks on chromosome 1 for one particular sample. Below that we see the same plot for the end positions.

Note that with an exception of a region in the middle of the chromosome, the entire chromosome seems to be covered by phase blocks.

In [314]:
temp = ps_stats_ht.filter((ps_stats_ht.sample_group == 'FN000001') & (ps_stats_ht.chrom_group == 'chr1'))
p = hl.plot.histogram(temp.ps_start)
show(p)

In [307]:
temp = ps_stats_ht.filter((ps_stats_ht.sample_group == 'FN000001') & (ps_stats_ht.chrom_group == 'chr1'))
p = hl.plot.histogram(temp.ps_stop)
show(p)

Make the same plot for a few random samples. Note that we see a simlar pattern across samples. Most likely, the gap in the middle is a "dark spot" in the genome assembly.

In [353]:
plot_list = []
for sample in ['FN001485', 'FN000020', 'FN000254', 'FN000182']:
    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == sample) & (ps_stats_ht.chrom_group == 'chr1'))
    p = hl.plot.histogram(temp.ps_start, title=sample)
    plot_list.append(p)

In [354]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))

Also plot a few chromosomes for the same sample.

In [356]:
plot_list = []
for chrom in ['chr1', 'chr2', 'chrX', 'chrY']:
    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == 'FN000001') & (ps_stats_ht.chrom_group == chrom))
    p = hl.plot.histogram(temp.ps_start, title=chrom)
    plot_list.append(p)

In [357]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))

Now let's look into the **phase block lengths**.

The plots below show a histogram of the phase block lengths for four different samples.  Note that we've removed blocks with length zero, but included the count in the title.

Most of the blocks have length less than $10^2$, which is quite poor. However, quite a few have length between $10^4$ and $10^5$, which is good as these blocks are able to cover most genes (**TODO:** how long is a typical human gene?).

In [367]:
plot_list = []
for sample in ['FN001485', 'FN000020', 'FN000254', 'FN000182']:
    # Calculate number of zero length phase blocks.
    ps_zero = ps_stats_ht.aggregate(hl.agg.filter(ps_stats_ht.sample_group == sample, hl.agg.count_where(ps_stats_ht.ps_length == 0)))

    temp = ps_stats_ht.filter((ps_stats_ht.sample_group == sample) & (ps_stats_ht.ps_length > 0))

    p = hl.plot.histogram(hl.log10(temp.ps_length), title='{s}. Zero length blocks: {n}'.format(s=sample, n=ps_zero))
    p.xaxis.axis_label = 'log10(Length)'
    
    plot_list.append(p)

In [368]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))