# Pasing statistics

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/phasing/hail-20210531-1111-0.2.61-3c86d3ba497a.log


In [196]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [8]:
import pandas as pd

## Import data

In [3]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

In [4]:
vcf_gz_path = '/data/projects/fargen_phase_1/data/multi_sample_data/phasing/outs/phased_merged.vcf.gz'

In [5]:
mt = hl.import_vcf(vcf_gz_path, force_bgz=True, reference_genome='GRCh38', array_elements_required=False)

In [6]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

2021-05-31 11:12:36 Hail: INFO: Coerced sorted dataset


Number of sites: 3110759
Number of samples: 472


## Filter variants

In [10]:
mt = mt.filter_rows(mt.filters.contains('VQSRTrancheSNP99.90to100.00') | mt.filters.contains('VQSRTrancheINDEL99.90to100.00'))

In [11]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

2021-05-31 11:14:39 Hail: INFO: Coerced sorted dataset


Number of sites: 617463
Number of samples: 472


## Write variants to file

In [12]:
mt.write(BASE_DIR + '/data/mt/phased_hq.mt', overwrite=True)

2021-05-31 11:15:03 Hail: INFO: Coerced sorted dataset
2021-05-31 11:16:31 Hail: INFO: wrote matrix table with 617463 rows and 472 columns in 37 partitions to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/phased_hq.mt
    Total size: 724.91 MiB
    * Rows/entries: 724.91 MiB
    * Columns: 1.77 KiB
    * Globals: 11.00 B
    * Smallest partition: 14326 rows (16.44 MiB)
    * Largest partition:  18817 rows (21.81 MiB)


Read the new data again, to avoid having to repeat all operations.

In [15]:
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/phased_hq.mt/')

In [16]:
n_sites, n_samples = mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

Number of sites: 617463
Number of samples: 472


## Phasing statistics

### Phased heterozygotes

Calculate number of phased heterozygotes.

In [207]:
# Get all heterozygotes.
het_mt = mt.filter_entries(mt.GT.is_het())

In [208]:
het_mt = het_mt.annotate_cols(n_phased_hets=hl.agg.count_where(het_mt.GT.phased), n_hets = hl.agg.count_where(het_mt.GT.is_het()))

In [209]:
het_mt = het_mt.annotate_cols(phased_hets_fraction = het_mt.n_phased_hets / het_mt.n_hets)

In [210]:
p = hl.plot.histogram(het_mt.phased_hets_fraction)
show(p)

### Phase block lengths

Calculate the lengths of the phase blocks. Phase blocks are defined by the `PS` 'phase set' tag on the genotypes. The `PS` tag is an integer equal to the position of the first variant in the phase block.

Below we rename the phase sets, such that they include the chromosome name, to avoid clashing in phase set names.

**NOTE:** we only look at heterozygotes, because in principal, the phase of any homozygote variants are known trivially. Therefore, if we include homozygous variants, in principal our phase block stretches from the first to the last homozygote on the chromosome.

In [59]:
sample_sub_exprs = hl.array(['FN001485', 'FN000020', 'FN000254'])
het_mt = het_mt.filter_cols(sample_sub_exprs.contains(het_mt.s))

In [60]:
n_sites, n_samples = het_mt.count()
print('Number of sites: {n}\nNumber of samples: {m}'.format(n=n_sites, m=n_samples))

Number of sites: 617463
Number of samples: 3


In [163]:
het_mt = het_mt.annotate_entries(ps_chr = het_mt.locus.contig + hl.str(':') + hl.str(het_mt.PS))

We calculate the number of variants in each phase block. We do this for a single sample here.

In [177]:
ps_count = het_mt.aggregate_entries(hl.agg.filter(het_mt.s == 'FN001485', hl.agg.counter(het_mt.PS)))

We don't care about the unphased variants without a phase set, so we remove this.

In [178]:
_ = ps_count.pop(None)

We can also group the rows by chromosome, and then just count the `PS` tags directly.

In [181]:
result = (het_mt.group_rows_by(chrom=het_mt.locus.contig)
    .aggregate(PS_count=hl.agg.counter(het_mt.PS)))

#### Construct phase set table