In [1]:
import hail as hl
hl.init(default_reference='GRCh38', spark_conf={'spark.driver.memory': '10g'})

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-7889d4ff4c-z7fmq:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/hail-20210304-1042-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

Load variant data.

In [3]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/variants.mt')

In [4]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 3110759
Number of samples: 474


Collect QC metrics.

In [5]:
mt = hl.variant_qc(mt)
mt = hl.sample_qc(mt)

## VQSR filters

In [6]:
mt = mt.transmute_rows(filters=hl.delimit(mt.filters, ','))
mt.aggregate_rows(hl.agg.counter(mt.filters))

{'': 1332013,
 'VQSRTrancheINDEL99.90to100.00': 261474,
 'VQSRTrancheINDEL99.00to99.90': 630276,
 'VQSRTrancheSNP99.90to100.00': 355989,
 'VQSRTrancheSNP99.00to99.90': 531007}

We will only look at the **high-quality variants** from now on, so we filter variants in the 99% to 99.9% and the 99.9% to 100% VQSR tranches.

In [7]:
mt = mt.filter_rows(mt.filters == '')

In [8]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1332013
Number of samples: 474


## Population-based variant filters

Filter out variants failing Hardy-Weinberg Equilibrium test with $p>10^{-6}$.

**FIXME:** This step probably removes multi-allelic sites. I can do HWE filtering on multi-allelic sites using [split_multi()](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.split_multi).

In [9]:
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

In [10]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1194405
Number of samples: 474


## Variant QC

In [11]:
p = hl.plot.histogram(mt.variant_qc.gq_stats.mean, range=(0,100), legend='Mean GQ per variant histogram')
p.plot_width = 800
p.plot_height = 500
show(p)

In [12]:
p = hl.plot.histogram(mt.variant_qc.dp_stats.mean, range=(0,100), legend='Mean DP per variant histogram')
p.plot_width = 800
p.plot_height = 500
show(p)

## Sample QC

In [13]:
p = hl.plot.histogram(mt.sample_qc.gq_stats.mean, range=(10,100), legend='Mean Sample GQ')
p.plot_width = 800
p.plot_height = 500
show(p)

In [14]:
p = hl.plot.histogram(mt.sample_qc.dp_stats.mean, range=(0,60), legend='Mean Sample DP')
p.plot_width = 800
p.plot_height = 500
show(p)

In [15]:
p = hl.plot.histogram(mt.sample_qc.r_het_hom_var, range=(1.3,4), legend='Het/hom rate')
p.plot_width = 800
p.plot_height = 500
show(p)

Looks like there are a few samples with a lot higher het/hom rate than the rest of the samples. Let's check whether this is due to poor coverage.

In [16]:
mt = mt.annotate_cols(high_hom_het=mt.sample_qc.r_het_hom_var > 3)

In [17]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.r_het_hom_var, xlabel='DP mean', ylabel='het/hom rate',
                    hover_fields={'Sample': mt.s}, label=mt.high_hom_het)
p.plot_width = 600
p.plot_height = 600
show(p)

The plot above indicates these samples have similar coverage as the other samples, so that doesn't explain the high het/hom rate.

We can also check the MultiQC reports of these samples (below), and we see that they are all of reasonable quality.

In [18]:
high_hethom_samples = mt.filter_cols(mt.sample_qc.r_het_hom_var > 3).s.collect()

for sample in high_hethom_samples:
    print('/fargen/data/single_sample_data/{sample}/multiqc/multiqc_report.html'.format(sample=sample))

/fargen/data/single_sample_data/FN000909/multiqc/multiqc_report.html
/fargen/data/single_sample_data/FN000940/multiqc/multiqc_report.html
/fargen/data/single_sample_data/FN001018/multiqc/multiqc_report.html
/fargen/data/single_sample_data/FN001019/multiqc/multiqc_report.html


In the genealogy summary file (`/fargen/fargen_phase_1_utils/multi_sample/joint_genotyping/metadata/genealogy/individuals_summary.csv`), it seems that all these samples have reasonably deep roots in the Faroes.

In [19]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1194405
Number of samples: 474


## Write variants to file

We have filtered out poor variants that don't pass the VQSR filters, and variants that fail the HWE test with $p>10^{-6}$. We write this dataset to file.

In [20]:
mt.write(BASE_DIR + '/data/mt/high_quality_variants.mt', overwrite=True)

2021-03-04 10:46:42 Hail: INFO: wrote matrix table with 1194405 rows and 474 columns in 96 partitions to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/high_quality_variants.mt
    Total size: 4.84 GiB
    * Rows/entries: 4.84 GiB
    * Columns: 49.41 KiB
    * Globals: 11.00 B
    * Smallest partition: 11386 rows (41.00 MiB)
    * Largest partition:  13123 rows (56.51 MiB)
