# Quality control

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

2021-09-28 09:23:02 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2021-09-28 09:23:03 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
  Compatibility is not guaranteed.
2021-09-28 09:23:04 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-8tkk6:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210928-0923-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

**TODO:**

* Probably a good idea to filter out sigletons here.

Load variant data.

In [43]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/variants.mt')

Split multi-allelic variants into separate rows. This makes a lot of analyses much easier.

**FIXME:** while `split_multi_hts()` is really useful, it is causing me a lot of headaches in some other areas.

In [44]:
mt = hl.split_multi_hts(mt, permit_shuffle=True)

In [45]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 229834
Number of samples: 473


In [46]:
def variant_counts(mt):
    # Count number of variants, SNPs and indels. Only first allele in alternate allele list is considered.
    variant_counts_struct = mt.aggregate_rows(hl.struct(
        n_variants = hl.agg.count(),
        snps_fraction = hl.agg.count_where(hl.is_snp(mt.alleles[0], mt.alleles[1])) / hl.agg.count(),
        indels_fraction = hl.agg.count_where(hl.is_indel(mt.alleles[0], mt.alleles[1])) / hl.agg.count()))
    
    variant_counts_pd = pd.DataFrame(variant_counts_struct.values(), index=variant_counts_struct.keys(), columns=[''])
    return variant_counts_pd

In [47]:
variant_counts(mt)

Unnamed: 0,Unnamed: 1
n_variants,229834.0
snps_fraction,0.812073
indels_fraction,0.187927


## VQSR filters

We will remove variants not passing the VQSR filters. First, however, we will look a bit closer into these values.

In [48]:
mt = mt.transmute_rows(filters=hl.delimit(mt.filters, ','))

In [49]:
mt.aggregate_rows(hl.agg.counter(mt.filters))

{'': 91718,
 None: 114917,
 'VQSRTrancheINDEL99.90to100.00': 53,
 'VQSRTrancheINDEL99.00to99.90': 1864,
 'VQSRTrancheSNP99.90to100.00': 548,
 'VQSRTrancheSNP99.00to99.90': 4018,
 'VQSRTrancheINDEL90.00to99.00': 4856,
 'VQSRTrancheSNP90.00to99.00': 11860}

We will calculate some mean QC statistics for each VQSR tranch.

In [50]:
# Calculate variant statistics.
mt = hl.variant_qc(mt)

# Get rows table.
rows_ht = mt.rows()

# Aggregate.
result = (rows_ht.group_by(rows_ht.filters)
         .aggregate(mean_gq = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.gq_stats.mean), hl.agg.mean(rows_ht.variant_qc.gq_stats.mean)),
                   mean_dp = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.dp_stats.mean), hl.agg.mean(rows_ht.variant_qc.dp_stats.mean)),
                   mean_af = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.AF[0]), hl.agg.mean(1 - rows_ht.variant_qc.AF[0])),
                   mean_vqslod = hl.agg.filter(~hl.is_nan(rows_ht.info.VQSLOD), hl.agg.mean(rows_ht.info.VQSLOD)),
                   n_variants = hl.agg.count()))

We convert the results to a Pandas dataframe.

In [51]:
vqsr_stats_pd = result.to_pandas()



Below we first print the statistics for the SNPs and then for the indels. The rows are sorted by mean genotype quality. The empty filter row (`filter=''`) corresponds to all unfiltered variants.

In [52]:
vqsr_stats_pd[vqsr_stats_pd.filters.isin(['VQSRTrancheSNP90.00to99.00', 'VQSRTrancheSNP99.00to99.90', 'VQSRTrancheSNP99.90to100.00', ''])].sort_values('mean_vqslod')

Unnamed: 0,filters,mean_gq,mean_dp,mean_af,mean_vqslod,n_variants
6,VQSRTrancheSNP99.90to100.00,67.016624,48.829652,0.58067,-28.934914,548
5,VQSRTrancheSNP99.00to99.90,77.295291,47.716385,0.528285,-1.892502,4018
4,VQSRTrancheSNP90.00to99.00,77.397878,41.130287,0.568779,1.408824,11860
0,,81.534412,43.228872,0.523656,4.720143,91718


In [53]:
vqsr_stats_pd[vqsr_stats_pd.filters.isin(['VQSRTrancheINDEL90.00to99.00', 'VQSRTrancheINDEL99.00to99.90', 'VQSRTrancheINDEL99.90to100.00', ''])].sort_values('mean_vqslod')

Unnamed: 0,filters,mean_gq,mean_dp,mean_af,mean_vqslod,n_variants
3,VQSRTrancheINDEL99.90to100.00,60.661981,28.263731,0.183336,-92.501132,53
2,VQSRTrancheINDEL99.00to99.90,73.732753,44.697718,0.199547,-1.945062,1864
1,VQSRTrancheINDEL90.00to99.00,70.528396,34.666267,0.262361,0.910408,4856
0,,81.534412,43.228872,0.523656,4.720143,91718


We will only look at the **high-quality variants** from now on. So we remove variants in the 90% to 100% VQSR tranches, and keep only the unfiltered variants.

In [54]:
mt = mt.filter_rows(mt.filters == '')

We check how many variants we have remaining after filtering.

## Genotype QC

In [55]:
p = hl.plot.histogram(mt.GQ, range=(0, 100))
p.xaxis.axis_label = 'Genotype quality'
p.plot_width = 800
p.plot_height = 500
show(p)



Filter genotypes with low quality (GQ), using GQ > 20 for SNPs and GQ > 40 for indels.

After the filter some sites may have become invariant. These are removed.

NOTE: Many (or most) of the lower quality genotypes, i.e. the low-end tail of the histogram above, are indels.

In [56]:
# Calculate variant statistics.
mt = hl.variant_qc(mt)

mt = mt.filter_entries(hl.if_else(
    hl.is_snp(mt.alleles[0],mt.alleles[1]),
    mt.GQ > 20,
    mt.GQ > 40))

### Allelic balance

Note that since we've split the table, all rows are diallelic.

We compute the allelic balance as $AB = \frac{AD[1]}{DP}$. Note that DP is equivalent to $AD[0] + AD[1]$ for a diallelic site.

In [57]:
mt = mt.annotate_entries(AB = mt.AD[1] / mt.DP)

In [58]:
hets_mt = mt.filter_entries(mt.GT.is_het())
p = hl.plot.histogram(hets_mt.AB, range=(0, 1))
p.xaxis.axis_label = 'Allelic balance'
p.plot_width = 800
p.plot_height = 500
show(p)



Filter all heterozygotes with allelic balance outside the range of $]0.25;0.75[$.

In [59]:
mt = mt.filter_entries(hl.if_else(
    mt.GT.is_het(),
    (mt.AB > 0.25) & (mt.AB < 0.75),
    True))

## Variant QC

### Variant call quality (QUAL)

In [60]:
p = hl.plot.histogram(mt.qual, legend='Variant call quality (QUAL)')
p.plot_width = 800
p.plot_height = 500
show(p)

In [61]:
p = hl.plot.histogram(hl.log10(mt.qual), legend='Variant call quality (log10 of QUAL)')
p.plot_width = 800
p.plot_height = 500
show(p)

### Hardy-Weinberg Equilibrium (HWE)

Remove variants that significantly deviate from HWE. We use a p-value of $10^{-9}$ for SNPs and $10^{-6}$ for indels. Since our criteria for indels is 1000 times more stringent than for SNPs, we cannot expect the indels to carry information about population structure and inbreeding.

In [62]:
# Update variant statistics.
mt = hl.variant_qc(mt)

mt = mt.filter_rows(hl.if_else(
    hl.is_snp(mt.alleles[0], mt.alleles[1]),
    mt.variant_qc.p_value_hwe > 1e-9,
    mt.variant_qc.p_value_hwe > 1e-6))

### Average variant GQ

In [63]:
p = hl.plot.histogram(mt.variant_qc.gq_stats.mean, range=(0,100), legend='Mean GQ per variant histogram')
p.plot_width = 800
p.plot_height = 500
show(p)



## Sample QC

Below we see histograms of sample mean genotype quality and genotype depth. Most samples seem to have good depth and quality, although the deviation between samples is quite large. There are some samples with low depth and quality, but we will not worry about these.

In [64]:
mt = hl.sample_qc(mt)

In [65]:
p = hl.plot.histogram(mt.sample_qc.gq_stats.mean, range=(0,100), legend='Mean Sample GQ')
p.plot_width = 800
p.plot_height = 500
show(p)



In [66]:
p = hl.plot.histogram(mt.sample_qc.dp_stats.mean, range=(0,60), legend='Mean Sample DP')
p.plot_width = 800
p.plot_height = 500
show(p)



Below is a histogram of the heterozygote/homozygote ratio.

In [67]:
p = hl.plot.histogram(mt.sample_qc.r_het_hom_var, legend='Het/hom rate')
p.plot_width = 800
p.plot_height = 500
show(p)



Looks like there are a few samples with a lot higher het/hom rate than the rest of the samples. Let's check whether this is due to poor coverage.

In [68]:
het_hom_thres = 1.4
mt = mt.annotate_cols(high_hom_het=mt.sample_qc.r_het_hom_var > het_hom_thres)

In [69]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.r_het_hom_var, xlabel='DP mean', ylabel='het/hom rate',
                    hover_fields={'Sample': mt.s}, label=mt.high_hom_het)
p.plot_width = 600
p.plot_height = 600
show(p)



The plot above indicates these samples have similar coverage as the other samples, so that doesn't explain the high het/hom rate.

We can also check the MultiQC reports of these samples (file paths below), and we see that they are all of reasonable quality.

In [70]:
high_hethom_samples = mt.filter_cols(mt.sample_qc.r_het_hom_var > het_hom_thres).s.collect()

for sample in high_hethom_samples:
    print('/data/projects/fargen_phase_1/data/single_sample_data/{sample}/multiqc/multiqc_report.html'.format(sample=sample))



/data/projects/fargen_phase_1/data/single_sample_data/FN000538/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN000909/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN000940/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN001018/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN001019/multiqc/multiqc_report.html


In the genealogy summary file (`/fargen/fargen_phase_1_utils/multi_sample/joint_genotyping/metadata/genealogy/individuals_summary.csv`), it seems that all these samples have reasonably deep roots in the Faroes. This means we have no reason to suspect this difference is due to these samples being from different populations.

These four samples most likely have high het/hom rate due to poor data quality. One potential reason for this is contamination of the sample in the lab.

We will **discard high het/hom rate samples**, as they may skew further analyses down the line.

In [71]:
mt = mt.filter_cols(mt.sample_qc.r_het_hom_var < het_hom_thres)

In [72]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))



Number of variants: 88885
Number of samples: 468


## Variant counts

In [73]:
# Remove sites that have become invariant after the AB filter.
mt = mt.filter_rows(mt.variant_qc.AC[1] > 0)

In [74]:
variant_counts(mt)



Unnamed: 0,Unnamed: 1
n_variants,86300.0
snps_fraction,0.866095
indels_fraction,0.133905


## Write data to file

In [75]:
if False:
    mt.write(BASE_DIR + '/data/mt/high_quality_variants.mt', overwrite=True)

[Stage 71:>                                                         (0 + 8) / 8]2021-09-28 09:57:56 Hail: INFO: Ordering unsorted dataset with network shuffle
    Total size: 585.32 MiB
    * Rows/entries: 585.27 MiB
    * Columns: 49.87 KiB
    * Globals: 11.00 B
    * Smallest partition: 10027 rows (68.56 MiB)
    * Largest partition:  11876 rows (78.96 MiB)


## Summary

In this notebook we have:

* Split multi-allelic sites
* Filtered variants failing VQSR filter
* Filtered variants with genotype quality (GQ)
    * GQ > 20 for SNPs
    * GQ > 40 for indels
* Filtered variants with allelic balance outside the range $[0.25, 0.75]$
* Filtered variants failing HWE filter
    * $p > 10^{-9}$ for SNPs
    * $p > 10^{-6}$ for indels
* Removed four samples with abnormally high het/hom rate (>1.4)
* Filtered variants with allele count equal to zero