# Quality control

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '10g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210614-1347-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

Load variant data.

In [3]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/variants.mt')

In [4]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 3110759
Number of samples: 472


Collect QC metrics on samples and variants, using built-in Hail methods.

In [5]:
mt = hl.variant_qc(mt)
mt = hl.sample_qc(mt)

## VQSR filters

We will remove variants not passing the VQSR filters. First, however, we will look a bit closer into these values.

In [6]:
mt = mt.transmute_rows(filters=hl.delimit(mt.filters, ','))

We will calculate some mean QC statistics for each VQSR tranch.

In [7]:
rows_ht = mt.rows()
result = (rows_ht.group_by(rows_ht.filters)
         .aggregate(mean_gq = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.gq_stats.mean), hl.agg.mean(rows_ht.variant_qc.gq_stats.mean)),
                   mean_dp = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.dp_stats.mean), hl.agg.mean(rows_ht.variant_qc.dp_stats.mean)),
                   mean_af = hl.agg.filter(~hl.is_nan(rows_ht.variant_qc.AF[0]), hl.agg.mean(1 - rows_ht.variant_qc.AF[0])),
                   mean_vqslod = hl.agg.filter(~hl.is_nan(rows_ht.info.VQSLOD), hl.agg.mean(rows_ht.info.VQSLOD)),
                   n_variants = hl.agg.count()))

We convert the results to a Pandas dataframe.

In [8]:
vqsr_stats_pd = result.to_pandas()

2021-06-14 13:47:35 Hail: INFO: Coerced sorted dataset
2021-06-14 13:47:35 Hail: INFO: Coerced dataset with out-of-order partitions.


Below we first print the statistics for the SNPs and then for the indels. The rows are sorted by mean genotype quality. The empty filter row (`filter=''`) corresponds to all unfiltered variants.

In [9]:
vqsr_stats_pd[vqsr_stats_pd.filters.isin(['VQSRTrancheSNP99.00to99.90', 'VQSRTrancheSNP99.90to100.00', ''])].sort_values('mean_gq')

Unnamed: 0,filters,mean_gq,mean_dp,mean_af,mean_vqslod,n_variants
4,VQSRTrancheSNP99.90to100.00,45.259487,24.388426,0.012501,-15.030785,355989
3,VQSRTrancheSNP99.00to99.90,47.31921,18.250923,0.014949,-1.060047,531007
0,,50.08941,23.245864,0.073181,inf,1332013


In [10]:
vqsr_stats_pd[vqsr_stats_pd.filters.isin(['VQSRTrancheINDEL99.00to99.90', 'VQSRTrancheINDEL99.90to100.00', ''])].sort_values('mean_gq')

Unnamed: 0,filters,mean_gq,mean_dp,mean_af,mean_vqslod,n_variants
1,VQSRTrancheINDEL99.00to99.90,37.965855,17.087049,0.003321,-1.531774,630276
2,VQSRTrancheINDEL99.90to100.00,49.220394,26.10606,0.005978,-5.601662,261474
0,,50.08941,23.245864,0.073181,inf,1332013


We will only look at the **high-quality variants** from now on. So we remove variants in the 99% to 99.9% and the 99.9% to 100% VQSR tranches, and keep only the unfiltered variants.

In [11]:
mt = mt.filter_rows(mt.filters == '')

We check how many variants we have remaining after filtering.

In [12]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1332013
Number of samples: 472


## Variant QC

Below we make a histogram of variant mean genotype quality. We see that a lot of the variants have a mean GQ of just shy of 100, which is quite good. A lot of the variants also have GQ ranging from about 15 to 50.

In [13]:
p = hl.plot.histogram(mt.variant_qc.gq_stats.mean, range=(0,100), legend='Mean GQ per variant histogram')
p.plot_width = 800
p.plot_height = 500
show(p)

VQSR filtering takes into account many factors. Therefore variants with low GQ may pass. As an extra precaution, we will remove variants with GC < 20.

In [14]:
mt = mt.filter_rows(mt.variant_qc.gq_stats.mean >= 20)

In [15]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 472


Let's look at the depth histogram below. It seems quite a lot of variants have low depth. However, these variants may very well still be reliable, so we will keep them.

In [16]:
p = hl.plot.histogram(mt.variant_qc.dp_stats.mean, range=(0,100), legend='Mean DP per variant histogram')
p.plot_width = 800
p.plot_height = 500
show(p)

At this point, we will refrain from using population filters such as minor allele frquency and Hardy-Weinberg Equilibrium filters. We omit these as the threshold used in these can be quite context dependent.

## Sample QC

Below we see histograms of sample mean genotype quality and genotype depth. Most samples seem to have good depth and quality, although the deviation between samples is quite large. There are some samples with low depth and quality, but we will not worry about these.

In [17]:
p = hl.plot.histogram(mt.sample_qc.gq_stats.mean, range=(10,100), legend='Mean Sample GQ')
p.plot_width = 800
p.plot_height = 500
show(p)

In [18]:
p = hl.plot.histogram(mt.sample_qc.dp_stats.mean, range=(0,60), legend='Mean Sample DP')
p.plot_width = 800
p.plot_height = 500
show(p)

Below is a histogram of the heterozygote/homozygote ratio.

In [19]:
p = hl.plot.histogram(mt.sample_qc.r_het_hom_var, range=(1.3,4), legend='Het/hom rate')
p.plot_width = 800
p.plot_height = 500
show(p)

Looks like there are a few samples with a lot higher het/hom rate than the rest of the samples. Let's check whether this is due to poor coverage.

In [20]:
mt = mt.annotate_cols(high_hom_het=mt.sample_qc.r_het_hom_var > 3)

In [21]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.r_het_hom_var, xlabel='DP mean', ylabel='het/hom rate',
                    hover_fields={'Sample': mt.s}, label=mt.high_hom_het)
p.plot_width = 600
p.plot_height = 600
show(p)

The plot above indicates these samples have similar coverage as the other samples, so that doesn't explain the high het/hom rate.

We can also check the MultiQC reports of these samples (file paths below), and we see that they are all of reasonable quality.

In [22]:
high_hethom_samples = mt.filter_cols(mt.sample_qc.r_het_hom_var > 3).s.collect()

for sample in high_hethom_samples:
    print('/data/projects/fargen_phase_1/data/single_sample_data/{sample}/multiqc/multiqc_report.html'.format(sample=sample))

/data/projects/fargen_phase_1/data/single_sample_data/FN000909/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN001019/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN000940/multiqc/multiqc_report.html
/data/projects/fargen_phase_1/data/single_sample_data/FN001018/multiqc/multiqc_report.html


In the genealogy summary file (`/fargen/fargen_phase_1_utils/multi_sample/joint_genotyping/metadata/genealogy/individuals_summary.csv`), it seems that all these samples have reasonably deep roots in the Faroes. This means we have no reason to suspect this difference is due to these samples being from different populations.

These four samples most likely have high het/hom rate due to poor data quality. One potential reason for this is contamination of the sample in the lab.

We will **discard high het/hom rate samples**, as they may skew further analyses down the line.

In [23]:
mt = mt.filter_cols(mt.sample_qc.r_het_hom_var < 3)

In [24]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Write variants to file

In [25]:
mt.write(BASE_DIR + '/data/mt/high_quality_variants.mt', overwrite=True)

2021-06-14 13:50:21 Hail: INFO: wrote matrix table with 1146382 rows and 468 columns in 37 partitions to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/high_quality_variants.mt
    Total size: 1.61 GiB
    * Rows/entries: 1.61 GiB
    * Columns: 49.98 KiB
    * Globals: 11.00 B
    * Smallest partition: 29355 rows (38.26 MiB)
    * Largest partition:  31925 rows (46.83 MiB)


## Summary

In this notebook we have:

* Filtered variants failing VQSR filter
* Filtered variants with genotype quality < 20
* Removed four samples with abnormally high het/hom rate (>3)