# Concordance between FarGen and gnomAD

**NOTE:** Maybe *concordance* isn't the correct word, as it usually refers to genotype concordance. Maybe *overlap* is more suitable.

# Concordance between FarGen and gnomAD

**NOTE:** Maybe *concordance* isn't the correct word, as it usually refers to genotype concordance. Maybe *overlap* is more suitable.

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '100g', 'spark.local.dir': '/home/olavur/tmp'})

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-7889d4ff4c-6wxtc:4044
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/gnomad_exome_sites/hail-20210312-1208-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [112]:
import pandas as pd
import numpy as np

In [3]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'

## Load gnomAD exome sites data

In [4]:
gnomad_ht = hl.read_table(BASE_DIR + '/data/resources/gnomAD/gnomad.exomes.r2.1.1.sites.GRCh38.ht')

In [5]:
n_variants = gnomad_ht.count()
print('Number of variants: ' + str(n_variants))

Number of variants: 17204631


## Load FarGen data annotated with gnomAD data

In [10]:
fargen_mt = hl.read_matrix_table(BASE_DIR + '/data/mt/hq_gnomad_annotated.mt')

## Concordance

We want to compare the rows in the two datasets, so we take out the rows of the FarGen matrix table, and remove all unnecessary fields from both datasets.

In [69]:
fargen_sites_ht = fargen_mt.rows()
fargen_sites_ht = fargen_sites_ht.select()

In [70]:
gnomad_sites_ht = gnomad_ht.select()

Count the number of sites in each dataset.

In [46]:
n_fargen_sites = fargen_sites_ht.count()
n_gnomad_sites = gnomad_sites_ht.count()

Count the number of sites present in *either* dataset, the *union*.

In [71]:
union_ht = fargen_rows_ht.union(gnomad_sites_ht)
n_union = union_ht.count()

Count the number of sites present in *both* datasets, the *intersection*.

In [49]:
intersection_ht = fargen_rows_ht.semi_join(gnomad_sites_ht)
n_intersection = intersection_ht.count()

In [62]:
count_list = [n_fargen_sites, n_gnomad_sites, n_union, n_intersection]
index = ['FarGen', 'gnomAD', 'Union', 'Intersection']
pd.DataFrame(count_list, index=index, columns=['Variant count'])

Unnamed: 0,Variant count
FarGen,1194405
gnomAD,17204631
Union,18399036
Intersection,195491


## Calculate site-frequency spectrum

Calculate a histogram of allele frequencies.

In [138]:
n_bins = 100
assert n_bins % 2 == 0, 'Number of bins must be an even number.'
hist_struct = fargen_mt.aggregate_rows(hl.agg.hist(fargen_mt.info.AF[0], 0, 1, n_bins))

Get the allele frequencies.

In [139]:
allele_freq = hist_struct.bin_edges

Get site frequencies.

In [140]:
site_counts = np.array(hist_struct.bin_freq)
n_sites = sum(site_counts)
site_freq = site_counts / n_sites

Compute the folded site frequency.

In [141]:
half = int(n_bins/2)
folded_site_freq = site_freq[:half] + site_freq[:half-1:-1]

Make a Hail table out of the results.

In [142]:
# Make a Hail table with the allele counts and site frequencies.
ffs_table = []
for ac, fc in zip(allele_freq, folded_site_freq):
    row = {'af': ac, 'ff': fc}
    ffs_table.append(row)
    
ht_ffs = hl.Table.parallelize(hl.literal(ffs_table, 'array<struct{af:float32,ff:float32}>'))

Plot the FFS.

In [143]:
p = hl.plot.scatter(ht_ffs.af, ht_ffs.ff,
                    xlabel='Allele counts', ylabel='Frequency in population', title='Site frequency spectrum (folded)',
                    collect_all=True)
p.plot_width = 800
p.plot_height = 400
show(p)