# Population structure - PCA of the FarGen cohort

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '100g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210604-1228-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

## Load FarGen data

Use high-quality variants.

In [4]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt')

In [6]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Population filters

Apply HWE filter to remove potential genotyping errors. Note that if we use a stringent HWE filter we risk removing evidence of population structure from our data. Therefore, we shall use a lenient filter, only removing variants with very strong deviation from HWE at $p<10^{-9}$.

In [7]:
mt = mt.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt.GT))
mt = mt.filter_rows(mt.hwe.p_value > 1e-9)

To filter rare variants, we first must calculate the minor allele frequency.

In [8]:
# The number of alleles at the site is the sum of the ploidy at each site.
# This number should be twice the number of samples.
# If there are missing genotype calls, the number of alleles will be less.
AN_exprs = hl.agg.sum(mt.GT.ploidy)
mt = mt.annotate_rows(AN=AN_exprs)

# Calculate the number of alternate alleles at each site.
AC_exprs = hl.agg.sum(mt.GT.n_alt_alleles())
mt = mt.annotate_rows(AC=AC_exprs)

# Calculate the alternate allele frequency.
mt = mt.annotate_rows(AF=mt.AC / mt.AN)

Remove variants with minor allele frequency under 0.01. Note that we remove only very rare variants, because common variants can be insufficient to describe fine-scale population structure at a subpopulation scale.

In [9]:
maf_filter = 0.01
mt = mt.filter_rows((mt.AF > maf_filter) & (mt.AF < (1 - maf_filter)))

## Filter indels

Remove all indels from the dataset.

**NOTE:** this code only works because there are only diallelic sites. If there were multi-allelic sites, I would have to check all allele pairs.

In [10]:
mt = mt.filter_rows(hl.is_snp(mt.alleles[0], mt.alleles[1]))

## LD pruning

Before we calculate LD, we remove multi-allelic sites.

**NOTE:** I can do LD pruning on multi-allelic sites using [split_multi()](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.split_multi).

In [11]:
mt = mt.filter_rows(hl.len(mt.alleles) == 2)

Prune variants with $r^2 > 0.2$ within a 500 000 basepair window.

**NOTE:** this $r^2$ value is quite arbitrary. It is the default from the Hail method and I did not try other values.

In [12]:
pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000)
mt = mt.filter_rows(hl.is_defined(pruned_variant_table[mt.row_key]))

2021-06-04 12:29:01 Hail: INFO: ld_prune: running local pruning stage with max queue size of 401850 variants
2021-06-04 12:29:15 Hail: INFO: wrote table with 84041 rows in 37 partitions to /home/olavur/tmp/ABC4ejVYKpuAu3E9HCuX9k
    Total size: 2.06 MiB
    * Rows: 2.06 MiB
    * Globals: 11.00 B
    * Smallest partition: 1048 rows (25.28 KiB)
    * Largest partition:  2875 rows (72.31 KiB)
2021-06-04 12:29:32 Hail: INFO: Wrote all 21 blocks of 84041 x 468 matrix with block size 4096.
2021-06-04 12:32:42 Hail: INFO: wrote table with 185 rows in 41 partitions to /home/olavur/tmp/RrRAxUWCQcO5Jpa0dEvHQn
    Total size: 856.50 KiB
    * Rows: 4.88 KiB
    * Globals: 851.62 KiB
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  22 rows (461.00 B)


Make a checkpoint, caching all operations done on the matrix table.

In [None]:
if False:
    mt = mt.checkpoint('/home/olavur/tmp/ld_pruned.ht')
else:
    mt = hl.read_matrix_table('/home/olavur/tmp/ld_pruned.ht')

In [14]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 83887
Number of samples: 468


This may be a lot of variants to run a PCA analysis on, but as mentioned previously, we do not want to remove too many rare variants as these can hold a lot of information about fine-scale population structure.

## Annotate birth place region

Import data containing the birthplace ID of all FarGen participants. We will annotate the matrix table with these IDs.

First we read a table containing the birthplace but with the 'RIN' participant ID. We will have to map this 'RIN' ID to an 'FN' ID.

In [15]:
rin_birthplace_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/rin_region.csv', delimiter=',')
# Rename "ind" to "rin".
# Convert the region variable to float.
rin_birthplace_ht = rin_birthplace_ht.transmute(rin=rin_birthplace_ht.ind, birthplace=hl.float64(rin_birthplace_ht.region))

rin_birthplace_ht = rin_birthplace_ht.key_by(rin_birthplace_ht.rin)

2021-06-04 12:45:17 Hail: INFO: Reading table without type imputation
  Loading field 'ind' as type str (not specified)
  Loading field 'region' as type str (not specified)


Import table with 'RIN' IDs and corresponding 'FN' IDs.

In [16]:
fargen_rin_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/fargen_rin_samplename.csv', delimiter=',')
fargen_rin_ht = fargen_rin_ht.key_by(fargen_rin_ht.rin)

2021-06-04 12:45:17 Hail: WARN: Name collision: field 'sample' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['sample']
2021-06-04 12:45:17 Hail: INFO: Reading table without type imputation
  Loading field 'rin' as type str (not specified)
  Loading field 'sample' as type str (not specified)


Make a table with 'RIN', 'FN' and birthplace.

In [17]:
# Annotate the table with the birthplace by the samplenames.
samplename_birthplace_ht = rin_birthplace_ht.annotate(samplename=fargen_rin_ht[rin_birthplace_ht.rin].sample)
samplename_birthplace_ht = samplename_birthplace_ht.key_by(samplename_birthplace_ht.samplename)

Finally, we an annotate the matrix table with birthplace of samples.

In [18]:
mt = mt.annotate_cols(birthplace = samplename_birthplace_ht[mt.s].birthplace)

Count the number of samples in each region. Note that `birthplace=6` means that we do not know the birthplace of the sample.

In [19]:
cols_ht = mt.cols()
result = (cols_ht.group_by(cols_ht.birthplace)
    .aggregate(count = hl.agg.count()))
result.to_pandas()

2021-06-04 12:45:20 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2021-06-04 12:45:20 Hail: WARN: Name collision: field 'count' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['count']
2021-06-04 12:45:20 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-04 12:45:21 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-04 12:45:23 Hail: INFO: Coerced sorted dataset
2021-06-04 12:45:23 Hail: INFO: Coerced dataset with out-of-order partitions.


Unnamed: 0,birthplace,count
0,1.0,69
1,2.0,99
2,3.0,177
3,4.0,28
4,5.0,17
5,6.0,34
6,,44


## Filter related individuals

Estimate the relatedness between the samples by the PC-Relate method, with a minimum alternate allele frequency of 0.001.

In [20]:
pc_rel = hl.pc_relate(mt.GT, 0.001, k=2, statistics='kin')

2021-06-04 12:45:37 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-04 12:45:38 Hail: INFO: pca: running PCA with 2 components...
2021-06-04 12:46:08 Hail: INFO: Wrote all 21 blocks of 83887 x 468 matrix with block size 4096.


Plot all the relatedness coefficients in a histogram to get an overview.

In [21]:
p = hl.plot.histogram(pc_rel.kin, title='Histogram of kinship coefficient')
show(p)

2021-06-04 12:51:17 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-CG9VBnJsl4RQj58eOSDn64.bm
2021-06-04 12:51:17 Hail: INFO: wrote matrix with 83887 rows and 468 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-zN0wq5ThJ7qHVDzqnOapDK.bm
2021-06-04 12:51:50 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-yvaT4AeXMF9y0bHS3rip1i.bm
2021-06-04 12:52:18 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-RZyDsiW0tPoJhSqeZBbNuk.bm
2021-06-04 12:52:19 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-oQLZAnN4nqyMpOek1eoZjo.bm
2021-06-04 12:52:19 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-04 12:52:22 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of

**FIXME:** why did I used $2^{-3}$? If this is like the kinship coefficients I'm used to, the cutoff for second degree relationships is $2^{-4}$.

In [22]:
pairs = pc_rel.filter(pc_rel['kin'] > 2**(-4))

Then we find the maximal independent set, consistent of the samples to remove.

In [23]:
related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, keep=False)

2021-06-04 13:06:45 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-Nnp6aKsQsuNPSpAvD0NIVk.bm
2021-06-04 13:06:45 Hail: INFO: wrote matrix with 83887 rows and 468 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-sHTpC1kGM5IJ7nYT13BsQm.bm
2021-06-04 13:07:14 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-h636dKSqcJhWcLy84f6oQw.bm
2021-06-04 13:07:41 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-38Kwo6HgytZeLq1H6R9uhV.bm
2021-06-04 13:07:41 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-sCUMFr8GSXQOwQ903bMMso.bm
2021-06-04 13:07:41 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-04 13:07:42 Hail: INFO: wrote table with 104 rows in 1 partition to /home/olavur/t

Now we filter these individuals from the matrix table.

In [24]:
mt = mt.filter_cols(hl.is_defined(related_samples_to_remove[mt.col_key]), keep=False)

Make a checkpoint, caching all operations done on the matrix table.

In [37]:
if False:
    mt = mt.checkpoint('/home/olavur/tmp/rel_pruned.ht')
else:
    mt = hl.read_matrix_table('/home/olavur/tmp/rel_pruned.ht')

In [38]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 83887
Number of samples: 389


## Compute PCA

In [39]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt.GT, k=4)

2021-06-04 13:17:04 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-04 13:17:05 Hail: INFO: pca: running PCA with 4 components...


In [54]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)

In [41]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=hl.str(mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

## Filter outliers

In the PCA plot above, it seems we have a few outliers. These individuals may have non-Faroese ancestry, whether they are part Faroese or not, or they may be very closely related. Either way, they prevent us from detecting possible population structure in the data.

As we see in the plot above, both PC 1 and 2 describe the variation between these outliers and the rest of the samples. So if we remove these outlier individuals, we may be able to detect population structure.

In [55]:
mt = mt.filter_cols(mt.scores[0] < 0.4)

Now we just do the PCA again, and see if PC 1 and 2 show signs of population structure.

In [56]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt.GT, k=4)

2021-06-04 13:23:04 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-04 13:23:04 Hail: INFO: pca: running PCA with 4 components...


In [57]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)

In [58]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=hl.str(mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

Region number | Region name(s)
-----|-----
1 | Norðoyggjar
2 | Eysturoy og Norðstreymoy
3 | Suðurstreymoy
4 | Vágar og Mykines
5 | Sandoy, Skúvoy, Stóra Dímun
6 | Suðuroy

From this PCA, it looks like there is very little population structure. But I'm not so convinced, I see this as a failure to detect population structure, not a success to show that there is no population structure.