# Population structure - PCA of the FarGen cohort

In [2]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '100g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4045
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210628-1457-0.2.61-3c86d3ba497a.log


In [3]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [4]:
import pandas as pd

## Load FarGen data

Use high-quality variants.

In [5]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt')

In [6]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 468


## Population filters

Apply HWE filter to remove potential genotyping errors. Note that if we use a stringent HWE filter we risk removing evidence of population structure from our data. Therefore, we shall use a lenient filter, only removing variants with very strong deviation from HWE at $p<10^{-9}$.

In [6]:
mt = mt.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt.GT))
mt = mt.filter_rows(mt.hwe.p_value > 1e-9)

To filter rare variants, we first must calculate the minor allele frequency.

In [7]:
# The number of alleles at the site is the sum of the ploidy at each site.
# This number should be twice the number of samples.
# If there are missing genotype calls, the number of alleles will be less.
AN_exprs = hl.agg.sum(mt.GT.ploidy)
mt = mt.annotate_rows(AN=AN_exprs)

# Calculate the number of alternate alleles at each site.
AC_exprs = hl.agg.sum(mt.GT.n_alt_alleles())
mt = mt.annotate_rows(AC=AC_exprs)

# Calculate the alternate allele frequency.
mt = mt.annotate_rows(AF=mt.AC / mt.AN)

Remove variants with minor allele frequency under 0.01. Note that we remove only very rare variants, because common variants can be insufficient to describe fine-scale population structure at a subpopulation scale.

In [8]:
maf_filter = 0.01
mt = mt.filter_rows((mt.AF > maf_filter) & (mt.AF < (1 - maf_filter)))

## Filter indels

Remove all indels from the dataset.

**NOTE:** this code only works because there are only diallelic sites. If there were multi-allelic sites, I would have to check all allele pairs.

In [9]:
mt = mt.filter_rows(hl.is_snp(mt.alleles[0], mt.alleles[1]))

## LD pruning

Before we calculate LD, we remove multi-allelic sites.

**NOTE:** I can do LD pruning on multi-allelic sites using [split_multi()](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.split_multi).

In [10]:
mt = mt.filter_rows(hl.len(mt.alleles) == 2)

Prune variants with $r^2 > 0.2$ within a 500 000 basepair window.

**NOTE:** this $r^2$ value is quite arbitrary. It is the default from the Hail method and I did not try other values.

In [11]:
pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000)
mt = mt.filter_rows(hl.is_defined(pruned_variant_table[mt.row_key]))

2021-06-15 12:07:25 Hail: INFO: ld_prune: running local pruning stage with max queue size of 401850 variants
2021-06-15 12:07:39 Hail: INFO: wrote table with 84041 rows in 37 partitions to /home/olavur/tmp/Z1cRyg3S0j88Mrm40mXXiY
    Total size: 2.06 MiB
    * Rows: 2.06 MiB
    * Globals: 11.00 B
    * Smallest partition: 1048 rows (25.28 KiB)
    * Largest partition:  2875 rows (72.31 KiB)
2021-06-15 12:07:56 Hail: INFO: Wrote all 21 blocks of 84041 x 468 matrix with block size 4096.
2021-06-15 12:10:09 Hail: INFO: wrote table with 185 rows in 41 partitions to /home/olavur/tmp/7iwLMorU7fyciApDaFz59F
    Total size: 856.50 KiB
    * Rows: 4.88 KiB
    * Globals: 851.62 KiB
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  22 rows (461.00 B)


Make a checkpoint, caching all operations done on the matrix table.

In [12]:
if False:
    mt = mt.checkpoint('/home/olavur/tmp/ld_pruned.ht')
else:
    mt = hl.read_matrix_table('/home/olavur/tmp/ld_pruned.ht')

In [13]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 83887
Number of samples: 468


This may be a lot of variants to run a PCA analysis on, but as mentioned previously, we do not want to remove too many rare variants as these can hold a lot of information about fine-scale population structure.

## Annotate birth place region

Import data containing the birthplace ID of all FarGen participants. We will annotate the matrix table with these IDs.

First we read a table containing the birthplace but with the 'RIN' participant ID. We will have to map this 'RIN' ID to an 'FN' ID.

In [14]:
rin_birthplace_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/rin_region.csv', delimiter=',')
# Rename "ind" to "rin".
# Convert the region variable to float.
rin_birthplace_ht = rin_birthplace_ht.transmute(rin=rin_birthplace_ht.ind, birthplace=hl.float64(rin_birthplace_ht.region))

rin_birthplace_ht = rin_birthplace_ht.key_by(rin_birthplace_ht.rin)

2021-06-15 12:10:10 Hail: INFO: Reading table without type imputation
  Loading field 'ind' as type str (not specified)
  Loading field 'region' as type str (not specified)


Import table with 'RIN' IDs and corresponding 'FN' IDs.

In [15]:
fargen_rin_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/fargen_rin_samplename.csv', delimiter=',')
fargen_rin_ht = fargen_rin_ht.key_by(fargen_rin_ht.rin)

2021-06-15 12:10:10 Hail: WARN: Name collision: field 'sample' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['sample']
2021-06-15 12:10:10 Hail: INFO: Reading table without type imputation
  Loading field 'rin' as type str (not specified)
  Loading field 'sample' as type str (not specified)


Make a table with 'RIN', 'FN' and birthplace.

In [16]:
# Annotate the table with the birthplace by the samplenames.
samplename_birthplace_ht = rin_birthplace_ht.annotate(samplename=fargen_rin_ht[rin_birthplace_ht.rin].sample)
samplename_birthplace_ht = samplename_birthplace_ht.key_by(samplename_birthplace_ht.samplename)

Finally, we an annotate the matrix table with birthplace of samples.

In [17]:
mt = mt.annotate_cols(birthplace = samplename_birthplace_ht[mt.s].birthplace)

Count the number of samples in each region. Note that `birthplace=6` means that we do not know the birthplace of the sample.

In [18]:
cols_ht = mt.cols()
result = (cols_ht.group_by(cols_ht.birthplace)
    .aggregate(count = hl.agg.count()))
result.to_pandas()

2021-06-15 12:10:10 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2021-06-15 12:10:10 Hail: WARN: Name collision: field 'count' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['count']
2021-06-15 12:10:11 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-15 12:10:11 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-15 12:10:14 Hail: INFO: Coerced sorted dataset
2021-06-15 12:10:14 Hail: INFO: Coerced dataset with out-of-order partitions.


Unnamed: 0,birthplace,count
0,1.0,69
1,2.0,99
2,3.0,177
3,4.0,28
4,5.0,17
5,6.0,34
6,,44


## Filter related individuals

Estimate the relatedness between the samples by the PC-Relate method, with a minimum alternate allele frequency of 0.001.

In [19]:
pc_rel = hl.pc_relate(mt.GT, 0.001, k=2, statistics='kin')

2021-06-15 12:10:17 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-15 12:10:18 Hail: INFO: pca: running PCA with 2 components...
2021-06-15 12:10:46 Hail: INFO: Wrote all 21 blocks of 83887 x 468 matrix with block size 4096.


Plot all the relatedness coefficients in a histogram to get an overview.

In [20]:
p = hl.plot.histogram(pc_rel.kin, title='Histogram of kinship coefficient')
show(p)

2021-06-15 12:10:47 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-e25n1fSr0R4vAaQL80t2MY.bm
2021-06-15 12:10:47 Hail: INFO: wrote matrix with 83887 rows and 468 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-yLWrx5hKEPFZ5omcu7susr.bm
2021-06-15 12:11:18 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-oschpShlfYlA2pdxvaP1Sq.bm
2021-06-15 12:11:49 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-ddcESr9wsjbb10DVVCmF5s.bm
2021-06-15 12:11:49 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-xbGOQ1MRpyQEcP5elmGJQC.bm
2021-06-15 12:11:49 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-15 12:11:51 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of

**FIXME:** why did I used $2^{-3}$? If this is like the kinship coefficients I'm used to, the cutoff for second degree relationships is $2^{-4}$.

In [21]:
pairs = pc_rel.filter(pc_rel['kin'] > 2**(-4))

Then we find the maximal independent set, consistent of the samples to remove.

In [22]:
related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, keep=False)

2021-06-15 12:12:52 Hail: INFO: wrote matrix with 3 rows and 83887 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-4DyFtqloLa80JNpOymKr1V.bm
2021-06-15 12:12:52 Hail: INFO: wrote matrix with 83887 rows and 468 columns as 21 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-x0s73o6PgkIQkMGPH22ChJ.bm
2021-06-15 12:13:19 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-W1aTjFjpjea06QgzmXwKOL.bm
2021-06-15 12:13:49 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-ZOHlNAYw1UEQHlw9433tOQ.bm
2021-06-15 12:13:49 Hail: INFO: wrote matrix with 468 rows and 468 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-uvDZGcAdoQDE38iRQ0ZPDs.bm
2021-06-15 12:13:49 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-15 12:13:50 Hail: INFO: wrote table with 104 rows in 1 partition to /home/olavur/t

Now we filter these individuals from the matrix table.

In [23]:
mt = mt.filter_cols(hl.is_defined(related_samples_to_remove[mt.col_key]), keep=False)

Make a checkpoint, caching all operations done on the matrix table.

In [33]:
if False:
    mt = mt.checkpoint('/home/olavur/tmp/rel_pruned.ht')
else:
    mt = hl.read_matrix_table('/home/olavur/tmp/rel_pruned.ht')

In [34]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 83887
Number of samples: 389


## Compute PCA

In [35]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt.GT, k=4)

2021-06-28 15:05:59 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-28 15:06:00 Hail: INFO: pca: running PCA with 4 components...


In [36]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)

In [11]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=hl.str(mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

Above are the two first principal components plotted against eachother. Note that there are four outlier samples. These outliers may be due to the quality of these samples, so we shall invesigate them closer below.

As we see below, there seems to be nothing abnormal about these four samples. They may therefore potentially be *ancestry outliers*.

In [21]:
mt = mt.annotate_cols(pc1_outliers = (mt.scores[0] < -0.4))

In [22]:
exprs_list = [('# heterozygotes', mt.sample_qc.n_het), ('Ti/Tv rate', mt.sample_qc.r_ti_tv), ('Call rate', mt.sample_qc.call_rate), ('# singletons', mt.sample_qc.n_singleton)]
plot_list = []
for name, exprs in exprs_list:
    p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, exprs, label=mt.pc1_outliers, xlabel='DP mean', ylabel=name, legend=False)
    p.plot_width = 800
    p.plot_height = 500
    plot_list.append(p)

In [23]:
show(gridplot(plot_list, ncols=2, plot_width=600, plot_height=400))

## Remove potential ancestry outliers from full dataset

In further analysis of the dataset, we do not want to include these four potential ancestry outliers (PAO). We will therefore load the full dataset, discard the PAOs, and write the data again.

In [52]:
# Potential ancestry outliers.
pao_list = mt.filter_cols(mt.scores[0] < -0.4).s.collect()

print('Number of potential ancestry outliers: {n}'.format(n=len(pao_list)))

Number of potential ancestry outliers: 4


In [50]:
# Read original matrix table.
full_mt = hl.read_matrix_table(BASE_DIR + '/data/mt/high_quality_variants.mt')

# Annotate with the list of potential ancestry outliers.
full_mt = full_mt.annotate_globals(pao_list = pao_list)

# Remove the potential ancestry outliers.
pao_removed_mt = full_mt.filter_cols(~full_mt.pao_list.contains(full_mt.s))

In [51]:
n_variants, n_samples = pao_removed_mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 1146382
Number of samples: 464


Update variant QC information, as this has changed when we have removed some samples.

In [55]:
pao_removed_mt = hl.variant_qc(pao_removed_mt)

Write the resulting matrix table to file.

In [56]:
pao_removed_mt.write(BASE_DIR + '/data/mt/high_quality_variants_pao_removed.mt', overwrite=True)

2021-06-28 15:23:08 Hail: INFO: wrote matrix table with 1146382 rows and 464 columns in 37 partitions to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/high_quality_variants_pao_removed.mt
    Total size: 1.60 GiB
    * Rows/entries: 1.60 GiB
    * Columns: 49.53 KiB
    * Globals: 39.00 B
    * Smallest partition: 29355 rows (37.99 MiB)
    * Largest partition:  31925 rows (46.47 MiB)


## PCA without outliers

These PCA outerliers prevent us from detecting population structure in the PCA above. Whether they are ancestry outliers or whether something else is at play, we must remove them to be able to detect population structure.

In [24]:
mt = mt.filter_cols(mt.scores[0] > -0.4)

Now we just do the PCA again, and see if PC 1 and 2 show signs of population structure.

In [25]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt.GT, k=4)

2021-06-28 15:01:05 Hail: INFO: hwe_normalized_pca: running PCA using 83887 variants.
2021-06-28 15:01:06 Hail: INFO: pca: running PCA with 4 components...


In [26]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)

In [27]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=hl.str(mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

Region number | Region name(s)
-----|-----
1 | Norðoyggjar
2 | Eysturoy og Norðstreymoy
3 | Suðurstreymoy
4 | Vágar og Mykines
5 | Sandoy, Skúvoy, Stóra Dímun
6 | Suðuroy

From this PCA, it looks like there is very little population structure. But I'm not so convinced, I see this as a failure to detect population structure, not a success to show that there is no population structure.

## Summary

In this notebook we have:

* Investigate population structure in the cohort
* Removed potential ancestry outliers from the data