# Principal Component Analysis

In [1]:
import hail as hl
hl.init(spark_conf={'spark.driver.memory': '100g'}, tmp_dir='/home/olavur/tmp')

Running on Apache Spark version 2.4.1
SparkUI available at http://hms-beagle-848846b477-48ks9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/qc/hail-20210526-1347-0.2.61-3c86d3ba497a.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
import pandas as pd

## Load FarGen data

Use LD pruned diallelic sites.

In [4]:
BASE_DIR = '/home/olavur/experiments/2020-11-13_fargen1_exome_analysis'
mt = hl.read_matrix_table(BASE_DIR + '/data/mt/ld_pruned_diallelic_common.mt')

In [5]:
n_variants, n_samples = mt.count()
print('Number of variants: ' + str(n_variants))
print('Number of samples: ' + str(n_samples))

Number of variants: 134452
Number of samples: 474


### Add population filters

Use a HWE filter and a minor allele frequency filter. The LD pruned data is already filtered with MAF > 0.01, but we will use a more stringent filter.

In [6]:
mt = mt.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt.GT))
mt = mt.filter_rows(mt.hwe.p_value > 1e-6)

Calculate allele frequencies.

In [7]:
# The number of alleles at the site is the sum of the ploidy at each site.
# This number should be twice the number of samples.
# If there are missing genotype calls, the number of alleles will be less.
AN_exprs = hl.agg.sum(mt.GT.ploidy)
mt = mt.annotate_rows(AN=AN_exprs)

# Calculate the number of alternate alleles at each site.
AC_exprs = hl.agg.sum(mt.GT.n_alt_alleles())
mt = mt.annotate_rows(AC=AC_exprs)

# Calculate the alternate allele frequency.
mt = mt.annotate_rows(AF=mt.AC / mt.AN)

**TODO:** I can increase the MAF threshold if still a lot of variants remain.

Remove variants with minor allele frequency under 0.05.

In [8]:
maf_filter = 0.05
mt = mt.filter_rows((mt.AF > maf_filter) & (mt.AF < (1 - maf_filter)))

### Filter indels

Remove all indels from the dataset.

In [9]:
mt = mt.filter_rows(hl.is_snp(mt.alleles[0], mt.alleles[1]))

## Annotate birth place region

In [10]:
fargen_rin_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/fargen_rin_samplename.csv', delimiter=',')
fargen_rin_ht = fargen_rin_ht.key_by(fargen_rin_ht.rin)

2021-05-26 13:47:22 Hail: WARN: Name collision: field 'sample' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['sample']
2021-05-26 13:47:22 Hail: INFO: Reading table without type imputation
  Loading field 'rin' as type str (not specified)
  Loading field 'sample' as type str (not specified)


In [11]:
fargen_rin_ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'rin': str 
    'sample': str 
----------------------------------------
Key: ['rin']
----------------------------------------


In [12]:
rin_birthplace_ht = hl.import_table(BASE_DIR + '/data/metadata/birthplace/rin_region.csv', delimiter=',')
# Rename "ind" to "rin".
# Convert the region variable to float.
rin_birthplace_ht = rin_birthplace_ht.transmute(rin=rin_birthplace_ht.ind, birthplace=hl.float64(rin_birthplace_ht.region))

rin_birthplace_ht = rin_birthplace_ht.key_by(rin_birthplace_ht.rin)

2021-05-26 13:47:22 Hail: INFO: Reading table without type imputation
  Loading field 'ind' as type str (not specified)
  Loading field 'region' as type str (not specified)


In [13]:
rin_birthplace_ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'rin': str 
    'birthplace': float64 
----------------------------------------
Key: ['rin']
----------------------------------------


In [14]:
# Annotate the table with the birthplace by the samplenames.
samplename_birthplace_ht = rin_birthplace_ht.annotate(samplename=fargen_rin_ht[rin_birthplace_ht.rin].sample)
samplename_birthplace_ht = samplename_birthplace_ht.key_by(samplename_birthplace_ht.samplename)

In [15]:
samplename_birthplace_ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'rin': str 
    'birthplace': float64 
    'samplename': str 
----------------------------------------
Key: ['samplename']
----------------------------------------


In [16]:
mt = mt.annotate_cols(birthplace = samplename_birthplace_ht[mt.s].birthplace)

In [17]:
p = hl.plot.histogram(mt.birthplace)
show(p)

2021-05-26 13:47:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:47:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:47:29 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:47:29 Hail: INFO: Ordering unsorted dataset with network shuffle


## Filter related individuals

Estimate the relatedness between the samples by the PC-Relate method, with a minimum alternate allele frequency of 0.001.

In [18]:
pc_rel = hl.pc_relate(mt.GT, 0.001, k=2, statistics='kin')

2021-05-26 13:47:36 Hail: INFO: hwe_normalized_pca: running PCA using 39986 variants.
2021-05-26 13:47:40 Hail: INFO: pca: running PCA with 2 components...
2021-05-26 13:48:12 Hail: INFO: Wrote all 10 blocks of 39986 x 474 matrix with block size 4096.


Plot all the relatedness coefficients in a histogram to get an overview.

In [19]:
p = hl.plot.histogram(pc_rel.kin, title='Histogram of kinship coefficient')
show(p)

2021-05-26 13:48:13 Hail: INFO: wrote matrix with 3 rows and 39986 columns as 10 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-pCDSvTTnu8oA89BdnEQFEJ.bm
2021-05-26 13:48:13 Hail: INFO: wrote matrix with 39986 rows and 474 columns as 10 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-zcw7ERDoajuMEqMlvjqe3Q.bm
2021-05-26 13:48:26 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-Q7TWY0nu8beBWorclHg1W3.bm
2021-05-26 13:48:37 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-KEm2XOVAkCR7F7t5KEfL17.bm
2021-05-26 13:48:37 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-axxkMgTg50prFkqAuy8Esu.bm
2021-05-26 13:48:38 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:48:39 Hail: INFO: wrote matrix with 3 rows and 39986 columns as 10 blocks of

In [20]:
pairs = pc_rel.filter(pc_rel['kin'] > 2**(-3))

Then we find the maximal independent set, consistent of the samples to remove.

In [21]:
related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, keep=False)

2021-05-26 13:49:15 Hail: INFO: wrote matrix with 3 rows and 39986 columns as 10 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-XClyq2VyHdWUZEaMf2LuX6.bm
2021-05-26 13:49:15 Hail: INFO: wrote matrix with 39986 rows and 474 columns as 10 blocks of size 4096 to /home/olavur/tmp/pcrelate-write-read-mwqOQd0Y7zR0S1AG32UhVP.bm
2021-05-26 13:49:28 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-Fdw974MIon5PuOa5y3kBU6.bm
2021-05-26 13:49:42 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-9cOrFPitOWXxg3RmXSq1du.bm
2021-05-26 13:49:42 Hail: INFO: wrote matrix with 474 rows and 474 columns as 1 block of size 4096 to /home/olavur/tmp/pcrelate-write-read-2BYw022i6tlArLtJu9oPZx.bm
2021-05-26 13:49:42 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:49:43 Hail: INFO: wrote table with 72 rows in 1 partition to /home/olavur/tm

Now we filter these individuals from the matrix table.

In [22]:
pruned_mt = mt.filter_cols(hl.is_defined(related_samples_to_remove[mt.col_key]), keep=False)

2021-05-26 13:49:44 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


In [23]:
samples_before_prune = mt.count_cols()
samples_after_prune = pruned_mt.count_cols()
print('Samples before prune: {n}\nSample after prune: {m}'.format(n=samples_before_prune, m=samples_after_prune))

2021-05-26 13:49:44 Hail: INFO: Coerced sorted dataset
2021-05-26 13:49:44 Hail: INFO: Ordering unsorted dataset with network shuffle


Samples before prune: 474
Sample after prune: 416


## Compute PCA

In [24]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(pruned_mt.GT, k=2)

2021-05-26 13:49:45 Hail: INFO: Coerced sorted dataset
2021-05-26 13:49:46 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:49:48 Hail: INFO: hwe_normalized_pca: running PCA using 39986 variants.
2021-05-26 13:49:49 Hail: INFO: Coerced sorted dataset
2021-05-26 13:49:49 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:49:51 Hail: INFO: pca: running PCA with 2 components...


In [25]:
pruned_mt = pruned_mt.annotate_cols(scores = scores[pruned_mt.s].scores)

In [26]:
p = hl.plot.scatter(pruned_mt.scores[0],
                    pruned_mt.scores[1],
                    label=hl.str(pruned_mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

2021-05-26 13:50:04 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:04 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:05 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:05 Hail: INFO: Ordering unsorted dataset with network shuffle


## Filter outliers

In the PCA plot above, it seems we have a few outliers. These individuals may have non-Faroese ancestry, whether they are part Faroese or not, or they may be very closely related. Either way, they prevent us from detecting possible population structure in the data.

As we see in the plot above, both PC 1 and 2 describe the variation between these outliers and the rest of the samples. So if we remove these outlier individuals, we may be able to detect population structure.

In [27]:
no_outerliers_mt = pruned_mt.filter_cols(pruned_mt.scores[1] < 0.4)

Now we just do the PCA again, and see if PC 1 and 2 show signs of population structure.

In [28]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(no_outerliers_mt.GT, k=2)

2021-05-26 13:50:07 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:07 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:08 Hail: INFO: hwe_normalized_pca: running PCA using 39986 variants.
2021-05-26 13:50:09 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:09 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:12 Hail: INFO: pca: running PCA with 2 components...


In [29]:
no_outerliers_mt = no_outerliers_mt.annotate_cols(scores = scores[no_outerliers_mt.s].scores)

In [30]:
p = hl.plot.scatter(no_outerliers_mt.scores[0],
                    no_outerliers_mt.scores[1],
                    label=hl.str(no_outerliers_mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

2021-05-26 13:50:21 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:21 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:22 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:22 Hail: INFO: Ordering unsorted dataset with network shuffle


Region number | Region name(s)
-----|-----
1 | Norðoyggjar
2 | Eysturoy og Norðstreymoy
3 | Suðurstreymoy
4 | Vágar og Mykines
5 | Sandoy, Skúvoy, Stóra Dímun
6 | Suðuroy

From this PCA, it looks like there is very little population structure. But I'm not so convinced, I see this as a failure to detect population structure, not a success to show that there is no population structure.

## Remove the capital region

In [31]:
filter_central_mt = no_outerliers_mt.filter_cols(no_outerliers_mt.birthplace != 3)

In [32]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(filter_central_mt.GT, k=2)

2021-05-26 13:50:24 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:24 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:25 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:28 Hail: INFO: hwe_normalized_pca: running PCA using 39986 variants.
2021-05-26 13:50:28 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:28 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:30 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:30 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:32 Hail: INFO: pca: running PCA with 2 components...


In [33]:
filter_central_mt = filter_central_mt.annotate_cols(scores = scores[filter_central_mt.s].scores)

In [34]:
p = hl.plot.scatter(filter_central_mt.scores[0],
                    filter_central_mt.scores[1],
                    label=hl.str(filter_central_mt.birthplace),
                    title='PCA', xlabel='PC1', ylabel='PC2')
p.plot_width = 800
p.plot_height = 600
show(p)

2021-05-26 13:50:46 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:46 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-05-26 13:50:47 Hail: INFO: Coerced sorted dataset
2021-05-26 13:50:47 Hail: INFO: Ordering unsorted dataset with network shuffle
