# 1000 Genomes Single Chromosome PCA Example

*(In-class assignment extracted from: http://bwlewis.github.io/1000_genomes_examples/PCA.html)*

This example walks through the computation of principal components (PCA) of genomic variant data across one chromosome from 2,504 people from the 1000 genomes project1. The example projects all of the variant data for one chromosome into a three-dimensional subspace, and then plots the result. I think the example is popular perhaps because it’s very effective at clustering people by ethnicity. It’s often used to illustrate “big data” analysis in genomics, even though the data are not particularly big. The point of this example is not to say that PCA on genomic variants is profound, but rather that it’s easy.

This step by step exercise will give you a sense of how populations from different continents of origin can distinguished by a small numnber of variants. Even though we share 99% of our genome, there are sufficient population diffences to allow a simple PCA to demonstrate the difference. However the vast majority of these variants are likely not biologically or medically informative. **Note this is just using variants from a single chromosome**

The example uses:

- a very simple C parsing program to efficiently read variant data into an R sparse matrix
- the irlba package to efficiently compute principal components
- the threejs package to visualize the result

## Reading variant data into an R sparse matrix
This step assumes that you’ve downloaded and compiled the simple VCF parser and downloaded at least the chromosome 20 and phenotype data files from the 1000 genomes project. Run this code directly from the terminal.

In [None]:
# 1000 genomes example variant data file (chromosome 20)
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

# 1000 genomes phenotype data file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# Simple but fast parser program (after compilation you'll have a program called a.out)
wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/parse.c
cc -O2 parse.c

- We can use R alone to read and parse the VCF file, it would just take a while longer.
- All the remaining steps in this example run from R. 

## R packages installation

In [3]:
rm(list=ls())
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,522176,27.9,1155967,61.8,641780,34.3
Vcells,991667,7.6,8388608,64.0,1754213,13.4


In [4]:
#install.packages("Matrix",repos="http://cran.us.r-project.org")
#install.packages("irlba", repos="http://cran.us.r-project.org")
#install.packages("threejs", repos="http://cran.us.r-project.org")
library("Matrix")
library("irlba")
library("threejs")

Loading required package: igraph

Attaching package: ‘igraph’

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union



Let’s read the variant data for chromosome 20 into an R sparse matrix. Note that we only care about the variant number and sample (person) number in this exercise and ignore everything else. Set the working directory to the one where you have downloaded the data. 

In [14]:
setwd("/home/ec2-user/SageMaker/")

In [15]:
p = pipe("zcat ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz  | sed /^#/d  | cut  -f '10-' | ./a.out | cut -f '1-2'")

In [None]:
#this step takes some time
x = read.table(p, colClasses=c("integer","integer"), fill=TRUE, row.names=NULL)

In [None]:
dim( x )
x[1:10,]

In [None]:
# Convert to a sparse matrix of people (rows) x variant (columns)
?sparseMatrix

In [None]:
chr20 = sparseMatrix(i=x[,2], j=x[,1], x=1.0)

In [None]:
# Inspect the dimensions of this matrix
print(dim(chr20))

In [None]:
# Visualize a subset of the content of the matrix
chr20[461:470,1:10]

We’ve loaded a sparse matrix with 2,504 rows (people) by 1,812,841 columns (variants).

## Compute the three principal component vectors
The next step computes the first three principal component vectors using the irlba package and plots a 3d scatterplot using the threejs package, or a normal plot using plot() function.

In [None]:
?irlba

In [None]:
# takes some time
cm = colMeans(chr20)
p = irlba(chr20, nv=3, nu=3, tol=0.1, center=cm)

In [None]:
plot(x = p$u[,1], y = p$u[,2], xlab = "PC 1", ylab = "PC 2")
plot(x = p$u[,1], y = p$u[,3], xlab = "PC 1", ylab = "PC 3")
plot(x = p$u[,2], y = p$u[,3], xlab = "PC 2", ylab = "PC 3")

In [None]:
scatterplot3js(p$u)

The data exhibit obvious groups, and those groups correspond to ethnicities. That can be illustrated by loading ancillary data from the 1000 genomes project that identifies the “superpopulation” of each sample.

In [5]:
# Read just the header of the chromosome file to obtain the sample identifiers

ids = readLines(pipe("zcat ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz  | sed -n /^#CHROM/p | tr '\t' '\n' | tail -n +10"))

In [6]:
# Download and parse the superpopulation data for each sample, order by ids
#ped = read.table(url("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped"),sep="\t",header=TRUE,row.names=2)[ids,6,drop=FALSE]
ped = read.table("/home/ec2-user/SageMaker/20130606_g1k.ped",sep="\t",header=TRUE,row.names=2)[ids,6,drop=FALSE]

In [7]:
# Download the subpopulation and superpopulation codes
# WARNING: These links occasionally change. Beware!
#pop = read.table("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.populations.tsv",sep="\t",header=TRUE)
pop = read.table(url("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/20131219.populations.tsv"),sep="\t",header=TRUE)
pop = pop[1:26,]
super = pop[,3]
names(super) = pop[,2]
super = factor(super)
# The last rows of pop are summary data or non-relevant:

In [8]:
# Map sample sub-populations to super-populations
ped$Superpopulation = super[as.character(ped$Population)]

In [10]:
# Plot with colors corresponding to super populations
N = length(levels(super))
N
# Interactive plot
#scatterplot3js(p$u, col=rainbow(N)[ped$Superpopulation], size=0.5)

In [None]:
# Non-interactive plot
plot(x = p$u[,1], y = p$u[,2], col=rainbow(N)[ped$Superpopulation], xlab = "PC 1", ylab = "PC 2")
plot(x = p$u[,1], y = p$u[,3], col=rainbow(N)[ped$Superpopulation], xlab = "PC 1", ylab = "PC 3")
plot(x = p$u[,2], y = p$u[,3], col=rainbow(N)[ped$Superpopulation], xlab = "PC 2", ylab = "PC 3")

In [None]:
ls()

In [None]:
Sys.time()
sessionInfo()