# Clustering analysis and PCA #

## Missing R packages##

When running the scripts in this section, if you get an error saying the gplots package has not been installed, you have two options: (1) either run the following command to add a pre-downloaded version of the package to the **$R_LIBS** environment variable...

In [1]:
%%bash
export R_LIBS=$R_LIBS:/users/avanti/R/x86_64-unknown-linux-gnu-library/3.0/

UsageError: %%bash is a cell magic, but the cell body is empty.

...or install the package locally by  running install.packages("gplots"): 

In [3]:
%%bash 
R --no-save 
install.packages("gplots",repos='http://cran.us.r-project.org')
q()


R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("gplots",repos='http://cran.us.r-project.org')

The downloaded source packages are in
	‘/tmp/RtmpvpcHlO/downloaded_packages’
> q()


Installing package into ‘/users/annashch/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.us.r-project.org/src/contrib/gplots_3.0.1.tar.gz'
Content type 'application/x-gzip' length 578626 bytes (565 KB)
downloaded 565 KB

* installing *source* package ‘gplots’ ...
** package ‘gplots’ successfully unpacked and MD5 sums checked
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (gplots)


## The process_peaks.sh script ##

Cluster analysis is a simple way to visualize patterns in the data. By clustering peaks according to their signal across different time points, we may find groups of peaks that have similar behavior across these time points. By clustering samples according to their signal across peaks, we can perform a simple sanity check of data quality ‐ samples of the same time point should cluster together.

We have developed a script to perform all the following steps:

In [4]:
%%bash
${SRC_DIR}/process_peaks.sh

bash: line 1: /process_peaks.sh: No such file or directory


## Heatmap generation ##

Note that the last step in the script above is to generate heatmaps. This step may fail if X11 is not installed. One solution is to copy the foldChange.tab file to your local computer, and then run:

**[path/to/]visualize_clusters.R foldChange.tab foldChange.png**

Note that visualize_clusters.R is obtained from the github repository of scripts: https://github.com/kundajelab/training_camp/blob/master/src/visualize_clusters.R

### Optional: use RStudio ###

Another solution to the absence of X11 on these machines is to use RStudio which is running out of Wotan on wotan.stanford.edu:8787. You can use what is called SSH port forwarding to access this on your local machine. SSH port forwarding will basically make the address "localhost:8787" on your computer access wotan.stanford.edu:8787. Open up a terminal window and enter:

In [9]:
%%bash
ssh -L localhost:8787:localhost:8787 [YOUR USERNAME]]@wotan.stanford.edu

ssh: Could not resolve hostname [your: Name or service not known


Now, keep that terminal window open and access localhost:8787 on your web browser to access RStudio. In the R shell, use 

**setwd("/tc2015/[yourUsername]/results/signal")**

to change to the directory with the foldChange.tab file (replace [yourUsername] with your username), enter the command input_file="foldChange.tab", and then use

**source("/tc2015/[yourUsername]/src/training_camp/src/visualize_clusters_from_r_shell.R")**

to visualise the contents of foldChange.tab.

## Separating out wt and ko ##

The strong difference between wt and ko may make the heatmap hard to discern. One solution is to separate out the wt and the ko samples, and call visualise_clusters.R on each one. You can do this with:

In [5]:
%%bash
perl -lane 'if ($. == 1) {@titleIdxs = grep {$F[$_] =~ /wt/} 0..$#F; @idxs = map {$_+1} @titleIdxs; print "@F[@titleIdxs]"} else {print "@F[0,@idxs]"}' foldChange.tab > wt_foldChange.tab
perl -lane 'if ($. == 1) {@titleIdxs = grep {$F[$_] =~ /ko/} 0..$#F; @idxs = map {$_+1} @titleIdxs; print "@F[@titleIdxs]"} else {print "@F[0,@idxs]"}' foldChange.tab > ko_foldChange.tab

Can't open foldChange.tab: No such file or directory.
Can't open foldChange.tab: No such file or directory.


Then use the steps you used to visualise the heatmaps, but supply **wt_foldChange.tab** or **ko_foldChange.tab** where you previously supplied **foldChange.tab**.

## PCA ##

PCA (Principal Component Analysis) is a way to identify the primary directions of variation in the data. It can also be used for very coarse-grained clustering of samples; similar samples will have similar coordinates along the principal axes.

We will perform PCA on foldChange.tab. The first step is to clean up the column labels in foldChange.tab with the following perl one-liner:

In [6]:
%%bash 
cd $WORK_DIR/results/signal
perl -i".bak" -pe '$_ = $.==1 ? do {$_ =~ s/\/[^\s]+\///g; $_ =~ s/\"//g; $_ =~ s/\-/\./g; $_ =~ s/PooledReps_Sample/samp/g; $_} : $_' foldChange.tab

bash: line 1: cd: /results/signal: No such file or directory
Can't open foldChange.tab: No such file or directory.


We will now do PCA. We treat each sample as a single point in a very high dimensional space (where the dimensionality is equal to the number of genes the vary), and then we will perform dimensionality reduction in this space (if you get an X11 error, log into R studio and follow the instructions at the end):

In [10]:
%%bash
cd $WORK_DIR/results/signal
$SRC_DIR/doPCA.R foldChange.tab

bash: line 1: cd: /results/signal: No such file or directory
bash: line 2: /doPCA.R: No such file or directory


This script will produce PCA_sdev.png, which shows the standard deviation explained by each of the principle components. Since there are only 12 datapoints, the effective dimensionality of our data is 12, even though there are thousands of genes; this is why there are only 12 PCs.

It also produces PC_[x]_vs_[y].png for components 1..3. How do you interpret the different principle components?

Finally, it produces the files pc[x]_rotation.txt for components 1..3, which show the contribution of each peak to the direction of the principle component; this file can be used to get a sense of which peaks are critical in defining the principle components, and in which direction (positive or negative). One interesting analysis we can do with these files is to sort the peaks by their contribution to the principle component in ascending or descending order, map the peaks to their nearest genes, and then used the ranked list with software such as GOrilla which accept a ranked list of genes and output which GO terms are overrepresented towards the top: (http://cbl-gorilla.cs.technion.ac.il/)

The following commands will sort the genes by their contribution to each principle component and then map them to their nearest gene:



In [11]:
%%bash 
cd $WORK_DIR/results/signal
for pcFile in `ls pc*_rotation.txt`; do
    theBase=`basename ${pcFile}`
    cat $pcFile | sort -k 2r > "ascending_"$theBase
    cat $pcFile | sort -k 2rg > "descending_"$theBase
done
$SRC_DIR/mapToNearestPeak.py --sigPeakInputFiles ascending*.txt descending*.txt --peaks2genesFile /tc2015/avanti/results/peaks/peaks2genes.bed

bash: line 1: cd: /results/signal: No such file or directory
ls: cannot access pc*_rotation.txt: No such file or directory
bash: line 7: /mapToNearestPeak.py: No such file or directory


You can then copy the mapped files to your local computer with:

In [13]:
%%bash 
scp [yourUsername]@wotan.stanford.edu:/tc2015/[yourUsername]/results/signal/nearestGenes*pc* .

cp: cannot stat ‘[yourUsername]@wotan.stanford.edu:/tc2015/[yourUsername]/results/signal/nearestGenes*pc*’: No such file or directory


## From RStudio ##

From RStudio, do 
**setwd("/tc2015/[yourUsername]/results/signal")**
where you replace [yourUsername] with your username, set inpFile="foldChange.tab", and then call:

**source("/tc2015/[yourUsername]/src/training_camp/src/doPCA_from_r_shell.R")**
