# Download dbGaP phs000424.v7.p2 GTEx (version 7)
[phs000424.v7.p2](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2)

This notebook outlines the steps taken to download the GTEx genotype and RNA-seq data. The genotype data was retrieved from [dbGaP Authorized Access page](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login). Users that wish to access controlled-access data must first apply for approval; Eric Johnson is the point of contact in this case. The RNA-seq data was retrieved from [gtexportal](https://www.gtexportal.org/home/datasets). 


**Author**: Jesse Marks

### Software and tools
The software and tools needed for this analysis are
* [ascp for linux](https://gist.github.com/mfansler/71f09c8b6c9a95ec4e759a8ffc488be3)
* [SRA toolkit](https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
* [Amazon Web Services: S3](https://aws.amazon.com/)

### Install Amazon Web Services Comman Line Interface (AWS CLI)
The Amazon Web Services Command Line Interface (AWS CLI) needs to be installed in order to upload the data from a local machine to S3.

In [None]:
# Install pip, a package manager for python applications on local machine
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py

# Install awscli via pip
pip install awscli

# Verify installation - should see something similar to the string below
aws --version
"aws-cli/1.11.178 Python/2.7.13 CYGWIN_NT-10.0/2.9.0(0.318/5/3) botocore/1.7.36"

#### Configure AWS
More information about Amazon S3 can be found [here](http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html).

In [None]:
# Local Machine
aws configure

AWS Access Key ID [None]: AKIAJONBCJHOJSW2PFJA
AWS Secret Access Key [None]:  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
Default region name [None]: us-east-1
Default output format [None]: text  # could be json, text, or table

### SRA toolkit Preparation
A guide explaining the steps to download dbGaP data using the SRA toolkit can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=dbgap_use). 

Users that wish to access controlled-access data must first apply for approval; Eric Johnson is the point of contact in this case. Please review the process at the [dbGaP Authorized Access page](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login).

**Note**, if you are downloading the data to your local machine, then the **command-prompt** should be used rather than **Cygwin** for this specific procedure to get the key for that project.

    1) Click get dbGaP repository key from the My Projects
    2) Using the command-prompt, follow the steps for downloading dbGaP data in the link above
    
Also Note, that once the key has been complete for that specific Research Project, you will not need to get the key again for studies that fall under this project.

These data will be downloaded to EC2 and then subsequently uploaded to S3 due to the size of the content (~0.5Tb).

### Extend EBS Volume
Because we are dealing with large amounts of data, we need to download and process it on EC2 before pushing it to S3. We will have to extend the volumne of our instance in this case. The following example demonstrates how an EBS volume can be modified from the command line using hte AWS CLI. An example is detailed [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cli-modify.html).To make use of the new storage capacity after modifying the ebs volume you need to [extend a linux file system after resizing](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html).

In [None]:
## EC2 ##

# first configure AWS. you will be prompted to enter the following information
# the keys and default region can be found in the config file on the cluster launcher
aws configure
'''
AWS Access Key ID [None]: AKIAI44QH8DHBEXAMPLE
AWS Secret Access Key [None]: je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: text
'''


# size is the size you desire in Mb. Note, will need to remove dry-run to make it work.
aws ec2 modify-volume --dry-run --volume-id vol-038921893392154fa --size 1800  

# extend file system to the new volume capacity.
sudo resize2fs /dev/xvdb


## Download RNA-seq data to local machine and subset
For RNA-seq data, we can download as needed by tissue type (i.e., no sense storing it if we won’t use it) to limit the amount we store. The files we will keep, per Christina Markunas' request, are:

**Blood:**

* Whole_Blood
* Cells_EBV-transformed_lymphocytes
 

**Brain:**
* Brain_Amygdala
* Brain_Anterior_cingulate_cortex_BA24
* Brain_Caudate_basal_ganglia
* Brain_Cerebellar_Hemisphere
* Brain_Cerebellum
* Brain_Cortex
* Brain_Frontal_Cortex_BA9
* Brain_Hippocampus
* Brain_Hypothalamus
* Brain_Nucleus_accumbens_basal_ganglia
* Brain_Putamen_basal_ganglia
* Brain_Spinal_cord_cervical_c-1
* Brain_Substantia_nigra
 

**Lung:**
* Lung
 

**Heart:**
* Heart_Atrial_Appendage
* Heart_Left_Ventricle
* Artery_Aorta
* Artery_Coronary
* Artery_Tibial

These files can be found at  [gtexportal](https://www.gtexportal.org/home/datasets). The file is under the 
`Single-Tissue cis-eQTL` Data section

* `GTEx_Analysis_v7_eQTL_expression_matrices.tar.gz`

The data was downloaded to my local machine at the path:

`/cygdrive/c/Users/jmarks/Desktop/AWS/S3/GTEx_Analysis_v7_eQTL_expression_matrices`

In [None]:
## local machine (Cygwin) ##
cd ~/Desktop/AWS/S3

# untar data
tar -xvzf GTEx_Analysis_v7_eQTL_expression_matrices/

# subset data
mkdir -p subset_GTEx_Analysis_v7_eQTL_expression_matrices/{Blood,Brain,Heart,Lung}
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*blood*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Blood
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*lymphocytes*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Blood
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*brain*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Brain
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*lung*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Lung
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*heart*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Heart
find GTEx_Analysis_v7_eQTL_expression_matrices/ -iname '*artery*' | xargs cp -t subset_GTEx_Analysis_v7_eQTL_expression_matrices/Heart

## Download and Decrypt Genotype Data to EC2

In [None]:
## local machine (cygwin) ##

# uploading the key for decryption that was downloaded onto local machine from dbGaP
scp -i ~/.ssh/gwas_rsa prj_4984.ngc ec2-user@35.169.161.38:/shared/data/studies/ncbi

# login to EC2 instance (this instance was titled 428retirement)
ssh -i ~/.ssh/gwas_rsa ec2-user@35.169.161.38
#-------------------------------------------------------------------------------------------------------------------------------


## EC2 ##
cd /shared/data/studies/ncbi

# configuring, note that I specify the location of the key that I uploaded from local machine
/shared/bioinformatics/software/third_party/sratoolkit.2.8.2-1-centos_linux64/bin/vdb-config --import vdb-passwd/prj_4984.ngc
"""
Repository directory is: '/shared/data/studies/ncbi/dbGaP-4984'.
"""

# download data from dbGaP
cd /shared/data/studies/ncbi/dbGaP-4984
"/home/ec2-user/.aspera/connect/bin/ascp" -QTr -l 300M -k 1 -i "/home/ec2-user/.aspera/connect/etc/asperaweb_id_dsa.openssh" -W A3D5B68A80BBA4951B29A86CABAC14BC287FB278DAFEF491EA304EFEACE1C2BB007F31D32001E4CE985BE5FC634F30842D dbtest@gap-upload.ncbi.nlm.nih.gov:data/instant/cloviseoj/59867 . &

#decrypt
/shared/bioinformatics/software/third_party/sratoolkit.2.8.2-1-centos_linux64/bin/vdb-decrypt -q files/59867 &

## Compress files (gzip) and combine
Need to compress the genotype data on EC2 as well as the RNA-seq data that is being stored locally. After compressing them, I will copy the RNA-seq data from my local machine to EC2.

In [None]:
## Local machine (Cygwin) ##

# Path to the subsetted RNA-seq data
cd /cygdrive/c/Users/jmarks/Desktop/AWS/S3/subset_GTEx_Analysis_v7_eQTL_expression_matrices

# compress individual files
for file in Blood/*; do gzip $file; done
for file in Brain/*; do gzip $file; done
for file in Lung/*; do gzip $file; done
for file in Heart/*; do gzip $file; done

# send to EC2
cd ../
scp -ri ~/.ssh/gwas_rsa subset_GTEx_Analysis_v7_eQTL_expression_matrices ec2-user@35.169.161.38:/shared/data/studies/ncbi/dbGaP-4984/files/59867
scp -i ~/.ssh/gwas_rsa igsr_samples.tsv ec2-user@35.169.161.38:/shared/data/ref_panels/1000G

# login to EC2 
ssh -i ~/.ssh/gwas_rsa ec2-user@35.169.161.38
#---------------------------------------------------------------------------------------------------------

## EC2 ##
cd /shared/data/studies/ncbi/dbGaP-4984/files/59867/PhenoGenotypeFiles/RootStudyConsentSet_phs000424.GTEx.v7.p2.c1.GRU
find PhenotypeFiles/ -type f ! -name "*.gz" -exec gzip {} \;
find StudyMetaFiles/ -type f ! -name "*.gz" -exec gzip {} \;
gzip Genotype/* &

## Upload to S3

In [None]:
## EC2 ##
