# Using dbGaP2x, R package to explore and sort phenotypics data from dbGap

## Introduction
### Load the package

In [1]:
devtools::install_github("hms-dbmi/sandboxR", force = TRUE)
library(sandboxR)

Downloading GitHub repo hms-dbmi/sandboxR@master
from URL https://api.github.com/repos/hms-dbmi/sandboxR/zipball/master
Installing sandboxR
'/opt/conda/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/tmp/RtmpkPeAnI/devtools277382b9326/hms-dbmi-sandboxR-97f6186'  \
  --library='/opt/conda/lib/R/library' --install-tests 



### Get the list of the function for this new package

In [2]:
lsf.str("package:sandboxR")

browse.dbgap : function (phs, jupyter = FALSE)  
browse.study : function (phs, jupyter = FALSE)  
consent.groups : function (phs)  
datatables.dict : function (phs)  
dbgap.data_dict : function (xml, dest)  
dbgap.decrypt : function (file, key = FALSE)  
dbgap.download : function (krt, key = FALSE)  
dbgap.expand : function (phs, files, destination, study_name = phs)  
dl.nhanes : function (destination = getwd())  
is.parent : function (phs)  
list.oldmaps : function (mappath)  
look.oldmap : function (mappath, old)  
MapToTree : function (mappath)  
n.pop : function (phs, consentgroups = TRUE, gender = TRUE)  
n.tables : function (phs)  
n.variables : function (phs)  
parent.study : function (phs)  
phs.version : function (phs)  
readytoload : function (mappath, short_name, long_name, phs)  
recover.map : function (oldcsv, mappath)  
search.dbgap : function (term, jupyter = FALSE)  
study.name : function (phs)  
sub.study : function (phs)  
table.expand : function (study_name, files, 

## 1. Search for dbGap studies
### Let's try to explore the "Jackson Heart Study" cohort that exists on dbGap.
###### The dbGap search engine can be tricky, that's why we created the function "browse.dbgap", who helps you find the studies related to the term that you search on your web browser.
Note that if you run this function in a jupyterhub environment, it will return a url since jupyterhub doesn't have access to your local browser.

In [3]:
search.dbgap("Jackson", jupyter = TRUE)

#### dbGap returns the list of the studies related to your term. As you see, there are 6 studies associated with the "Jackson Heart Study" (JHS). One of these study is the main one aka the "parent study", whereas the other ones are substudies. In this case, phs000286.v5.p1 is the parent study. Firslty, we can use the phs.version() function in order to be sure that this is the latest version of the study. We can abbreviate the phs name by giving just the digit, or we can use the full dbGap id.

In [4]:
phs.version("286")

##### The is.parent() function is usefull to test if a study is a parent study or a substudy

In [5]:
is.parent("000286") # JHS main cohort
is.parent("phs499") # substudy "CARe" for JHS

#### If you don't know the parent study of a substudy, try parent.study()

In [6]:
parent.study("phs000499")

##### On the other side, use sub.study() to get the name and IDs of the substudies from a parent one

In [7]:
sub.study("286")

phs,name
phs000499.v3.p1,NHLBI Jackson Heart Study Candidate Gene Association Resource (CARe)
phs000498.v3.p1,Jackson Heart Study Allelic Spectrum Project
phs000402.v3.p1,NHLBI GO-ESP: Heart Cohorts Exome Sequencing Project (JHS)
phs001098.v1.p1,T2D-GENES Multi-Ethnic Exome Sequencing Study: Jackson Heart Study


##### If you want to get the name of a study from its dbGap id, use study.name()

In [8]:
study.name("286")

##### Finally, you can watch your study on dbGap with browse.dbgap().
##### If a website exists for this study, you can browse it using browse.study()

In [9]:
browse.dbgap("286", jupyter = TRUE)
browse.study("286", jupyter = TRUE)

## 2. Explore the characteristics of your study
##### For each dbGap study, there can be multiple consent groups that will have there specificities. Use consent.groups to know the number and the name of the consent groups in the study that you are exploring. Let's keep focusing on JHS.

In [10]:
JHS <- "phs000286"
consent.groups(JHS)

Unnamed: 0,shortName,longName
0,NRUP,"Subjects did not participate in the study, did not complete a consent document and are included only for the pedigree structure and/or genotype controls, such as HapMap subjects"
1,HMB-IRB-NPU,"Health/Medical/Biomedical (IRB, NPU)"
2,DS-FDO-IRB-NPU,"Disease-Specific (Focused Disease Only, IRB, NPU)"
3,HMB-IRB,Health/Medical/Biomedical (IRB)
4,DS-FDO-IRB,"Disease-Specific (Focused Disease Only, IRB)"


##### Use n.pop() to know the number of patient included in each groups

In [11]:
n.pop(JHS)
n.pop(JHS, consentgroups = FALSE)

consent_group,male,female,total
HMB-IRB,1860,2504,4549
HMB-IRB-NPU,264,505,802
DS-FDO-IRB-NPU,63,107,180
HMB-IRB,784,1232,2131
DS-FDO-IRB,173,289,489
TOTAL,3144,4637,8151


##### Use n.tables() and n.variables() to get the number of datatables in your study and the total number of variables
(n.variables goes into the study files to count the actual number of variables)

In [12]:
n.tables(JHS)
n.variables(JHS)

#### datatables.dict() will return a data frame with the datatables IDs (phtxxxxxx) and description of your study

In [13]:
tablesdict <- datatables.dict(JHS)
head(tablesdict)

pht,dt_study_name,dt_label
pht002539.v2,ESP_HeartGO_JHS_Subject_Phenotypes,"Subject ID, ESP cohort, target capture used in sequencing, sequence center, race, sex, affection status, family medical history of stroke, participant medical history of asthma and COPD, ankle brachial index, artery disease status, atrioventricular block, blood pressure, body weight, height and BMI, coronary artery calcium, EKG, Framingham Risk Score, intimal-medial thickness, laboratory tests including basophils, eosinophils, neutrophils, lymphocytes, lymphocytes, blood fasting insulin and glucose, level of C-reactive protein, LDL, HDL, triglycerides, uric acid, urinary creatinine, serum creatinine, menopause, MI, FEV1, FVC, stroke status, type 2 diabetes, Wolff-Parkinson-White pattern, hormone replacement therapy, and smoking status of subjects participated in the ""National Heart Lung and Blood Institute (NHLBI) GO-ESP: Heart Cohorts Component of the Exome Sequencing Project (JHS)"" sub study of the ""Jackson Heart Study (JHS) Cohort"" project."
pht001948.v1,CSTA,Agatston score of all coronary section among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001947.v1,CSIA,Approach to life B. Life style among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001968.v1,PPAA,Post physical activity monitoring among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001955.v1,ECHA,Echocardiographic abnormalities among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001952.v1,DPASS_DIET1,Dietary data (DPASS) among participants of the Jackson Heart Study including adult 35-84 years old African Americans.


#### variables.dict() will return a data frame with the variables IDs (phvxxxxxx), their name in the study, the datatable where they come from and their description
(may takes even more time)

In [14]:
vardict <- variables.dict(JHS)
head(vardict)

dt_study_name,phv,var_name,var_desc
ESP_HeartGO_JHS_Subject_Phenotypes,phv00165323.v2,SUBJID,Subject ID
ESP_HeartGO_JHS_Subject_Phenotypes,phv00165322.v2,ESP_Cohort,Cohort name [JHS]
ESP_HeartGO_JHS_Subject_Phenotypes,phv00165324.v2,ESP_phenotype,"ESP Phenotype group (phenotype that the sample was selected to be sequenced for) [EOMI_Control (Early MI control), LDL_Low, LDL_High, BP_Low (low blood pressure); BP_High (high blood pressure); DPR (Deeply Phenotyped Reference); BMI_High]"
ESP_HeartGO_JHS_Subject_Phenotypes,phv00181282.v1,Sequence_center,"Indicates where the sample was sequence at [Broad, UW]"
ESP_HeartGO_JHS_Subject_Phenotypes,phv00181283.v1,Target,Indicates target capture used in sequencing
ESP_HeartGO_JHS_Subject_Phenotypes,phv00181284.v1,ESP_race_selfreport,Self report race [African American]


Now that we have explore our datasets, let's use sandboxR in order to clean our variables, and to gather them into a tree that will be easier to use for researchers. Note that for chapter 3, we will need to move and create a lot of files on your environment. It will be easier to use on your local computer than in the Jupyterhub environment.

## 3. Extract your study
### 3.1. Export your data from dbGap
In order to get your data from dbGap, you will need to request an access and to get a decryption key. This has to be done here: https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?login=&page=login
### 3.2. Decrypt your files
We found that the decryption system from dbGap can be tricky. We created dbgap.decrypt() in order to easily decrypt the files that you have downloaded. Note that the "files" argument can be a file or a folder containing multiple encrypted files. Also, this function works only for Mac OS at this moment.