# Case-study: use of PIC-SURE API to extract data from the Jackson Heart Study cohort

## INTRO - Install the required libraries

Here we set up the JupyterNB environment. We also install the newly created package picsuRe to facilitate the use of the PIC-SURE API

In [None]:
httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L, ssl_verifystatus  = 0L))
if (!file.exists(Sys.getenv("TAR")))  Sys.setenv(TAR = "/bin/tar")

install.packages("devtools", repos = "http://cran.r-project.org")
install.packages("reticulate", repos = "http://cran.r-project.org")
install.packages("ggthemes", repos = "https://cran.cnr.berkeley.edu/")
install.packages("rlang", repos = "http://cran.r-project.org")
install.packages("Rcpp", repos = "http://cran.r-project.org")
install.packages("ggplot2", repos = "http://cran.r-project.org")

library(devtools)
library(reticulate)
library(ggplot2)
library(ggthemes)

install_github("hms-dbmi/picsuRe")
install_github("kaz-yos/tableone")
library(picsuRe)
library(tableone)

## 1. Data extraction
`environment`: The URL of the environment

`key`: To authenticate with PIC-SURE put your key or token in an otherwise empty text file in your JupyterNotebook's top level folder. The key will be read from there so the key does not get seen by anyone except you.

`variables`: A vector with the variables of interest. You can put a variable, or a path, as you want. You can also use the * key if you want to use a wild card. If an argument corresponds to a node, it will return all the variables below the node

In [None]:
env <- "https://topmed-dev.hms.harvard.edu"
key <- as.character(read.table("topmedkey.csv", sep=",")[1,1])

var <- c(Consent_groups = "The Jackson Heart Study - phs000286/00. population/consent_groups",
               Age = "Age (yrs) at baseline clinic visit",
               Gender = "01. Demographics/Gender",
               LV_thickness = "M-mode diastolic IV septum thickness in mm",
               Smoking = "*Smoked at least 400 cigarettes",
               Diastolic_BP = "*Diastolic (first BP)")

With the function `picsure`, we build our query, and get the results back from the API. The output is a dataset with the variables of interests. By default, it will return all the patients that have at least one value for a variable.

In [None]:
demo <- picsure(env, key, var)

For simplicity, we exclude the observations where "Gender" is missing

In [None]:
demo <- demo[!(demo$Gender == ""),]
demo <- demo[!(demo$Smoking == ""),]

## 2. Use the data to make statistics
### 2.a. Summary statistics
Let's take a look at the characteristics of our population

In [None]:
catVars <- c("Consent_groups", "Gender", "Smoking")
vars <- c("Consent_groups", "Gender", "Smoking", "LV_thickness", "Age", "Diastolic_BP")
paste("We have", nrow(demo), "patients in our population.")
"Table 1: Description of the population from the Jackson Study"
CreateTableOne(vars, data = demo[,-1], factorVars = catVars, strata = c("Gender"), test = FALSE)

### 2.b. Comparison of a categorical variable with a continuous one.
#### 2.b.1. Comparison of Age among male and female
We want to start by looking at the distribution of age in our population

In [None]:
age <- demo$Age
summary(age)
hist(age,
     main="Histogram for Age", 
     xlab="Age of participants", 
     border="black", 
     col="sky blue",
     xlim=c(0,100),
     breaks=20,
     prob = TRUE)
lines(density(age), col = "blue", lwd = 2)

We can notice a bi-modal distribution of age among the participants, with one mode around 50 years old, and the other around 65 years old.

##### Now let's break down the distribution of age by gender categories.

In [None]:
Age <- demo$Age
Gender <- demo$Gender
demo2 <- droplevels(demo)

boxplot(Age~Gender,data=demo2, main="Age by Gender among the Jackson Heart Study Cohort", xlab="Gender", ylab="Age")

The distribution of age appears to be similar among men and women with a mean around 50 to 55 years old.

### 2.b.2. Comparison of the sitting blood pressure among men and women from the JHS cohort
First, let's see the distribution of the diastolic blood pressure among the JHS cohort

In [None]:
diastolic <- demo$Diastolic_BP
summary(diastolic)
hist(diastolic,
     main="Distribution of Diastolic blood pressure among the cohort",
     sub="-The dark line corresponds to a normal distribution-",
     xlab="Diastolic blood pressure (mmHg)", 
     ylab="n",
     border="black", 
     col="wheat1",
     xlim=c(40,150),
     breaks=10,
     las = 1,
     prob = TRUE
    )
m <- mean(diastolic, na.rm = TRUE)
std <- sqrt(var(diastolic, na.rm = TRUE))
x <- length(diastolic)
curve(dnorm(x, mean=m, sd=std), col="wheat4", lwd=3, add=TRUE, yaxt="n")

We can see that the distribution as a bell-shape curve slightly left skewed. However, our population contains enough patient so that we can apply the central limit theorem for our analysis.

Let's run a t-test in order to look for a significant difference of the diastolic BP between men and women.

In [None]:
Gender <- demo$Gender
diastolic <- demo$Diastolic_BP
demo2 <- droplevels(demo)
t.test(diastolic~Gender)
boxplot(diastolic~Gender,data=demo2, main="Diastolic blood pressure by Gender", xlab="Gender", ylab="Diastolic blood pressure (mmHg)",   las = 1)

The p-value is lower than 0.05, therefore we can conclude that the dastolic blood pressure is statistically significantly lower among the female population of the Jackson cohort than among the male population. This is also visualy significant as looking at the box-plots.

### 2.c. Comparison of 2 categorical variables
Let's analyze the tobacco epidemic among the Jackson Heart Study cohort subjects.
#### Firstly, we want to know the proportion of smokers, broken down by gender

In [None]:
demo <- demo[((demo$Gender == "Male" | demo$Gender == "Female")
            & (demo$Smoking == "Yes" | demo$Smoking == "No")),]
demo <- droplevels(demo)

Smokers <- demo$Smoking
Gender <- demo$Gender

TwoByTwo <- table(Gender, Smokers)
TwoByTwo
chisq.test(Gender, Smokers)
mosaicplot(TwoByTwo, color = TRUE, main = "Mosaic plot of smokers by gender categories")

The previous Chi-Square test concluded that there was a statistically significant higher proportion of smokers among men than among women. That is also visually significant on the mosaic plot.

### 3. Focus on Myocardial hypertrophy
#### 3.1 Distribution
Histogram showing the distribution of the interventricular septum thickness measurement during the ventricular contraction.

In [None]:
demo <- demo[!(is.na(demo$LV_thickness)),]
hist(demo$LV_thickness,
     xlab="Septal thickness in mm",
     main = "Distribution of septum thickness among the JHS cohort",
     xlim=c(5,20),
     breaks=19)
abline(v=15,col="red")

The distribution among our population doesn't seem to fit a bell-shaped curve. It's left-skewed, and with a right tail. The red line drawn at 15mm represent the threshold above which myocardial hypertrophy is defined.

#### 3.2 Comparison between male and female

We can now run a t-test in order to compare the wall thickness of the interventricular septum between females and males.

In [None]:
demo2 <- demo[(demo$Gender == "Female" | demo$Gender == "Male"),]

LV_thickness <- demo$LV_thickness
Gender <- demo2$Gender

summary(LV_thickness)

p <- ggplot(demo, aes(x=Gender, y=LV_thickness, fill=Gender)) + geom_boxplot()
p + labs(subtitle="Wall thickness of interventricular septum by gender")

In [None]:
t.test(LV_thickness~Gender)

The p-value is lower than 0.05, therefore we can conclude that the interventricular septum thickness is significantly lower among the female population of the Jackson cohort than among the male population.

In [None]:
p <- ggplot(data=demo,aes(x=Age,y=LV_thickness))
p + theme_tufte(base_size=14) + stat_smooth(method='loess') + facet_grid(~Gender) + labs(subtitle="Wall thickness of interventricular septum by gender and age")