# Case-study: use of PIC-SURE API to extract data from the CARDIA cohort

## INTRO - Install the required libraries

We install the newly created package picsuRe to facilitate the use of the PIC-SURE API

We also set up the JupyterNB environment

In [None]:
httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L, ssl_verifystatus  = 0L))
if (!file.exists(Sys.getenv("TAR")))  Sys.setenv(TAR = "/bin/tar")

install.packages("devtools", repos = "http://cran.r-project.org")
install.packages("reticulate", repos = "http://cran.r-project.org")
install.packages("ggthemes", repos = "https://cran.cnr.berkeley.edu/")
install.packages("rlang", repos = "http://cran.r-project.org")
install.packages("Rcpp", repos = "http://cran.r-project.org")
install.packages("ggplot2", repos = "http://cran.r-project.org")

library(devtools)
library(reticulate)
library(ggplot2)
library(ggthemes)

install_github("hms-dbmi/picsuRe")
install_github("kaz-yos/tableone")
library(picsuRe)
library(tableone)

## 1. Data extraction
`environment`: The URL of the environment

`key`: To authenticate with PIC-SURE put your key or token in an otherwise empty text file in your JupyterNotebook's top level folder. The key will be read from there so the key does not get seen by anyone except you.

`variables`: A vector with the variables of interest. You can put a variable, or a path, as you want. You can also use the * key if you want to use a wild card. If an argument corresponds to a node, it will return all the variables below the node

In [None]:
env <- "https://topmed-dev.hms.harvard.edu"
key <- as.character(read.table("topmedkey.csv", sep=",")[1,1])

var <- c(Race = "Coronary Artery Risk Development in Young Adults Study Cohort - phs000285/01. demographics/Race (verified at exam 2)",
        Gender = "Coronary Artery Risk Development in Young Adults Study Cohort - phs000285/01. demographics/Sex (verified at exam 2)",
        Age = "Coronary Artery Risk Development in Young Adults Study Cohort - phs000285/01. demographics/Calculated age at exam 1",
        Septal_thickness_systole = "1990-1991 Year 5/02. Clinical data/Cardiology/Echocardiography/M-mode/M-mode: vent septal thickness - systole",
        Has_smoked_cigarettes = "1985-1986 Year 0/01. Medical history/Tobacco, alcohol and drug use/Tobacco use form/Cigarette smoking status",
        Has_smoked_cigars = "1985-1986 Year 0/01. Medical history/Tobacco, alcohol and drug use/Tobacco use form/03. Subject has smoked cigars")

With the function `picsure`, we build our query, and get the results back from the API. The output is a dataset with the variables of interests. By default, it will return all the patients that have at least one value for a variable.

In [None]:
demo <- picsure(env, key, var)

For simplicity, we exclude the observations where the data are missing

In [None]:
demo <- demo[!(demo$Has_smoked_cigarettes == ""),]
demo <- demo[!(demo$Has_smoked_cigars == ""),]

## 2. Use the data to make statistics
### 2.a. Summary statistics
Let's take a look at the characteristics of our population

In [None]:
catVars <- c("Race", "Gender", "Has_smoked_cigarettes", "Has_smoked_cigars")
vars <- c("Race", "Gender", "Age", "Septal_thickness_systole", "Has_smoked_cigarettes", "Has_smoked_cigars")

paste("We have", nrow(demo), "patients in our population.")
"Table 1: Description of the population from the CARDIA Study"
CreateTableOne(vars, data = demo[,-1], factorVars = catVars, strata = c("Gender"), test = FALSE)

### 2.b. Comparison of a categorical variable with a continuous one.
#### 2.b.1. Comparison of Age among male and female
We want to start by looking at the distribution of age in our population

In [None]:
Age <- demo$Age
summary(Age)
hist(Age,
     main="Distribution of the age at enrollment among the cohort",
     sub="-The dark line fits a normal distribution-",
     xlab="Age at enrollment (years)", 
     ylab="Frequency",
     border="black", 
     col="wheat1",
     xlim=c(0,40),
     ylim=c(0,0.13),
     breaks=20,
     las = 2,
     prob = TRUE
    )
m <- mean(Age, na.rm = TRUE)
std <- sqrt(var(Age, na.rm = TRUE))
x <- length(Age)
curve(dnorm(x, mean=m, sd=std), col="wheat4", lwd=3, add=TRUE, yaxt="n")

We can see that the distibution of Age is not normal in our population.

### 2.c. Comparison of 2 categorical variables.
We want to know if gender and cigars smokers are correlated.

In [None]:
demo <- demo[((demo$Gender == "Female" | demo$Gender == "Male")
            & (demo$Has_smoked_cigars == "Yes" | demo$Has_smoked_cigars == "No")),]
demo <- droplevels(demo)

Cigars_smokers <- demo$Has_smoked_cigars
Gender <- demo$Gender

TwoByTwo <- table(Gender, Cigars_smokers)
TwoByTwo
chisq.test(Gender, Cigars_smokers)

In [None]:
mosaicplot(TwoByTwo, color = TRUE, main = "Mosaic plot of cigars smokers by gender")

The p-value is lower than 0.05, therefore we can conclude that the cigars smokers are statistically significantly lower among the female population of the CARDIA cohort than among the male population. That is also visually significant on the mosaic plot.

### 3. Focus on Myocardial hypertrophy
#### 3.1 Distribution
Histogram showing the distribution of the interventricular septum thickness measurement during the ventricular contraction.

In [None]:
demo <- demo[!(is.na(demo$Septal_thickness_diastole)),]
hist(demo$Septal_thickness_diastole,
     xlab="Septal thickness diatole in cm",
     main = "Distribution of septum thickness among the CARDIA cohort",
     xlim=c(0,3),
     breaks=20)
abline(v=1.5,col="red")

The distribution among our population doesn't seem to fit a bell-shaped curve. It's left-skewed, and with a right tail. The red line drawn at 15mm represent the threshold above which myocardial hypertrophy is defined.

#### 3.2 Comparison between male and female

We can now run a t-test in order to compare the wall thickness of the interventricular septum between whites and african-americans.

In [None]:
demo2 <- demo[(demo$Race == "White, not Hispanic" | demo$Race == "Black, not Hispanic"),]

Septal_thickness_systole <- demo$Septal_thickness_systole
Race <- demo2$Race

summary(Septal_thickness_systole)

p <- ggplot(demo, aes(x=Race, y=Septal_thickness_systole, fill=Race)) + geom_boxplot()
p + labs(subtitle="The wall thickness of the interventricular septum during the systole (cm) between race")

In [None]:
t.test(Septal_thickness_systole~Race)

The p-value is lower than 0.05, therefore we can conclude that the interventricular septum thickness is significantly lower among whites subjects of the CARDIA cohort than among african-american.

In [None]:
p <- ggplot(data=demo,aes(x=Age,y=Septal_thickness_systole))
p + theme_tufte(base_size=14) + stat_smooth(method='loess') + facet_grid(~Race) + labs(subtitle="Wall thickness of interventricular septum by race and age")

We can see that the wall thickness of the interventricular septum during the systole according to the age is higher in the black population than in the white population, and increases with age.