# Case-study: use of PIC-SURE API to extract data from the COPD cohort

## INTRO - Install the required libraries

We install the newly created package picsuRe to facilitate the use of the PIC-SURE API

We also set up the JupyterNB environment

In [None]:
httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L, ssl_verifystatus  = 0L))
if (!file.exists(Sys.getenv("TAR")))  Sys.setenv(TAR = "/bin/tar")

install.packages("devtools", repos = "http://cran.r-project.org")
install.packages("reticulate", repos = "http://cran.r-project.org")
install.packages("ggthemes", repos = "https://cran.cnr.berkeley.edu/")
install.packages("rlang", repos = "http://cran.r-project.org")
install.packages("Rcpp", repos = "http://cran.r-project.org")
install.packages("ggplot2", repos = "http://cran.r-project.org")

library(devtools)
library(reticulate)
library(ggplot2)
library(ggthemes)

install_github("hms-dbmi/picsuRe")
install_github("kaz-yos/tableone")
library(picsuRe)
library(tableone)

## 1. Data extraction
`environment`: The URL of the environment

`key`: To authenticate with PIC-SURE put your key or token in an otherwise empty text file in your JupyterNotebook's top level folder. The key will be read from there so the key does not get seen by anyone except you.

`variables`: A vector with the variables of interest. You can put a variable, or a path, as you want. You can also use the * key if you want to use a wild card. If an argument corresponds to a node, it will return all the variables below the node

In [None]:
env <- "https://copdgene.hms.harvard.edu"
key <- as.character(read.table("key.csv", sep=",")[1,1])

var1 <- "00 Affection status"
var2 <- "00 Consent groups"
var3 <- "01 Demographics/01 Demographics/Gender"
var4 <- "01 Demographics/Age at enrollment"
var5 <- "01 Demographics/01 Demographics/Race"
var6 <- "03 Clinical data/Respiratory disease form/05 Environmental exposures/01 Cigarette smoking/02 Do you now smoke cigarettes as of one month ago"
var7 <- "Oxygen saturation and therapy/05 Resting SaO2 in percent"


var <- c(var1, var2, var3, var4, var5, var6, var7)

With the function `picsure`, we build our query, and get the results back from the API. The output is a dataset with the variables of interests. By default, it will return all the patients that have at least one value for a variable.

In [None]:
demo <- picsure(env, key, var, verbose = TRUE)

In [None]:
names(demo)[2]<-"Affection_status"
names(demo)[3]<-"Consent_groups"
names(demo)[7]<-"Do_you_now_smoke_cigarettes"
names(demo)[8]<-"Resting_SaO2_in_percent"

## 2. Use the data to make statistics
### 2.a. Summary statistics
Let's take a look at the characteristics of our population

In [None]:
catVars <- c("Consent_groups", "Gender", "Affection_status", "Race", "Do_you_now_smoke_cigarettes")
vars <- c("Consent_groups", "Affection_status", "Race", "Do_you_now_smoke_cigarettes", "Age_at_enrollment", "Resting_SaO2_in_percent")

paste("We have", nrow(demo), "patients in our population.")
"Table 1: Description of the population from the COPD Study"
CreateTableOne(vars, data = demo[,-1], factorVars = catVars, strata = c("Gender"), test = FALSE)

### 2.b. Comparison of a categorical variable with a continuous one. 

In [None]:
Age <- demo$Age_at_enrollment
summary(Age)
hist(Age,
     main="Distribution of the age at enrollment among the cohort",
     sub="-The dark line fits a normal distribution-",
     xlab="Age at enrollment (years)", 
     ylab="n",
     border="black", 
     col="wheat1",
     xlim=c(20,100),
     ylim=c(0,0.05),
     breaks=10,
     las = 2,
     prob = TRUE
    )
m <- mean(Age, na.rm = TRUE)
std <- sqrt(var(Age, na.rm = TRUE))
x <- length(Age)
curve(dnorm(x, mean=m, sd=std), col="wheat4", lwd=3, add=TRUE, yaxt="n")

### 2.c. Comparison of 2 categorical variables. Cases and smokers.

In [None]:
demo <- demo[((demo$Affection_status == "Case" | demo$Affection_status == "Control")
            & (demo$Do_you_now_smoke_cigarettes == "Yes" | demo$Do_you_now_smoke_cigarettes == "No")),]
demo <- droplevels(demo)

Smokers <- demo$Do_you_now_smoke_cigarettes
Cases <- demo$Affection_status

table(Cases, Smokers)
chisq.test(Cases, Smokers)

### 3.b. Statistical analysis
We can now run a t-test in order to compare the resting SaO2 between case and control

In [None]:
demo2 <- demo[(demo$Affection_status == "Case" | demo$Affection_status == "Control"),]

Resting_SaO2_in_percent <- demo$Resting_SaO2_in_percent
Affection_status <- demo2$Affection_status

summary(Resting_SaO2_in_percent)
t.test(Resting_SaO2_in_percent~Affection_status)
boxplot(Resting_SaO2_in_percent~Affection_status, main="Resting SaO2 in percent by Affection status", xlab="Affection status", ylab="Resting SaO2 in percent",   las = 1)