# PIC-SURE API use-case: Phenome-Wide analysis on Cure Sickle Cell data

This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

Databases exposed through PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hide this complexity and expose the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using any of those languages.

PIC-SURE is a large project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface, allowing research scientist to get quick knowledge about variables and data available for a specific data source.

The python API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client



 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the get_your_token.ipynb notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- R 3.5 or later

### Packages installation

In [22]:
list_packages <- c("ggrepel",
                   "jsonlite", 
                   "ggplot2",
                   "plyr",
                   "dplyr",
                   "tidyr",
                   "purrr",
                   "devtools")

for (package in list_packages){
     if(! package %in% installed.packages()){
         install.packages(package, dependencies = TRUE)
     }
     library(package, character.only = TRUE)
}

Loading required package: ggplot2

------------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

------------------------------------------------------------------------------


Attaching package: ‘plyr’


The following objects are masked from ‘package:dplyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize



Attaching package: ‘purrr’


The following object is masked from ‘package:plyr’:

    compact


The following object is masked from ‘package:jsonlite’:

    flatten


Loading required package: usethis



#### Installing latest R PIC-SURE API libraries from github

In [23]:
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

Downloading GitHub repo hms-dbmi/pic-sure-r-client@master




[32m✔[39m  [38;5;247mchecking for file ‘/private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/RtmpxBo0KI/remotes72644ecf3788/hms-dbmi-pic-sure-r-client-d62fec5/DESCRIPTION’[39m[36m[36m (461ms)[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mpreparing ‘picsure’:[39m[36m[39m
[32m✔[39m  [38;5;247mchecking DESCRIPTION meta-information[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for empty or unneeded directories[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mbuilding ‘picsure_0.1.0.tar.gz’[39m[36m[39m
   


Downloading GitHub repo hms-dbmi/pic-sure-r-adapter-hpds@master




[32m✔[39m  [38;5;247mchecking for file ‘/private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/RtmpxBo0KI/remotes72645341cc24/hms-dbmi-pic-sure-r-adapter-hpds-353b541/DESCRIPTION’[39m[36m[36m (358ms)[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mpreparing ‘hpds’:[39m[36m[39m
[32m✔[39m  [38;5;247mchecking DESCRIPTION meta-information[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for empty or unneeded directories[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mbuilding ‘hpds_0.1.0.tar.gz’[39m[36m[39m
   


##### Loading user-defined functions

In [24]:
source("R_lib/utils.R")

## Connecting to a PIC-SURE network

Several information are needed to get access to data through the PIC-SURE API: a network URL, a resource id, and a user security token which is specific to a given URL + resource.

In [29]:
PICSURE_network_URL <- "https://curesc.hms.harvard.edu/picsure"
resource_id <- "37663534-6161-3830-6264-323031316539"
token_file <- "token.txt"

In [35]:
my_token <- scan(token_file, what = "character")

In [36]:
connection <- picsure::connect(url = PICSURE_network_URL,
                                 token = my_token)

In [37]:
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object, using respectively the `picsure` and `hpds` libraries. 

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter** (FYI, the `connection` object is useful to get access to different databases stored in different resources). 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this source.

## Getting help with the R PIC-SURE API

The `?` operator prints out helper message for any PIC-SURE library function.

In [10]:
?hpds::get.resource()

0,1
get.resource {hpds},R Documentation

0,1
connection,A PIC-SURE connection object.
resourceUUID,The UUID identity of a Resource hosted via the PIC-SURE connection.
verbose,Flag to display additional runtime information.


## Using the *variables dictionnary*

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A `dictionary` instance offers the possibility to retrieve matching records according to a specific term, or to retrieve information about all available variables, using the `find.in.dictionary()` function. For instance, looking for variables containing the term `COPD` is done this way: 

In [41]:
dictionary_search <- hpds::find.in.dictionary(resource, "Smoke")
hpds::extract.dataframe(dictionary_search) %>%
     tail()

Unnamed: 0_level_0,name,categorical,categoryValues,observationCount,patientCount,HpdsDataType,min,max
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>
32,"\SAC\interimhx\45. During the past week, did someone smoke in the participants' presence at work?\",True,"No,Yes",230,222,phenotypes,,
33,\SIT - Cure Sickle Cell\04 - Arm 3: Observation: Non-Transfusion\M36 (Q12) Annual 3: END\S02r1 Demographic And Phenotypic Information (s02r1_demographic_and_phenotypic_information)\22. Does either the patient or the primary caretaker identify anyone living in the home who smokes or smoked tobacco products in the last 3 years either inside or outside the home?\,True,"No,Yes",16,16,phenotypes,,
34,"\SAC\ih_form\23. Since the last visit date, has the participant smoked cigarettes?\",True,"No,Yes",264,253,phenotypes,,
35,\SIT - Cure Sickle Cell\01 - Arm 4: Screening\02 - Re-screening 1\S02r1 Demographic And Phenotypic Information (s02r1_demographic_and_phenotypic_information)\22. Does either the patient or the primary caretaker identify anyone living in the home who smokes or smoked tobacco products in the last 3 years either inside or outside the home?\,True,No,3,3,phenotypes,,
36,\SAC\ccontrol\12. Has your child smoked cigarettes in the past year?\,True,No,27,27,phenotypes,,
37,"\SAC\interimhx\39. Since the last visit date, has the participant been exposed to second hand tobacco cigarette, pipe, or cigar smoke?\",True,"No,Yes",229,216,phenotypes,,


Subsequently, objects created by the `dictionary.find` exposes the search result using 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [12]:
print(list("Count"   = hpds::extract.count(dictionary_search), 
           "Keys"    = hpds::extract.keys(dictionary_search)[1:5],
           "Entries" = hpds::extract.entries(dictionary_search)[1:5]))

$Count
[1] 31

$Keys
[1] "\\CIBMTR - Cure Sickle Cell Disease\\4 - Outcomes\\Z - Other outcomes\\Stroke post HCT\\"                                                                                                                                                                                                                                         
[2] "\\SIT - Cure Sickle Cell\\04 - Arm 3: Observation: Non-Transfusion\\M36 (Q12) Annual 3: END\\S02r1 Demographic And Phenotypic Information (s02r1_demographic_and_phenotypic_information)\\35B. If Yes  year of diagnosis of silent stroke (yyyy):\\"                                                                              
[3] "\\SIT - Cure Sickle Cell\\01 - Arm 4: Screening\\04 Pre-Randomization\\S13r0 Randomization Eligibility Form (s13r0_randomization_eligibility_form)\\04. Patient with a history of a focal neurologic event lasting more than 24 hours with medical documentation or a history of prior overt stroke.\\"                       

**`hpds::extract.dataframe()` enables to get the result of the dictionary search in a data.frame format. This way, it enables to:** 

The dictionary provide various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

Hence, it enables to:
* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.


Variable names, as currently implemented in the API, aren't handy to use right away.
1. Very long
2. Presence of backslashes that requires modification right after copy-pasting. 

However, using the dictionary to select variables can help to deal with this. Let's say we want to retrieve every variable from the CSC. Hence, one way to proceed is to retrieve the whole dictionary for those variables in the form of a data.frame, as below:

In [46]:
plain_variablesDict <- hpds::find.in.dictionary(resource, "SAC") %>% hpds::extract.dataframe()

Moreover, using the `hpds::find.in.dictionary` function without arguments return every entries, as shown in the help documentation. *As for now, this takes a long time in the R PIC-SURE API implementation, and it will probably be fixed in the latter version of the API*

In [47]:
plain_variablesDict[10:20,]

Unnamed: 0_level_0,name,categorical,categoryValues,observationCount,patientCount,HpdsDataType,min,max
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>
10,"\SAC\semtel\6. Since the last study visit, has the participant had an attack of wheezing, coughing, shortness of breath or chest tightness after playing hard or exercising?\",True,"No,Yes",67,66,phenotypes,,
11,\SAC\phlebot\12. Comments on blood drawing/centrifuging\,True,"DNA sample obtained,Difficult blood draw pt c portacath clinic not open today minimal blood obtainted,Difficult veni puncture minimal blood obtained,Difficult venipencture 1cc obtained in tube only. pt did not want additional attempt to obtain more blood,Difficult venipuncture pt veins exhausted - will attempt at next clinic or research visit.,Difficult venipuncture recently d/c from hospital,Difficult venipuncture x 2 pt viens exhausted pt did not want another attempt. no blood obtained this visit. OK to try at subsequent clinic research visit per pt.,Difficult venipunture pt veins exausted two recent hospitalizations,Dificult blood draw minimal blood obtained,Hemalyzed,IgE was no necessary,LDH not obtained,Mother refused to because of religious beliefs,NONE,No LDH obtained because of Hemolyzed sample,No LDH was obtained,Port Accessed in Heme/onc clinic by rn's tolerated well, port flushed / deaccessed,Portacath accessed by clinic RN blood obtained.,centrifuge time missing,difficult Venipuncture parent only wanted one try this time. May attempt again.,difficult venipenture pt viens exhausted to recent surgery and pain admit no blood obtained. pt refused retry,difficult venipuncture pt Veins exhausted recent admit- will try at visit 2,difficult venipuncture, Very small veins. Will obtain cbc results from clinic visit today,difficult venipunture, no blood obtained, pt cooperative; will try at next clinic research visit,drawn from mediport on first attempt without incident; flush required for blood return,missing time,no Blood obtained pt became very tearful and upset. will attempt at next visit. CBC obtained at clinic visit today,no Virgina sample was sent because participant is Chronic tx,no blood obtained poor venous access will attempt to obtain at sebsequent visit.,no blood obtained will attempt at sebsequent visit,no blood obtained- difficult venipuncture will obtain univ of Virginia sample at clinic visit in a month,none,time missing,time missing from centrifuged,total amount not entered; draw tubes sent to a different lab for processing; main centrifuge being serviced,very low hematocrit",61,55,phenotypes,,
12,\SAC\sleep\Screaming in his/her sleep\,True,"Don't Know,Never (does not happen),No answer,Not Often (<1 night/day a week),Often (3 to 5 nights/days a week),Sometimes (1 to 2 nights/days a week)",197,197,phenotypes,,
13,\SAC\event\5. Has the participant had a surgical procedure?\,True,"No,Yes",122,102,phenotypes,,
14,"\SAC\ih_form\6. In the past MONTH, how often has the participant had cough, wheeze, shortness of breath, or chest tightness while exercising or playing?\",True,"10 or more times per month,2 or fewer times per month,3- 4 times per month,5 - 9 times per month",372,253,phenotypes,,
15,"\SAC\interimhx\12a. If Yes, reason for MRI (check all that apply): (choice=Seizure)\",True,"Checked,Unchecked",224,224,phenotypes,,
16,\SAC\sleep\Usual Wake Time\,True,"01:00,01:30,02:00,02:30,05:00,05:30,05:40,05:45,06:00,06:15,06:20,06:30,06:40,06:45,06:50,07:00,07:15,07:20,07:30,07:40,07:45,08:00,08:15,08:30,08:45,09:00,09:10,09:30,10:00,10:30,11:00,11:30,12:00,12:30,13:00,15:00,19:00,22:00",390,196,phenotypes,,
17,\SAC\interimhx\30a. How many times?\,False,,29,27,phenotypes,0.0,6.0
18,\SAC\ih_form\13. Has the participant had an attack of wheezing that has caused him/her to be short of breath\,True,"No,Yes",319,253,phenotypes,,
19,\SAC\ih_form\Trouble falling asleep?\,True,"Always. (6 to 7 nights a week.),Don't know.,Never. (Does not happen.),Not often. (< 1 night a week.),Often. (3 to 5 nights a week.),Sometimes. (1 to 2 nights a week.)",436,253,phenotypes,,


The dictionary currently returned by the API provide various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

### Parsing variable names

Though helpful, we can use a simple function, `get_multiIndex`, defined in `R_lib/utils.R` to add a little more information and ease working with variables names. 

Although not an official feature of the API, such functionality illustrate how to quickly scan an select groups of related variables.

Printing part of the "parsed names" Dictionary allows to quickly see the tree-like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "varName" and "simplified_varName" columns (simplified variable names is simply the last component of the variable name, which usually makes the most sense to know what each variable is about).

In [49]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict)

level_0,level_1,level_2,level_3,level_4,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<lgl>,<chr>,<int>,<dbl>,<dbl>,<chr>
SAC,asthma_medication,6c. Stop Date:,,,6c. Stop Date:,\SAC\asthma_medication\6c. Stop Date:\,8,True,"07/03/2011,10/02/2003,12/05/2012,13/02/2008,25/07/2012,29/08/2012,6/3/2013,9/11/2007",,,,phenotypes
SAC,interimhx,13b.The participant complain of a stiff neck or neck pain during headache?,,,13b.The participant complain of a stiff neck or neck pain during headache?,\SAC\interimhx\13b.The participant complain of a stiff neck or neck pain during headache?\,31,True,"Never,No answer,Once in a While,With most headaches",,,,phenotypes
SAC,semtel,"11. Since the last visit, has the participant had problems with allergies?",,,"11. Since the last visit, has the participant had problems with allergies?","\SAC\semtel\11. Since the last visit, has the participant had problems with allergies?\",67,True,"No,Yes",,,,phenotypes
SAC,missvisit,Assessments to be done at rescheduled visit (check all that apply): (choice=Polysomnogram),,,Assessments to be done at rescheduled visit (check all that apply): (choice=Polysomnogram),\SAC\missvisit\Assessments to be done at rescheduled visit (check all that apply): (choice=Polysomnogram)\,8,True,Unchecked,,,,phenotypes
SAC,interimhx,Other? Specify:,,,Other? Specify:,\SAC\interimhx\Other? Specify:\,1,True,Visual Eye spots,,,,phenotypes
SAC,interim_meds,2. Medication Start Date,,,2. Medication Start Date,\SAC\interim_meds\2. Medication Start Date\,20,True,"01/11/2010,02/07/2007,03/03/2008,06/12/2010,08/12/2008,09/07/2012,09/08/2010,10/07/2012,12/04/2007,12/09/2011,15/06/2009,15/11/2010,16/06/2008,17/06/2010,19/12/2011,20/04/2011,20/12/2010,21/09/2009,25/01/2010,26/01/2009",,,,phenotypes


Below is a simple example to illustrate the ease of use a parsed dictionary. Let's say we are interested in every variables pertaining to the "Medical history" and "Medication history" subcategories.

In [None]:
mask_medication = variablesDict[,3] == "Medication History"
mask_medical = variablesDict[,3] == "Medical History"
medication_history_variables = variablesDict[mask_medical | mask_medication,]
medication_history_variables

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

## Querying and retrieving data

Beside from the dictionary, the second cornerstone of the API are the `query` functions (`hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`). They are the entering point to retrieve data from the resource.

First, we need to create a query object.

In [43]:
my_query <- hpds::new.query(resource = resource)

The query object created will be then be passed to the different query functions to build the query: `hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`. Each of those methods accept a query object, a list of variable names, and eventual additional parameters

- The `query.select.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting.

- The `query.require.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.

- The `query.anyof.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for those variables.

- The `query.filter.add()` method accept variable names a variable name as strings as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

#### Building the query

In [50]:
# Selecting all variables from "SAC" study
mask_study = variablesDict[["level_0"]] == "SAC"
varnames = variablesDict[mask_study, "name"] %>% as.list()

In [51]:
mask_smoke <- variablesDict["simplified_name"] == "40. Does the participant LIVE with anyone who currently smokes cigarettes? (not only the place where you live most of the time, but any other place where you also spend the night on a regular basis i.e. Grandparents' house, mom's house or dad's house, etc.)"
smoke <- variablesDict[mask_smoke, "name"] 

mask_stroke <- variablesDict["simplified_name"] == "12. Has the participant ever had a diagnosis of a silent stroke?"
stroke <- variablesDict[mask_stroke, "name"] 
values_stroke <- variablesDict[mask_stroke, "categoryValues"]

In [52]:
hpds::query.require.add(my_query, keys = smoke)
hpds::query.filter.add(my_query, 
                       keys = stroke,
                       values="Yes")
hpds::query.select.add(my_query, keys = varnames)
my_df <- hpds::query.run(my_query, result.type = "dataframe")

“la condition a une longueur > 1 et seul le premier élément est utilisé”
“la condition a une longueur > 1 et seul le premier élément est utilisé”
“la condition a une longueur > 1 et seul le premier élément est utilisé”


## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [None]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [None]:
dim(my_df)

In [None]:
head(my_df)

From this point, we can proceed with the data management and analysis using any other R function or libraries.