<img src="https://dbmi.hms.harvard.edu/sites/g/files/mcu781/files/hero-images/HMS_DBMI_Logo.svg" width= "550px">

# PIC-SURE API use-case: Phenome-Wide analysis on COPDgene data

## PIC-SURE python API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

Databases exposed through PIC-SURE API encompass a wide heterogeneity of architecture and data organization underneath. PIC-SURE hide this complexity and expose the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.


### Why PIC-SURE? 
Databases exposed through PIC-SURE API encompass a wide heterogeneity of architecture and data organization underneath. PIC-SURE hide this complexity and expose the different data in the same format, allowing for data-scientist and clinical researchers to focus on the analysis and medical insights, thus unburdening the complexity of integrating clinical and genomics data, and easing the process of reproducible sciences.

### More about
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, python and R, allowing investigators two query databases in the same way using any of those languages.

PIC-SURE is a large project from which the PIC-SURE R/python API is only a brick. Among those other components, PIC-SURE also offers a graphical user interface, allowing scientist to get quick knowledge about variables and data available for a specific data source.

The Python/R API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client



## Phenome-Wide Association Studies (PheWAS)

### What is a PheWAS analysis?
A PheWAS analysis is basically testing the association of an individual trait (i.e. a genomic variant in most of the cases, but not exclusively) against a wide variety of phenotypes. It is frequently used in the genomics field, sometimes in association with GEWAS analyzes (invert process, that is testing association of a phenotype against multiple genetic variants).

References:
- [*Denny et al.*, 2010](https://academic.oup.com/bioinformatics/article/26/9/1205/201211)
- [*Denny et al.*, 2017](https://www.annualreviews.org/doi/abs/10.1146/annurev-genom-090314-024956)

## COPDGene data

COPDGene is a case-control study that focus on Chronic Obstructive Pulmonary Disease (COPD), and that comprise linked genomic and clinical data. It's one of the database that is integrated in the BioData Catalyst alongside other projects.
Although genomics data are not yet available through PIC-SURE API, COPDGene is well-suited for such a use case because of the fact that it does provide a specific trait (namely presence or absence of a COPD diagnosis) which appears to be relevant to test against every other phenotypical variable availables.

 -------   

# Environment set-up

For this notebook to be reproducible out of the box in any environment, here is the code to download and install the necessary packages for this analysis. 

### Pre-requisite
- python 3.6 or later (although earlier versions of Python 3 must work too)
- pip: python package manager, already available in most system with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### IPython magic command

Those two lines of code below do load the `autoreload` IPython extension. Although not necessary to execute the rest of the Notebook, it does enable to reload every dependency each time python code is executed, thus enabling to take into account changes in external file imported into this Notebook (e.g. user defined function stored in separate file), without having to manually reload libraries. Turns out very handy when developing interactively. More about [IPython Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Useful to estimate execution time of the Notebook, see at the end
from datetime import datetime
then = datetime.now()

### Installation of external dependencies

Installation of depencides using the pip package manager. 
<!--`%pip` execute a shell command that install the required libraries for this notebook (FYI, `%pip` install libraries for the current kernel, as opposed to the `!pip` command which install packages for the python interpreter that launched the Notebook, [see](https://stackoverflow.com/questions/38368318/installing-a-pip-package-from-within-a-jupyter-notebook-not-working/50473278#50473278)). -->


<!-- *TODO: Test %pip in older version of python interpreter, for instance in auth0 oldest environment version* -->

In [None]:
!pip install numpy==1.17.3
!pip install matplotlib==3.1.1
!pip install pandas==0.25.3
!pip install scipy==1.3.1
!pip install tqdm==4.38.0
!pip install statsmodels==0.10.2

In [None]:
!pip install git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!pip install git+https://github.com/hms-dbmi/pic-sure-python-client.git 

Import all the external dependencies, along as user-defined functions written in files located in the `python_lib` folder

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesTable, get_dic_renaming_vars, match_dummies_to_varNames, joining_variablesTable_onCol

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureClient: 0.1.0\n- PicSureHpdsLib: 1.1.0\n")
print("The PIC-SURE API libraries versions you've been downloading are: \n- PicSureClient: {0}\n- PicSureHpdsLib: {1}".format(PicSureClient.__version__, PicSureHpdsLib.__version__))

##### Set up the options for displaying tables and plots in this Notebook

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 435)

# Matplotlib parameters options
fig_size = plt.rcParams["figure.figsize"]
 
# Prints: [8.0, 6.0]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 12}

plt.rc('font', **font)

## Connecting to HPDS resources
Gaining access to any data source using the API require two steps: 
* Connecting to a PIC-SURE network
* Connecting to an HPDS-hosted resource. Indeed, a PIC-sure network can host several different resources. However, the only resources we will be using hereafter is the COPDGene HPDS-hosted database instance.

### Connecting to a PIC-SURE network

To connect to a network, one need two informations: 
- URL of this network. This notebook example is using COPDGene Dev environment, which URL is https://copdgene-dev.hms.harvard.edu/psamaui/login/?redirection_url=/picsureui/
- An authorized individual user token to gain access to the resources of the network through the API.

We will need to get a token and feed it to the API. First, we will create a blank text file that will be used to store the token right after.

In [None]:
import os
token_path = "./tokens/copd.txt"
if not os.path.isdir("./tokens"):
    os.mkdir("./tokens")
if not os.path.isfile(token_path):
    open(token_path, "w+").close()

To actually get the token, process as below:

1. In a web browser, open the COPDGene Dev User Interface: https://copdgene-dev.hms.harvard.edu/psamaui/login/?redirection_url=/picsureui/, and choose one of the available authentication methods to enter it.
2. In the user-interface click on USER PROFILE
3. Click again on USER PROFILE and then on COPY button
4. Back into your Jupyter environment, paste it into the newly created text file (`./tokens/copd.txt`).

![Getting authorization token](img/get_token_screen.png)

**Token is strictly personal**, be careful not to share it with anyone (thus `./tokens` directory is explicitely excluded in the `.gitignore` file).

Once the token has been copied into the prespecified file, we can read it and feed it to the API as follow.

In [None]:
with open(token_path) as f:
    user_token = f.read()

In [None]:
PICSURE_network_URL = "https://copdgene-dev.hms.harvard.edu/picsure/"

Next, we will use the PicSureClient library to create the connection to a PICSure network, as well as the PicSureHpdsLib that handle data extraction from a HPDS-hosted database. 

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, user_token, allowSelfSignedSSL=True)
#connection.list()

### Connecting to the COPDGene resource

In [None]:
COPDGene_resource = "b6ef7b1a-56f6-11e9-8958-0242c0a83007"
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(COPDGene_resource)

Finally, we created an object called `resource`, which is an instance of the `PicSureHpdsLib.Adapter()` class. It is connected to the specific resources we indicated, namely COPDGene hosted database in our case. 

**This `resource` object is actually the only one we will need to proceed with our analysis thereafter**.

NB: As of 11/26/19, user tokens to acces PICSure Network got a very limited validity duration time (they're getting expired in about 20 minutes without any connection activity). In the case you're getting a connection error stating: `ERROR: HTTP response was bad [...] User is not authorized. [Token invalid or expired]`, please get a new token the same way you did it before, and update your `resource` object by re-executing cells above. This is a known issue, and tokens life-expectancy will soon be expanded to a suitable duration to conduct analysis.

#### Getting help with the PIC-SURE python API

Each object exposed by the PicSureHpdsLib library got a `help()` method. Calling it will print a helper message about it. 

In [None]:
resource.help()

For instance, this output tells us that this `resource` object got 2 methods, and it gives insights about their function. 

## Using the *variables dictionnary*

Once connection to the desired resource has been established, we first need to get a grasp of which variables are available in the database. To this end, we will create a `dictionary` using the `resource` object.

The `dictionary` object offers the possibility to retrieve matching records according to a specific term, or to retrieve information about all available variables, using the `find()` method. For instance, looking for variables containing the term `COPD`: 

In [None]:
dictionary = resource.dictionary()
lookup = dictionary.find("pneumonia")

Subsequently, objects created by the `dictionary.find` exposes the search result using 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [None]:
pprint({"Count": lookup.count(), 
        "Keys": lookup.keys(),
        "Entries": lookup.entries()})

In [None]:
lookup.DataFrame()

**`.DataFrame()` appears as the most useful method for an end-user**. 

* Various criteria exposed in the dictionary (patientCount, variable type ...) can be subsequently used as selection criteria for variable selection.
* Row names of the DataFrame, representing actual variables names, can be used in the query, instead of typing directly the name of the variable in the source code.

Variable names, as currently implemented in the API, aren't very practical to use.
1. Very long
2. Presence of backslashes that prevent from copy-pasting. 

However, using the dictionary to select variables can definitely help to deal with this pitfall. Hence, one handy way to proceed is to retrieve the whole dictionary in the form of a pandas DataFrame, as below:

In [None]:
plain_variablesDict = resource.dictionary().find().DataFrame()

Indeed, using the find function without arguments return every entries, as stated in the help below.

In [None]:
resource.dictionary().help()

In [None]:
plain_variablesDict

The dictionary currently returned by the API provide various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently COPDGene instance only contains 'phenotypes' variables

#### Variable dictionary + pandas multiIndex

Though helpful, we can use a simple user-defined function (`get_multiIndex_variablesTable`) to add a little more information and ease dealing with variables names. It takes advantage of pandas MultiIndex functionality [see pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality to quickly scan an select groups of related variables may be integrated at some point. So for now, just printing the 'multiIndexed' variable Dictionary allows to quickly see the tree like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "varName" and "simplified_varName" columns.

In [None]:
variablesDict = get_multiIndex_variablesTable(plain_variablesDict)

In [None]:
variablesDict

In [None]:
# Now that we have seen how our entire dictionnary looked, we limit the number of lines to be displayed for the future outputs
pd.set_option("max.rows", 50)

A simple example to illustrate the ease of use a multiIndex Dictionary in this case:

In [None]:
idx = pd.IndexSlice
medication_history_variables = variablesDict.loc[idx[:,"Medication history"],:]
medication_history_variables

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

In [None]:
query = resource.query()
gender_variables = resource.dictionary().find("Sex").keys()
query.select().add(gender_variables)
query.getCount()

## Querying the COPDGene HPDS database

Beside from the dictionary, the second cornerstone of the API is the `query` object (exposed by a resource object).

In [None]:
query = resource.query()

The most simple usage of the query object is passing a variable name through the `select` method.

In [None]:
query.select().add("\\03 Clinical data\\SF-36 form\\SF-36 Body Pain (BP) score\\")
query.getResultsDataFrame()
query.getCount()

#### Selecting variables

There is many different methods provided by the API: `select`, `require`, `anyof`, `filter`, and each one of those methods can be combined with `add` and `delete` to create queries. Moreover, different results can be returned for a single query: `getCount`, `getResults` ... Information about each one of those functions can be found using `help()`.

However, **a simple straightforward workflow is to simply select the desired variables using the Dictionary, enter their names using `query.select().add()`, and then retrieve the data using `query.getResultsDataFrame()` method**.

Let's say we are interested in the variables pertaining to the 'Respiratory disease form' category, and that we only want the categorical ones, with at least 4000 non-null values. One simple way to process is:

In [None]:
query = resource.query()
mask_cat = variablesDict["categorical"] == True
mask_count = variablesDict["observationCount"] > 4000
varnames = variablesDict.loc[idx[:, "Respiratory disease form"],:].loc[mask_cat & mask_count, "varName"]
query.select().add(varnames)
query_result = query.getResultsDataFrame()
query_result.head()

## PheWAS analysis

### Retrieving the relevant data

In a nutshell, a PheWAS analysis consists of two main steps:
- Running univariate tests again every phenotypes variable
- Adjusting for multiple testing issue

In this example, we will select every phenotype variables available in the Dictionary, except for the variables pertaining to the "Sub-study ESP LungGO COPDGene" category (very small population as compared to the COPDGene one).

In [None]:
mask_pheno = variablesDict["HpdsDataType"] == "phenotypes"
mask_substudy = variablesDict.index.get_level_values(0) != "Sub-study ESP LungGO COPDGene"
mask_vars = mask_pheno & mask_substudy
selected_vars = variablesDict.loc[mask_vars, "varName"].tolist()

In [None]:
query = resource.query()
query.select().add(selected_vars)
facts = query.getResultsDataFrame(selected_vars).set_index("Patient ID")
facts.head(5)

We just check that our query runned the way intended by looking at the number of rows and columns

In [None]:
print("{0} rows, {1} columns".format(*facts.shape))

### Data-management

#### Selecting variables regarding their types

One important step in a PheWAS is to make the distinction between categorical and numerical variables. Again, this distinction is straightforward using the variables dictionary.

In [None]:
mask_categories = variablesDict.loc[mask_vars, "categorical"] == True
categorical_varnames = variablesDict.loc[mask_vars,:].loc[mask_categories, "varName"].tolist()
continuous_varnames = variablesDict.loc[mask_vars,:].loc[~mask_categories, "varName"].tolist()

#### Selecting the variant trait to study
Most of PheWAS use genetic variants as the case-control subpopulations whom association with other phenotypes will be tested. But using the population doesn't have to be dichotomized using a genetic variant (see for example [*Neuraz et al.*, 2013](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003405)). 

Here we will use the presence or absence of a COPD diagnosis as the variable to dichotomize the population in our subsequent analysis.

In [None]:
trait_name = variablesDict.loc[variablesDict["simplified_varName"] == "00 Affection status", "varName"].values[0]
categorical_varnames.remove(trait_name)

Then we select adequate subpopulations regarding the chosen variant values (i.e. keeping "Case" and "Control" individuals, thus discarding "Other", "Control, Exclusionary Disease", and null values).

In [None]:
mask_trait_name = facts[trait_name].isin(["Case", "Control"])
facts = facts.loc[mask_trait_name,:]
print("Control: {0} individuals\nCase: {1} individuals".format(*facts[trait_name].value_counts().tolist()))

Next we create dummy variables in order to be able to carry categorical univariate statistical tests, and we store their names in the dictionary alongside corresponding original variables in the dictionary.

In [None]:
facts_dummies = pd.get_dummies(facts, columns=categorical_varnames, drop_first=True)

In [None]:
matching_dummies_varNames = match_dummies_to_varNames(facts.columns,
                                                      facts_dummies.columns,
                                                      columns=["varName", "dummies_varName"])

In [None]:
variablesDict = joining_variablesTable_onCol(variablesDict,
                                              matching_dummies_varNames,
                                              left_col="varName",
                                              right_col="varName",
                                              overwrite=False)

In [None]:
variablesDict.head()

## Univariate statistical tests

To this point, each variable present in the facts_dummies dataset will be tested with the selected trait (presence or absence of COPD). 

Two different association test will be carried out according to variables data types: 
- Mann-Whitney U test for continuous ones
- Fisher exact test for categorical ones

### Quantitative variables: Mann-Whitney U test

In [None]:
grouped = facts_dummies.groupby(trait_name) 

dic_mannwhitneyu = {}
for var in continuous_varnames: 
    group1, group2 = [group[1].dropna() for group in grouped[var]]
    try:
        dic_mannwhitneyu[var] = stats.mannwhitneyu(group1, group2).pvalue
    except ValueError:
        dic_mannwhitneyu[var] = np.NaN

### Qualitative variables: Fisher Exact test

In [None]:
dummy_categorical_varnames = variablesDict.loc[variablesDict["varName"].isin(categorical_varnames),:]\
["dummies_varName"].values[:500]

In [None]:
# Fisher test for categorical variables
from tqdm import tqdm
dic_fisher = {}
try:
    for var in tqdm(dummy_categorical_varnames, position=0, leave=True):
        if type(var) != str:
            print("skipping {0}".format(var))
            continue
        elif var not in facts_dummies.columns:
            print("skipping {0}, not in dataframe columns".format(var))
            continue        
        crosstab = pd.crosstab(facts_dummies[var], facts_dummies[trait_name])
        if crosstab.shape == (1,2):
            dic_fisher[var] = np.NaN
        else:
            dic_fisher[var] = stats.fisher_exact(crosstab)[1]
except AttributeError:
    print("End of loop tqdm AttributeError catched")

### Results visualization

#### Univariate tests distribution

In [None]:
pd.Series([v for v in dic_mannwhitneyu.values()]).plot.hist(bins=30)
plt.suptitle("Distribution of individual p-values for Mann-Whintey U test",
             weight="bold",
            fontsize=15)

In [None]:
pd.Series([v for v in dic_fisher.values()]).plot.hist(bins=20)
plt.suptitle("Distribution of individual p-values for Fisher association test", 
             size=30,
             weight="bold",
            fontsize=15)

#### Multiple hypotheses testing correction: Bonferroni Method

In order to handle the multiple comparison issue (increase in the probability to "discover" false statistical associations, because of the number of tests performed), we will use the Bonferroni correction method. Although many other multiple comparison exist, Bonferroni is the most straightforward to use, because it doesn't require assumptions about variables correlation. Other PheWAS analysis also use False Discovery Rate controlling procedures ([see](https://en.wikipedia.org/wiki/False_discovery_rate)).

In a nutshell, Bonferonni allows to calculate a corrected "statistical significant threshold" according to the number of test performed. Every p-value below this threshold will be deemed statistically significant.

In [None]:
# Merging pvalues from different tests
dic_pvalues = {**dic_mannwhitneyu, **dic_fisher}
df_pvalues = pd.DataFrame.from_dict(dic_pvalues, orient="index", columns=["pvalues"])\
.rename_axis("dummies_varName")\
.reset_index(drop=False)

# Adding pvalues results as a new column to variablesDict
variablesDict = joining_variablesTable_onCol(variablesDict,
                                              df_pvalues,
                                              left_col="dummies_varName",
                                              right_col="dummies_varName")

In [None]:
adjusted_alpha = 0.05/len(variablesDict["pvalues"])
variablesDict["p_adj"] = variablesDict["pvalues"] / len(variablesDict["pvalues"])

In [None]:
variablesDict["pvalues"]

In [None]:
variablesDict['log_p'] = -np.log10(variablesDict['pvalues'])

In [None]:
pd.set_option('expand_frame_repr', False)

In [None]:
variablesDict = variablesDict.sort_index()
variablesDict["group"] = variablesDict.reset_index(level=1)["level_1"].values

### Manhattan plot

The classical synthetic data visualisation of a PheWAS analysis is the Manhattan plot, which plot each one of the tested phenotypes on the X-axis, against -log of pvalues on the Y axis. Usually a horizontal line is drawn to represent the corrected level of significance calculated using an adequate multiple hypothesis correction method (Bonferroni in our case).

In [None]:
mask = variablesDict["pvalues"].isna()
df_results = variablesDict.loc[~mask,:].copy().replace([np.inf, -np.inf], np.nan)
df_results["ind"] = np.arange(1, len(df_results)+1)
df_grouped = df_results.groupby(('group'))

# print(df_grouped.head(10))

fig = plt.figure()
ax = fig.add_subplot(111)
colors = plt.get_cmap('Set1')
x_labels = []
x_labels_pos = []

y_lims = (0,
          df_results["log_p"].max(skipna=True) + 20)
threshold_top_values = df_results["log_p"].sort_values(ascending=False)[0:6][-1]

for num, (name, group) in enumerate(df_grouped):
        group.plot(kind='scatter', x='ind', y='log_p',color=colors.colors[num % len(colors.colors)], ax=ax, s=20)
        x_labels.append(name)
        x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2)) # Set label in the middle
        
        pair_ind = 0 # To shift label which might overlap because to close
        for n, row in group.iterrows():
            if pair_ind %2 == 0:
                shift = 1.1
            else:
                shift = -1.1
            if row["log_p"] > threshold_top_values:
                ax.text(row['ind'] + 3, row["log_p"] + 0.05 + shift, row["simplified_varName"], rotation=0, alpha=1, size=8, color="black")
                pair_ind += 1
                
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df_results) +1])
ax.set_ylim(y_lims)
ax.set_ylabel('-log(p-values)', style="italic")
ax.set_xlabel('Phenotypes')
ax.axhline(y=-np.log10(adjusted_alpha), linestyle=":", color="black")
plt.xticks(fontsize = 8,rotation=90)
plt.yticks(fontsize = 8)
plt.title("Statistical association between studied allele and phenotypes", 
          loc="center",
          style="oblique", 
          fontsize = 20,
         y=1)
xticks = ax.xaxis.get_major_ticks()
xticks[0].set_visible(False)

plt.show()

In [None]:
now = datetime.now()
elapsed = now - then
print(elapsed)