# Phenome-Wide analysis on COPDgene data: python PIC-SURE API use-case

This notebook is an illustration example of how to use the python **PIC-SURE API** to select and query data from an HPDS-hosted database. It takes as use-case a simple PheWAS analysis. This notebook is intentionally straightforward, without too much explanation. For a more step-by-step introduction to the python PIC-SURE API, see the `PICSURE-API_101_PheWAS_example.ipynb` Notebook.

**Before running this notebook, please be sure to get an user-specific security token. For more information on how to proceed, see the `HPDS_connection.ipynb` notebook**

# Environment set-up

### System requirements
- python 3.6 or later
- pip & bash interpreter

### Installation of external dependencies

In [None]:
!pip install -r requirements.txt

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesTable, get_dic_renaming_vars, match_dummies_to_varNames, joining_variablesTable_onCol
from python_lib.HPDS_connection_manager import tokenManager

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureHpdsLib: 1.1.0\n- PicSureClient: 0.1.0")
print("The PIC-SURE API libraries versions you've been downloading are:\n- PicSureHpdsLib: {0}\n- PicSureClient: {1}".format(PicSureHpdsLib.__version__, PicSureClient.__version__))

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 435)

# Matplotlib display parameters
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "tokens/copd.txt"

In [None]:
token = tokenManager(token_file).get_token()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, token, allowSelfSignedSSL=True)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

## PheWAS analysis

In a nutshell, this PheWAS analysis consists of two main steps:
- Running univariate tests again every phenotypes variable
- Adjusting for multiple testing issue

In this example, we will select every phenotype variables available in the Dictionary, except for the variables pertaining to the "Sub-study ESP LungGO COPDGene" category (very small and specific population as compared to the COPDGene one).

## PheWAS analysis

In a nutshell, this PheWAS analysis follows those subsequent steps:
- Retrieving the variables dictionary, using the PIC-SURE API dedicated methods
- From the info provided by the dictionary, retrieving the data in an exploitable format through PIC-SURE API calls
- Data management
- Running univariate tests again every phenotypes variable
- Adjusting for multiple testing issue
- Plotting the results


This analysis is conducted using COPDGene Study data. The study overall goal is to detect underlying genetic factors to develop Chronic Obstructive Pulmonary Disease (COPD), and enrolled more than 10,000 individuals ([more information on COPDGene Study](http://www.copdgene.org)).

### 1. Retrieving variable dictionary from HPDS Database

Requesting only COPDGene related variables

In [None]:
plain_variablesDict = resource.dictionary().find("Genetic Epidemiology of COPD (COPDGene)").DataFrame()

In [None]:
variablesDict = get_multiIndex_variablesTable(plain_variablesDict)

### 2. Selecting variables and retrieving data from HPDS

In [None]:
mask_pheno = variablesDict.index.get_level_values(1) == 'Subject Phenotype'
mask_status = variablesDict.index.get_level_values(2) == 'Affection status'
mask_vars = mask_pheno | mask_status
selected_vars = variablesDict.loc[mask_vars, "varName"].tolist()

In [None]:
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "tokens/copd.txt"

In [None]:
print(selected_vars[1])

In [None]:
query = resource.query()
query.select().add(selected_vars[1])
facts = query.getResultsDataFrame().set_index("Patient ID")

We just check that our query runned the way intended by looking at the number of rows and columns

In [None]:
facts.head(5)

In [None]:
print("{0} rows, {1} columns".format(*facts.shape))

### 3. Data-management

#### Selecting variables regarding their types

One important step in a PheWAS is to get the distinction between categorical and numerical variables. This distinction is straightforward using the variables dictionary.

In [None]:
mask_categories = variablesDict.loc[mask_vars, "categorical"] == True
categorical_varnames = variablesDict.loc[mask_vars,:].loc[mask_categories, "varName"].tolist()
continuous_varnames = variablesDict.loc[mask_vars,:].loc[~mask_categories, "varName"].tolist()

### Selecting the dependent variable to study
Most of PheWAS use a genetic variant as the variable used to separate the population between cases and controls. However the population doesn't have to be dichotomized using a genetic variant, and any phenotypic variable could be used to run a PheWAS analysis (see for example [*Neuraz et al.*, 2013](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003405)). 

Here we will use the **COPD status** as the case-control variable to dichotomize the population in our analysis (ie the dependent variable for which univariate association test will be run against).

In [None]:
dependent_var_name = variablesDict.loc[variablesDict["simplified_varName"] == "00 Affection status", "varName"].values[0]
categorical_varnames.remove(dependent_var_name)

Then we subset our population regarding the relevant values for the COPD diagnosis variable (i.e. keeping "Case" and "Control" individuals, thus discarding "Other", "Control, Exclusionary Disease", and null values).

In [None]:
mask_dependent_var_name = facts[dependent_var_name].isin(["Case", "Control"])
facts = facts.loc[mask_dependent_var_name,:]
print("Control: {0} individuals\nCase: {1} individuals".format(*facts[dependent_var_name].value_counts().tolist()))

Next we create dummy variables in order to be able to carry categorical univariate statistical tests, and we store their names in the dictionary alongside corresponding original variables in the dictionary.

In [None]:
facts_dummies = pd.get_dummies(facts, columns=categorical_varnames, drop_first=True)

In [None]:
matching_dummies_varNames = match_dummies_to_varNames(facts.columns,
                                                      facts_dummies.columns,
                                                      columns=["varName", "dummies_varName"])

In [None]:
variablesDict = joining_variablesTable_onCol(variablesDict,
                                              matching_dummies_varNames,
                                              left_col="varName",
                                              right_col="varName",
                                              overwrite=False)

In [None]:
variablesDict.head()

### 4. Univariate statistical tests

To this point, each variable present in the facts_dummies dataset will be tested again the selected dependent variable, (ie presence or absence of COPD). 

Two different association test will be carried out according to variables data types: 
- Mann-Whitney U test for continuous ones
- Fisher exact test for categorical ones

#### Numerical variables: Mann-Whitney U test

In [None]:
grouped = facts_dummies.groupby(dependent_var_name) 

dic_mannwhitneyu = {}
for var in continuous_varnames: 
    group1, group2 = [group[1].dropna() for group in grouped[var]]
    try:
        dic_mannwhitneyu[var] = stats.mannwhitneyu(group1, group2).pvalue
    except ValueError:
        dic_mannwhitneyu[var] = np.NaN

#### Qualitative variables: Fisher Exact test

In [None]:
dummy_categorical_varnames = variablesDict.loc[variablesDict["varName"].isin(categorical_varnames),:]\
["dummies_varName"].values[:500]

In [None]:
# Fisher test for categorical variables
from tqdm import tqdm
dic_fisher = {}
try:
    for var in tqdm(dummy_categorical_varnames, position=0, leave=True):
        if type(var) != str:
            print("skipping {0}".format(var))
            continue
        elif var not in facts_dummies.columns:
            print("skipping {0}, not in dataframe columns".format(var))
            continue        
        crosstab = pd.crosstab(facts_dummies[var], facts_dummies[dependent_var_name])
        if crosstab.shape == (1,2):
            dic_fisher[var] = np.NaN
        else:
            dic_fisher[var] = stats.fisher_exact(crosstab)[1]
except AttributeError:
    print("End of loop tqdm AttributeError catched")

#### Univariate test p-values distribution

In [None]:
pd.Series([v for v in dic_mannwhitneyu.values()]).plot.hist(bins=30)
plt.suptitle("Distribution of individual p-values for Mann-Whintey U test",
             weight="bold",
            fontsize=15)

In [None]:
pd.Series([v for v in dic_fisher.values()]).plot.hist(bins=20)
plt.suptitle("Distribution of individual p-values for Fisher association test", 
             size=30,
             weight="bold",
            fontsize=15)

### 5. Multiple hypotheses testing correction: Bonferroni Method

In order to handle the multiple comparison issue (increase in the probability to "discover" false statistical associations, because of the number of tests performed), we will use the Bonferroni correction method. Although many other multiple comparison exist, Bonferroni is the most straightforward to use, because it doesn't require assumptions about variables correlation. Other PheWAS analysis also use False Discovery Rate controlling procedures ([see](https://en.wikipedia.org/wiki/False_discovery_rate)).

In a nutshell, Bonferonni allows to calculate a corrected "statistical significant threshold" according to the number of test performed. Every p-value below this threshold will be deemed statistically significant.

In [None]:
# Merging pvalues from different tests
dic_pvalues = {**dic_mannwhitneyu, **dic_fisher}
df_pvalues = pd.DataFrame.from_dict(dic_pvalues, orient="index", columns=["pvalues"])\
.rename_axis("dummies_varName")\
.reset_index(drop=False)

# Adding pvalues results as a new column to variablesDict
variablesDict = joining_variablesTable_onCol(variablesDict,
                                              df_pvalues,
                                              left_col="dummies_varName",
                                              right_col="dummies_varName")

In [None]:
adjusted_alpha = 0.05/len(variablesDict["pvalues"])
variablesDict["p_adj"] = variablesDict["pvalues"] / len(variablesDict["pvalues"])

In [None]:
variablesDict['log_p'] = -np.log10(variablesDict['pvalues'])

In [None]:
pd.set_option('expand_frame_repr', False)

In [None]:
variablesDict = variablesDict.sort_index()
variablesDict["group"] = variablesDict.reset_index(level=1)["level_1"].values

## 6. Result visualisations: Manhattan plot

Manhattan plot is the classical results representation of a PheWAS analysis. It plots every each tested phenotypical variables on the X-axis, against its *-log(pvalue)* on the Y-axis. The horizontal line represent the adjusted significance level threshold.

In [None]:
mask = variablesDict["pvalues"].isna()
df_results = variablesDict.loc[~mask,:].copy().replace([np.inf, -np.inf], np.nan)
df_results["ind"] = np.arange(1, len(df_results)+1)
df_grouped = df_results.groupby(('group'))

# print(df_grouped.head(10))

fig = plt.figure()
ax = fig.add_subplot(111)
colors = plt.get_cmap('Set1')
x_labels = []
x_labels_pos = []

y_lims = (0,
          df_results["log_p"].max(skipna=True) + 20)
threshold_top_values = df_results["log_p"].sort_values(ascending=False)[0:6][-1]

for num, (name, group) in enumerate(df_grouped):
        group.plot(kind='scatter', x='ind', y='log_p',color=colors.colors[num % len(colors.colors)], ax=ax, s=20)
        x_labels.append(name)
        x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2)) # Set label in the middle
        
        pair_ind = 0 # To shift label which might overlap because to close
        for n, row in group.iterrows():
            if pair_ind %2 == 0:
                shift = 1.1
            else:
                shift = -1.1
            if row["log_p"] > threshold_top_values:
                ax.text(row['ind'] + 3, row["log_p"] + 0.05 + shift, row["simplified_varName"], rotation=0, alpha=1, size=8, color="black")
                pair_ind += 1
                
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df_results) +1])
ax.set_ylim(y_lims)
ax.set_ylabel('-log(p-values)', style="italic")
ax.set_xlabel('Phenotypes')
ax.axhline(y=-np.log10(adjusted_alpha), linestyle=":", color="black")
plt.xticks(fontsize = 8,rotation=90)
plt.yticks(fontsize = 8)
plt.title("Statistical association between studied allele and phenotypes", 
          loc="center",
          style="oblique", 
          fontsize = 20,
         y=1)
xticks = ax.xaxis.get_major_ticks()
xticks[0].set_visible(False)

plt.show()

Overall, it appears that most of the tested phenotypes covariates are above the adjusted threshold of significant association. However, it is not surprising at all, given the nature of our dependent variable: a lot of those variables are by nature tied directly to the COPD status.

This code can be used directly with any other variable present in the variable Dictionary. It only need to change the `dependent_var_name` value.