# Phenome-Wide analysis on TOPMed wide studies

This notebook is an illustration example of how to use the python **PIC-SURE API** to select and query data from an HPDS-hosted database. It takes as use-case a simple PheWAS analysis. This notebook is intentionally straightforward, and explanation provided are only aimed at guiding through the PheWAS analysis pipeline. For a more step-by-step introduction to the python PIC-SURE API, see the `python_PICSURE-API_101_PheWAS_example.ipynb` Notebook.

**Before running this notebook, please be sure to get an user-specific security token. For more information on how to proceed, see the `HPDS_connection.ipynb` notebook**

# Environment set-up

### System requirements
- Python 3.6 or later
- pip & bash interpreter

### Installation of external dependencies

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesDict, get_dic_renaming_vars, match_dummies_to_varNames, joining_variablesDict_onCol
from python_lib.HPDS_connection_manager import tokenManager

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureHpdsLib: 1.1.0\n- PicSureClient: 0.1.0")
print("The installed PIC-SURE API libraries versions:\n- PicSureHpdsLib: {0}\n- PicSureClient: {1}".format(PicSureHpdsLib.__version__, PicSureClient.__version__))

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 435)

# Matplotlib display parameters
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
token = tokenManager(token_file).get_token()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, token, allowSelfSignedSSL=True)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

## PheWAS analysis

In a nutshell, this PheWAS analysis follows those following steps:
- Retrieving the variables dictionary, using the PIC-SURE API dedicated methods
- From the info provided by the dictionary, retrieving the desired variables and individuals in an exploitable format through PIC-SURE API calls
- Data management
- Running univariate tests against every phenotypes variable
- Accounting for multiple hypotheses testing issue
- Plotting the results


This analysis is conducted using individuals enrolled in the COPDGene Study. Overall goal of this cohort is to detect underlying genetic factors to develop Chronic Obstructive Pulmonary Disease (COPD), and currently includes more than 10,000 individuals ([more information on COPDGene Study](http://www.copdgene.org)).

### 1. Retrieving variables dictionary from HPDS Database

Retrieving variables dictionary only for the COPDGene study.

In [None]:
plain_variablesDict = resource.dictionary().find("Genetic Epidemiology of COPD (COPDGene)").DataFrame()

In [None]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)
variablesDict.iloc[10:20,:]

### 2. Selecting variables and retrieving data from the database

Subseting to keep only the phenotypical variables + the "affection status", that will be used as the dependent variable for this illustration use-case.

In [None]:
mask_pheno = variablesDict.index.get_level_values(1) == 'Subject Phenotype'
mask_status = variablesDict.index.get_level_values(2) == 'Affection status'
mask_vars = mask_pheno | mask_status
variablesDict = variablesDict.loc[mask_vars,:]

In [None]:
selected_vars = variablesDict.loc[:, "varName"].tolist()

In [None]:
pprint(selected_vars[0:5])

Retrieving the data:

In [None]:
query = resource.query()
query.select().add(selected_vars)
facts = query.getResultsDataFrame()

In [None]:
status_var = variablesDict.loc[variablesDict.index.get_level_values(2) == 'Affection status', "varName"]
facts = facts.dropna(subset=status_var)\
.set_index("Patient ID")
mask_to_drop = variablesDict["simplified_varName"]\
.isin(["Dbgap_id", "De-identified site code", "A1AD: phenotype/genotype"])
variablesDict = variablesDict.loc[~mask_to_drop, :]
var_to_keep = variablesDict.loc[:, "varName"]
facts = facts.loc[:, var_to_keep]

In [None]:
print("{0} rows, {1} columns".format(*facts.shape))

Here is a sample of our dataset, one row per patient and one column par variable:

In [None]:
facts.tail(5)

### 3. Data-management

#### Selecting variables regarding their types

One important step in a PheWAS is to get the distinction between categorical and numerical variables. This distinction is straightforward using the variables dictionary.

In [None]:
mask_categories = variablesDict.loc[:, "categorical"] == True
categorical_varnames = variablesDict.loc[mask_categories, "varName"].tolist()
continuous_varnames = variablesDict.loc[~mask_categories, "varName"].tolist()

### Selecting the dependent variable to study
Most of PheWAS use a genetic variant as the variable used to separate the population between cases and controls. However the population doesn't have to be dichotomized using a genetic variant, and any phenotypic variable could be used to run a PheWAS analysis (see for example [*Neuraz et al.*, 2013](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003405)). 

Here we will use the **COPD status** as the case-control variable to dichotomize the population in our analysis, and keep only the population subset containing relevant values for the COPD status (i.e. keeping "Case" and "Control" individuals, thus discarding "Other", "Control, Exclusionary Disease", and null values).

In [None]:
dependent_var_name = variablesDict.loc[variablesDict["simplified_varName"] == "Affection status", "varName"].values[0]
categorical_varnames.remove(dependent_var_name)

In [None]:
mask_dependent_var_name = facts[dependent_var_name].isin(["Case", "Control"])
facts = facts.loc[mask_dependent_var_name,:]\
             .astype({dependent_var_name: "category"})
print("Control: {0} individuals\nCase: {1} individuals".format(*facts[dependent_var_name].value_counts().tolist()))

### 4. Univariate statistical tests

To this point, each variable present in the `facts_dummies` dataset will be tested again the selected dependent variable, (ie presence or absence of COPD). 

Two different association test will be carried out according to variables data types: 
- Mann-Whitney U test for continuous ones
- Fisher exact test for categorical ones

### LLR result

In [None]:
from statsmodels.discrete.discrete_model import Logit

In [None]:
independent_varNames = variablesDict["varName"].tolist()
independent_varNames.remove(dependent_var_name)
dependent_var = facts[dependent_var_name].astype("category").cat.codes
dic_pvalues = {}
simple_index_variablesDict = variablesDict.set_index("varName", drop=True)

In [None]:
from scipy.linalg import LinAlgError
from statsmodels.tools.sm_exceptions import PerfectSeparationError
from tqdm import tqdm

In [None]:
for independent_varName in tqdm(independent_varNames, position=0, leave=True):
    matrix = facts.loc[:, [dependent_var_name, independent_varName]]\
                  .dropna(how="any")
    if matrix.shape[0] == 0:
        dic_pvalues[independent_varName] = np.NaN
        continue
    if simple_index_variablesDict.loc[independent_varName, "categorical"]:
        matrix = pd.get_dummies(matrix,
                                columns=[independent_varName],
                                drop_first=False)\
                    .iloc[:, 0:-1]
    dependent_var = matrix[dependent_var_name].cat.codes
    independent_var = matrix.drop(dependent_var_name, axis=1)\
                            .assign(intercept = 1)
    model = Logit(dependent_var, independent_var)
    try:
        results = model.fit(disp=0)
        dic_pvalues[independent_varName] = results.llr_pvalue
    except (LinAlgError, PerfectSeparationError) as e:
        dic_pvalues[independent_varName] = np.NaN

#### p-values distribution (univariate tests)

In [None]:
pd.Series([v for v in dic_pvalues.values()]).plot.hist(bins=30)
plt.suptitle("Distribution of individual p-values",
             weight="bold",
            fontsize=15)

### 5. Multiple hypotheses testing correction: Bonferroni Method

In order to handle the multiple testing problem (increase in the probability to "discover" false statistical associations), we will use the Bonferroni correction method. Although many other multiple comparisons exist, Bonferroni is the most straightforward to use, because it doesn't require assumptions about variables correlation. Other PheWAS analysis also use False Discovery Rate controlling procedures ([see](https://en.wikipedia.org/wiki/False_discovery_rate)).

In a nutshell, Bonferonni allows to calculate a corrected "statistical significant threshold" according to the number of test performed. Every p-value below this threshold will be deemed statistically significant.

In [None]:
%%capture
# Merging pvalues from different tests
df_pvalues = pd.DataFrame.from_dict(dic_pvalues, orient="index", columns=["pvalues"])\
.rename_axis("varName")\
.reset_index(drop=False)

# Adding pvalues results as a new column to variablesDict
variablesDict = joining_variablesDict_onCol(variablesDict,
                                              df_pvalues,
                                              left_col="varName",
                                              right_col="varName")

adjusted_alpha = 0.05/len(variablesDict["pvalues"])
variablesDict["p_adj"] = variablesDict["pvalues"] / len(variablesDict["pvalues"])
variablesDict['log_p'] = -np.log10(variablesDict['pvalues'])
variablesDict = variablesDict.sort_index()
variablesDict["group"] = variablesDict.reset_index(level=2)["level_2"].values

In [None]:
print("Bonferonni adjusted significance threshold: {0:.2E}".format(adjusted_alpha))

## 6. Result visualisations: Manhattan plot

Manhattan plot is the classical way to plot the results of a PheWAS analysis. It plots every tested phenotypical variables on the X-axis, against its *-log(pvalue)* on the Y-axis. The horizontal line represent the adjusted significance level threshold.

In [None]:
mask = variablesDict["pvalues"].isna()
df_results = variablesDict.loc[~mask,:].copy().replace([np.inf, -np.inf], np.nan)
df_results = df_results.loc[~df_results["log_p"].isna().values,:]

#### Specific adjustment to make this specific plot looks nicer
####### to adapt when changing data or dependent variable
df_results = df_results.replace({"TLC": "Spirometry",
                                 "New Gold Classification": "Quantitative Analysis", 
                  "Other": "Demographics"})
group_order={'6MinWalk': 0,
 'CT Acquisition Parameters': 1,
 'CT Assessment Scoresheet': 2,
 'Demographics and Physical Characteristics': 3,
 'Eligibility Form': 10,
 'Longitudinal Analysis': 5,
 'Medical History': 4,
 'Medication History': 13,
 'Quantitative Analysis': 9,
 'Respiratory Disease': 6,
 'SF-36 Health Survey': 11,
 'Sociodemography and Administration': 12,
 'Spirometry': 7,
 'VIDA': 15}
df_results["group_order"] = df_results["group"].replace(group_order)
df_results = df_results.sort_values("group_order", ascending=True)
df_results["simplified_varName"] = df_results["simplified_varName"].str.replace("[0-9]+[A-z]*", "").to_frame()
###


fig = plt.figure()
ax = fig.add_subplot(111)
colors = plt.get_cmap('Set1')
x_labels = []
x_labels_pos = []

y_lims = (0, df_results["log_p"].max(skipna=True) + 50)
threshold_top_values = df_results["log_p"].sort_values(ascending=False)[0:6][-1]

df_results["ind"] = np.arange(1, len(df_results)+1)
df_grouped = df_results.groupby(('group'))
for num, (name, group) in enumerate(df_grouped):
    group.plot(kind='scatter', x='ind', y='log_p',color=colors.colors[num % len(colors.colors)], ax=ax, s=20)
    x_labels.append(name)
    x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2)) # Set label in the middle
    for n, row in group.iterrows():
        if row["log_p"] > threshold_top_values:
            ax.text(row['ind'] + 3, row["log_p"] + 0.05, row["simplified_varName"], rotation=0, alpha=1, size=8, color="black")
                
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df_results) +1])
ax.set_ylim(y_lims)
ax.set_ylabel('-log(p-values)', style="italic")
ax.set_xlabel('Phenotypes', fontsize=15)
ax.axhline(y=-np.log10(adjusted_alpha), linestyle=":", color="black", label="Adjusted Threshold")
plt.xticks(fontsize = 9,rotation=90)
plt.yticks(fontsize = 8)
plt.title("Statistical Association Between COPD Status and Phenotypes", 
          loc="center",
          style="oblique", 
          fontsize = 20,
         y=1)
xticks = ax.xaxis.get_major_ticks()
xticks[0].set_visible(False)
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles = handles, labels = labels, loc = "upper left")
plt.show()

Overall, it appears that most of the tested phenotypes covariates are above the adjusted threshold of significant association. However, it is not surprising at all, given the nature of our dependent variable: a lot of those variables are by nature tied directly to the COPD status.

This code can be used directly with any other variable present in the variable Dictionary. It only need to change the `dependent_var_name` value.