# PheWAS analysis minimal reproductible example

### Installing dependencies

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git 
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 

import PicSureHpdsLib
import PicSureClient

### Connecting to a PIC-SURE resource

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r+") as f:
    token = f.read()

In [None]:
from python_lib.wrappers import get_HPDS_connection, query_runner
from python_lib.descriptive_scripts import quality_filtering, get_study_variables_info

The wrapper `get_HPDS_connection` is a wrapper around PICSURE API python client calls to get a resource object, to connect to HPDS (found in `python_lib/wrappers.py`).

In [None]:
resource = get_HPDS_connection(token,
                               PICSURE_network_URL,
                               resource_id)

In [None]:
variablesDict = resource.dictionary().find().DataFrame()

The variablesDict is the entry point of the analysis. It is used to get the variables names to query. The idea is to query every variables available in the accessed BDC environement, to run a phewas against one harmonized variable.

This process is done iteratively with batches of "phenome" variables with which statistical tests are computed.

For the sake of example, let's use the first 50 variable names from the dictionary as phenome variables, and the "Harmonized Sex" variable as the response variable (categorical against which multiple statistical univariate tests will be conducted, the critical part of a PheWAS analysis).

Using batches of 50 variables, the next described steps are done 911 times.

### Beginning ot the iterative pipeline

In [None]:
covariates = variablesDict.index[0:50].tolist()

In [None]:
dependent_var_name = "\\DCC Harmonized data set\\02 - Atherosclerosis\\Presence or absence of carotid plaque.\\"

In [None]:
vars_to_query = covariates + [dependent_var_name]

`query_runner` is just a wrapper around PICSURE API methods. Variable names can be passed to "select", "any_of", "require", "filter".

In [None]:
study_df = query_runner(resource=resource,
             to_select=vars_to_query,
             result_type="DataFrame",
             low_memory=False, 
             timeout=500)
print("Shape of retrieved HPDS dataframe {0}".format(study_df.shape))        

Once the patient-level date have been retrieved, 3 steps: 

1. Variables quality checking
2. Variable description (eg number missing values)
3. PheWAS computation

Quality checking functions are found in the file `python_lib/descriptive_scripts.py`


### Variables quality checking and filtering

In [None]:
filtered_df = quality_filtering(study_df)
print("shape filtered_df: {0}".format(filtered_df.shape))

## Study information

In [None]:
variables_info = get_study_variables_info(study_df, filtered_df)

In [None]:
independent_var_names = [var for var in covariates if var in filtered_df.columns]

## PheWAS analysis

This piece of code is implemented in a function called `PheWAS` in the script `python_lib/PheWAS_funcs.py`.


In [None]:
from statsmodels.discrete.discrete_model import Logit
from scipy.linalg import LinAlgError
from statsmodels.tools.sm_exceptions import PerfectSeparationError
from tqdm import tqdm
    
dic_pvalues = {}
dic_errors = {}
for independent_var_name in tqdm(independent_var_names, position=0, leave=True):
    subset_df = filtered_df.loc[:, [dependent_var_name, independent_var_name]]\
              .dropna(how="any")

    if subset_df.shape[0] == 0:
        dic_pvalues[independent_var_name] = np.NaN
        dic_errors[independent_var_name] = "All NaN"
        continue

    if subset_df[independent_var_name].dtype in ["object", "bool"]:
        subset_df = pd.get_dummies(subset_df, 
                                   columns=[independent_var_name],
                                   drop_first=False)\
                      .iloc[:, 0:-1]
    y = subset_df[dependent_var_name].astype("category").cat.codes
    X = subset_df.drop(dependent_var_name, axis=1)\
                            .assign(intercept = 1)
    model = Logit(y, X)
    try:
        results = model.fit(disp=0)
        params = results.params
        conf = np.exp(results.conf_int())
        conf['Odds Ratio'] = np.exp(params)
        conf.columns = ['5%', '95%', 'Odds Ratio']

        dic_pvalues[independent_var_name] = (results.llr_pvalue, conf)
    except (LinAlgError, PerfectSeparationError) as e:
        dic_pvalues[independent_var_name] = np.NaN
        dic_errors[independent_var_name] = e

Once the OR and pvalues have been retrieved, we can go on we the rest of the analysis, eg correcting pvalus with Bonferonni correction or plotting results with a Manhattan plot.

The analysis is simplified and do not deal with more complicated statistical cases like multicategorical response variable, as well as multivariate regression, which will be necessary for the complete PheWAS analysis on TOPMed Studies.