# PIC-SURE python API use-case: Phenome-Wide analysis on *NHLBI BioData Catalyst® (BDC)* studies

This notebook is an illustration example about how to query data using the python **PIC-SURE API**. It takes as use-case a simple PheWAS analysis. This notebook is intentionally straightforward, and explanation provided are only aimed at guiding through the PheWAS analysis process. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

# Environment set-up

### System requirements
- Python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

### Install packages

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

In [None]:
# Pandas DataFrame display options
pd.set_option("display.max_rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

### Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# PheWAS analysis
*Note: This example is not meant to be publication-ready, but rather serve as a guide or starting point to perform PheWAS.*

This PheWAS analysis focuses on the TOPMed DCC Harmonized Variables. 
We leverage the harmonized variables to provide an example PheWAS focused on total cholesterol in FHS.
The PIC-SURE API is helpful in wrangling our phenotypic data. 

In a nutshell, this PheWAS analysis follows the subsequent steps:
1. Retrieving the variable dictionary, using the PIC-SURE API dedicated methods
2. Using the PIC-SURE API to select variables and retrieve data
3. Data management
4. Statistical analysis for each study and sex
5. Visualization of results in Manhattan Plot

With this, we are tackling two different analysis considerations of a PheWAS: 
1. Using multiple variables in a PheWAS. In this example, we are looking into sex differences of total 
2. Harmonization and meta-analysis issues when using data from multiple studies or datasets

### 1. Retrieving variable dictionary from PIC-SURE
The first step to conducting the PheWAS is to retrieve information about the variables that will be used in the analysis. For this example, we will be using variables from the TOPMed Data Coordinating Center (DCC) Harmonized data set. 

The Data Harmonization effort aims to produce "a high quality, lasting resource of publicly available and thoroughly documented harmonized phenotype variables". The TOPMed DCC collaborates with Working Group members and phenotype experts on this endeavour. So far, 44 harmonized variables are accessible through PIC-SURE API (in addition to the age at which each variable value has been collected for a given subject).

Which phenotypic characteristics are included in the harmonized variables?

- Key NHLBI phenotypes
    - Blood cell counts
    - VTE
    - Atherosclerosis-related phenotypes
    - Lipids
    - Blood pressure
    
    
- Common covariates
    - Height
    - Weight
    - BMI
    - Smoking status
    - Race/ethnicity

More information about the variable harmonization process is available at https://www.nhlbiwgs.org/sites/default/files/pheno_harmonization_guidelines.pdf

Here, we find all 44 DCC harmonized variables. For more details on this process, see the `2_TOPMed_DCC_Harmonized_Variables_analysis` notebook.

In [None]:
harmonized_dictionary = bdc.useDictionary().dictionary().find('harmonized')
harmonized_dataframe = harmonized_dictionary.dataframe()
vars_to_remove = harmonized_dataframe.columnmeta_name.str.contains("age at measurement|harmonization unit")
harmonized_dataframe = harmonized_dataframe[-vars_to_remove]
harmonized_dataframe = harmonized_dataframe[harmonized_dataframe['studyId'] == "DCC Harmonized data set"]

print(harmonized_dataframe.shape)
harmonized_dataframe.head()

### 2. Using the PIC-SURE API to select variables and retrieve data
Now that we've retrieved the variable information, we need to select our variable of interest. In this example, we are interested in exploring the relationship between the harmonized variables and blood cholesterol. Specifically, we will find the HPDS path that contains "Blood mass concentration of total cholesterol".

In [None]:
# Retrieve the dependent variable - total cholesterol
cholesterol_variables = harmonized_dataframe[harmonized_dataframe.columnmeta_HPDS_PATH.str.contains('cholesterol')]

cholesterol_path = list(cholesterol_variables.columnmeta_HPDS_PATH)[0]
cholesterol_variables

In [None]:
# Create full list of concept paths with cholesterol_path removed
selected_vars = list(harmonized_dataframe.columnmeta_HPDS_PATH)
selected_vars.remove(cholesterol_path)
selected_vars

We are ready to create our query and retrieve the dataframe. This query will consist of two parts:
1. **Any record of `cholesterol_path`.** By performing an "any record of" filter on the `cholesterol_path`, we will filter out all participants that do not have total blood cholesterol measurements. This allows us to perform more meaningful statistical analysis on the data.
2. **Select all remaining harmonized variables (`selected_vars`).** We will then add all of the remaining harmonized variables to the query, which will allow us to retrieve this information.

In [None]:
# Initialize a query
authPicSure = bdc.useAuthPicSure()
myquery = authPicSure.query()
myquery.anyof().add(cholesterol_path)
myquery.select().add(selected_vars)
facts = myquery.getResultsDataFrame(low_memory = False)
facts.head()

### 3. Data-management
Now that we have retrieved the data, we shall perform some data management steps to prepare for the statistical analysis. First, we will identify which variables are categorical and which are continuous using the `columnmeta_data_type` column of the harmonized dictionary. This is an example of how the PIC-SURE API greatly simplifies this step for the user, as categorizing variables can be tricky.

In [None]:
categorical_dataframe = harmonized_dataframe[harmonized_dataframe.columnmeta_data_type == "Categorical"]
categorical_paths = list(categorical_dataframe.columnmeta_HPDS_PATH)

continuous_dataframe = harmonized_dataframe[harmonized_dataframe.columnmeta_data_type == "Continuous"]
continuous_paths = list(continuous_dataframe.columnmeta_HPDS_PATH)

In [None]:
# remove cholesterol_path from continuous_paths
continuous_paths.remove(cholesterol_path) 

# remove subcohort concept path from categorical_paths
categorical_paths.remove("\\DCC Harmonized data set\\demographic\\subcohort_1\\")

To perform this PheWAS, we will frame two participant cohorts in the context of the dependent variable of interest. In this example, we are interested in blood cholesterol. However, `Blood mass concentation of total cholesterol` is a continuous variable. We shall convert this variable into a binary variable with two groups, Normal/Low and High cholesterol levels, by applying a [threshold of 200mg/dL](https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/diagnosis-treatment/drc-20350806). 

In [None]:
conditions = [
    list(facts[cholesterol_path] <= 200),
    list(facts[cholesterol_path] > 200)
]
outputs = [0, 1] 
# Note: 0 indicates Normal/Low blood pressure, while 1 indicates High blood pressue.

In [None]:
res = np.select(conditions, outputs)
facts['categorical_cholesterol'] = pd.Series(res)

We will also specify the variable name for the covariate we are interested in, in this case Sex.

In [None]:
sex_path = list(facts.filter(regex = 'sex'))[0]
sex_path

We will also select our cohorts of interest. In this example, we are interested in participants from the Framingham Heart Study (FHS. We can utilize the `\\DCC Harmonized data set\\demographic\\subcohort_1\\` concept path in the DCC Harmonized data set to select the participants of interest.

In [None]:
fhs_subset = facts[facts['\\DCC Harmonized data set\\demographic\\subcohort_1\\'].str.contains('FHS') == True]

print(fhs_subset['\\DCC Harmonized data set\\demographic\\subcohort_1\\'].value_counts(), '\n')

Finally, we need to handle any columns that have multiple values per participant. In PIC-SURE data, multiple values are recorded using a tab, `\t`, as the delimiter. The functions below handle these by returning the mean of numeric values and the unique strings of categorical values. 

In [None]:
def process_mixed_cell(cell):
    # Split by tab
    items = str(cell).split('\t')
    # Try to convert all to float
    try:
        nums = [float(x) for x in items]
        return np.mean(nums)
    except ValueError:
        # Not all are numbers, treat as categorical
        first = items[0]
        if all(x == first for x in items):
            return first
        else:
            return f"{items}"

def process_tab_separated_columns(df, cols=None):
    df_processed = df.copy()
    if cols is None:
        # Try to infer columns with tabs (optional)
        cols = [col for col in df.columns if df[col].astype(str).str.contains('\t').any()]
    for col in cols:
        df_processed[col] = df_processed[col].apply(process_mixed_cell)
    return df_processed

In [None]:
fhs_subset = process_tab_separated_columns(fhs_subset)

### 4. Statistical analysis
Two different association tests will be carried out according to variables data types:
- Logistic regression for continuous variables, using the `Logit` statsmodels function
- Fisher exact test for categorical variables, using the `chi2_contingency` scipy.stats function

We will create two functions, `test_continuous` and `test_categorical`, to perform these statistical tests. 
An additional function, `check_vars`, will be used to check if the data passes some assumptions of these tests.

In [None]:
import statsmodels.discrete.discrete_model as sm
import statistics
from scipy.stats import chi2_contingency
import statsmodels.stats.multitest as smt

In [None]:
def test_continuous(dependent_vec, independent_vec):
    model = sm.Logit(dependent_vec, independent_vec, missing='drop')
    pval = model.fit().pvalues[0]
    return(pval)

def test_categorical(dependent_vec, independent_vec):
    contingency_table = pd.crosstab(index=dependent_vec, columns=independent_vec)
    pval = chi2_contingency(contingency_table)[1]
    return pval

def check_vars(dependent_var, other_var, df, case_value, control_value):
    check_pass = False
    concept_vec = df.iloc[:,1].value_counts()
    if len(concept_vec) > 1:
        check_pass = True
    return check_pass

We wrap the previously created functions into one broad analysis function: ```run_phewas```. This would allow the user to run a variation of the analyses described here simply by modifying the calls to the function. The arguments are described below.
- `facts`: a dataframe representing the results from your PIC-SURE query. This dataframe can be filtered as needed.
- `dependent_var`: the column name corresponding to the name of your outcome varible. In this example. 'categorical_cholesterol'.
- `continuous_varnames`: a vector containing all column names of continuous variables within the facts dataframe which you would like to test.
- `categorical_varnames`: a vector containing all column names of categorical variables within the facts dataframe which you would like to test.
- `case_value`: the value corresponding to the 'cases' in the dependent_var vector
- `control_value`: the value corresponding to the 'controls' in the dependent_var vector

In [None]:
def run_phewas(facts, dependent_var, continuous_varnames, categorical_varnames, case_value, control_value):
    results_df = pd.DataFrame(columns=['concept_code', 
                                       'simplified_varname', 
                                       'vartype',
                                       'pval',
                                       'n_cases',
                                       'n_controls',
                                       'var_cases',
                                       'var_controls'
                                      ])
    
    for other_var in continuous_varnames:
        df = facts[[dependent_var, other_var]]
        check_pass = check_vars(dependent_var, other_var, df, case_value, control_value)
        if check_pass:
            cases = df[df.iloc[:,0]==case_value]
            controls = df[df.iloc[:,0]==control_value]
            row_to_add = [other_var, # concept_code
                          other_var.split('\\')[-2], #simplified_varname
                          'continuous', #vartype
                          test_continuous(df.iloc[:,0], df.iloc[:,1]), #pval
                          len(cases), #n_cases
                          len(controls), #n_controls
                          statistics.variance(cases.iloc[:,1].dropna()), #var_cases
                          statistics.variance(controls.iloc[:,1].dropna()) #var_controls
                         ]
            results_df = pd.concat([results_df, pd.DataFrame(data=[row_to_add], columns=results_df.columns)], ignore_index=True)
            
    for other_var in categorical_varnames:
        df = facts[[dependent_var, other_var]]
        check_pass = check_vars(dependent_var, other_var, df, case_value, control_value)
        if check_pass:
            row_to_add = [other_var, #concept_code
                          other_var.split('\\')[-2], #simplified_varname
                          'categorical', #vartype
                          test_categorical(df.iloc[:,0], df.iloc[:,1]), #pval
                          len(df[df.iloc[:,0]==case_value]), #n_cases
                          len(df[df.iloc[:,0]==control_value]), #n_controls
                          np.nan, #var_cases
                          np.nan #var_controls
                         ]
            results_df = pd.concat([results_df, pd.DataFrame(data=[row_to_add], columns=results_df.columns)], ignore_index=True)
            
    return results_df

We then use our previously defined wrapper function to run the PheWAS 2 times:
- Testing all harmonized variables against cholesterol in females in the FHS study
- Testing all harmonized variables against cholesterol in males in the FHS study


In [None]:
fhs_female_df = run_phewas(fhs_subset[fhs_subset[sex_path] == 'Female'],
                                     "categorical_cholesterol", 
                                     continuous_paths, 
                                     categorical_paths, 
                                     case_value=1, 
                                     control_value=0)
fhs_female_df['sex'] = 'Female'
fhs_female_df['study'] = 'FHS'

fhs_male_df = run_phewas(fhs_subset[fhs_subset[sex_path] == 'Male'],
                                     "categorical_cholesterol", 
                                     continuous_paths, 
                                     categorical_paths, 
                                     case_value=1, 
                                     control_value=0)
fhs_male_df['sex'] = 'Male'
fhs_male_df['study'] = 'FHS'

In [None]:
combined_df = pd.concat([fhs_female_df, fhs_male_df])
combined_df = combined_df[combined_df.pval != 0] #Removing pvalues equal to 0
combined_df.head()

Because we are running many statistical tests, we need to perform a p-value adjustment. Here, we use the holm-bonferroni method with an alpha of 0.01.

In [None]:
combined_df['adj_pvalues'] = smt.multipletests(combined_df['pval'], alpha=0.01, method='holm')[1]
combined_df['log_adj_pvalues'] = -1*np.log10(combined_df['adj_pvalues'])
adjusted_alpha = -1*np.log10(0.01)

### 5. Visualization of results in a Manhattan plot
We plot a Manhattan plot, commonly used in PheWAS analyses, to visualize our results. First we will organize our data for plotting.

In [None]:
# Identify and record categories for each concept code
def categorize_function(x):
    return(x.split('\\')[2])

combined_df['category'] = combined_df['concept_code'].apply(categorize_function)

In [None]:
plot_df = combined_df.sort_values(['category'])
plot_df.reset_index(inplace=True, drop=True)
plot_df['i'] = plot_df.index

The x axis represents each of the phenotypes tested, and the y axis represents their associated -log10 p value. 

In [None]:
import seaborn as sns
import textwrap

In [None]:
# identify top 5 adjusted p value results for each study
# FHS
fhs_labels_df = plot_df[plot_df.study == 'FHS']
fhs_labels_df = fhs_labels_df[fhs_labels_df.log_adj_pvalues >= fhs_labels_df.log_adj_pvalues.sort_values(na_position = 'first').iloc[-5]]
fhs_labels_df['offset'] = [15, -15, -30, 0, -10]

In [None]:
# Generate Manhattan plot for FHS:

plot = sns.relplot(data=plot_df[plot_df.study == 'FHS'], 
                   x='i', y='log_adj_pvalues', aspect=3.7, style = 'sex',
                   hue='category', palette = 'bright')
groups=plot_df.groupby('category')['i'].median()
plot.ax.set_xlabel('Phenotype Category')
plot.ax.set_ylabel('-log10(p-value)')
plot.ax.set_xticks(groups)
#plot.ax.set_xticklabels(groups.index)
plot.ax.set_xticklabels('')
#plt.xticks(rotation = 10)
# label points on the plot
for x, y, z, offset in zip(fhs_labels_df['i'], fhs_labels_df['log_adj_pvalues'], fhs_labels_df['simplified_varname'], fhs_labels_df['offset']):
    plt.text(x = x, # x-coordinate position of data label
             y = y, # y-coordinate position of data label
             s = textwrap.fill(z, width=40, fix_sentence_endings=True, break_long_words=False)) #option for text wrapping, import textwrap
plt.axhline(y=adjusted_alpha, color='r', linestyle = 'dotted')
plot.fig.suptitle('Association between phenotypic variables and cholesterol level status within the Framingham Heart Study (FHS) cohort');

In [None]:
# View top 5 p-values for each study/sex combination
fhs_df = plot_df[plot_df.study == "FHS"]
print("FHS Females:")
fhs_females_df = fhs_df[fhs_df.sex == "Female"]
logpthresh = fhs_females_df.log_adj_pvalues.sort_values(na_position = 'first').iloc[-5]
fhs_females_df = fhs_females_df[fhs_females_df.log_adj_pvalues >= logpthresh]
print(fhs_females_df[['simplified_varname', 'pval', 'sex', 'study']])

print("FHS Males:")
fhs_males_df = fhs_df[fhs_df.sex == "Male"]
logpthresh = fhs_males_df.log_adj_pvalues.sort_values(na_position = 'first').iloc[-5]
fhs_males_df = fhs_males_df[fhs_males_df.log_adj_pvalues >= logpthresh]
print(fhs_males_df[['simplified_varname', 'pval', 'sex', 'study']])
