# Sample QC for GWAS analysis

## As-Is Software Disclaimer

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

## Jupyterlab app details (launch configuration)

Recommended configuration
- Runtime: ~5 min
- Cluster configuration: `Single Node`
- Recommended instance: `mem1_ssd1_v2_x36`
- Cost: ~£0.3

## Dependencies

|Library |License|
|:------------- |:-------------|
|[pandas](https://pandas.pydata.org/) |[BSD-3](https://github.com/pandas-dev/pandas/blob/main/LICENSE)|
|[numpy](https://numpy.org/) |[BSD-3](https://github.com/numpy/numpy/blob/main/LICENSE.txt)|

## Introduction

This notebook:
- Loads cohorts created in cohort browser
- Performs sample QC
- Creates a file containing phenotype and covariate information needed for GWAS analysis

Data used for this notebook:
- Cohorts (`ischemia_cases` `ischemia_controls`) created using Cohort Browser
- Dataset to retrieve phenotype data for both cohorts (`ischemia.pheno`)


## Prepare your environment

In [None]:
# Import packages
# dxpy allows python to interact with the platform storage
import dxpy
import numpy as np
import pandas as pd
import re
import subprocess
import glob
import os

In [None]:
imputation_folder = 'Imputation from genotype (GEL)'
imputation_field_id = '21008'
output_dir = '/Data/'

## Load dataset description containing phenotypic data

In [None]:
# Automatically discover dispensed dataset ID
dispensed_dataset = dxpy.find_one_data_object(
    typename='Dataset', name='app*.dataset', folder='/', name_mode='glob'
)
dispensed_dataset_id = dispensed_dataset['id']

In [None]:
# Get project ID
project_id = dxpy.find_one_project()['id']

In [None]:
dataset = (':').join([project_id, dispensed_dataset_id])

Using the `-ddd` parameter will extract 3 dictionary files associated with the dataset.

The 3 dictionary files that are returned include:
1. `entity_dictionary` that contains the different tables resources that are available. The table we’re most interested in tends to be the participant table that contains the information about each participant.
2. `data_dictionary` that contains the different field names that we might want to include in our dataset.
3. `coding_dictionary` that contains a lookup for the values for some of the field names.

In [None]:
# Note: This cell can only be run once. Otherwise, you'll need to delete the existing data tables in order to re-run
cmd = ['dx', 'extract_dataset', dataset, '-ddd', '--delimiter', ',']
subprocess.check_call(cmd)

## Load cohorts that were created in cohort browser

Cohorts were created in Cohort Browser. `ischemic_cases` cohort was created by having the following condition in the field [41270](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270) condition `I20-I25 Ischaemic heart disease`. `ischemic_controls` was created by using Cohorts compare: not in `ischemic_cases`.

Here we use the `load_cohort` function from dxdata to load the cohort record. For more information about dxdata, see the associated [github repository](https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb)

In [None]:
# Discover cohort data
dispensed_control_id = list(
    dxpy.find_data_objects(
        typename='CohortBrowser',
        folder='/Cohorts',
        name_mode='exact',
        name='ischemic_controls',
    )
)[0]['id']

dispensed_case_id = list(
    dxpy.find_data_objects(
        typename='CohortBrowser',
        folder='/Cohorts',
        name_mode='exact',
        name='ischemic_cases',
    )
)[0]['id']

In [None]:
control_dataset = (':').join([project_id, dispensed_control_id])
case_dataset = (':').join([project_id, dispensed_case_id])

## Retrieve phenotypic data

Specify fields ID to retrieve, get corresponding UKB RAP field names and print description table.

- `31` - [Sex](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=31)
- `2966` - [Age high blood pressure diagnosed](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2966)
- `22001` - [Genetic sex](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22001)
- `22006` - [Genetic ethnic grouping](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22006)
- `22019` - [Sex chromosome aneuploidy](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22019)
- `22021` - [Genetic kinship to other participants](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22021)
- `21022` - [Age at recruitment](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=21022)
- `23104` - [Body mass index (BMI)](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=23104)
- `20160` - [Ever smoked](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=20160)
- `30760` - [HDL cholesterol](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=30760)
- `30780` - [LDL direct](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=30780)
- `22020` - [Used in genetic principal components](https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22020)
- `22009` - [Genetic principal components](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=22009)

In [None]:
field_ids = [
    '31',
    '2966',
    '22001',
    '22006',
    '22019',
    '22021',
    '21022',
    '23104',
    '20160',
    '30760',
    '30780',
    '22020',
    '22009'
]

In [None]:
path = os.getcwd()

In [None]:
data_dict_csv = glob.glob(os.path.join(path, '*.data_dictionary.csv'))[0]
data_dict_df = pd.read_csv(data_dict_csv)
data_dict_df.head()

The UKB participant tables have the following naming convention for the fields (or columns): `p<field id>_i<instance id>_a<array id>`.

In [None]:
def fields_for_id(field_id):
    '''Collect all field names (e.g. 'p<field_id>_iYYY_aZZZ') given a list of field IDs and return string to pass into extract_dataset'''
    field_names = ['eid']
    for _id in field_id:
        select_field_names = list(
            data_dict_df[
                data_dict_df.name.str.match(r'^p{}(_i\d+)?(_a\d+)?$'.format(_id))
            ].name.values
        )
        # Note: This conditional is used to select only the first instance for all fields except '2966'
        # This conditional is not needed otherwise
        # For PCA field, select the first ten PCs
        if _id == '22009':
            field_names += select_field_names[:10]
        # Select only the first instance for all fields except '2966' 
        elif _id != '2966' and len(select_field_names) > 1:
            field_names.append(select_field_names[0])
        else:
            field_names += select_field_names

    field_names = [f'participant.{f}' for f in field_names]
    return ','.join(field_names)

In [None]:
field_names = fields_for_id(field_ids)
field_names

### Select phenotypes for case and control cohort

In [None]:
# Load dataset
# Note: This cell can only be run once. Otherwise, you'll need to delete the existing data tables in order to re-run
# Note: There is no space separating the different fields
cmd = [
    'dx',
    'extract_dataset',
    control_dataset,
    '--fields',
    field_names,
    '--delimiter',
    ',',
    '--output',
    'control_dictionary.csv',
]
subprocess.check_call(cmd)

In [None]:
cmd = [
    'dx',
    'extract_dataset',
    case_dataset,
    '--fields',
    field_names,
    '--delimiter',
    ',',
    '--output',
    'case_dictionary.csv',
]
subprocess.check_call(cmd)

In [None]:
control_dict_csv = 'control_dictionary.csv'
control_df = pd.read_csv(control_dict_csv)
print(control_df.shape)
control_df.head()

In [None]:
# Rename column headers
control_df = control_df.rename(columns=lambda x: re.sub('participant.', '', x))
control_df.head()

In [None]:
case_dict_csv = 'case_dictionary.csv'
case_df = pd.read_csv(case_dict_csv)
print(case_df.shape)
case_df.head()

In [None]:
case_df = case_df.rename(columns=lambda x: re.sub('participant.', '', x))
case_df.head()

Create phenotype variable and concatenate cohorts into one dataframe.

In [None]:
case_df['ischemia_cc'] = 1
control_df['ischemia_cc'] = 0

In [None]:
df = pd.concat([case_df, control_df])

In [None]:
# Note: The counts should be consistent with the counts from the Cohort Browser
df.ischemia_cc.value_counts()

Here is an example of retrieved data.

*Note: this table contains synthetic data and is made for demonstration purposes only!*

|Row|eid|p31|p2966_i0|p2966_i1|p2966_i2|p2966_i3|p22001|p22006|p22019|p22021|p22022|p23104_i0|p20160_i0|p30760_i0|p30780_i0|p22020|ischemia_cc|
|:---|:--- |:---|:--- |:---|:---|:---|:---|:--- |:---|:---|:--- |:---|:--- |:---|:--- |:---|:---|
|0 |EID-XXXXXXX|0 |58.0|NA|NA|NA|0.0|1.0|NA|0.0|61.0|28.5|1.0|0.959|2.456|1.0|1|
|1 |EID-YYYYYYY|1 |NA  |NA|NA|NA|1.0|1.0|NA|0.0|51.0|24.5|1.0|1.159|2.719|1.0|1|
|2 |EID-ZZZZZZZ|1 |50.0|NA|NA|NA|1.0|1.0|NA|1.0|64.0|29.7|1.0|0.970|3.440|NA |1|

In [None]:
print(df.shape)
df.head()

## QC samples based on several conditions

In [None]:
df_qced = df[
    (df['p31'] == df['p22001']) # Filter reported sex and genetic sex are the same
    & (df['p22006'] == 1)  # Only include Caucasian ancestry (In_white_british_ancestry_subset)
    & (  
        df['p22019'].isnull() # No Sex chromosome aneuploidy
    )
    & (  
        df['p22020'] == 1 # Participant was used to calculate PCA (only non-relatives were included)
    )  
].copy()

In [None]:
df_qced.ischemia_cc.value_counts()

In [None]:
df_qced.head()

## Prepare covariates for subsequent regression analysis

Combine all instances of `Age high blood pressure diagnosed` (`2966`) field and make `hypertension` boolean variable.

In [None]:
df_qced['hypertension'] = [
    np.nansum([float(row[i]) for i in [0, 1, 2, 3]]) > 0
    for row in df_qced[['p2966_i0', 'p2966_i1', 'p2966_i2', 'p2966_i3']].to_numpy()
]
df_qced['hypertension'] = df_qced['hypertension'].astype(int)

For the `Ever smoked` (`20160`) make all missing values as `0`.

In [None]:
df_qced['p20160_i0'].fillna(0, inplace=True)

For the rest of covariate columns, replace missing values by mean for that column.

In [None]:
df_qced['p23104_i0'].fillna(df_qced['p23104_i0'].mean(), inplace=True)
df_qced['p30760_i0'].fillna(df_qced['p30760_i0'].mean(), inplace=True)
df_qced['p30780_i0'].fillna(df_qced['p30780_i0'].mean(), inplace=True)

In [None]:
df_qced.isna().sum()

## Rename columns and organize it in format suitable for PLINK and regenie

In [None]:
# Rename columns for better readibility
df_qced = df_qced.rename(columns=lambda x: re.sub('p22009_a','pc',x))
df_qced = df_qced.rename(
    columns={
        'eid': 'IID',
        'p31': 'sex',
        'p21022': 'age',
        'p20160_i0': 'ever_smoked',
        'p23104_i0': 'bmi',
        'p30760_i0': 'hdl_cholesterol',
        'p30780_i0': 'ldl_cholesterol',
    }
)

# Add FID column -- required input format for regenie
df_qced['FID'] = df_qced['IID']

# Create a phenotype table from our QCed data
cols = [
        'FID',
        'IID',
        'sex',
        'age',
        'bmi',
        'ever_smoked',
        'hdl_cholesterol',
        'ldl_cholesterol',
        'hypertension',
        'ischemia_cc',
]
cols.extend([col for col in df_qced if 'pc' in col])
df_phenotype = df_qced[cols]

In [None]:
df_phenotype.head()

## Select only samples that have imputed data available and save phenotype table as CSV

In [None]:
# Get imputed data
path_to_impute_file = f'/mnt/project/Bulk/Imputation/{imputation_folder}/ukb{imputation_field_id}_c1_b0_v1.sample'
sample_file = pd.read_csv(
    path_to_impute_file,
    delimiter='\s',
    header=0,
    names=['FID', 'IID', 'missing', 'sex'],
    engine='python',
)
# Intersect the phenotype file and the imputed .sample file
# to generate phenotype DataFrame for only samples included in the imputed data
ischemia_df = df_phenotype.join(
    sample_file.set_index('IID'), on='IID', rsuffix='_sample', how='inner'
)
# Drop unuseful columns from .fam file
ischemia_df.drop(
    columns=['FID_sample', 'missing', 'sex_sample'],
    axis=1,
    inplace=True,
    errors='ignore',
)

In [None]:
# Write phenotype files to a TSV file
ischemia_df.to_csv('ischemia_df.phe', sep='\t', na_rep='NA', index=False, quoting=3)

In [None]:
ischemia_df.head()

## Load file to project storage

In [None]:
%%bash -s "$output_dir"
# Upload the geno-pheno intersect phenotype file back to the RAP project
dx upload ischemia_df.phe -p --path $1 --brief

Here is an example of phenotype file.

*Note: this table contains synthetic data and is made for demonstration purposes only!*

|Row|FID|IID|sex|age|bmi|ever_smoked|hypertension|hdl_cholesterol|ldl_cholesterol|ischemia_cc|
|:--- |:--- |:---|:--- |:---|:---|:---|:---|:--- |:---|:---|
|0|EID-XXXXXXX|1234567|0|61.0|28.5|0|1|0.959|2.456|1|1|1|
|1|EID-YYYYYYY|1234568|1|51.0|24.5|0|0|1.159|2.719|0|1|1|
|2|EID-ZZZZZZZ|1234569|0|60.0|23.8|1|1|1.579|3.396|0|1|1|

## Output files

- Table containing phenotype and covariates to be used in regenie GWAS analysis (`ischemia_df.phe`)