Do initial setup. This involves the next four code-blocks:
1. Import required modules:
        * pyspark – allows for running on multiple nodes
        * dxpy – allows for querying DNANexus-specific files
        * dxdata – allows for querying DNANexus-specific databases
        * pandas – dataframe manipulation
2. Connect to the pyspark cluster to be able to pull UK Biobank data, structured in a mysql-like database
3. Use dxpy to find the database ‘file’ in our current project.
4. Load the database into this instance with the dxdata package

In [1]:
import pyspark
import dxpy
import dxdata
import pandas as pd
import numpy as np
import subprocess
import math
import csv

In [2]:
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

In [3]:
dispensed_dataset_id = dxpy.find_one_data_object(typename = "Dataset",
                                                name = "app*.dataset", folder = "/",
                                                name_mode = "glob") ["id"]

In [4]:
dataset = dxdata.load_dataset(id = dispensed_dataset_id)
dataset.entities

[<Entity "participant">,
 <Entity "death">,
 <Entity "death_cause">,
 <Entity "hesin">,
 <Entity "hesin_critical">,
 <Entity "hesin_delivery">,
 <Entity "hesin_diag">,
 <Entity "hesin_maternity">,
 <Entity "hesin_oper">,
 <Entity "hesin_psych">,
 <Entity "covid19_result_england">,
 <Entity "covid19_result_scotland">,
 <Entity "covid19_result_wales">,
 <Entity "gp_clinical">,
 <Entity "gp_registrations">,
 <Entity "gp_scripts">]

Now we get the participant sql table, which provides the dxdata package with information
to extract individual participant information.

In [5]:
participant = dataset["participant"]

Now use the participant table to pull specific data fields from the UK Biobank database. The find_field method simply takes a UKBiobank field ID (e.g. field 22001 is genetic sex). Here we need to extract several fields:

In [6]:
search_fields = ['eid', 'p22001', 'p21003_i0', 'p22000']
rename_dict = {'p22001': '22001-0.0', 'p21003_i0': '21003-0.0', 'p22000': '22000-0.0'}
for PC in range(1,41):
    search_fields.append(f'p22009_a{PC}')
    rename_dict[f'p22009_a{PC}'] = f'22009-0.{PC}'
    

Here we are actually extracting the per-individual values. 

In [7]:
df = participant.retrieve_fields(names = search_fields,
                                 coding_values = "keep",
                                 engine = dxdata.connect())

In [8]:
# Convert data frame to Pandas
df_pandas = df.toPandas()

Now we are going to get information on the WES batch for each study individual. This requires bcftools to be installed. To install bcftools in a pycharm spark instance, open a new terminal and type:

```
apt-get update
apt-get intall bcftools
```

Also remember to change the file/project IDs below to your specific project.

In [9]:
# Define file IDs for each VCF we need:
vcf_200k = 'file-Fz7GfKjJfQZy0y652gXxxYJz'
vcf_450k = 'file-G56qJV8JykJqJ3p94qzbgFbq'
vcf_470k = 'file-G97fyZ8JykJV073Z34fYgYz9'

dxpy.download_dxfile(vcf_200k, '200k_chrY.vcf.gz', project='project-GFPBQv8J0zVvBX9XJyqyqYz1')
dxpy.download_dxfile(vcf_450k, '450k_chrY.vcf.gz', project='project-GFPBQv8J0zVvBX9XJyqyqYz1')
dxpy.download_dxfile(vcf_470k, '470k_chrY.vcf.gz', project='project-GFPBQv8J0zVvBX9XJyqyqYz1')

In [10]:
for version in ['200k', '450k', '470k']:
    cmd = f'bcftools query -l {version}_chrY.vcf.gz > {version}_samples.txt'
    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = proc.communicate()
    print(cmd)

bcftools query -l 200k_chrY.vcf.gz > 200k_samples.txt
bcftools query -l 450k_chrY.vcf.gz > 450k_samples.txt
bcftools query -l 470k_chrY.vcf.gz > 470k_samples.txt


In [11]:
file_470k = open('470k_samples.txt', 'r')
samples_470k = set()
for sample in file_470k:
    sample = sample.rstrip()
    samples_470k.add(sample)
print(len(samples_470k))

file_450k = open('450k_samples.txt', 'r')
samples_450k = set()
for sample in file_450k:
    sample = sample.rstrip()
    samples_450k.add(sample)
print(len(samples_450k))
    
file_200k = open('200k_samples.txt', 'r')
samples_200k = set()
for sample in file_200k:
    sample = sample.rstrip()
    samples_200k.add(sample)    
print(len(samples_200k))

469835
454756
200643


Here we are just lightly reformatting the pandas dataframe to get it into the shape we want. See below for specific changes

In [12]:
# Set names to the old-school versions from the UKBB file pre-DNANexus for compatability
df_pandas = df_pandas.rename(columns = rename_dict)

# Cast sex to an int. I convert to a string to force into 'int' as numpy does not have an NA int dtype
df_pandas['22001-0.0'] = df_pandas['22001-0.0'].apply(lambda x: 'NA' if np.isnan(x) else str(int(x)))

# Decide which batch an individual belongs to
# Note that 50k is not here, yet, due to unavailability of any 50k data on the RAP
def decide_wes(eid):
    if eid in samples_200k:
        return('200k')
    elif eid in samples_450k:
        return('450k')
    elif eid in samples_470k:
        return('470k')
    else:
        return(None)
    
df_pandas['wes.batch'] = df_pandas['eid'].apply(lambda x: decide_wes(x))


Here we read in the .fam file for genetic data we just created. This is so we can add a variable indicating whether genetic data is available or not (again, remember to update file IDs):

In [13]:
fam_470k = 'file-GGJb0zQJ80QbJ1fgJQpFQpK6'

dxpy.download_dxfile(fam_470k, 'UKBB_470K_Autosomes_QCd.fam', project='project-GFPBQv8J0zVvBX9XJyqyqYz1')

In [14]:
fam_reader = csv.DictReader(open('UKBB_470K_Autosomes_QCd.fam','r'), fieldnames=['FID','IID','FATHER_ID','MOTHER_ID','sex','phenotype'], delimiter="\t")
genetics_samples = set()
for indv in fam_reader:
    genetics_samples.add(indv['FID'])
    
# And add a column into the master covariate table indicating presence in this fam file:
df_pandas['genetics_qc_pass'] = df_pandas['eid'].isin(genetics_samples).apply(lambda x: 1 if x is True else 0)
df_pandas['genetics_qc_pass'].value_counts()


1    468519
0     33890
Name: genetics_qc_pass, dtype: int64

Need to convert the array batch into a categorical character value to enable easy processing during association tests.

In [26]:
df_pandas['22000-0.0'] = df_pandas['22000-0.0'].apply(lambda x: f'axiom{x:0.0f}' if x > 0 else f'bileve{x*-1:0.0f}')

In [28]:
# Write an output CSV file:
df_pandas.to_csv('base_covariates.covariates', sep = "\t", na_rep="NA", index=False)

Make sure to upload the final file from a terminal!
dx upload base_covariates.covariates 

**REMEMBER** This will upload to the container. NOT the project, so be sure to copy it into your project. No idea why it works this way...