Do initial setup. This involves the next four code-blocks:
1. Import required modules:
        * pyspark – allows for running on multiple nodes
        * dxpy – allows for querying DNANexus-specific files
        * dxdata – allows for querying DNANexus-specific databases
        * pandas – dataframe manipulation
2. Connect to the pyspark cluster to be able to pull UK Biobank data, structured in a mysql-like database
3. Use dxpy to find the database ‘file’ in our current project.
4. Load the database into this instance with the dxdata package

In [1]:
import pyspark
import dxpy
import dxdata
import pandas as pd
import numpy as np
import subprocess
import math

In [2]:
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

In [3]:
dispensed_dataset_id = dxpy.find_one_data_object(typename = "Dataset",
                                                name = "app*.dataset", folder = "/",
                                                name_mode = "glob") ["id"]

In [4]:
dataset = dxdata.load_dataset(id = dispensed_dataset_id)
dataset.entities

[<Entity "participant">,
 <Entity "death">,
 <Entity "death_cause">,
 <Entity "hesin">,
 <Entity "hesin_critical">,
 <Entity "hesin_delivery">,
 <Entity "hesin_diag">,
 <Entity "hesin_maternity">,
 <Entity "hesin_oper">,
 <Entity "hesin_psych">,
 <Entity "covid19_result_england">,
 <Entity "covid19_result_scotland">,
 <Entity "covid19_result_wales">,
 <Entity "gp_clinical">,
 <Entity "gp_registrations">,
 <Entity "gp_scripts">]

Now we get the participant sql table, which provides the dxdata package with information
to extract individual participant information.

In [5]:
participant = dataset["participant"]

Generate a set of all possible field names to allow for easy querying (built in find_fields is _TERRIBLE_).

In [18]:
all_fields = participant.fields
all_names = set()

for field in all_fields:
    all_names.add(field.name)

Now use the participant table to pull specific data fields from the UK Biobank database. The find_field method simply takes a UKBiobank field ID (e.g. field 22001 is genetic sex). Here we need to extract several fields:

In [37]:
search_strings = [
    '40006', # Cancer ICD-10
    '40013', # Cancer ICD-9
    '41270', # HES ICD-10
    '41271', # HES ICD-9
    '40001', # Primary Death ICD-10
    '40002'  # Secondary Death ICD_10
                 ]
for search in search_strings:
    query = ['eid']
    for title in all_names:
        if search in title:
            query.append(title)
            
    df = participant.retrieve_fields(names = query,
                                     coding_values = "keep",
                                     engine = dxdata.connect())
    df_pandas = df.toPandas()
    df_pandas.to_csv(f'{search}_extracted.tsv', sep = "\t", na_rep="NA", index=False)


Here we are actually extracting the per-individual values. 