# HERMES 3.0 UK Biobank Phenotyping
#### Nicholas Sunderland - Mar 2025
#### nicholas.sunderland@bristol.ac.uk

## Setup
First open an R JupytrLab session on the UK Biobank RAP with standard settings. 

Open the terminal app and clone the heRmes Github repository into `/opt/notebooks` by running: 

`git clone https://github.com/nicksunderland/heRmes.git`

This notebook is located in the resulting heRmes folder at: 

`/opt/notebooks/heRmes/scripts/hermes3_phenotyping_ukbb_rap.ipynb

You need to set the project and record IDs below to your own project.

* **Project ID**: the `Project ID` is available from the 'Settings' tab after clicking on your project in the UK Biobak RAP.  
* **Record ID**: the record or dataset `ID` is available from the 'Manage' tab, where you select the checkbox to the left of your dispensed `.dataset` (e.g. app81499_20241105095754.dataset) which then displays the `ID` in the information pane on the right.

Running this script will extract all of the data and process the phenotypes, resulting in a new folder in your project called `hermes3_data`. The phenotype file is called `hermes3_phenotypes.tsv.gz` and a counts summary is in `hermes3_phenotype_summary.tsv`.

In [None]:
projectid <- "project-GvZyZ20J81vgPJGbJy8pgpyq"
recordid  <- "record-Gvb0Bg0Jfxfv0q8Fb2pXqKjg"

### Libraries

In [2]:
library(glue)
library(data.table)

## Data extraction
Given the large datasets we make use of the `dx run table-exporter` to extract the required phenotype data. The extraction function below will create a table-exportor job which you will be able to track in the 'Monitor' table on your RAP's homepage. The data will be extracted to your project into a folder called `hermes3_data`. The data is not immediately uploaded to this session, although we will import it later. 

To get the small data dictionaries locally in this session we use the `dx extract_dataset` function. 

### Download data dictionary

In [3]:
setwd("/opt/notebooks")
dataset <- glue("{projectid}:{recordid}")
cmd <- glue("dx extract_dataset {dataset} -ddd")
system(cmd)
dict_files <- list.files(pattern="codings|data_dictionary|entity_dictionary")
data_dict_file <- dict_files[grepl("data_dictionary", dict_files)]

#### Data dictionary filter function

In [4]:
#' @title filter_data_dict
#'
#' @param dict_path, str, path to the dataset.data_dictionary.csv
#' @param codes_str, list, list of lists representing UKBB column name, table entity, and search strategy list(name=, entity=, search=). 
#'   name must be a valid column name in the data_dictionary, entity a valid entity in the entity dictionary, and search either "matches"
#'   for exact matches, or starts with to match cases of multiple instances (repeated measures usually)
#'
#' @returns a filtered subset of the data_dictionary 
#'
filter_data_dict <- function(dict_path, codes_struc) {
    
    data_dict <- fread(dict_path)
    
    d <- lapply(codes_struc, function(x) {
        
        d0 <- data.table()
        if (x$search=="matches") {
            d0 <- data_dict[entity==x$entity & name==x$name]
        } else if (x$search=="startswith") {
            d0 <- data_dict[entity==x$entity & grepl(paste0("^", x$name), name)]
        }
        
        if (nrow(d0)==0) {
            cat(glue("Code [{x$name}] not found in data dictionary\n"))
            stop("Code not found error")
        }
        
        d0
        
    }) |> rbindlist(idcol = "item")
    
    return(d)
}

### Data extraction function

In [5]:
#' @title extract_data
#'
#' @param dataset, str, a valid dataset id - format "{projectid}:{recordid}" 
#' @param fields, str, vector of UK-BB format column names e.g. p31
#' @param entity, str, string of length one - the entity to extract from e.g. participants
#' @param output, str, the base name for the output file, no extension
#'
#' @returns NULL side effect is starting a table-exporter job which outputs the file to /hermes3_data directory in the RAP
#'
extract_data <- function(dataset, fields, entity, output) {
    
    field_str <- paste0('-ifield_names="', fields, '"', collapse=" ") 
    
    cmd <- glue(
      "dx run table-exporter ",
      "-idataset_or_cohort_or_dashboard={dataset} ",
      "-ioutput={output} ",
      "-ioutput_format=TSV ",
      "-iheader_style=FIELD-NAME ",
      "-icoding_option=RAW ",
      "{field_str} ",
      "-ientity={entity} ",
      "--destination hermes3_data/"
    )    

    o <- system(cmd, intern = TRUE)
    cat(o, sep = "\n")
}

### Define participant data

In [6]:
participant_codes= list(eid                = list(name="eid",       entity="participant", search="matches"),
                        reason_lost_fu     = list(name="p190",      entity="participant", search="matches"),
                        sex                = list(name="p31",       entity="participant", search="matches"),
                        age                = list(name="p21022",    entity="participant", search="matches"),
                        ethnicity          = list(name="p21000",    entity="participant", search="startswith"),
                        genetic_sex        = list(name="p22001",    entity="participant", search="matches"),
                        genetic_ethnicity  = list(name="p22006",    entity="participant", search="matches"),
                        pc1                = list(name="p22009_a1", entity="participant", search="matches"),
                        pc2                = list(name="p22009_a2", entity="participant", search="matches"),
                        pc3                = list(name="p22009_a3", entity="participant", search="matches"),
                        pc4                = list(name="p22009_a4", entity="participant", search="matches"),
                        pc5                = list(name="p22009_a5", entity="participant", search="matches"))

participant_data_dict = filter_data_dict(data_dict_file, participant_codes)
head(participant_data_dict, 3)

item,entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
eid,participant,eid,string,global,,,,Participant Information,,,,,,,Participant ID,
reason_lost_fu,participant,p190,integer,,data_coding_1965,,,Population characteristics > Ongoing characteristics,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=190,,,,Reason lost to follow-up,
sex,participant,p31,integer,,data_coding_9,,,Population characteristics > Baseline characteristics,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=31,,,,Sex,


### Define self-report illness data
Give there are lots of columns and extraction fails with so much data I do this in a loop. 

In [7]:
self_illness_codes=list(eid                = list(name="eid",    entity="participant", search="matches"),
                        self_rep_ill       = list(name="p20002", entity="participant", search="startswith"), # 0:3 instances
                        self_rep_ill_year  = list(name="p20008", entity="participant", search="startswith"), # 0:3 instances
                        self_rep_proc      = list(name="p20004", entity="participant", search="startswith"), # 0:3 instances
                        self_rep_proc_year = list(name="p20010", entity="participant", search="startswith")) # 0:3 instances

self_rep_data_dict = filter_data_dict(data_dict_file, self_illness_codes)
head(self_rep_data_dict, 3)

item,entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
eid,participant,eid,string,global,,,,Participant Information,,,,,,,Participant ID,
self_rep_ill,participant,p20002_i0_a0,integer,,data_coding_6,,,Assessment centre > Verbal interview > Medical conditions,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20002,,,,"Non-cancer illness code, self-reported | Instance 0 | Array 0",
self_rep_ill,participant,p20002_i0_a1,integer,,data_coding_6,,,Assessment centre > Verbal interview > Medical conditions,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20002,,,,"Non-cancer illness code, self-reported | Instance 0 | Array 1",


### Define HES inpatient data

In [8]:
hesin_to_extract = list(eid                = list(name="eid",       entity="hesin", search="matches"),
                        ins_index          = list(name="ins_index", entity="hesin", search="matches"),
                        epistart           = list(name="epistart",  entity="hesin", search="matches"),
                        admidate           = list(name="admidate",  entity="hesin", search="matches"))

hes_data_dict = filter_data_dict(data_dict_file, hesin_to_extract)
head(hes_data_dict, 4)

item,entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
eid,hesin,eid,string,,,,,,,,,,participant:eid,many_to_one,Participant ID,
ins_index,hesin,ins_index,integer,,,,,,,,,,,,Instance index,
epistart,hesin,epistart,date,,,,,,,,,,,,Episode start date,
admidate,hesin,admidate,date,,,,,,,,,,,,Date of admission to hospital,


### Define HES diagnoses data

In [9]:
hesdiag_to_extract=list(eid                = list(name="eid",        entity="hesin_diag", search="matches"),
                        ins_index          = list(name="ins_index",  entity="hesin_diag", search="matches"),
                        diag_icd9          = list(name="diag_icd9",  entity="hesin_diag", search="matches"),
                        diag_icd10         = list(name="diag_icd10", entity="hesin_diag", search="matches"))

hesdiag_data_dict = filter_data_dict(data_dict_file, hesdiag_to_extract)
head(hesdiag_data_dict, 4)

item,entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
eid,hesin_diag,eid,string,,,,,,,,,,,,Participant ID,
ins_index,hesin_diag,ins_index,integer,,,,,,,,,,,,Instance index,
diag_icd9,hesin_diag,diag_icd9,string,,data_coding_87,,,,,,,,,,Diagnoses - ICD9,
diag_icd10,hesin_diag,diag_icd10,string,,data_coding_19,,,,,,,,,,Diagnoses - ICD10,


### Define HES procedures data

In [10]:
hesproc_to_extract=list(eid                = list(name="eid",       entity="hesin_oper", search="matches"),
                        ins_index          = list(name="ins_index", entity="hesin_oper", search="matches"),
                        oper3              = list(name="oper3",     entity="hesin_oper", search="matches"),
                        oper4              = list(name="oper4",     entity="hesin_oper", search="matches"))
hesoper_data_dict = filter_data_dict(data_dict_file, hesproc_to_extract)
head(hesoper_data_dict, 4)

item,entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
eid,hesin_oper,eid,string,,,,,,,,,,,,Participant ID,
ins_index,hesin_oper,ins_index,integer,,,,,,,,,,,,Instance index,
oper3,hesin_oper,oper3,string,,data_coding_259,,,,,,,,,,Operative procedures - OPCS3,
oper4,hesin_oper,oper4,string,,data_coding_240,,,,,,,,,,Operative procedures - OPCS4,


## Run Table-Exporter extraction

In [11]:
data_file_paths <- list(
    demog = "/mnt/project/hermes3_data/data_participant.tsv",
    self  = "/mnt/project/hermes3_data/data_selfreportedillness.tsv",
    hesin = "/mnt/project/hermes3_data/data_hesin.tsv",
    diag  = "/mnt/project/hermes3_data/data_hesin_diag.tsv",
    oper  = "/mnt/project/hermes3_data/data_hesin_oper.tsv"
)

if (!file.exists(data_file_paths$demog)) {
    extract_data(dataset=dataset, fields=participant_data_dict$name, entity="participant", output = "data_participant")
}
if (!file.exists(data_file_paths$self)) {
    extract_data(dataset=dataset, fields=self_rep_data_dict$name,    entity="participant", output = "data_selfreportedillness")
}
if (!file.exists(data_file_paths$hesin)) {
    extract_data(dataset=dataset, fields=hes_data_dict$name,         entity="hesin",       output = "data_hesin")
}
if (!file.exists(data_file_paths$diag)) {
    extract_data(dataset=dataset, fields=hesdiag_data_dict$name,     entity="hesin_diag",  output = "data_hesin_diag")
}
if (!file.exists(data_file_paths$oper)) {
    extract_data(dataset=dataset, fields=hesoper_data_dict$name,     entity="hesin_oper",  output = "data_hesin_oper")
}

## Read in extracted data

In [12]:
counter <- 0
data_files <- list()
while(TRUE) {
    
    if (counter > 60) {
        cat("Waited 1 hour and files not extracted - aborting\n")
        stop("extract timeout error")
    }
    
    found <- sapply(data_file_paths, file.exists)
    
    if (!all(found)) {
        cat(glue('****\nWaiting for extraction - {counter} minutes elapsed\n****\n'), sep="\n")
        cat(glue('{ifelse(found,"","->")}{names(found)}: file_found={found} : {data_files}'), sep="\n")
        flush.console()
        Sys.sleep(60*5)
        counter <- counter + 1
    } else {
        cat(glue('Reading data files\n'), sep="\n")
        flush.console()
        for (i in seq_along(data_file_paths)) {
            f <- data_file_paths[[i]]
            n <- names(data_file_paths)[i]
            cat(glue('...{n}: {f}\n'), sep="\n")
            flush.console()
            data_files[[n]] <- fread(f)
        }
        break
    }
}

lapply(data_files, head, n = 5)

Reading data files
...demog: /mnt/project/hermes3_data/data_participant.tsv
...self: /mnt/project/hermes3_data/data_selfreportedillness.tsv
...hesin: /mnt/project/hermes3_data/data_hesin.tsv
...diag: /mnt/project/hermes3_data/data_hesin_diag.tsv
...oper: /mnt/project/hermes3_data/data_hesin_oper.tsv


eid,p190,p31,p21022,p21000_i0,p21000_i1,p21000_i2,p21000_i3,p22001,p22006,p22009_a1,p22009_a2,p22009_a3,p22009_a4,p22009_a5
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1062757,,0,62,1001,,,,0.0,1.0,-12.6693,4.30928,-2.23438,-0.109226,-7.364
2217356,,1,57,1001,,,,,,,,,,
3712401,,1,56,1001,,,,1.0,1.0,-13.7558,5.64773,-3.58187,7.74165,21.0285
1011090,,0,49,1001,,,,0.0,1.0,-12.3794,2.03865,-0.837131,-0.562303,2.60283
2874739,,0,59,1001,,,,0.0,1.0,-12.6667,4.18819,-1.96996,2.85875,0.6597

eid,p20002_i0_a0,p20002_i0_a1,p20002_i0_a2,p20002_i0_a3,p20002_i0_a4,p20002_i0_a5,p20002_i0_a6,p20002_i0_a7,p20002_i0_a8,⋯,p20010_i3_a22,p20010_i3_a23,p20010_i3_a24,p20010_i3_a25,p20010_i3_a26,p20010_i3_a27,p20010_i3_a28,p20010_i3_a29,p20010_i3_a30,p20010_i3_a31
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1000074,1351,,,,,,,,,⋯,,,,,,,,,,
1000194,1086,,,,,,,,,⋯,,,,,,,,,,
1000258,1065,1265.0,1465.0,,,,,,,⋯,,,,,,,,,,
1000280,1436,,,,,,,,,⋯,,,,,,,,,,
1000299,1154,,,,,,,,,⋯,,,,,,,,,,

eid,ins_index,epistart,admidate
<int>,<int>,<IDate>,<IDate>
3026745,17,2009-04-23,2009-04-23
5469223,105,2013-06-18,2013-06-18
2099936,32,2021-02-18,2021-02-18
5152448,0,1997-05-07,1997-05-07
5944709,19,2017-12-13,2017-12-12

eid,ins_index,diag_icd9,diag_icd10
<int>,<int>,<chr>,<chr>
2097360,0,,D140
2622045,9,,Z800
4723574,4,,N183
5697013,17,,Z888
5858023,42,,M059

eid,ins_index,oper3,oper4
<int>,<int>,<int>,<chr>
2940276,4,,W822
3772102,12,,K634
5994223,4,,E492
4168836,0,,Y819
4239222,0,,Y767


### Rename columns

In [13]:
rename_cols <- function(d, code_struc) {
    for (col in names(code_struc)) {
        if (code_struc[[col]]$search=="matches") {
            setnames(d, code_struc[[col]]$name, col)
        } else if (code_struc[[col]]$search=="startswith") {
            regex     <- paste0("^", code_struc[[col]]$name)
            matches   <- names(d)[grepl(regex, names(d))]
            new_names <- paste0(col, "_", 1:length(matches))
            setnames(d, matches, new_names)
        }
    }
    return(d)
}

data_files$demog <- rename_cols(data_files$demog, code_struc=participant_codes)
data_files$self  <- rename_cols(data_files$self,  code_struc=self_illness_codes)
data_files$hesin <- rename_cols(data_files$hesin, code_struc=hesin_to_extract)
data_files$diag  <- rename_cols(data_files$diag,  code_struc=hesdiag_to_extract)
data_files$oper  <- rename_cols(data_files$oper,  code_struc=hesproc_to_extract)

lapply(data_files, head, n = 5)

eid,reason_lost_fu,sex,age,ethnicity_1,ethnicity_2,ethnicity_3,ethnicity_4,genetic_sex,genetic_ethnicity,pc1,pc2,pc3,pc4,pc5
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1062757,,0,62,1001,,,,0.0,1.0,-12.6693,4.30928,-2.23438,-0.109226,-7.364
2217356,,1,57,1001,,,,,,,,,,
3712401,,1,56,1001,,,,1.0,1.0,-13.7558,5.64773,-3.58187,7.74165,21.0285
1011090,,0,49,1001,,,,0.0,1.0,-12.3794,2.03865,-0.837131,-0.562303,2.60283
2874739,,0,59,1001,,,,0.0,1.0,-12.6667,4.18819,-1.96996,2.85875,0.6597

eid,self_rep_ill_1,self_rep_ill_2,self_rep_ill_3,self_rep_ill_4,self_rep_ill_5,self_rep_ill_6,self_rep_ill_7,self_rep_ill_8,self_rep_ill_9,⋯,self_rep_proc_year_119,self_rep_proc_year_120,self_rep_proc_year_121,self_rep_proc_year_122,self_rep_proc_year_123,self_rep_proc_year_124,self_rep_proc_year_125,self_rep_proc_year_126,self_rep_proc_year_127,self_rep_proc_year_128
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1000074,1351,,,,,,,,,⋯,,,,,,,,,,
1000194,1086,,,,,,,,,⋯,,,,,,,,,,
1000258,1065,1265.0,1465.0,,,,,,,⋯,,,,,,,,,,
1000280,1436,,,,,,,,,⋯,,,,,,,,,,
1000299,1154,,,,,,,,,⋯,,,,,,,,,,

eid,ins_index,epistart,admidate
<int>,<int>,<IDate>,<IDate>
3026745,17,2009-04-23,2009-04-23
5469223,105,2013-06-18,2013-06-18
2099936,32,2021-02-18,2021-02-18
5152448,0,1997-05-07,1997-05-07
5944709,19,2017-12-13,2017-12-12

eid,ins_index,diag_icd9,diag_icd10
<int>,<int>,<chr>,<chr>
2097360,0,,D140
2622045,9,,Z800
4723574,4,,N183
5697013,17,,Z888
5858023,42,,M059

eid,ins_index,oper3,oper4
<int>,<int>,<int>,<chr>
2940276,4,,W822
3772102,12,,K634
5994223,4,,E492
4168836,0,,Y819
4239222,0,,Y767


## Data processing

### Read in the heRmes codes
ICD-9/10 coding is provided but we need to add the self reported codes from the UK-BB too.

In [14]:
codes <- fread(file.path("heRmes", "inst", "extdata", "hermes_3_codes", "hermes_3_codes.tsv"))
self_reported_codes <- list(
  list(name = "Heart Failure",                      code = "1076", code_type = "ukbb_self_reported_illness"),
  list(name = "Myocardial infarction",              code = "1075", code_type = "ukbb_self_reported_illness"),
  list(name = "Hypertrophic cardiomyopathy",        code = "1588", code_type = "ukbb_self_reported_illness"),
  list(name = "Coronary artery bypass grafting",    code = "1095", code_type = "ukbb_self_reported_procedure"),
  list(name = "Percutaneous coronary intervention", code = "1070", code_type = "ukbb_self_reported_procedure")
)
codes <- rbind(codes,
               data.table(Concept     = paste0(sapply(self_reported_codes, function(x) x$name), " Self Reported"),
                          Code        = sapply(self_reported_codes, function(x) x$code),
                          Source      = sapply(self_reported_codes, function(x) x$code_type),
                          Description = sapply(self_reported_codes, function(x) x$name)))
codes[, `:=`(code      = Code,
             code_type = fcase(Source=="ICD10", "icd10",
                               Source=="ICD9",  "icd9",
                               Source=="OPCS4", "opcs4",
                               Source=="ukbb_self_reported_illness", "ukbb_self_reported_illness",
                               Source=="ukbb_self_reported_procedure", "ukbb_self_reported_procedure"))]
codes <- codes[!is.na(code_type)] 
head(codes)

Concept,Code,Source,Description,code,code_type
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Heart Failure,I110,ICD10,Hypertensive heart disease with heart failure,I110,icd10
Heart Failure,I130,ICD10,"Hypertensive heart and chronic kidney disease with heart failure and stage 1 through stage 4 chronic kidney disease, or unspecified chronic kidney disease",I130,icd10
Heart Failure,I132,ICD10,"Hypertensive heart and chronic kidney disease with heart failure and with stage 5 chronic kidney disease, or end stage renal disease",I132,icd10
Heart Failure,I50,ICD10,Heart failure,I50,icd10
Heart Failure,I500,ICD10,Congestive heart failure,I500,icd10
Heart Failure,I501,ICD10,"Left ventricular failure, unspecified",I501,icd10


### Clean up the cohort data

In [15]:
ethnicity_codes <- list(
  white                 = 1,
  british               = 1001,
  white_black_caribbean = 2001,
  indian                = 3001,
  caribbean             = 4001,
  mixed                 = 2,
  irish                 = 1002,
  white_black_african   = 2002,
  pakistani             = 3002,
  african               = 4002,
  asian_or_asian_british= 3,
  any_other_white       = 1003,
  white_asian           = 2003,
  bangladeshi           = 3003,
  any_other_black       = 4003,
  black_or_black_british= 4,
  any_other_mixed       = 2004,
  any_other_asian       = 3004,
  chinese               = 5,
  other_ethnic_group    = 6)

data_files$demog[, ethnicity := fcoalesce(.SD), .SDcols = names(data_files$demog)[grepl("^ethnicity_[0-9]$", names(data_files$demog))]]

data_files$demog <- data_files$demog[, 
    list(eid               = eid,
         reason_lost_fu    = reason_lost_fu,
         age               = as.integer(age),
         sex               = factor(sex, levels = 0:1, labels = c("female", "male")),
         ethnicity         = factor(ethnicity, levels = unlist(ethnicity_codes), labels = names(ethnicity_codes)),
         ethnicity_group   = factor(sub("([0-9])00[0-9]", "\\1", ethnicity), levels = unlist(ethnicity_codes), labels = names(ethnicity_codes)),
         genetic_sex       = factor(genetic_sex, levels = 0:1, labels = c("female", "male")),
         genetic_ethnicity = factor(genetic_ethnicity, levels = 1, labels = c("caucasian")), 
         pc1               = pc1,
         pc2               = pc2,
         pc3               = pc3,
         pc4               = pc4,
         pc5               = pc5)]

# check
stopifnot("Failed to parse some date of births" = all(!is.na(data_files$demog$dob)))
stopifnot("some ages / dob indicate cohort age <37, is this right?" = all(data_files$demog$dob <= as.Date("1972-01-01")))

### Self-report illness codes to long

In [16]:
self_rep_code_cols <- grep("self_rep_ill_[0-9]+",      names(data_files$self), value = TRUE)
self_rep_year_cols <- grep("self_rep_ill_year_[0-9]+", names(data_files$self), value = TRUE)
data_files$self[, (self_rep_code_cols) := lapply(.SD, as.character), .SDcols = self_rep_code_cols]
data_files$self[, (self_rep_year_cols) := lapply(.SD, as.numeric),   .SDcols = self_rep_year_cols]
data_files$self_illness <- data.table::melt(data_files$self,
                                            id.vars = "eid",
                                            measure = patterns("self_rep_ill_[0-9]+", "self_rep_ill_year_[0-9]+"),
                                            variable.name = "element",
                                            value.name = c("code", "year"),
                                            na.rm = TRUE)
data_files$self_illness <- data_files$self_illness[year != -1 & year != -3] # unknown / prefer not to answer
data_files$self_illness[, `:=`(date      = lubridate::ymd(paste0(as.character(floor(year)), "-01-01")) + lubridate::days(as.integer(365.25 * (year - floor(year)))),
                               year      = NULL,
                               element   = NULL,
                               code      = as.character(code),
                               code_type = "ukbb_self_reported_illness")]

# check self report illness table
stopifnot("unable to parse dates for self-reported illness codes" = all(!is.na(data_files$self_illness$date)))
stopifnot("are you sure something happened before 1900?" = all(data_files$self_illness$date > as.Date("1900-01-01")))

### Self-report procedure codes to long

In [17]:
self_rep_proc_code_cols <- grep("self_rep_proc_[0-9]+",      names(data_files$self), value = TRUE)
self_rep_proc_year_cols <- grep("self_rep_proc_year_[0-9]+", names(data_files$self), value = TRUE)
data_files$self[, (self_rep_proc_code_cols) := lapply(.SD, as.character), .SDcols = self_rep_proc_code_cols]
data_files$self[, (self_rep_proc_year_cols) := lapply(.SD, as.numeric),   .SDcols = self_rep_proc_year_cols]
data_files$self_oper <- data.table::melt(data_files$self,
                                         id.vars = "eid",
                                         measure = patterns("self_rep_proc_[0-9]+", "self_rep_proc_year_[0-9]+"),
                                         variable.name = "element",
                                         value.name = c("code", "year"),
                                         na.rm = TRUE)
data_files$self_oper <- data_files$self_oper[year != -1 & year != -3] # unknown / prefer not to answer
data_files$self_oper[, `:=`(date      = lubridate::ymd(paste0(as.character(floor(year)), "-01-01")) + lubridate::days(as.integer(365.25 * (year - floor(year)))),
                            year      = NULL,
                            element   = NULL,
                            code      = as.character(code),
                            code_type = "ukbb_self_reported_procedure")]

# check self report illness table
stopifnot("unable to parse dates for self-reported procedure codes" = all(!is.na(data_files$self_oper$date)))
stopifnot("are you sure something happened before 1900?" = all(data_files$self_oper$date > as.Date("1900-01-01")))

### Inpatient diagnosis codes

In [18]:
data_files$hesin[is.na(epistart) | epistart == "", epistart := admidate]
data_files$diag[data_files$hesin, date := as.Date(i.epistart), on = c("eid", "ins_index")]
data_files$diag[diag_icd9 == "", diag_icd9 := NA_character_]
data_files$diag[diag_icd10 == "", diag_icd10 := NA_character_]
data_files$diag <- data.table::melt(data_files$diag,
                                    id.vars = c("eid", "date"),
                                    measure.vars  = c("diag_icd9", "diag_icd10"),
                                    variable.name = "code_type",
                                    value.name = "code",
                                    na.rm = TRUE)
data_files$diag[, code_type := data.table::fcase(code_type == "diag_icd9", "icd9",
                                                 code_type == "diag_icd10", "icd10")]

### Inpatient procedure codes

In [19]:
data_files$oper[data_files$hesin, date := as.Date(i.epistart), on = c("eid", "ins_index")]
data_files$oper[oper3 == "", oper3 := NA_character_]
data_files$oper[oper4 == "", oper4 := NA_character_]
data_files$oper <- data.table::melt(data_files$oper,
                                    id.vars = c("eid", "date"),
                                    measure.vars  = c("oper3", "oper4"),
                                    variable.name = "code_type",
                                    value.name = "code",
                                    na.rm = TRUE)
data_files$oper[, code_type := data.table::fcase(code_type == "oper3", "opcs3",
                                                 code_type == "oper4", "opcs4")]

“'measure.vars' [oper3, oper4, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.”


### Combine all codes
Keep only unique codes per individuals at the code's first occurance.

In [20]:
combined <- rbind(data_files$selfs_illness, data_files$self_oper, data_files$diag, data_files$oper)
combined <- codes[combined, on = c("code" = "code", "code_type" = "code_type"), allow.cartesian = TRUE]
combined <- combined[!is.na(Concept)]
combined <- combined[combined[, .I[which.min(date)], by = c("eid", "Concept")]$V1]

### Annotate the cohort with the codes

In [21]:
cohort <- data_files$demog
concepts <- unique(codes$Concept)
for (g in concepts) {

  col_name <- tolower(gsub(" ", "_", gsub("[()]","",g)))
  cohort[combined[Concept == g], paste0(col_name, c("", "_first_date")) := list(TRUE, as.Date(i.date)), on = "eid"]
  cohort[is.na(get(col_name)), (col_name) := FALSE]

}

### Remove withdrawals

In [22]:
cat(glue('{cohort[reason_lost_fu==5, .N]} withdrawals to remove'), sep="\n")
cohort <- cohort[is.na(reason_lost_fu) | reason_lost_fu!=5] # 5 - Participant has withdrawn consent for future linkage

158 withdrawals to remove


## Run phenotyping

In [23]:
# any ischaemic ICD codes
ischaemic_cols <- c("myocardial_infarction", "coronary_artery_bypass_grafting", "percutaneous_coronary_intervention", "thrombolysis_coronary", "ischaemic_cardiomyopathy")
cohort[, ischaemic := rowSums(.SD) > 0, .SDcols = ischaemic_cols]
cohort[, ischaemic_first_date := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = paste0(ischaemic_cols, "_first_date")]

# combined NICM ICD codes
nicm_cols <- c("dilated_cardiomyopathy", "dilated_cardiomyopathy_associated_with", "left_ventricular_systolic_dysfunction")
cohort[, nicm_comb := rowSums(.SD) > 0, .SDcols = nicm_cols]
cohort[, nicm_comb_first_date := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = paste0(nicm_cols, "_first_date")]

# any ischaemic self reported codes
self_isch_cols <- c("myocardial_infarction_self_reported", "coronary_artery_bypass_grafting_self_reported", "percutaneous_coronary_intervention_self_reported")
cohort[, self_isch := rowSums(.SD) > 0, .SDcols = self_isch_cols]
cohort[, self_isch_first_date := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = paste0(self_isch_cols, "_first_date")]

# any self reported HF codes
self_hf_col = c("heart_failure_self_reported")
cohort[, self_hf := rowSums(.SD) > 0, .SDcols = self_hf_col]
cohort[, self_hf_first_date := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = paste0(self_hf_col, "_first_date")]

# any self reported HCM codes
self_hcm_col = c("hypertrophic_cardiomyopathy_self_reported")
cohort[, self_hcm := rowSums(.SD) > 0, .SDcols = self_hcm_col]
cohort[, self_hcm_first_date := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = paste0(self_hcm_col, "_first_date")]

# HF exclusions
cohort[, hf_exclude := congenital_heart_disease==TRUE |  # all congenital heart disease
                       (heart_failure==FALSE & (self_hf==TRUE | ischaemic==TRUE | self_isch==TRUE)) | # non-HF but with ischaemic ICD history or self reported heart failure or ischaemic history
                       (heart_failure==TRUE  & (ischaemic==TRUE & ischaemic_first_date > heart_failure_first_date))] # HF but with first ischaemic event after the HF diagnosis
# pheno 1
cohort[, pheno1 := congenital_heart_disease==FALSE & heart_failure==TRUE]

# pheno 2
cohort[, pheno2 := hf_exclude==FALSE & heart_failure==TRUE & ischaemic==TRUE]

# pheno 3
cohort[, pheno3 := hf_exclude==FALSE & heart_failure==TRUE & ischaemic==FALSE]

# HF controls
cohort[, hf_control := hf_exclude==FALSE & pheno1==FALSE & pheno2==FALSE & pheno3==FALSE]

# DCM exclusions
cohort[, cm_exclude := congenital_heart_disease==TRUE |  # all congenital heart disease
                       hypertrophic_cardiomyopathy==TRUE | self_hcm==TRUE | # all HCM
                       restrictive_cardiomyopathy==TRUE] # all RCM

# pheno 4
cohort[, pheno4 := cm_exclude==FALSE &
                   !(dilated_cardiomyopathy==TRUE  & (ischaemic==TRUE & ischaemic_first_date <= dilated_cardiomyopathy_first_date)) & # DCM but with first ischaemic event prior to the DCM diagnosis
                   dilated_cardiomyopathy==TRUE]

# pheno 5
cohort[, pheno5 := cm_exclude==FALSE &
         !(nicm_comb==TRUE & (ischaemic==TRUE & ischaemic_first_date <= nicm_comb_first_date)) &  # NICM but with first ischaemic event prior to the NICM diagnosis
         nicm_comb==TRUE]

# CM controls
cohort[, cm_control := cm_exclude==FALSE &
                       pheno4==FALSE &
                       pheno5==FALSE &
                       ischaemic==FALSE &
                       self_isch==FALSE]


# check HF phenotyping
base_cols <- c("eid", "age", "sex", "ethnicity", "ethnicity_group","genetic_sex", "genetic_ethnicity", paste0("pc",1:5))
sum_cols <- names(cohort)[!names(cohort) %in% base_cols
                          &
                          !grepl("date", names(cohort))]
summary <- data.table (name = c("total", sum_cols), sex = "all", N = c(nrow(cohort), cohort[, .(sapply(.SD, sum)), .SDcols = sum_cols]$V1))
summary <- rbind(summary,
                 data.table(name = rep(sum_cols, 2), cohort[, .(N = sapply(.SD, sum)), .SDcols = sum_cols, by = "sex"]), fill=TRUE)
summary

# write summary
fwrite(dcast(summary, name ~ sex, value.var = "N"),
       file = "hermes3_phenotype_summary.tsv",
       sep  = "\t")

cat("Total =", summary[name=="total",N], "; Sum (exc/crtl/1) = ", sum(summary[name%in%c("hf_exclude",paste0("pheno1"),"hf_control"), N]), "\n")
cat("Total =", summary[name=="total",N], "; Sum (exc/crtl/2/3) = ", sum(summary[name%in%c("hf_exclude",paste0("pheno",2:3),"hf_control"), N]), "\n")
cat("Total =", summary[name=="total",N], "; Sum (exc/crtl/5) = ", sum(summary[name%in%c("cm_exclude",paste0("pheno5"),"cm_control"), N]), "\n")

# write out
fwrite(cohort[, mget(c(base_cols[base_cols != "eid"],
                       paste0("pheno", 1:3), "hf_exclude", "hf_control",
                       paste0("pheno", 4:5), "cm_exclude", "cm_control"))],
       file = "hermes3_phenotypes.tsv.gz",
       sep  = "\t")

name,sex,N
<chr>,<fct>,<int>
total,all,501976
reason_lost_fu,all,
heart_failure,all,20156
dilated_cardiomyopathy,all,1597
dilated_cardiomyopathy_associated_with,all,1707
left_ventricular_systolic_dysfunction,all,10373
myocardial_infarction,all,26911
coronary_artery_bypass_grafting,all,7505
percutaneous_coronary_intervention,all,17933
thrombolysis_coronary,all,61


Total = 501976 ; Sum (exc/crtl/1) =  1007346 
Total = 501976 ; Sum (exc/crtl/2/3) =  1003952 
Total = 501976 ; Sum (exc/crtl/5) =  935210 


### Copy output to project

In [24]:
o <- system("dx upload hermes3_phenotype_summary.tsv hermes3_phenotypes.tsv.gz --destination hermes3_data", intern = TRUE)
cat(o, sep = "\n")

ID                                file-Gz5p3g0J81vjbxfqzZFj19Jb
Class                             file
Project                           project-GvZyZ20J81vgPJGbJy8pgpyq
Folder                            /hermes3_data
Name                              hermes3_phenotype_summary.tsv
State                             closing
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Sat Mar  8 01:48:12 2025
Created by                        nicholas.sunderland
 via the job                      job-Gz5kGkjJ81vb42BJyBfzb670
Last modified                     Sat Mar  8 01:48:13 2025
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
ID                                file-Gz5p3g8J81vfXJ28pz0K29GQ
Class                             file
Project           

# End