## Using `xvhelper` to download and decode pheno datasets

If we are running JupyterLab or RStudio on the RAP/DNAnexus platform, we can use `xvhelper` to automate our downloading and decoding of the Pheno Data.

In [1]:
library(xvhelper)

## Start with the Datasets

We can first use `find_all_datasets()` to return a `data.frame` of all datasets available in our project.

In [2]:
datasets <- find_all_datasets()
datasets

id,name,project
<chr>,<chr>,<chr>
record-G406j8j0x8kzxv3G08k64gVV,apollo_ukrap_synth_pheno_100k,project-GY19Qz00Yq34kBPz8jj0XKg0


If you look at the names in the table above, you'll see that the dataset name follows the following convention:

`{application_id}_{date_dispensed}.dataset`

We will use the latest dataset, which is the top row (well, there is only one dataset in our project, but if you do multiple dispensals in your project you will have multiple datasets). We can also find this by using `find_dataset_id`, which will give us the last dataset dispensed:

In [3]:
ds_id <- find_dataset_id()
ds_id

Now we have our project/dataset id, we can use it to grab metadata. We'l first fetch the dictionaries for our particular dataset.

In [4]:
get_dictionaries(ds_id)

→ running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --dump-dataset-dictionary

[32m✔[39m Data dictionary is downloaded as /opt/notebooks/apollo_ukrap_synth_pheno_100k.dataset.data_dictionary.csv

[32m✔[39m Coding dictionary is downloaded as /opt/notebooks/apollo_ukrap_synth_pheno_100k.dataset.codings.csv

[32m✔[39m Entity dictionary is downloaded as /opt/notebooks/apollo_ukrap_synth_pheno_100k.entity_dictionary.csv



Now that we have the dictionary files into our JupyterLab/RStudio storage, we can extract the coding/data dictionary, which we'll use in our decoding.

In [5]:
codings <- get_coding_table(ds_id)
head(codings)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


title,ent_field,entity,name,coding_name,code,meaning,is_sparse_coding,is_multi_select
<chr>,<glue>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Workplace very hot | Array 24,participant.p22608_a24,participant,p22608_a24,data_coding_493,-121,Do not know,,
Workplace very hot | Array 24,participant.p22608_a24,participant,p22608_a24,data_coding_493,-131,Sometimes,,
Workplace very hot | Array 24,participant.p22608_a24,participant,p22608_a24,data_coding_493,-141,Often,,
Workplace very hot | Array 24,participant.p22608_a24,participant,p22608_a24,data_coding_493,0,Rarely/never,,
Ever taken oral contraceptive pill | Instance 1,participant.p2784_i1,participant,p2784_i1,data_coding_100349,1,Yes,,
Ever taken oral contraceptive pill | Instance 1,participant.p2784_i1,participant,p2784_i1,data_coding_100349,-1,Do not know,,


In the next step, we'll need a list of fields

In [6]:
explore_field_list(ds_id)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


## Extracting Data

Now that we have the dataset id, we can extract the data into our RStudio Project. By default, `extract_data()` will save the data as a file into our current working directory.

In [7]:
fields <- c("participant.eid", "participant.p31", "participant.p41202")
extract_data(ds_id, fields)

→ running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --fields participant.eid,participant.p31,participant.p41202 -o apollo_ukrap_synth_pheno_100k.data.csv

[32m✔[39m data is now extracted to /opt/notebooks/apollo_ukrap_synth_pheno_100k.data.csv



Let's read in the data file in. 

In [8]:
#| message: false

data <- readr::read_csv("apollo_ukrap_synth_pheno_100k.data.csv", show_col_types = FALSE)
head(data)

participant.eid,participant.p31,participant.p41202
<chr>,<dbl>,<chr>
sample_100_101,1,"[""K297"",""I802"",""K29"",""Block K20-K31"",""Chapter XI"",""I80"",""Block I80-I89"",""Chapter IX""]"
sample_100_11,0,"[""I251"",""K409"",""I25"",""Block I20-I25"",""Chapter IX"",""K40"",""Block K40-K46"",""Chapter XI""]"
sample_100_110,1,"[""M5456"",""K635"",""S7200"",""K083"",""M545"",""M54"",""Block M50-M54"",""Chapter XIII"",""K63"",""Block K55-K64"",""Chapter XI"",""S720"",""S72"",""Block S70-S79"",""Chapter XIX"",""K08"",""Block K00-K14""]"
sample_100_116,0,"[""R073"",""Z099"",""I839"",""Z305"",""R07"",""Block R00-R09"",""Chapter XVIII"",""Z09"",""Block Z00-Z13"",""Chapter XXI"",""I83"",""Block I80-I89"",""Chapter IX"",""Z30"",""Block Z30-Z39""]"
sample_100_124,0,"[""K298"",""M2557"",""K29"",""Block K20-K31"",""Chapter XI"",""M255"",""M25"",""Block M20-M25"",""Chapter XIII""]"
sample_100_126,0,"[""H259"",""G473"",""I839"",""Z035"",""K529"",""H25"",""Block H25-H28"",""Chapter VII"",""G47"",""Block G40-G47"",""Chapter VI"",""I83"",""Block I80-I89"",""Chapter IX"",""Z03"",""Block Z00-Z13"",""Chapter XXI"",""K52"",""Block K50-K52"",""Chapter XI""]"


In [9]:
data[1:50,] |>
  decode_single(codings) |>
  decode_multi_purrr(codings) |>
  decode_column_names(codings) |>
  head() 

“running command 'timedatectl' had status 1”


participant_id,sex,diagnoses_main_icd10
<chr>,<chr>,<chr>
sample_100_101,Male,"K29.7 Gastritis, unspecified|I80.2 Phlebitis and thrombophlebitis of other deep vessels of lower extremities|K29 Gastritis and duodenitis|K20-K31 Diseases of oesophagus, stomach and duodenum|Chapter XI Diseases of the digestive system|I80 Phlebitis and thrombophlebitis|I80-I89 Diseases of veins, lymphatic vessels and lymph nodes, not elsewhere classified|Chapter IX Diseases of the circulatory system"
sample_100_11,Female,"I25.1 Atherosclerotic heart disease|K40.9 Unilateral or unspecified inguinal hernia, without obstruction or gangrene|I25 Chronic ischaemic heart disease|I20-I25 Ischaemic heart diseases|Chapter IX Diseases of the circulatory system|K40 Inguinal hernia|K40-K46 Hernia|Chapter XI Diseases of the digestive system"
sample_100_110,Male,"M54.56 Low back pain (Lumbar region)|K63.5 Polyp of colon|S72.00 Fracture of neck of femur (closed)|K08.3 Retained dental root|M54.5 Low back pain|M54 Dorsalgia|M50-M54 Other dorsopathies|Chapter XIII Diseases of the musculoskeletal system and connective tissue|K63 Other diseases of intestine|K55-K64 Other diseases of intestines|Chapter XI Diseases of the digestive system|S72.0 Fracture of neck of femur|S72 Fracture of femur|S70-S79 Injuries to the hip and thigh|Chapter XIX Injury, poisoning and certain other consequences of external causes|K08 Other disorders of teeth and supporting structures|K00-K14 Diseases of oral cavity, salivary glands and jaws"
sample_100_116,Female,"R07.3 Other chest pain|Z09.9 Follow-up examination after unspecified treatment for other conditions|I83.9 Varicose veins of lower extremities without ulcer or inflammation|Z30.5 Surveillance of (intra-uterine) contraceptive device|R07 Pain in throat and chest|R00-R09 Symptoms and signs involving the circulatory and respiratory systems|Chapter XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified|Z09 Follow-up examination after treatment for conditions other than malignant neoplasms|Z00-Z13 Persons encountering health services for examination and investigation|Chapter XXI Factors influencing health status and contact with health services|I83 Varicose veins of lower extremities|I80-I89 Diseases of veins, lymphatic vessels and lymph nodes, not elsewhere classified|Chapter IX Diseases of the circulatory system|Z30 Contraceptive management|Z30-Z39 Persons encountering health services in circumstances related to reproduction"
sample_100_124,Female,"K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)|K29 Gastritis and duodenitis|K20-K31 Diseases of oesophagus, stomach and duodenum|Chapter XI Diseases of the digestive system|M25.5 Pain in joint|M25 Other joint disorders, not elsewhere classified|M20-M25 Other joint disorders|Chapter XIII Diseases of the musculoskeletal system and connective tissue"
sample_100_126,Female,"H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|I83.9 Varicose veins of lower extremities without ulcer or inflammation|Z03.5 Observation for other suspected cardiovascular diseases|K52.9 Non-infective gastro-enteritis and colitis, unspecified|H25 Senile cataract|H25-H28 Disorders of lens|Chapter VII Diseases of the eye and adnexa|G47 Sleep disorders|G40-G47 Episodic and paroxysmal disorders|Chapter VI Diseases of the nervous system|I83 Varicose veins of lower extremities|I80-I89 Diseases of veins, lymphatic vessels and lymph nodes, not elsewhere classified|Chapter IX Diseases of the circulatory system|Z03 Medical observation and evaluation for suspected diseases and conditions|Z00-Z13 Persons encountering health services for examination and investigation|Chapter XXI Factors influencing health status and contact with health services|K52 Other non-infective gastro-enteritis and colitis|K50-K52 Noninfective enteritis and colitis|Chapter XI Diseases of the digestive system"


## Reading in Cohort Information

Working with cohorts is very similar to working with the entire dataset. Let's list the cohorts in our project:

In [10]:
cohorts <- find_all_cohorts()
cohorts

id,name,project,project_record
<chr>,<chr>,<chr>,<glue>
record-G5Ky4f008KQZZ6bx0yYz44fB,female_control_3.0,project-GY19Qz00Yq34kBPz8jj0XKg0,project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB
record-G5Ky4Gj08KQYQ4P810fJ8qPp,female_coffee_3.0,project-GY19Qz00Yq34kBPz8jj0XKg0,project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4Gj08KQYQ4P810fJ8qPp


Once we have the cohort `record` IDs, we can use `extract_data()` to extract the cohorts to our project.

In [11]:
fields <- c("participant.eid", "participant.p31", "participant.p41202")
cohort_id <- cohorts$project_record[1]
extract_data(cohort_id, fields)

→ running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB --fields participant.eid,participant.p31,participant.p41202 -o female_control_3.0.data.csv

[32m✔[39m data is now extracted to /opt/notebooks/female_control_3.0.data.csv



In [13]:
cohort1 <- readr::read_csv("female_control_3.0.data.csv")

[1mRows: [22m[34m37206[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): participant.eid, participant.p41202
[32mdbl[39m (1): participant.p31

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


We can decode our cohort in the same way:

In [14]:
cohort1[1:10,] |>
  decode_single(codings) |>
  decode_multi_purrr(codings) |>
  decode_column_names(codings)

participant_id,sex,diagnoses_main_icd10
<chr>,<chr>,<chr>
sample_100_11,Female,"I25.1 Atherosclerotic heart disease|K40.9 Unilateral or unspecified inguinal hernia, without obstruction or gangrene|I25 Chronic ischaemic heart disease|I20-I25 Ischaemic heart diseases|Chapter IX Diseases of the circulatory system|K40 Inguinal hernia|K40-K46 Hernia|Chapter XI Diseases of the digestive system"
sample_100_124,Female,"K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)|K29 Gastritis and duodenitis|K20-K31 Diseases of oesophagus, stomach and duodenum|Chapter XI Diseases of the digestive system|M25.5 Pain in joint|M25 Other joint disorders, not elsewhere classified|M20-M25 Other joint disorders|Chapter XIII Diseases of the musculoskeletal system and connective tissue"
sample_100_126,Female,"H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|I83.9 Varicose veins of lower extremities without ulcer or inflammation|Z03.5 Observation for other suspected cardiovascular diseases|K52.9 Non-infective gastro-enteritis and colitis, unspecified|H25 Senile cataract|H25-H28 Disorders of lens|Chapter VII Diseases of the eye and adnexa|G47 Sleep disorders|G40-G47 Episodic and paroxysmal disorders|Chapter VI Diseases of the nervous system|I83 Varicose veins of lower extremities|I80-I89 Diseases of veins, lymphatic vessels and lymph nodes, not elsewhere classified|Chapter IX Diseases of the circulatory system|Z03 Medical observation and evaluation for suspected diseases and conditions|Z00-Z13 Persons encountering health services for examination and investigation|Chapter XXI Factors influencing health status and contact with health services|K52 Other non-infective gastro-enteritis and colitis|K50-K52 Noninfective enteritis and colitis|Chapter XI Diseases of the digestive system"
sample_100_127,Female,"C77.3 Axillary and upper limb lymph nodes|C77 Secondary and unspecified malignant neoplasm of lymph nodes|C76-C80 Malignant neoplasms of ill-defined, secondary and unspecified sites|Chapter II Neoplasms"
sample_100_138,Female,I63.5 Cerebral infarction due to unspecified occlusion or stenosis of cerebral arteries|I63 Cerebral infarction|I60-I69 Cerebrovascular diseases|Chapter IX Diseases of the circulatory system
sample_100_141,Female,"R11 Nausea and vomiting|R10-R19 Symptoms and signs involving the digestive system and abdomen|Chapter XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified"
sample_100_150,Female,"I84.9 Unspecified haemorrhoids without complication|A09.0 Other and unspecified gastroenteritis and colitis of infectious origin|H02.1 Ectropion of eyelid|R10.3 Pain localised to other parts of lower abdomen|K52.9 Non-infective gastro-enteritis and colitis, unspecified|I84 Haemorrhoids|I80-I89 Diseases of veins, lymphatic vessels and lymph nodes, not elsewhere classified|Chapter IX Diseases of the circulatory system|A09 Diarrhoea and gastro-enteritis of presumed infectious origin|A00-A09 Intestinal infectious diseases|Chapter I Certain infectious and parasitic diseases|H02 Other disorders of eyelid|H00-H06 Disorders of eyelid, lacrimal system and orbit|Chapter VII Diseases of the eye and adnexa|R10 Abdominal and pelvic pain|R10-R19 Symptoms and signs involving the digestive system and abdomen|Chapter XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified|K52 Other non-infective gastro-enteritis and colitis|K50-K52 Noninfective enteritis and colitis|Chapter XI Diseases of the digestive system"
sample_100_155,Female,"B34.9 Viral infection, unspecified|H35.8 Other specified retinal disorders|K25.3 Acute without haemorrhage or perforation|Z01.8 Other specified special examinations|B34 Viral infection of unspecified site|B25-B34 Other viral diseases|Chapter I Certain infectious and parasitic diseases|H35 Other retinal disorders|H30-H36 Disorders of choroid and retina|Chapter VII Diseases of the eye and adnexa|K25 Gastric ulcer|K20-K31 Diseases of oesophagus, stomach and duodenum|Chapter XI Diseases of the digestive system|Z01 Other special examinations and investigations of persons without complaint or reported diagnosis|Z00-Z13 Persons encountering health services for examination and investigation|Chapter XXI Factors influencing health status and contact with health services"
sample_100_170,Female,"M25.56 Pain in joint (Lower leg)|T84.0 Mechanical complication of internal joint prosthesis|K08.3 Retained dental root|M25.5 Pain in joint|M25 Other joint disorders, not elsewhere classified|M20-M25 Other joint disorders|Chapter XIII Diseases of the musculoskeletal system and connective tissue|T84 Complications of internal orthopaedic prosthetic devices, implants and grafts|T80-T88 Complications of surgical and medical care, not elsewhere classified|Chapter XIX Injury, poisoning and certain other consequences of external causes|K08 Other disorders of teeth and supporting structures|K00-K14 Diseases of oral cavity, salivary glands and jaws|Chapter XI Diseases of the digestive system"
sample_100_203,Female,"D17.0 Benign lipomatous neoplasm of skin and subcutaneous tissue of head, face and neck|Z03.8 Observation for other suspected diseases and conditions|Z03.5 Observation for other suspected cardiovascular diseases|J32.0 Chronic maxillary sinusitis|K80.1 Calculus of gallbladder with other cholecystitis|N31.9 Neuromuscular dysfunction of bladder, unspecified|N40 Hyperplasia of prostate|D17 Benign lipomatous neoplasm|D10-D36 Benign neoplasms|Chapter II Neoplasms|Z03 Medical observation and evaluation for suspected diseases and conditions|Z00-Z13 Persons encountering health services for examination and investigation|Chapter XXI Factors influencing health status and contact with health services|J32 Chronic sinusitis|J30-J39 Other diseases of upper respiratory tract|Chapter X Diseases of the respiratory system|K80 Cholelithiasis|K80-K87 Disorders of gallbladder, biliary tract and pancreas|Chapter XI Diseases of the digestive system|N31 Neuromuscular dysfunction of bladder, not elsewhere classified|N30-N39 Other diseases of urinary system|Chapter XIV Diseases of the genitourinary system|N40-N51 Diseases of male genital organs"
