Skip to content

Load real UK Biobank data

meliao edited this page May 22, 2020 · 30 revisions

Table of Contents

Duplicated data-fields

If you are loading several CSV files (from different datasets, for instance, data refreshes or new data requests), and they happen to have duplicated data-fields (for example, data-field 50 present in both dataset 1 and 2), ukbREST will load the one present in the latest dataset. To infer that, it will take the number present in your CSV files (the dataset ID). So if you have three files, ukb00.csv, ukb01.csv and ukb50.csv, they will be loaded in this order: ukb50.csv, ukb01.csv and ukb00.csv.

If duplicated data-fields are found in the loading stage, you will see messages like these:

...
2018-11-25 06:34:49,054 - ukbrest - WARNING - Column c25756_2_0 already loaded from /var/lib/phenotype/ukb24989.csv. Skipping.
2018-11-25 06:34:49,055 - ukbrest - WARNING - Column c25757_2_0 already loaded from /var/lib/phenotype/ukb24989.csv. Skipping.
...

Unicode decoding errors

When loading real UK Biobank data, you could find this error:

2018-08-01 23:53:52,219 - ukbrest - INFO - Working on /var/lib/phenotype/example15_00.csv
[...]
2018-08-01 23:53:52,378 - ukbrest - WARNING - No encodings.txt found, assuming utf-8
2018-08-01 23:53:52,530 - ukbrest - ERROR - Unicode decoding error when reading CSV file. Activate debug to show more details.

That means the CSV has a different unicode (ukbREST uses utf-8 by default). To fix it, you need to specify the correct encoding for that file in a text file named encodings.txt in your phenotype folder (where you have your CSV/HTML files). For the example message below (where the file being loaded is example15_00.csv), the content of your encodings.txt file should be:

example15_00.csv latin1

The encodings.txt file has one line per CSV file. If you run into this issue, you can try different encodings like latin1 or cp1252 (see here for a full list of encodings supported in Python) or use some tool to try to detect it (like uchardet). You just need to specify an encoding when you run into this issue, for the rest utf-8 is used.

Data-fields codings

ukbREST allows you to load data-field codings. By default, when using the --load-codings parameter of the Docker image, ukbREST will load several data codings that are publicly available from the UK Biobank Data Showcase. However, you could have data-fields in your application whose coding was not loaded by default. To load the exact list of codings for your application data, follow this procedure.

Once the loading process finishes, you can get all the data-field codings in your data by connecting to the PostgreSQL database and exporting a list of codings:

\copy (select distinct coding from fields where coding is not null) to /tmp/all_codings.txt (format csv)

The file /tmp/all_codings.txt is just a list of coding numbers, one per line, that you can use to download all coding files using the download_codings.sh script (which you can get from this repository):

$ mkdir /tmp/codings && cd /tmp/codings
$ [UKBREST_CODE]/utils/scripts/download_codings.sh /tmp/all_codings.txt

When you downloaded all coding files (with names like coding_100329.tsv for coding code 100329), place them in a folder, for example /tmp/codings, and run this command:

$ docker run --rm --net ukb \
  -v /tmp/codings:/var/lib/codings \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load-codings

You'll see an output like this one:

2018-07-09 19:19:50,353 - ukbrest - INFO - Loading codings from /var/lib/codings
2018-07-09 19:19:51,121 - ukbrest - INFO - Processing coding file: coding_489.tsv
2018-07-09 19:19:51,190 - ukbrest - INFO - Processing coding file: coding_238.tsv
[...]

Once finished, you'll have in your database a table called codings, that will let you link your data with, for instance, ICD10 codes (through data-coding 19 in this case).

Loading other types of data

You can load other types of samples data, like Sample-QC and relatedness (See this page for more information).

For example, to load Sample-QC and relatedness data, create a subfolder in your phenotype directory named samples_data and copy the Sample-QC file (ukb_sqc_vZ.txt) with a new file name samplesqc.txt (note that this file does not have a samples ID column, so you must add this column using the .fam file from your application; read more about that here). And also copy the relatedness file (ukbA_rel_sP.txt) with name relatedness.txt. Although the names samplesqc.txt and relatedness.txt are not mandatory, you must specify the .txt extension to let ukbrest find the files and load them. Finally, run this command:

$ docker run --rm --net ukb \
  -v /full/path/to/phenotype/folder/:/var/lib/phenotype \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load-samples-data --identifier-columns relatedness.txt:ID1,ID2

2018-08-06 22:43:00,179 - ukbrest - INFO - Loading samples data from file: samplesqc.txt
2018-08-06 22:48:28,681 - ukbrest - INFO - Adding primary key
2018-08-06 22:48:29,147 - ukbrest - INFO - Adding columns to 'fields' table
2018-08-06 22:48:29,180 - ukbrest - INFO - Loading samples data from file: relatedness.txt
2018-08-06 22:48:52,616 - ukbrest - INFO - Adding primary key
2018-08-06 22:48:52,682 - ukbrest - INFO - Adding columns to 'fields' table

A new table for each file will be created, that you can later use to make your queries. With this method you can load other kinds of data of samples. Just put the files in the samples_data folder with .txt extension and then run the command above. You can specify the ID columns with --identifier-columns (the format is file1.txt:column1 file2.txt:column2), skip some columns with --skip-columns (the format is file1.txt:column1 file2.txt:column2,column3), and specify file separators with --separators (file1.txt:, file2.txt:;).

Load withdrawals

You can also load a list of participant ID who have withdrawn consent to continue participating in the study. You get this list from the UK Biobank as a CSV file (in fact, they are files with just one ID per line, with no header); place all these files in a folder, for example, ~/withdrawls, and run this command:

$ docker run --rm --net ukb \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  -v ~/withdrawls:/var/lib/withdrawals \
  hakyimlab/ukbrest --load-withdrawals

Load Electronic Health Records

ukbREST currently supports primary care records as well as certain hospital inpatient record datasets. Currently supported datasets are:

  • gp_clinical clinical records of primary care events
  • gp_registrations health system registration dates
  • gp_scripts prescriptions resulting from primary care visits
  • hesin main UK Biobank hospital inpatient record table detailing hospital episode-level data
  • hesin_diag ICD codes for diagnoses delivered when in inpatient care

These records can be downloaded from the UK Biobank's data showcase as tab-separated text files. Suppose the primary care tables are located in ~/primary_care/ and the hospital inpatient records are in ~/hospital_inpatient/. Then this command will load the files:

$ docker run --rm --net ukb \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
-v ~/hospital_inpatient:/var/lib/hospital_inpatient \
-v ~/primary_care:/var/lib/primary_care