# Datawrangling UK Biobank Data

I dowloaded Dixon weighted scans of 1000 subjects from the UK Biobank data base. For this I used the command line tools by UKBB: https://biobank.ndph.ox.ac.uk/ukb/download.cgi

```bash
# Decrypt UKBB dataset (only contains identifier of subjects) -> creates file 'dataset.enc_ukb'
ukbunpack <dataset.enc> <keyfile.key>

# Create bulk file. 20201-2 specifies dixon scans on a patients 'first' examination. Some patients came for follow up examinations that I dont pland to include.
# (Remark: spelling of '20201-2' might be slightly wrong)
ukbconv dataset.enc_ukb bulk -s20201-2

# Download 10000 dicoms based on bulk file
ukbfetch -a"$activation_code" -b<ukb675228.20201_2_0.bulk> -m1000
```

In [1]:
import pandas as pd
import SimpleITK as sitk
from tqdm.notebook import tqdm
from os.path import join, exists
from os import scandir, mkdir

# private libraries
import sys

if "../scripts" not in sys.path:
    sys.path.insert(1, "../scripts")
import config
from dicom_io import read_dicom_series_zipped

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [2]:
zip_list = [f for f in scandir(join(config.ukbb, "dicom")) if f.name[-4:] == ".zip"]
dixon_types = ["in", "opp", "W", "F"]

The data is currently present as indivudial zip files and structured like this:
```
.
└── dicom [1000 entries]
    ├──<eid>_20201_2_0.zip 
    ├──<eid>_20201_2_0.zip 
    ...
    └──<eid>_20201_2_0.zip 

```
The end goal should be a new directory `nifti`, where each exam is represented by four dixon sequences for six body sections each:

```
.
└── nifti [1000 entries]
    ├──<eid> [24 entries]
    │   ├──section0_in.nii.gz
    │   ├──section0_opp.nii.gz
    │   ...
    │   └──section5_F.nii.gz
    ...
    └──<eid> [24 entries]
        ├──section0_in.nii.gz
        ├──section0_opp.nii.gz
        ...
        └──section5_F.nii.gz
```

The dataframe/csv should look like this:


| eid        | section | dixon_type | image                               |
|------------|---------|------------|-------------------------------------|
| eid_000001 |    0    |     in     | nifti/eid_000001/sectio0_in.nii.gz  | 
| eid_000001 |    0    |     opp    | nifti/eid_000001/sectio0_opp.nii.gz |
| eid_00000n |    5    |     F      | nifti/eid_00000n/sectio5_F.nii.gz   |


## Dicom to Nifti
Create nifti images for 6 sections of the human body,  4 sequence types each. Meta information (i.e. id, section, dixon type, path) will be saved in a separate file "manifest.csv"

In [None]:
data = pd.DataFrame(columns=["eid", "datafile", "section", "dixon_type", "image"])

for file in tqdm(zip_list, postfix="Reading multiple patients", leave=False, position=0):
    eid = file.name.split("_")[0]
    datafile = file.name[len(eid) + 1 : -4]
    if not exists(join(config.ukbb, "nifti", eid)):
        mkdir(join(config.ukbb, "nifti", eid))

    images, series_desc = read_dicom_series_zipped(file, pbar_position=1)
    image_meta = pd.DataFrame({"series_desc": series_desc})
    image_meta["dixon_type"] = image_meta["series_desc"].apply(lambda x: x.split("_")[-1])

    for i, row in tqdm(
        image_meta.iterrows(),
        total=len(image_meta),
        postfix="Saving as nifti",
        leave=False,
        position=1,
    ):
        entry = {
            "eid": eid,
            "datafile": datafile,
            "section": i
            // 4,  # divide by 4, because we have 4 different sequence types for each section
            "dixon_type": row["dixon_type"],
            "image": join(eid, "section" + str(i // 4) + "_" + row["dixon_type"] + ".nii.gz"),
        }
        data.loc[len(data)] = entry
        sitk.WriteImage(images[i], join(config.ukbb, "nifti", entry["image"]), True, 1)

data.to_csv(join(config.ukbb, "manifest.csv"), index=False)