# Osteosacroma - Data Wrangling

The dataset consists of MRI images of children with bone tumors (osteosarcoma). The MRIs vary in number and type of sequences. Furthermore, there is sometimes considerable movement inbetween sequences, necessitating multiple transforms. 

In [1]:
from pathlib import Path
import re
import shutil

import pandas as pd

In [2]:
DATA_DIR = Path("/workspace/data")

In [3]:
files = sorted(list(DATA_DIR.glob("**/*nrrd")))

In [4]:
assert len(files) > 0, "Empty set. Check filepaths"

In [5]:
sequences = set()
for f in files:
    sequences.add(re.sub("\\..*", "", f.name))
sequences

{'AxADC',
 'AxDWI',
 'AxT1',
 'AxT1fs',
 'AxT2',
 'AxT2fs',
 'CorT1',
 'PostAxT1fs',
 'PreAxT1',
 'PreAxT1fs',
 'SagT1',
 'Segmentation_1',
 'Segmentation_2',
 'Segmentation_29',
 'Segmentation_3'}

#### Explanation to image names

- **AxADC** axial Apparent Diffusion Coefficient Maps
- **axDWI** axial Diffusion Weighted Imaging 
- **AxT1:** axial T1 without fat saturation
- **AxT2fs:** axial T2 without fat saturation
- **CorT1:** coronal T1 without fat saturation. Only if no other T1w sequence was available
- **SagT1:** sagittal T1 without fat saturation. Only if no other T1w sequence was available
- **PostAxT1fs:** axial T1 with fat saturation after contrast injection
- **PreAxT1fs:** axial T1 with fat saturation before contrast injection


#### Explanation to segmentation names.  
Sometines the patient moved between the pre- and post-contrast images. So the images are not aligned anymore and new segmentatios had to be done. 

- **Segmentation_1:** segmentations for pre-contrast images, if significant movement happend between pre- and post contrast sequences
- **Segmentation_3:** segmentations for pre-contrast images, if significant movement happend between pre- and post contrast sequences
- **Segmenetatin_2:** segmentation viable for all sequences in the examination





## Filenames to DataFrame
For compatibility with `trainlib`, file names need to be in csv format. For now, sequences are regarded as independed, even if from the same examination. Later in the project, one can implement a solution to merge multiple sequences. 

There is a hirachical structure to the sequences. 

```bash
patient_id # patient id
  |--exam_type # mostly two types: baseline or follow-up. However, can also be more than two
       |       # if significant movement happend inbetween sequences. 
       |--sequence_1 # first MRI sequence, e.g. T1 axial
       |--sequence_2
       |
       |--sequence_n
       |--segmentation # pixel-wise segmentation of the tumor

```

### Manual fixes
> corrections befor creating the dataframe

There are some minor errors in the data, which need to be corrected manually, . 

- rename file `AxT1nrrd` -> `AxT1.nrrd`

In [7]:
patient_ids = set([f.parent.parent.name for f in files])
df = {"patient_id": [], "image": [], "label": [], "follow_up": [], "exam_type": []}
for idx in patient_ids:
    exams = [f for f in files if idx in str(f)]
    exam_types = set([e.parent.name.replace(idx, "").replace("_", "", 1) for e in exams])
    for exam_type in exam_types:
        exams_to_keep = [e for e in exams if exam_type in str(e)]
        images = []
        for exam in exams_to_keep:
            if exam.name.endswith(".seg.nrrd"):
                label = exam
            else:
                images.append(exam)

        df["patient_id"] += [idx] * len(images)
        df["image"] += images
        df["label"] += [label] * len(images)
        df["follow_up"] += ["Follow" in str(i) for i in images]
        df["exam_type"] += [exam_type] * len(images)

In [8]:
df["image"] = [str(fn).replace(str(DATA_DIR) + "/", "") for fn in df["image"]]
df["label"] = [str(fn).replace(str(DATA_DIR) + "/", "") for fn in df["label"]]

In [9]:
df = pd.DataFrame(df)