# Preparing the Medical Decathlon dataset

The [Medical Segmentation Decathlon (MSD)](http://medicaldecathlon.com/) is a large collection of annotated medical image datasets of various clinically relevant anatomies available under open source license to facilitate the development of semantic segmentation algorithms. This allows: 1) objective assessment of general-purpose segmentation methods through comprehensive benchmarking and 2) open and free access to medical image data for any researcher interested in the problem domain. 

## Why are we converting the 3D MRI scans to 2D slices?

In this part of the tutorial, you will convert the Medical Decathlon dataset from [Nifti](https://nifti.nimh.nih.gov/) into NumPy format. Note that in the [3D model directory](../3D), we simply use the 3D Nifti files in our data loader. A 3D model is certainly the preferred method when using 3D datasets and will lead to higher accuracy models in general. However, for this demo we want to work with a simple 2D model and so it is faster to first extract the 2D slices into separate NumPy files and load the data as 2D. A batch in the 2D case refers to several 2D slices (probably from different 3D scans).

To begin, you will to download the raw dataset from the Medical Decathlon website (http://medicaldecathlon.com), extract the data (untar), and follow the instructions in this notebook.

You may wish to use the code snippet below to easily download the dataset (uncomment to execute)

In [None]:
import os
import json
import numpy as np

data_path = "../data/decathlon/Task01_BrainTumour"  # directory for the original data
save_path = "../data/decathlon/Task01_BrainTumour/2D_model"   # directory to save NumPy files
seed = 816                                         # Random seed
train_test_split = 0.85                            # Train/test split value (Percentage of dataset to keep for training)

## BraTS Dataset

The [Brain Tumor Segmentation Dataset](https://www.med.upenn.edu/sbia/brats2018/data.html) contains multi-institutional pre-operative 3D MRI scans and focuses on the segmentation of intrinsically heterogeneous (in appearance, shape, and histology) brain tumors, namely gliomas. 

In [None]:
"""
Get the training file names from the data directory.
Decathlon should always have a dataset.json file in the
subdirectory which lists the experiment information including
the input and label filenames.
"""

json_filename = os.path.join(data_path, "dataset.json")

try:
    with open(json_filename, "r") as fp:
        experiment_data = json.load(fp)
        
    # Print information about the Decathlon experiment data
    print("*" * 30)
    print("=" * 30)
    print("Dataset name:        ", experiment_data["name"])
    print("Dataset description: ", experiment_data["description"])
    print("Tensor image size:   ", experiment_data["tensorImageSize"])
    print("Dataset release:     ", experiment_data["release"])
    print("Dataset reference:   ", experiment_data["reference"])
    print("Dataset license:     ", experiment_data["licence"])  # sic
    print("Modality:            ", experiment_data["modality"]) 
    print("Labels:              ", experiment_data["labels"]) 
    print("Training set size :  ", experiment_data["numTraining"]) 
    print("=" * 30)
    print("*" * 30)
except IOError as e:
    print("File {} doesn't exist. \nIt should be part of the "
          "Decathlon directory.\nDid you download and "
          "extract Task01_BrainTumour.tar to directory {}?".format(json_filename, data_path))



## Train/Validation/Testing Splits

Although the MSD has separate training, validation, and testing splits, the testing directory does not contain the ground truth annotations. Instead, we'll split the training dataset into new training, validation, and testing datasets so that we'll have ground truth annotations for all 3.

In [None]:
"""
Randomize the file list. Then separate into training and
validation lists. We won't use the testing set since we
don't have ground truth masks for this; instead we'll
split the validation set into separate test and validation
sets.
"""
# Set the random seed so that always get same random mix
np.random.seed(seed)
numFiles = experiment_data["numTraining"]
idxList = np.arange(numFiles)  # List of file indices
randomList = np.random.random(numFiles)  # List of random numbers

# Random number go from 0 to 1. So anything above
# args.train_split is in the validation list.
trainList = idxList[randomList < train_test_split]

# Now we'll just split the remaining files into 50% validation and 50% testing datasets
otherList = idxList[randomList >= train_test_split]
randomList = np.random.random(len(otherList))  # List of random numbers
validateList = otherList[randomList >= 0.5]
testList = otherList[randomList < 0.5]

In [None]:
from convert_raw_to_npy import convert_raw_data_to_numpy

convert_raw_data_to_numpy(trainList, validateList, testList,
                          data_path,
                          experiment_data,
                          save_path, resize=-1)

### Medical Segmentation Decathlon (MSD)

The raw dataset has the CC-BY-SA 4.0 license. https://creativecommons.org/licenses/by-sa/4.0/. 
A paper describing the MSD is available [here](https://arxiv.org/abs/1902.09063).

### References for the BraTS Dataset

[1] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, et al. "The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)", IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694

[2] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J.S. Kirby, et al., "Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features", Nature Scientific Data, 4:170117 (2017) DOI: 10.1038/sdata.2017.117

[3] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, et al., "Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge", arXiv preprint arXiv:1811.02629 (2018)

In addition, if there are no restrictions imposed from the journal/conference you submit your paper about citing "Data Citations", please be specific and also cite the following:

[4] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al., "Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection", The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.KLXWJJ1Q

[5] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al., "Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection", The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.GJQ7R0EF

*Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. SPDX-License-Identifier: EPL-2.0*

*Copyright (c) 2019-2020 Intel Corporation*