# Generate SUMMIT datasplit

Purpose: The MONAI detection algorithm is based on nnDetection which takes very specific inputs i.e., nifti files. The algorithm uses LUNA16 as the baseline dataset and as such conforms to the competition rules i.e., 10 fold validation. The files need to be prepped in a very particular fashion in order to work out-of-box with the detection algorithm.

SUMMIT is a much larger dataset so can probably support a single training set of 5K samples and 1K validation (plus hold out for later evaluation), therefore it is necessary to convert the SUMMIT metadata, which is in the form of a flat csv file containing the list of nodules to the format required by MONAI detection which is in the form a json file.

This json file has the form:

    {
        "training" : [
        {
            "box": [
                [
                    real_world_x,
                    real_world_y,
                    real_world_z,
                    xd,
                    yd,
                    zd
                ],
                [
                    ...
                ]
            ],
            "image": "location of mhd on disc ... relative to a base root directory of the images",
            "label": [
                0,
                ...
            ]
        },
        ...
        ],
        "validation" : [
            {...},
            ...
            {...}
        ]
    }

This workbook transforms the training_metadata.csv and validation_metadata.csv into this form. due to the size and exploratory nature of this part of the project, only one fold will be created.

In [5]:
import pandas as pd
from pathlib import Path
import json


In [20]:
METADATA_PATH = '/Users/john/Projects/SOTAEvaluationNoduleDetection/output/metadata'

data_splits = ['training','validation']


summit_subset = {'training' : [], 'validation' : []}
summit_datasplits = {'training' : [], 'validation' : []}

i = j = 0

for data_split in data_splits:

    scans = pd.read_csv(Path(METADATA_PATH, data_split + '_scans.csv'))
    metadata = pd.read_csv(Path(METADATA_PATH, data_split + '_metadata.csv'))
    
    
    for scan_id in scans.scan_id.tolist():
        study_id = scan_id.split('_',1)[0]

        scan_item = {
            'box' : [], 
            'image' : f'{study_id}/{scan_id}.mhd',
            'label' : []
        }

        for idx, row in metadata[metadata.main_participant_id==study_id].iterrows():
            scan_item['box'].append(
                [
                    row.nodule_x_coordinate,
                    row.nodule_y_coordinate,
                    row.nodule_z_coordinate,
                    row.nodule_diameter_mm,
                    row.nodule_diameter_mm,
                    row.nodule_diameter_mm
                ])
            scan_item['label'].append(0)

        summit_datasplits[data_split].append(scan_item)
        i += 1

        if Path('/Users/john/Projects/SOTAEvaluationNoduleDetection/scans/lung50',scan_item['image']).exists():
            summit_subset[data_split].append(scan_item)
            j+= 1


print(i, j)

Path('SUMMIT_datasplit/mhd_original').mkdir(parents=True, exist_ok=True)

with open('SUMMIT_datasplit/mhd_original/dataset_fold0.json','w') as f:
    json.dump(summit_datasplits, f)

with open('SUMMIT_datasplit/mhd_original/dataset_fold0_subset.json','w') as f:
    json.dump(summit_subset, f)


5124 25
