In [None]:
import configparser
from pathlib import Path
import os
import pandas as pd
import nccid_cleaning.etl as etl
from nccid_cleaning import clean_data_df, patient_df_pipeline

This notebook can be used to generate CSV files containing patient clinical data, and image metadata for each patient and image file within the NCCID data. 

To use these tools you need to provide a `BASE_PATH` that points to the location of the data that has been pulled from the NCCID S3 bucket, where your local directory structure should match the original S3 structure. If you have split the data into training/test/validation sets, each subdirectory should have the same structure as the original S3 bucket and the below pipeline should be run separately for each of the dataset splits. 

You can replace the path value for `training_data` under the `Paths` section of the `config.ini` file to run the code below. This should point to the location of your NCCID data. Alternatively, comment out the first two lines and specify directly as string e.g., `Path("/project/data/training/")`

In [None]:
config = configparser.ConfigParser() #comment out if specifying directly
config.read("../config.ini") #comment out if specifying directly
# Change value of "training_data" in config.ini file 
BASE_PATH = Path(config["Paths"]["training_data"])
print(f"Location of NCCID data to be processed: {BASE_PATH}")

## Imaging Metadata

For the imaging metadata, a separate CSV is generated for each imaging modality: X-ray, CT, MRI. Three steps are performed:
<l>
    <li> `select_image_files` - traverses the directory tree finding all files of the imaging modality. For X-ray is it recommended to set `select_all = True` to process all available X-ray files. Whereas, for 3D modalities, CT, and MRI, `select_first = True` is recommened to select only the first file of each imaging volume, to speed up run time and reduce redundancy of information. </li>
    <li> `ingest_dicom_jsons` - reads the DICOM header information for each file. </li>
    <li> `pydicom_to_df` - converts the DICOM metadata into a pandas DataFrame where the rows are images and columns are the DICOM attributes. 
</l> <br>

The resulting DataFrames are saved as CSV files in `data/`

In [None]:
# subdirectories
XRAY_SUBDIR = "xray-metadata"
CT_SUBDIR = "ct-metadata"
MRI_SUBDIR = "mri-metadata"

In [None]:
# 1. finding image file lists within the subdirs
xray_files = etl.select_image_files(BASE_PATH / XRAY_SUBDIR, select_all=True)
ct_files = etl.select_image_files(BASE_PATH / CT_SUBDIR, select_first=True)
mri_files = etl.select_image_files(BASE_PATH / MRI_SUBDIR, select_first=True)

In [None]:
# 2. process image metadata
xray_datasets = etl.ingest_dicom_jsons(xray_files)
ct_datasets = etl.ingest_dicom_jsons(ct_files)
mri_datasets = etl.ingest_dicom_jsons(mri_files)

In [None]:
# 3. converting to DataFrame
xrays = etl.pydicom_to_df(xray_datasets)
cts = etl.pydicom_to_df(ct_datasets)
mris = etl.pydicom_to_df(mri_datasets)

In [None]:
# check structure of DFs
xrays.head()

In [None]:
# Save as csv
xrays.to_csv("data/xrays.csv")
cts.to_csv("data/cts.csv")
mris.to_csv("data/mris.csv")

## Patient Clinical Data

For patient clinical data, the most recent <b>data</b> file (for COVID-positive) or <b>status</b> file (for COVID-negative) is parsed for each patient in the directory tree. The resulting DataFrame is generated using `patient_jsons_to_df`, where rows are patients and columns are data fields. <br>

Three fields that are not in the original jsons files are included in the DataFrame: 
<l>
    <li> `filename_earliest_date` - earlist data/status file present for the patient. </li>
    <li> `filename_latest_date` - latest data/status file present for the patient. This is the file from which the rest of the patient's data has been pulled. </li>
    <li> `filename_covid_status` - indicates it the patient is in the COVID-postive or COVID-negative cohort, based on whether they have every been submitted with a <b>data</b> file (which are only present for positive patients. </li>
 </l>

In [None]:
PATIENT_SUBDIR = "data"

In [None]:
# process patient clinical data
patient_files = list(os.walk(BASE_PATH / PATIENT_SUBDIR))
patients = etl.patient_jsons_to_df(patient_files)

In [None]:
patients.head()

### Clean and enrich

The cleaning pipeline can be run on the resulting patients DataFrame to improve quality. In addition, missing values in the patient DataFrame for Sex and Age, can be filled using the DICOM image headers. This step generates two new columns `sex_update` and `age_update`, from the cleaned columns `sex`, `age`. 

In [None]:
# cleaning
patients = clean_data_df(patients, patient_df_pipeline)

# enriching
images = [xrays, cts, mris] # list all image DFs
patients = etl.patient_data_dicom_update(patients, images)
patients.head()

In [None]:
print(f"Sex Unknowns before merging with dicom: {(patients['sex']=='Unknown').sum()}")
print(f"Sex Unknowns after merging with dicom: {(patients['sex_update']=='Unknown').sum()}")
print("------")
print(f"Age NaNs before merging with dicom: {patients['age'].isnull().sum()}")
print(f"Age New after merging with dicom: {patients['age_update'].isnull().sum()}")

In [None]:
# save to csv
patients.to_csv("data/patients.csv")