In [None]:
from pathlib import Path
import os
import nccid_cleaning.etl as etl
from nccid_cleaning import clean_data_df, patient_df_pipeline

This notebook can be used to generate CSV files containing patient clinical data, and image metadata for each patient and image file within the NCCID data. 


To use these tools you need to provide a `BASE_PATH` that points to the location of the data that has been pulled from the NCCID S3 bucket, where your local directory structure should match the original S3 structure. 

#### SET THE DIRECTORY LOCATION BELOW

In [None]:
base_path_str = "" #Set location of NCCID data
BASE_PATH = Path(base_path_str)

## Imaging Metadata

For the imaging metadata, a separate CSV is generated for each imaging modality: X-ray, CT, MRI. There are three steps performed:
<l>
    <li> `select_image_files` - traverses the directory tree finding all files of the particularly imaging modality. For X-ray is it recommended to set `select_all = True` to process all available X-ray files. Whereas, for 3D modalities, CT, and MRI, `select_first = True` is recommened to select only the first file of each imaging volume, to speed up run time and reduce redundant information. </li>
    <li> `ingest_dicom_jsons` - reads the DICOM header information for each file. </li>
    <li> `pydicom_to_df` - converts the DICOM metadata into a pandas DataFrame where the rows are images and columns are the DICOM attributes. 
</l> <br>

The resulting DataFrames are saved as CSV files in `data/`

In [None]:
# subdirectories
XRAY_SUBDIR = "xray-metadata"
CT_SUBDIR = "ct-metadata"
MRI_SUBDIR = "mri-metadata"

In [None]:
# 1. finding image file lists within the subdirs
xray_files = etl.select_image_files(
    BASE_PATH / XRAY_SUBDIR, select_all=True
)
ct_files = etl.select_image_files(
    BASE_PATH / CT_SUBDIR, select_first=True
)
mri_files = etl.select_image_files(
    BASE_PATH  / MRI_SUBDIR, select_first=True
)

In [None]:
# 2. process image metadata
xray_datasets = etl.ingest_dicom_jsons(xray_files)
ct_datasets = etl.ingest_dicom_jsons(ct_files)
mri_datasets = etl.ingest_dicom_jsons(mri_files)



In [None]:
# 3. converting to DataFrame
xrays = etl.pydicom_to_df(xray_datasets)
cts = etl.pydicom_to_df(ct_datasets)
mris = etl.pydicom_to_df(mri_datasets)

In [None]:
# Save as csv
xrays.to_csv("data/xrays.csv")
cts.to_csv("data/cts.csv")
mris.to_csv("data/mris.csv")

In [None]:
xrays.head()

## Patient Clinical Data

For patient clinical data, the most recent <b>data</b> file (for COVID-positive) or <b>status</b> file (for COVID-negative) is parsed for each patient in the directory tree. The resulting DataFrame is generated using `patient_jsons_to_df`, where rows are patients and columns are data fields. <br>

Three fields that are not in the original jsons files are included in the DataFrame: 
<l>
    <li> `filename_earliest_date` - earlist data/status file present for the patient. </li>
    <li> `filename_latest_date` - latest data/status file present for the patient. This is the file from which the rest of the patient's data has been pulled. </li>
    <li> `filename_covid_status` - indicates it the patient is in the COVID-postive or COVID-negative cohort, based on whether they have every been submitted with a <b>data</b> file (which are only present for positive patients. </li>
 </l>

In [None]:
PATIENT_SUBDIR = "data"

In [None]:
# process patient clinical data
patient_files = list(os.walk(BASE_PATH / PATIENT_SUBDIR))
patients = etl.patient_jsons_to_df(patient_files)
# save DFs to csv
patients.to_csv("data/patients.csv")

In [None]:
patients.head()