## Introduction:

This notebook contains various utility code snippets to prepare datasets in the required specs/format.


### 1. DICOM to PNG

We'll convert DICOM images in the training set to PNG images of required sizes. The PNG images will be easier to work with and intially it's helpful to prototype with smaller 128x128 or 256x256 images. Most of this code is based on this [notebook](https://www.kaggle.com/code/theoviel/get-started-quicker-dicom-png-conversion).

In [None]:
!pip install -qU python-gdcm pylibjpeg

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import cv2
import glob
import gdcm
import pydicom
import matplotlib.pyplot as plt
import random
import shutil
random.seed(123)

from tqdm.notebook import tqdm
from joblib import Parallel, delayed

In [None]:
DICOM_PATH = '/kaggle/input/rsna-2023-abdominal-trauma-detection/train_images/'
SAVE_FOLDER = '/kaggle/working/train_images_128_png_complete/'
SIZE = 128
EXTENSION = 'png'

os.makedirs(SAVE_FOLDER, exist_ok = True)

In [None]:
def standardize_pixel_array(dcm: pydicom.dataset.FileDataset) -> np.ndarray:
    """
    Source : https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection/discussion/427217
    """
    # Correct DICOM pixel_array if PixelRepresentation == 1.
    pixel_array = dcm.pixel_array
    if dcm.PixelRepresentation == 1:
        bit_shift = dcm.BitsAllocated - dcm.BitsStored
        dtype = pixel_array.dtype 
        pixel_array = (pixel_array << bit_shift).astype(dtype) >>  bit_shift
#         pixel_array = pydicom.pixel_data_handlers.util.apply_modality_lut(new_array, dcm)

    intercept = float(dcm.RescaleIntercept)
    slope = float(dcm.RescaleSlope)
    center = int(dcm.WindowCenter)
    width = int(dcm.WindowWidth)
    low = center - width / 2
    high = center + width / 2    
    
    pixel_array = (pixel_array * slope) + intercept
    pixel_array = np.clip(pixel_array, low, high)

    return pixel_array

In [None]:
def process(patient, size = SIZE, save_folder = "", data_path = ""):
    
    for study in sorted(os.listdir(data_path + patient)):
        imgs = {}
        for f in sorted(glob.glob(data_path + f"{patient}/{study}/*.dcm")):
            
            dicom = pydicom.dcmread(f)
            pos_z = dicom[(0x20, 0x32)].value[-1]
            
            img = standardize_pixel_array(dicom)
            img = (img - img.min())/(img.max() - img.min() + 1e-6)
            
            if dicom.PhotometricInterpretation == "MONOCHROME1":
                img = 1 - img
            
            imgs[pos_z] = img
        
        for i, k in enumerate(sorted(imgs.keys())):
            
            img = imgs[k]
            
            img = cv2.resize(img, (size, size))
            cv2.imwrite(save_folder + f"{patient}_{study}_{i}.png", (img * 255).astype(np.uint8))

In [None]:
patients = os.listdir(DICOM_PATH)

In [None]:
Parallel(n_jobs = 3)(
        delayed(process)(patient, size=SIZE, save_folder=SAVE_FOLDER, data_path=TEST_PATH)
        for patient in tqdm(patients)
        )

In [None]:
random.choice(os.listdir('/kaggle/working/train_images_128_png_complete'))

In [None]:
random_file = random.choice(os.listdir(SAVE_FOLDER))
img = cv2.imread(SAVE_FOLDER + random_file, 0)
#cv2.imshow('image', img)

plt.figure(figsize=(15, 15))
plt.imshow(img, cmap="gray")
plt.show()

In [None]:
!tar -zcf train_images_128_png_complete.tar.gz /kaggle/working/train_images_128_png_complete/

### 2. Create a Small Subset of DICOM Images for Prototyping

Instead of working with all DICOM images for all patients in the training dataset, we'll only a small subset of them. We would like to ensure that all injury labels are present in adequate amount in this small dataset.

In [None]:
import os
import random
import shutil
from tqdm.notebook import tqdm
import pandas as pd
random.seed(123)

In [None]:
DICOM_PATH = '/kaggle/input/rsna-2023-abdominal-trauma-detection/train_images/'
SAVE_PATH = '/kaggle/working/train_images_DICOM_500/'
num_patients = 500
num_img = 10
os.makedirs(SAVE_PATH, exist_ok = True)

In [None]:
patients = random.sample(os.listdir(DICOM_PATH), k = num_patients)

for pat in tqdm(patients):
    
    series_id = random.choice(os.listdir(DICOM_PATH + str(pat)))
    source_folder = os.path.join(DICOM_PATH, pat, series_id)
    
    files_to_copy = random.sample(os.listdir(source_folder), k = num_img)
    
    for file in files_to_copy:
        target_filename = os.path.join(SAVE_PATH, str(pat) + '_' + str(series_id) + '_' + file)
        file_path = os.path.join(source_folder, file)
        shutil.copy(file_path, target_filename)
    

Let us look at the distribution of labels in this small DICOM image set.

In [None]:
patient_labels = pd.read_csv('/kaggle/input/rsna-2023-abdominal-trauma-detection/train.csv')
patients = [int(pat) for pat in patients]
subset_patient_labels = patient_labels[patient_labels['patient_id'].isin(patients)]

In [None]:
def get_pos_num(df, label):
    num_entries = df.shape[0]
    return (df[label] == 1).sum()

injury_labels = ['bowel_injury', 'extravasation_injury' , 'kidney_low' , 'kidney_high', 'liver_low',
                 'liver_high', 'spleen_low', 'spleen_high']

print(f'Total number of patients: {num_patients}')
for label in injury_labels:    
    print(f'Number of patients with {label} = {get_pos_num(subset_patient_labels, label)}')

In [None]:
!du -sh /kaggle/working/train_images_DICOM_500/

In [None]:
len(os.listdir(SAVE_PATH))

### 3. Resize (512x512) PNG to (sizexsize) PNG

We take the (512x512) PNG datasets provided [here](https://www.kaggle.com/code/theoviel/get-started-quicker-dicom-png-conversion) and convert them to a given size. It's helpful to initally experiment with datasets of small sized images (128x128, 256x256, etc).

In [None]:
BASE_DIR = '/kaggle/input/rsna-abdominal-trauma-detection-png-pt2'
SAVE_DIR = '/kaggle/working/rsna-atd-128-png-pt2/'
SIZE = 128
os.makedirs(SAVE_DIR, exist_ok = True)

In [None]:
def process(fn, size):
    im = cv2.imread(os.path.join(BASE_DIR, fn))
    im = cv2.resize(im, (size, size))
    cv2.imwrite(os.path.join(SAVE_DIR, fn), (im * 255).astype(np.uint8))

In [None]:
import multiprocessing

multiprocessing.cpu_count()

In [None]:
f_list = os.listdir(BASE_DIR)
Parallel(n_jobs = 4)(
    delayed(process)(fn, SIZE)
    for fn in tqdm(f_list)
    )