# Introduction
In this notebook, I will describe all the steps that I have taken to make a new dataset for Weekly CTs.

Basically, the process contains five different steps:

1. Navigation of the folder in which one think there maybe any weeklyCTs. These folders can be on this computer or a user can just make these folders by downloading new patients from MIRADA or other UMCG datasets.

2. Extracting only weeklyCTs from these folders and make an excel file from them.

3. Transferring the new-founded weeklyCTs into a destination folder (it can be an existing folder for the weeklyCTs or a new folder).

4. Making a report excel file of some information about the weeklyCTs in the destination file and some clinical information from the patients who have these weeklyCTs.

5. Making a pannel that contains different information about the WeeklyCT dataset.

6. A Watchdog is keep the track of all the additions to the destination folder, and save them in a log file.

In [1]:
# General Libraries
import glob
import os
import shutil
import math
import re
import numpy as np
import pandas as pd
from random import randint
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from datetime import time, datetime, date

# DICOM Libraries
import pydicom as pdcm
from pydicom.tag import Tag

# 1. Navigation Phase
### DICOM Files
All kinds of CTs were stored in the form of DICOM files. DICOM, which stands for Digital Imaging and Communications in Medicine, is a standard for transmitting, storing, and sharing medical images. DICOM files contain information about medical images, such as X-rays, CT scans, MRIs, and ultrasound. This standard ensures the interoperability of medical imaging equipment from different manufacturers. Some key features are:

**Metadata:** DICOM files store not only the pixel data of the medical images but also a wealth of metadata. This metadata includes patient information, imaging device details, acquisition parameters, and more.

**Interoperability:** DICOM enables the exchange of medical images and related information between different devices and systems. This interoperability is crucial in healthcare settings where various imaging modalities and equipment are used.

**Structured Data:** DICOM files use a structured format for information, allowing for consistency and ease of interpretation by different systems. This makes it possible for healthcare professionals to access and understand the data regardless of the equipment used to capture or generate the images.

For information of different tags and the definitions one can use the following links: [Wiki](https://en.wikipedia.org/wiki/DICOM), [link](https://dicom.innolitics.com/ciods)


In [13]:
def get_folder_name(image, subf):

    # find the name of the folder
    try:
        folder_name = image[Tag(0x0008103e)].value

    except:
        study = image[Tag(0x00081030)].value
        patient_id = image[Tag(0x00100020)].value
        print(f'Warning: folder {study} with {patient_id} ID does NOT have Series Description')
        folder_name = subf.split('\\')[-1]  

    return folder_name

def get_patient_id(image):

    # Extract the patient ID
    try:
        patient_id = image[Tag(0x00100020)].value

    except:
        print(f'Warning: There is NO patient ID')
        patient_id = None

    return patient_id

def get_probable_weklyct_name(name, number, names_list):

    lowercase_name = name.lower()

    # Search to find 'rct' or 'w' with a number
    if ('rct' in lowercase_name or 'w' in lowercase_name) and re.search(r'\d', name):
        saver = name

    elif 'wk..' in lowercase_name and not re.search(r'\d', name):
        saver = name

    # Check if 'w' is in 'j' and the next element in 'sep_names' is an integer
    elif 'w' in lowercase_name and number + 1 < len(names_list) and not re.search(r'\d', name):

        if '2.0' not in names_list[number + 1] and '2,' not in names_list[number + 1]:
            saver = name + str(names_list[number + 1])

    elif re.search('rct.*[..]|rct.*[#]', lowercase_name) and not re.search(r'\d', name):
        saver = name
    
    else:
        saver = None 

    return saver    
    
def get_hd_fov(name):

    lowercase_name = name.lower()
    # Search whether there is 'hd' or 'fov' in j
    if 'hd' in lowercase_name or 'fov' in lowercase_name:
        hd_fov = 1 
    
    else:
        hd_fov = 0
    
    return hd_fov

def get_fraction(name):

    lowercase_name = name.lower()

    # Find the fraction number
    if 'rct' in lowercase_name and re.search(r'\d', name):
        fraction = int(re.findall(r'\d+', name)[0])
    
    else:
        fraction = None
    
    return fraction

def get_date_information(image):

    # Extract the date, the week day, and the week number from study date time
    try:
        study_datetime_CT = datetime.strptime(image[Tag(0x00080020)].value ,"%Y%m%d")
        date_info = study_datetime_CT.date()
        weekday = study_datetime_CT.weekday() + 1
        week_num = study_datetime_CT.isocalendar().week
    except:
        date_info = None
        weekday = None
        week_num = None 
    
    return date_info, weekday, week_num

def get_slice_thickness(image):
    
    # Extract slice thickness
    try:
        slice_thickness = image['00180050'].value
    except:
        slice_thickness = None
    
    return slice_thickness

def get_contrast(image):

    # Extract contrast information
    try:
        image[Tag(0x00180010)].value
        contrast=1

    except:
        contrast=0
    
    return contrast

def get_pixel_spacing(image):

    # Extract pixel spacing
    try:
        pixel_spacing = image[Tag(0x00280030)].value
    except:
        pixel_spacing = None
    
    return pixel_spacing

def get_ref_uid(image):

    # Extract UID
    try:
        uid = image['00200052'].value
    except:
        uid = None
    
    return uid

In [16]:
def navigate_folder(path_folder):

    # Add in config
    exclusion_set = {'detail', 'ac_ct', 'ld_ct', 'ld ct', 'ac ct'}
    min_slice_num = 50

    # Make a group to save all the information
    group = list()

    for r, d, f in os.walk(path_folder):
        # make a list from all the directories 
        subfolders = [os.path.join(r, folder) for folder in d]

        for subf in subfolders:
            # number of slices (images) in each DICOM folder, and the name of the folders
            slice_num = len(glob.glob(subf+"/*.DCM"))

            # find whether subf is a path and the number of .DCM images is more than 50
            if slice_num > min_slice_num:

                # Extract the information of the image 
                image=pdcm.dcmread(glob.glob(subf+"/*.DCM")[0],force=True)
                folder_name = get_folder_name(image, subf)
    
                # Extract the CTs
                if image.Modality == 'CT' and all(keyword not in folder_name.lower() for keyword in exclusion_set):
   
                    patient_id = get_patient_id(image)

                    # split the name of the folder into strings of information
                    names_list = folder_name.split()
            
                    print(patient_id, folder_name)

                    # Find different information
                    for number, name in enumerate(names_list):
                        saver = get_probable_weklyct_name(name, number, names_list) 
                        hd_fov = get_hd_fov(name)
                        fraction = get_fraction(name)

                    date_info, weekday, week_num = get_date_information(image)
                    slice_thickness = get_slice_thickness(image)
                    contrast = get_contrast(image)
                    pixel_spacing = get_pixel_spacing(image)
                    uid = get_ref_uid(image)

                    # Add the information of this group to the total dataset
                    group.append({
                                'ID': patient_id, 'folder_name': folder_name, 'date': date_info,
                                'week_day': weekday, 'week_num': week_num, 'info_header': saver,
                                'fraction': fraction, 'HD_FoV': hd_fov, 'slice_thickness': slice_thickness,
                                'num_slices': slice_num, 'pixel_spacing': pixel_spacing, 'contrast': contrast,
                                'UID': uid, 'path': subf
                                })
    
    # Make a datafrme from the main folder
    df = pd.DataFrame(group)

    # Save the dataframe
    df.to_csv('output.csv', index=False)

    return df

In [17]:
path_folder = '//zkh/appdata/RTDicom/Projectline_HNC_modelling/OPC_data/DICOM_data_organized'
navigate_folder(path_folder)

0020715 Hals 2mm  pCT0  I40s  3 imar   iMAR
0020715 Hals 2.0 +C  HD_FoV imar   iMAR
0020715 rCT3 protonen  2.0  HD_FoV imar   iMAR
0020715 rCT8 protonen  2.0  I40s  3 imar   iMAR
0020715 rCT13 protonen  2.0  I40s  3 imar   iMAR
0020715 rCT18 protonen  2.0  I40s  3 imar   iMAR
0020715 rCT23 protonen  2.0  I40s  3 imar   iMAR
0020715 rCT28protonen  2.0  I40s  3 imar   iMAR
0020715 rCT33 protonen  2.0  I40s  3 imar   iMAR
0021879 Neck  2.0  B30f
0021879 HerCTHH wk 1
0021879 HerCTHH wk 2
0021879 HerCTHH wk 3  2.0  I40s  3
0021879 HerCTHH wk4  2.0  I40s  3
0021879 HerCTHH wk5  2.0  I40s  3
0021879 HerCTHH wk6  2.0  I40s  3
0021879 HerCTHH wk7  2.0  I40s  3
0052277 Hals 2mm  2.0  B30f
0052277 HerCTHH wk1  2.0  I40s  3
0052277 HerCTHH wk2  2.0  I40s  3
0052277 HerCTHH wk3  2.0  I40s  3
0052277 HerCTHH wk..4  2.0  I40s  3
0052277 HerCTHH wk5  2.0  I40s  3
0052277 HerCTHH wk 6  2.0  I40s  3
0052277 HerCTHH wk 7  2.0  I40s  3
0059896 Hals 2mm  pCT0  I40s  3 imar   iMAR
0059896 Hals 2.0 +C  I40s 

KeyboardInterrupt: 