# DICOM to PNG Converter for CBIS-DDSM Mammograms

This notebook is used to preprocess and convert DICOM (.dcm) medical imaging files into PNG format for use in deep learning pipelines. It was designed specifically for the [CBIS-DDSM dataset](https://www.cancerimagingarchive.net/collection/cbis-ddsm/), a curated breast imaging dataset publicly available from The Cancer Imaging Archive (TCIA).

## Dataset Information

Before running this notebook, download the CBIS-DDSM dataset from:
https://www.cancerimagingarchive.net/collection/cbis-ddsm/

Once downloaded, place the dataset's "Calc" and "Mass" folders in your working directory. This notebook supports folder restructuring and conversion of all `.dcm` images to `.png` format, preserving original metadata where possible.

## Notebook Features

- Reorganizes and renames CBIS-DDSM subfolders
- Converts DICOM images into standard PNG format
- Updates CSV paths for use in PyTorch or other pipelines


In [2]:
from typing import Union
import numpy as np
import pydicom
import pandas as pd
import os
import shutil
from PIL import Image

In [None]:
#some of the following code was taken from
# https://github.com/pablogiaccaglia/Breast-Cancer-Segmentation-Datasets/blob/master/CBIS/refactorCBIS.py
# and modified

In [43]:
def renameDcmFiles(logger, dcmFilePath: str, prefix : str) -> Union[str, bool]:
    """
    This function takes the absolute i of a .dcm file
    and renames it according to the convention below:
    1. Full mammograms:
        - Mass-Training_P_00001_LEFT_CC_FULL.dcm
    Parameters
    ----------
    dcmFilePath : {str}
        The relative (or absolute) i of the .dcm file
        to rename, including the .dcm filename.
        e.g. "source_folder/Mass-Training_P_00001_LEFT_CC/1.dcm"
    Returns
    -------
    newFilename : {str}
        The new name that the .dcm file should have
        WITH the ".dcm" extention WITHOUT its relative
        (or absolute) i.
        e.g. "Mass-Training_P_00001_LEFT_CC_FULL.dcm"
    False : {boolean}
        False is returned if the new name of the .dcm
        file cannot be determined.
    """

    try:
        # Read dicom.
        ds = pydicom.dcmread(dcmFilePath)
        # Get information.
        patientID = ds.PatientID
        # Check if the filename does not already start with "Calc-Test_"
        if not patientID.startswith(prefix):
            patientID = prefix + "_" + patientID

        print(patientID)
        patientID = patientID.replace(".dcm", "")

        try:
            # If ds contains SeriesDescription attribute...
            imageType = ds.SeriesDescription

            # === FULL ===
            if "full" in imageType:
                newFilename = patientID + "_FULL" + ".dcm"
                print(f"FULL --- {newFilename}")
                return newFilename

        except:
            # If ds does not contain SeriesDescription...
            # === FULL ===
            if "full" in dcmFilePath:
                newFilename = patientID + "_FULL" + ".dcm"
                return newFilename
            
    except Exception as e:
        # logger.error(f'Unable to new_name_dcm!\n{e}')
        print(f"Unable to new_name_dcm!\n{e}")


In [44]:
def countDcmFiles(logger, topDirectory: str) -> int:
    """
    This function recursively walks through a given directory
    (`topDirectory`) using depth-first search (bottom up) and counts the
    number of .dcm files present.
    Parameters
    ----------
    i : {str}
        The directory to count.
    Returns
    -------
    count : {int}
        The number of .dcm files in `i`.
    """

    count = 0

    try:

        # Count number of .dcm files in ../data/Mass/Test.
        for _, _, files in os.walk(topDirectory):
            for f in files:
                if f.endswith(".dcm"):
                    count += 1

    except Exception as e:
        # logger.error(f'Unable to count_dcm!\n{e}')
        print(f"Unable to count_dcm!\n{e}")

    return count

In [45]:
def moveDcmFileUp(logger, destinationDir: str, sourceDir: str, dcmFilename: str) -> None:
    """
    This function move a .dcm file from its given source
    directory into the given destination directory. It also
    handles conflicting filenames by adding "___a" to the
    end of a filename if the filename already exists in the
    destination directory.
    Parameters
    ----------
    destinationDir : {str}
        The relative (or absolute) i of the folder that
        the .dcm file needs to be moved to.
    sourceDir : {str}
        The relative (or absolute) i where the .dcm file
        needs to be moved from, including the filename.
        e.g. "source_folder/Mass-Training_P_00001_LEFT_CC_FULL.dcm"
    dcmFilename : {str}
        The name of the .dcm file WITH the ".dcm" extension
        but WITHOUT its (relative or absolute) i.
        e.g. "Mass-Training_P_00001_LEFT_CC_FULL.dcm".
    Returns
    -------
    None
    """

    try:
        dest_dir_with_new_name = os.path.join(destinationDir, dcmFilename)

        # If the destination i does not exist yet...
        if not os.path.exists(dest_dir_with_new_name):
            shutil.move(sourceDir, destinationDir)

        # If the destination i already exists...
        elif os.path.exists(dest_dir_with_new_name):
            # Add "_a" to the end of `new_name` generated above.
            newName2 = dcmFilename.strip(".dcm") + "___a.dcm"
            # This moves the file into the destination while giving the file its new name.
            shutil.move(sourceDir, os.path.join(destinationDir, newName2))

    except Exception as e:
        # logger.error(f'Unable to move_dcm_up!\n{e}')
        print(f"Unable to move_dcm_up!\n{e}")


In [46]:
def deleteEmptyFolders(logger, topDirectory: str, errorDirectory: str) -> None:
    """
    This function recursively walks through a given directory
    (`topDirectory`) using depth-first search (bottom up) and deletes
    any directory that is empty (ignoring hidden files).
    If there are directories that are not empty (except hidden
    files), it will save the absolute directory in a Pandas
    dataframe and export it as a `not-empty-folders.csv` to
    `error_dir`.
    Parameters
    ----------
    topDirectory : {str}
        The directory to iterate through.
    errorDirectory : {str}
        The directory to save the `not-empty-folders.csv` to.
    Returns
    -------
    None
    """

    try:
        curDirectoryList = []
        filesList = []

        for (curDir, dirs, files) in os.walk(top = topDirectory, topdown = False):

            if curDir != str(topDirectory):

                dirs.sort()
                files.sort()

                print(f"WE ARE AT: {curDir}")
                print("=" * 10)

                print("List dir:")

                directories_list = [
                    f for f in os.listdir(curDir) if not f.startswith(".")
                ]
                print(directories_list)

                if len(directories_list) == 0:
                    print("DELETE")
                    shutil.rmtree(curDir, ignore_errors = True)

                elif len(directories_list) > 0:
                    print("DON'T DELETE")
                    curDirectoryList.append(curDir)
                    filesList.append(directories_list)

                print()
                print("Moving one folder up...")
                print("-" * 40)
                print()

        if len(curDirectoryList) > 0:
            notEmptyDirs = pd.DataFrame(
                    list(zip(curDirectoryList, filesList)), columns = ["curDir", "files"]
            )
            pathToSave = os.path.join(errorDirectory, "not-empty-folders.csv")
            notEmptyDirs.to_csv(pathToSave, index = False)

    except Exception as e:
        # logger.error(f'Unable to delete_empty_folders!\n{e}')
        print(f"Unable to delete_empty_folders!\n{e}")


In [47]:
def moveFiles(logger, topDirPath: str, substring: str, extension: str, destinationDirectory: str) -> int:
    """
    This function recursively walks through a given directory
    (`topDirPath`) using depth-first search (bottom up), finds file names
    containing the `substr` substring and copies it to the
    target directory `destinationDirectory`.

    Parameters
    ----------
    topDirPath : {str}
        The directory to look in.
    substring : {str}
        The substring to look for, either "FULL" or "MASK".
    extension : {str}
        The extension of the file to look for. e.g. ".png".
    destinationDirectory : {str}
        The directory to copy to.

    Returns
    -------
    movedFiles : {int}
        The number of files moved.
    """

    movedFiles = 0

    try:

        # Count number of .dcm files in topDirPath.
        for currentDirectory, _, files in os.walk(topDirPath):

            files.sort()

            for f in files:

                if f.endswith(extension) and substring in f:
                    sourcePath = os.path.join(currentDirectory, f)
                    destinationPath = os.path.join(destinationDirectory, f)
                    shutil.move(sourcePath, destinationPath)

                    movedFiles += 1
                # if movedFiles == 1:
                #     break

    except Exception as e:
        # logger.error(f'Unable to moveFiles!\n{e}')
        print(f"Unable to moveFiles!\n{e}")

    return movedFiles


In [48]:
def updateDcmPath(logger, og_df, images_folder, masks_folder):
    """
    This function updates paths to the full mammogram scan,
    cropped image and ROI mask of each row (.dcm file) of the
    given DataFrame.
    Parameters
    ----------
    og_df : {pd.DataFrame}
        The original Pandas DataFrame that needs to be updated.
    dcm_folder : {str}
        The relative (or absolute) i to the folder that conrains
        all the .dcm files to get the i.
    Returns
    -------
    og_df: {pd.DataFrame}
        The Pandas DataFrame with all the updated .dcm paths.
    """

    try:

        # Creat new columns in og_df.
        og_df["full_path"] = np.nan
        # og_df["crop_path"] = np.nan
        og_df["mask_path"] = np.nan

        # Get list of .dcm paths.
        images_paths_list = []
        masks_paths_list = []
        for _, _, files in os.walk(images_folder):
            for f in files:
                if f.endswith(".png"):
                    images_paths_list.append(os.path.join(images_folder, f))

        for _, _, files in os.walk(masks_folder):
            for f in files:
                if f.endswith(".png"):
                    masks_paths_list.append(os.path.join(masks_folder, f))

        for row in og_df.itertuples():

            row_id = row.Index

            # Get identification details.
            patient_id = row.patient_id
            img_view = row.image_view
            lr = row.left_or_right_breast
            abnormality_id = row.abnormality_id

            # Use this list to match DF row with .dcm i.
            info_list = [patient_id, img_view, lr]

            #  crop_suffix = "CROP_" + str(abnormality_id)
            mask_suffix = "MASK_" + str(abnormality_id)

            # Get list of relevant paths to this patient.
            full_paths = [
                path
                for path in images_paths_list
                if all(info in path for info in info_list + ["FULL"])
            ]

            """ crop_paths = [
                    i
                    for i in dcm_paths_list
                    if all(info in i for info in info_list + [crop_suffix])
                ] """

            mask_paths = [
                path
                for path in masks_paths_list
                if all(info in path for info in info_list + [mask_suffix])
            ]

            # full_paths_str = ",".join(full_paths)
            # crop_paths_str = ",".join(crop_paths)
            # mask_paths_str = ",".join(mask_paths)

            # Update paths.
            if len(full_paths) > 0:
                og_df.loc[row_id, "full_path"] = full_paths
            #  if len(crop_paths) > 0:
            #      og_df.loc[row_id, "crop_path"] = crop_paths
            if len(mask_paths) > 0:
                og_df.loc[row_id, "mask_path"] = mask_paths

        del og_df["cropped_image_file_path"]
        del og_df["image_file_path"]
        del og_df["ROI_mask_file_path"]

    except Exception as e:
        raise e
        print(f"Unable to get updateDcmPath!\n{e}")

    return og_df


In [49]:
def refactorCBIS(logger, topDirectory: str):
    """main function for extractDicom module.
    iterates through each image and executes the necessary
    image preprocessing steps on each image, and saves
    preprocessed images in the output i specified.
    Parameters
    ----------
    logger : {logging.Logger}
        The logger used for logging error information
    """

    prefix = topDirectory.split('/')[-1]

    # ==============================================
    # 1. Count number of .dcm files BEFORE executing
    # ==============================================
    print("start")
    before = countDcmFiles(logger = None, topDirectory = topDirectory)

    # ==========
    # 2. Execute
    # ==========

    print(before)

    # 2.1. Rename and move .dcm files.
    # --------------------------------
    for (currentDirectory, dirs, files) in os.walk(top = topDirectory, topdown = False):

        dirs.sort()
        files.sort()

        for f in files:

            # === Step 1: Rename .dcm file ===
            if f.endswith(".dcm"):

                old_name_path = os.path.join(currentDirectory, f)
                newFilename = renameDcmFiles(logger = None, dcmFilePath = old_name_path, prefix = prefix)

                if newFilename:
                    pathOfNewNameFile = os.path.join(currentDirectory, newFilename)
                    os.rename(old_name_path, pathOfNewNameFile)

                    # === Step 2: Move RENAMED .dcm file ===
                    moveDcmFileUp(logger = None,
                                  destinationDir = topDirectory, sourceDir = pathOfNewNameFile,
                                  dcmFilename = newFilename
                                  )

    # 2.2. Delete empty folders.
    # --------------------------
    deleteEmptyFolders(logger = None, topDirectory = topDirectory, errorDirectory = "/Users/pablo/Desktop/nl2-project")

    # =============================================
    # 3. Count number of .dcm files AFTER executing
    # =============================================
    after = countDcmFiles(logger = None, topDirectory = topDirectory)

    print(f"BEFORE --> Number of .dcm files: {before}")
    print(f"AFTER --> Number of .dcm files: {after}")
    print()
    print("Getting out of extractDicom.")
    print("-" * 30)

    return


In [50]:
def updateCSV(logger, mass_csv_path, mass_png_folder, masks_folder, output_csv_path):
    """main function for updateDcmPath module.
    Parameters
    ----------
    logger : {logging.Logger}
        The logger used for logging error information
    resultsDict: {dict}
        A dictionary containing information about the
        command line arguments. These can be used for
        overwriting command line arguments as needed.
    """

    # Read the .csv files.
    og_mass_df = pd.read_csv(mass_csv_path)

    new_cols = [col.replace(" ", "_") for col in og_mass_df.columns]
    og_mass_df.columns = new_cols

    # Update .png paths.
    updated_mass_df = updateDcmPath(
            logger = None,
            og_df = og_mass_df, images_folder = mass_png_folder, masks_folder = masks_folder
    )

    updated_mass_df.to_csv(output_csv_path, index = False)

    print("Getting out of updateCSV.")
    print("-" * 30)

    return

In [None]:
# some of the following code was taken from 
# https://github.com/neheller/dicom-to-png/blob/master/mritopng.py
# and modified

In [None]:
def convert_dicom_to_png(dicom_path, png_path):
    """
    Converts a DICOM (.dcm) file into a normalized 8-bit PNG image.

    Parameters:
        dicom_path (str): Path to the input DICOM file.
        png_path (str): Path where the output PNG image will be saved.

    Steps:
        - Reads the DICOM image using pydicom
        - Normalizes pixel values to the range [0, 1]
        - Scales to 8-bit grayscale (0-255)
        - Saves the result as a PNG using Pillow
    """
    # Read the DICOM file
    ds = pydicom.dcmread(dicom_path)
    
    # Extract the pixel array
    pixel_array = ds.pixel_array
    
    # Normalize the array
    normalized_array = (pixel_array - pixel_array.min()) / (pixel_array.max() - pixel_array.min())
    
    # Scale to 8-bit (0-255)
    scaled_array = (normalized_array * 255).astype(np.uint8)
    
    # Create an image object
    image = Image.fromarray(scaled_array)
    
    # Save the image
    image.save(png_path)


In [None]:
def convert_folder(dicom_folder_path, png_folder_path):
    """
    Converts all DICOM (.dcm) files in a given folder into PNG images and 
    saves them in a specified output directory.

    Parameters:
        dicom_folder_path (str): Path to the folder containing DICOM files.
        png_folder_path (str): Path to the destination folder for the converted PNG files.

    Behavior:
        - Creates the output folder if it does not exist.
        - Recursively walks through the DICOM folder.
        - Converts each DICOM file to PNG using `convert_dicom_to_png()`.
        - Ensures no file is overwritten and reports success/failure per file.

    Raises:
        Exception: If input files are missing or output files already exist.
    """

    # Create the folder for the pnd directory structure
    if not os.path.exists(png_folder_path):
        os.makedirs(png_folder_path)

    # Recursively traverse all sub-folders in the path
    for dicom_sub_folder, subdirs, files in os.walk(dicom_folder_path):
        for dicom_file_name in files:
            dicom_file_path = dicom_folder_path + "/" + dicom_file_name
            #print(dicom_file_path)
            png_file_name = dicom_file_name.replace(".dcm", ".png")
            png_file_path = png_folder_path + "/" + png_file_name
            #print(png_file_path)
            
            # Making sure that the dicom file exists
            if not os.path.exists(dicom_file_path):
                raise Exception('File "%s" does not exists' % dicom_file_path)

            # Making sure the png file does not exist
            if os.path.exists(png_file_path):
                raise Exception('File "%s" already exists' % png_file_path)
            
            try:
                # Convert the actual file
                convert_dicom_to_png(dicom_file_path, png_file_path)
                print( 'SUCCESS>', dicom_file_path, '-->', png_file_path)
            except Exception as e:
                print( 'FAIL>', dicom_file_path, '-->', png_file_path, ':', e)


## Usage Summary

1. Place the DICOM folders in the appropriate directory.
2. Run the `refactorCBIS()` function to rename and reorganise the files.
3. Call the `convert_folder()` function on each directory to convert `.dcm` files to `.png`.
4. CSV paths can be updated using `updateCSV()` for downstream processing.

Ensure that the necessary folders exist and modify paths as required for your local setup.


In [None]:
# Define paths
calc_test_dicom_path = "mammogram-ai-project/Data/Data png/Calc-Test"
calc_test_png_path = "mammogram-ai-project/Data/Data png/Calc-Test-png"

calc_training_dicom_path = "mammogram-ai-project/Data/Data png/Calc-Training"
calc_training_png_path = "mammogram-ai-project/Data/Data png/Calc-Training-png"

mass_test_dicom_path = "mammogram-ai-project/Data/Data png/Mass-Test"
mass_test_png_path = "mammogram-ai-project/Data/Data png/Mass-Test-png"

mass_training_dicom_path = "mammogram-ai-project/Data/Data png/Mass-Training"
mass_training_png_path = "mammogram-ai-project/Data/Data png/Mass-Training-png"

In [None]:
# rename and reorganise files
refactorCBIS(logger = None, topDirectory = calc_test_dicom_path)
refactorCBIS(logger = None, topDirectory = calc_training_dicom_path)
refactorCBIS(logger = None, topDirectory = mass_test_dicom_path)
refactorCBIS(logger = None, topDirectory = mass_training_dicom_path)

In [None]:
# convert dicom files to png images
convert_folder(calc_test_dicom_path, calc_test_png_path)
convert_folder(calc_training_dicom_path, calc_training_png_path)
convert_folder(mass_test_dicom_path, mass_test_png_path)
convert_folder(mass_training_dicom_path, mass_training_png_path)