<a href="https://colab.research.google.com/github/in7777777/nnfs_book/blob/main/01_PreprocessingIJB_ipyn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

First, you need to obtain and preprocess the data for the segmentation task
The data is provided by the medical segmentation decathlon challenge(http://medicaldecathlon.com/) <br/>
You can directly download the full body cts and segmentation maps from: <br/>
https://drive.google.com/file/d/1I1LR7XjyEZ-VBQ-Xruh31V7xExMjlVvi/view?usp=sharing

(Data License: CC-BY-SA 4.0, https://creativecommons.org/licenses/by-sa/4.0/)

# **1. Imports:**
Task: Import the necessary libraries:

*1.1. List of necessary libraries:*

*   Pathlib for convenient path handling;
*   Numpy for storing the actual images;
*   cv2 for directly resizing the images;
*   Pandas to read the provided labels;
*   Matplotlib for visualizing some images;
*   tqdm for nice progress bar.
*   NiBabel reads and converts between NIfTI and several other common neuroimaging file formats, including ANALYZE;
*   dicom2nifti converst DICOM images to NIfTI format
*   Pydicom for reading dicom files;
*   NiLearn designed to provide statistical analysis and machine learning tools for neuroimaging data;
*   scipy.ndimage functions for multi-dimensional image processing.

*1.2. Import of necessary libraries:*

In [1]:
pip install pydicom

Collecting pydicom
  Downloading pydicom-2.4.4-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydicom
Successfully installed pydicom-2.4.4


In [2]:
from pathlib import Path
!pip install dicom2nifti
!pip install nibabel
!pip install nilearn
import dicom2nifti
import nibabel as nib
import nilearn as nil
import scipy.ndimage as ndi
import pydicom
import numpy as np
import cv2
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import os
import tarfile

Collecting dicom2nifti
  Downloading dicom2nifti-2.4.11-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting python-gdcm (from dicom2nifti)
  Downloading python_gdcm-3.0.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-gdcm, dicom2nifti
Successfully installed dicom2nifti-2.4.11 python-gdcm-3.0.24.1
Collecting nilearn
  Downloading nilearn-0.10.4-py3-none-any.whl (10.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nilearn
Successfully installed nilearn-0.10.4


# **2. File download and upload:**

*2.1. The files were DOWNLOADED from the Lecturer's Google Drive accounts [[patelkrishat], [miki.micki]] to my Laptop LOCAL DISK and UPLOADED to my Google Drive [johnboscondaruhutse].
Mounting the Drive:*


*   Note: It was not possible to directly reach the
working directory we proceeded by in the:
*   * First step mount "/content/drive" alone
*   * Second step go for "/content/drive/MyDrive"

*2.1.i. Mounting the Drive: "/content/drive"*




In [1]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

*2.1.ii. Change the current working directory: "/content/drive" to "/content/drive/MyDrive":*




In [5]:
# Change the current working directory
os.chdir('/content/drive/MyDrive')

# **3. Inspection:**
**Task: Define the paths to images and labels**

*3.1. File path and the extraction directory:*

In [6]:
tar_file_path = 'Task06_Lung.tar'
extraction_dir = 'Task06_Lung'

*3.2. File extraction:*

**Unzipping the Task06_Lung.tar**

In [7]:
if os.path.isfile(tar_file_path):
    # Open and extract the tar file
    with tarfile.open(tar_file_path, 'r:') as tar:
        tar.extractall(path=extraction_dir)
    print(f'Extracted to {extraction_dir}')
else:
    print(f'The file {tar_file_path} does not exist.')

The file Task06_Lung.tar does not exist.


# **4. Define Helper Functions:**


**Task: Load a sample NIfTI [ standing for Neuroimaging Informatics Technology Initiative] and its corresponding label mask**<br />
Hint: You might want to define a helper function to make your life easier

*4.1. Load NIfTI File and Mask:*

In [None]:
def load_nifti(file_path):
    img = nib.load(file_path)
    data = img.get_fdata()
    return img, data

*4.2. Find Orientation:*

**Task: Find out the orientation**

In [None]:
def find_orientation(nifti_img):
    return nib.orientations.aff2axcodes(nifti_img.affine)

*4.3. Plot Image with Ground Truth Overlay:*

**Task: Inspect the loaded data with overlaid Ground Truth tumor segmentation**
Hint: Use the axial view (last axis) You may skip every second slice to speed up the process.


In [None]:
def plot_overlay(image_data, mask_data, skip_slices=2):
    num_slices = image_data.shape[2]
    for i in range(0, num_slices, skip_slices):
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.imshow(image_data[:, :, i], cmap='gray')
        plt.title(f'Image Slice {i}')
        plt.axis('off')

        plt.subplot(1, 2, 2)
        plt.imshow(image_data[:, :, i], cmap='gray')
        plt.imshow(mask_data[:, :, i], cmap='jet', alpha=0.5)
        plt.title(f'Mask Overlay Slice {i}')
        plt.axis('off')

        plt.show()

# **4. Preprocessing:**
**Task: Please perform the following preprocessing steps:**

1. CT images have a fixed range from -1000 to 3071. **Thus you could normalize by dividing by 3071** <br /> You don't need to compute mean and standard deviation for this task
2. As we want to focus on lung tumors, we can crop away parts of the lower abdomen to reduce the complexity and help the network learn. As an example, **you might skip the first 30 slices (from lower abdomen to the neck)** (last axis)
3. As we want to tackle this task on a slice level (2D) and not on a subject level (3D) to reduce the computational cost **you should store the preprocessed data as 2d files**, because reading a single slice is much faster than loading the complete NIfTI file.
4. Resize the single slices and masks to (256, 256) (when resizing the mask, pass interpolation=cv2.INTER_NEAREST to the resize function to apply nearest neighbour interpolation)

Loop over all_files and apply the preprocessing steps. <br />

Additionally, please make sure that all scans have the same orientation

In the preprocessing loop, you need to create a directory for each subject containg the ct and label slices with identical names. <br />
E.g:
* 0/data/0.npy
* 0/masks/0.npy

Store the last 6 subjects as validation data

PS: Feel free to try the lung window!

*4.1. Normalize, Crop, and Resize:*

In [None]:
def preprocess_image(image_data, mask_data, num_slices_to_skip=30, target_size=(256, 256)):
    # Normalize the image data
    image_data = (image_data - (-1000)) / (3071 + 1000)  # normalize to [0, 1]

    # Skip the first 30 slices
    image_data = image_data[:, :, num_slices_to_skip:]
    mask_data = mask_data[:, :, num_slices_to_skip:]

    # Resize the images and masks
    num_slices = image_data.shape[2]
    preprocessed_images = []
    preprocessed_masks = []

    for i in range(num_slices):
        image_slice = cv2.resize(image_data[:, :, i], target_size, interpolation=cv2.INTER_LINEAR)
        mask_slice = cv2.resize(mask_data[:, :, i], target_size, interpolation=cv2.INTER_NEAREST)
        preprocessed_images.append(image_slice)
        preprocessed_masks.append(mask_slice)

    return np.array(preprocessed_images), np.array(preprocessed_masks)

*4.2. Save Preprocessed Slices:*

In [None]:
def save_slices(images, masks, output_dir, subject_id):
    image_dir = os.path.join(output_dir, f"{subject_id}/data")
    mask_dir = os.path.join(output_dir, f"{subject_id}/masks")

    os.makedirs(image_dir, exist_ok=True)
    os.makedirs(mask_dir, exist_ok=True)

    for i in range(images.shape[0]):
        np.save(os.path.join(image_dir, f"{i}.npy"), images[i])
        np.save(os.path.join(mask_dir, f"{i}.npy"), masks[i])

*4.3. Main Preprocessing Loop:*

In [None]:
def main_preprocessing(input_dir, output_dir):
    all_files = list(Path(input_dir).glob("lung_*"))

    # Split the files into training and validation sets
    train_files = all_files[:-6]
    validation_files = all_files[-6:]

    for subject_id, file_pair in enumerate(tqdm(train_files)):
        image_file = file_pair / "imagesTr" / "image.nii.gz"  # Adjust paths as per actual file structure
        mask_file = file_pair / "labelsTr" / "label.nii.gz"   # Adjust paths as per actual file structure

        # Load image and mask
        _, image_data = load_nifti(image_file)
        _, mask_data = load_nifti(mask_file)

        # Preprocess image and mask
        preprocessed_images, preprocessed_masks = preprocess_image(image_data, mask_data)

        # Save preprocessed slices
        save_slices(preprocessed_images, preprocessed_masks, output_dir / "train", subject_id)

    for subject_id, file_pair in enumerate(tqdm(validation_files)):
        image_file = file_pair / "imagesTr" / "image.nii.gz"  # Adjust paths as per actual file structure
        mask_file = file_pair / "labelsTr" / "label.nii.gz"   # Adjust paths as per actual file structure

        # Load image and mask
        _, image_data = load_nifti(image_file)
        _, mask_data = load_nifti(mask_file)

        # Preprocess image and mask
        preprocessed_images, preprocessed_masks = preprocess_image(image_data, mask_data)

        # Save preprocessed slices
        save_slices(preprocessed_images, preprocessed_masks, output_dir / "validation", subject_id)

# Define file paths and output directory
input_dir = 'Task06_Lung'
output_dir = Path('Task06_Lung_Preprocessed')

# Preprocess and save the data
main_preprocessing(input_dir, output_dir)

0it [00:00, ?it/s]

0it [00:00, ?it/s]

*4.4. Main Preprocessing Loop:*

In [None]:
def main_preprocessing(input_dir, output_dir):
    all_files = list(Path(input_dir).glob("lung_*"))

    # Split the files into training and validation sets
    train_files = all_files[:-6]
    validation_files = all_files[-6:]

    for subject_id, file_pair in enumerate(tqdm(train_files)):
        image_file = file_pair / "imagesTr" / "image.nii.gz"  # Adjust paths as per actual file structure
        mask_file = file_pair / "labelsTr" / "label.nii.gz"   # Adjust paths as per actual file structure

        # Load image and mask
        _, image_data = load_nifti(image_file)
        _, mask_data = load_nifti(mask_file)

        # Preprocess image and mask
        preprocessed_images, preprocessed_masks = preprocess_image(image_data, mask_data)

        # Save preprocessed slices
        save_slices(preprocessed_images, preprocessed_masks, output_dir / "train", subject_id)

    for subject_id, file_pair in enumerate(tqdm(validation_files)):
        image_file = file_pair / "imagesTr" / "image.nii.gz"  # Adjust paths as per actual file structure
        mask_file = file_pair / "labelsTr" / "label.nii.gz"   # Adjust paths as per actual file structure

        # Load image and mask
        _, image_data = load_nifti(image_file)
        _, mask_data = load_nifti(mask_file)

        # Preprocess image and mask
        preprocessed_images, preprocessed_masks = preprocess_image(image_data, mask_data)

        # Save preprocessed slices
        save_slices(preprocessed_images, preprocessed_masks, output_dir / "validation", subject_id)

# Define file paths and output directory
input_dir = 'Task06_Lung'
output_dir = Path('Task06_Lung_Preprocessed')

# Preprocess and save the data
main_preprocessing(input_dir, output_dir)

0it [00:00, ?it/s]

0it [00:00, ?it/s]

# **5. Validate Preprocessed Data:**

**Task: Take a look at your stored files and inspect if everything worked as expected.**

*5.1. Inspect Preprocessed Data:*

In [None]:
def inspect_preprocessed_data(data_dir):
    for subject_dir in Path(data_dir).glob('*'):
        data_files = sorted(subject_dir.glob('data/*.npy'))
        mask_files = sorted(subject_dir.glob('masks/*.npy'))

        for data_file, mask_file in zip(data_files, mask_files):
            image_data = np.load(data_file)
            mask_data = np.load(mask_file)

            plt.figure(figsize=(10, 5))
            plt.subplot(1, 2, 1)
            plt.imshow(image_data, cmap='gray')
            plt.title(f'{data_file.stem} Image')
            plt.axis('off')

            plt.subplot(1, 2, 2)
            plt.imshow(mask_data, cmap='jet', alpha=0.5)
            plt.title(f'{mask_file.stem} Mask')
            plt.axis('off')

            plt.show()

# Validate preprocessed data
inspect_preprocessed_data(output_dir / "train")
inspect_preprocessed_data(output_dir / "validation")