# Mixed Dataset

This is a jupyter notebook to easily create folders with mixed DL and CT images.

**Authors:** [Raphaël Achddou](https://people.epfl.ch/raphael.achddou) ([contact](mailto:raphael.achddou@epfl.ch)), [Paulo Ribeiro](https://people.epfl.ch/paulo.ribeirodecarvalho) ([contact](mailto:paulo.ribeirodecarvalho@epfl.ch))

In [2]:
!pip install opencv-python

Defaulting to user installation because normal site-packages is not writeable
Collecting opencv-python
  Downloading opencv_python-4.9.0.80-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Downloading opencv_python-4.9.0.80-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (62.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.2/62.2 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: opencv-python
Successfully installed opencv-python-4.9.0.80


In [3]:
import os
import cv2
import random
from tqdm.notebook import tqdm

SEED = 42
PERCENTAGES = [0.2, 0.4, 0.6, 0.8]

## Load DL and CT

First load all the DL and CT images.

In [4]:
# Specify the directory containing the .png images
dl_folder = "train/dl_images_texture"
ct_folder = "train/ct_images"

# Iterate over all files in the directory
def load_images_in_folder(folder_path: str):
    images = []
    count = 0
    for filename in tqdm(os.listdir(folder_path)):
        # Check if the file is a .png image
        count += 1
        if filename.endswith(".png"):
            # Construct the full file path
            file_path = os.path.join(folder_path, filename)
            # Open the image file
            image = cv2.imread(file_path)
            # Append the image to the list
            if image is not None:
                images.append(image)

    return images

dl_images = load_images_in_folder(folder_path=dl_folder)
print(f"Loaded {len(dl_images)} DL images.")
ct_images = load_images_in_folder(folder_path=ct_folder)
print(f"Loaded {len(ct_images)} CT images.")

  0%|          | 0/13300 [00:00<?, ?it/s]

Loaded 13300 DL images.


  0%|          | 0/13302 [00:00<?, ?it/s]

Loaded 13302 CT images.


## Mixed Dataset

Compute the mixed dataset and store it in the disk.

In [6]:
# Fix the seed for reproductibility
random.seed(SEED)

# Create the mixed datasets
def create_mixed_dataset(ct_images: list, dl_images: list, percentage: float):
    base_folder_name = "train/mixed_dl_ct_"

    for percentage in PERCENTAGES:
        # Start creating the directory
        current_folder_name = base_folder_name + f'{int(percentage*100)}%'
        os.makedirs(current_folder_name, exist_ok=True)
        # Choose randomly the percentage of DL and CT images
        total_images = len(dl_images)
        dl_number = int(total_images * percentage)
        ct_number = total_images - dl_number
        batch_ct_images = random.sample(ct_images, ct_number)
        batch_dl_images = random.sample(dl_images, dl_number)
        mixed_images = batch_ct_images + batch_dl_images
        random.shuffle(mixed_images)  # Shuffle the mixed images

        # Store the mixed images in disk
        for idx, image in tqdm(enumerate(mixed_images), total=len(mixed_images)):
            cv2.imwrite(os.path.join(current_folder_name, f"image_{idx:05}.png"), image)

create_mixed_dataset(ct_images=ct_images, dl_images=dl_images, percentage=PERCENTAGES)

  0%|          | 0/13300 [00:00<?, ?it/s]

  0%|          | 0/13300 [00:00<?, ?it/s]

  0%|          | 0/13300 [00:00<?, ?it/s]

  0%|          | 0/13300 [00:00<?, ?it/s]