# About the Dataset

Imports for the following notebook

In [1]:
import os
import shutil

The Dataset is from [OSF](https://osf.io/s6ru5/files/osfstorage#) from where I downloaded only the folder “Dataset_Original” from the folder “CD&S”. The download may take a while as it is a large file.

Once you have downloaded the folder and moved it to your working directory, you can unzip it. The following code gives you an overview of the contents of the file. Tip do not delete the zip file.

Make sure the path to the Dataset is set correctly.

In [13]:
base_folder = '../ML2Project/Dataset_Original'  # Adjust this Path if not in the same Working directory

import os

def list_folder_structure(base_folder):
    structure = {}
    for root, dirs, files in os.walk(base_folder):
        level = root.replace(base_folder, '').count(os.sep)
        indent = ' ' * 4 * (level)
        subindent = ' ' * 4 * (level + 1)
        if level == 0:
            structure[root] = {}
            current_level = structure[root]
        elif level == 1:
            parent_folder = os.path.basename(root)
            structure[base_folder][parent_folder] = {}
            current_level = structure[base_folder][parent_folder]
        elif level == 2:
            grandparent_folder = os.path.basename(os.path.dirname(root))
            parent_folder = os.path.basename(root)
            structure[base_folder][grandparent_folder][parent_folder] = files
    return structure

def count_images_in_folder(folder):
    image_extensions = ('.jpg')
    return len([f for f in os.listdir(folder) if f.lower().endswith(image_extensions)])

def summarize_image_counts(base_folder):
    summary = {}
    for root, dirs, files in os.walk(base_folder):
        level = root.replace(base_folder, '').count(os.sep)
        if level == 2:  # We're at the lowest level
            image_count = count_images_in_folder(root)
            summary[root] = image_count
    return summary

def main(base_folder):
    structure = list_folder_structure(base_folder)
    summary = summarize_image_counts(base_folder)
    
    print("Folder Structure:")
    for key, value in structure.items():
        print(f"{key}:")
    
    print("\nImage Summary:")
    for key, value in summary.items():
        print(f"{key} contains {value} images")

main(base_folder)

Folder Structure:
../ML2Project/Dataset_Original:

Image Summary:
../ML2Project/Dataset_Original\test\gls contains 261 images
../ML2Project/Dataset_Original\test\nlb contains 248 images
../ML2Project/Dataset_Original\test\nls contains 276 images
../ML2Project/Dataset_Original\train\gls contains 262 images
../ML2Project/Dataset_Original\train\nlb contains 249 images
../ML2Project/Dataset_Original\train\nls contains 275 images


# Reorganize Dataset

The whole data set is not optimally divided for the application with a split of 50% training data and 50% test data. I have decided to reduce the test data set by 50% and to make the images that are removed from the test data set available to the training data set. For this reason I have defined the following script, which is used to restructure the data set.

**EXECUTE THE FOLLOWING CODE ONLY ONCE!**

If you were to run the following code more than once, the test data set would not have enough data. The target distribution is about 3/4 training and 1/4 test data.

The paths of the folders are defined first and the script to move every second image is executed for each pair of folders.

*Do not forget to change the path if it is not the same path for you.*

In [15]:
# Path to Folders (adjust if image folder not in Working directory!)
folder_pairs = [
    ('../ML2Project/Dataset_Original/test/gls', '../ML2Project/Dataset_Original/train/gls'), #source folder1, target folder1
    ('../ML2Project/Dataset_Original/test/nlb', '../ML2Project/Dataset_Original/train/nlb'),  #source folder2, target folder2
    ('../ML2Project/Dataset_Original/test/nls', '../ML2Project/Dataset_Original/train/nls')   #source folder3, target folder3
]

def move_every_second_image(source_folder, target_folder):
    # Check if the source and target folders exist
    if not os.path.exists(source_folder):
        print(f"Source folder '{source_folder}' does not exist.")
        return
    
    if not os.path.exists(target_folder):
        os.makedirs(target_folder)
    
    # Get a list of files in the source folder
    files = sorted(os.listdir(source_folder))
    
    # Filter out only image files (optional: you can add more extensions if needed)
    image_extensions = ('.jpg')
    images = [f for f in files if f.lower().endswith(image_extensions)]
    
    # Move every second image to the target folder
    for index, image in enumerate(images):
        if index % 2 != 0:  # index is odd (every second image)
            source_path = os.path.join(source_folder, image)
            target_path = os.path.join(target_folder, image)
            shutil.move(source_path, target_path)


def count_folder_contents(folder):
    return len(os.listdir(folder))

def process_folders(folder_pairs):
    for source_folder, target_folder in folder_pairs:
        print(f"Processing folder pair: {source_folder} -> {target_folder}")

        before_source_count = count_folder_contents(source_folder)
        before_target_count = count_folder_contents(target_folder)
        
        print(f"Before: {source_folder} contains {before_source_count} items.")
        print(f"Before: {target_folder} contains {before_target_count} items.")
        
        move_every_second_image(source_folder, target_folder)

        after_source_count = count_folder_contents(source_folder)
        after_target_count = count_folder_contents(target_folder)

        print(f"After: {source_folder} contains {after_source_count} items.")
        print(f"After: {target_folder} contains {after_target_count} items.")
        
        print(f"Finished processing folder pair: {source_folder} -> {target_folder}")


process_folders(folder_pairs)

Processing folder pair: ../ML2Project/Dataset_Original/test/gls -> ../ML2Project/Dataset_Original/train/gls
Before: ../ML2Project/Dataset_Original/test/gls contains 261 items.
Before: ../ML2Project/Dataset_Original/train/gls contains 262 items.
After: ../ML2Project/Dataset_Original/test/gls contains 131 items.
After: ../ML2Project/Dataset_Original/train/gls contains 392 items.
Finished processing folder pair: ../ML2Project/Dataset_Original/test/gls -> ../ML2Project/Dataset_Original/train/gls
Processing folder pair: ../ML2Project/Dataset_Original/test/nlb -> ../ML2Project/Dataset_Original/train/nlb
Before: ../ML2Project/Dataset_Original/test/nlb contains 248 items.
Before: ../ML2Project/Dataset_Original/train/nlb contains 249 items.
After: ../ML2Project/Dataset_Original/test/nlb contains 124 items.
After: ../ML2Project/Dataset_Original/train/nlb contains 373 items.
Finished processing folder pair: ../ML2Project/Dataset_Original/test/nlb -> ../ML2Project/Dataset_Original/train/nlb
Proces

After executing the above code, you should get an output similar to the following graphic:

<img src="ImageLib\ImageTransfer.png" alt="image" style="width:700px;"/>

If something went wrong, and you have a different output, you can delete the data set at any time and create a new one from the downloaded zip file.
