# Image Quality Quantitative Metrics - Data Subset

## Implementation

The purpose of this notebook is to obtain a subset of a specific dataset. This subset should present the same imbalance and class distribution as the raw training data.

This subset will the applied to the quantitative image quality metrics: FID, IS, and SSIM. As outlined in the section 3.5.1 of the bachelor thesis.

## Step 1 - Importing Dependencies

- Importing the necessary libraries to execute the code.

In [None]:
from torchvision.datasets import ImageFolder
import random
import os

## Step 3 - Dataset Loading

- Loading the subset as a PyTorch dataset from the `ImageFolder` method.

In [None]:
synthetic_dataset  = ImageFolder(root='/kaggle/input/augmented-files/03_augmented/images')

## Step 2 - Defining Utils Functions

- Functions and variables to sample the datasets.

In [None]:
# Discrete classes for the analysis
discrete_class = {0 : "0_punching_hole",
                  1 : "1_welding_line",
                  2 : "2_crescent_gap",
                  3 : "3_water_spot",
                  4 : "4_oil_spot",
                  5 : "5_silk_spot",
                  6 : "6_inclusion",
                  7 : "7_rolled_pit",
                  8 : "8_crease",
                  9 : "9_waist_folding"}

# Desired final number of instances per class
count_real = {0: 140, 1: 174, 2: 144, 3: 184, 4: 130, 5: 416, 6: 138, 7: 19, 8: 33, 9: 96}

def count_instances(dataset):
    """
    Counting the number of instances is a desired dataset
    Used for double check the final results
    """
    class_count = {}
    for _, label in dataset:
        if label in class_count:
            class_count[label] += 1
        else:
            class_count[label] = 1
    return class_count


def random_sample(class_count, synthetic_dataset):
    """
    Random sampling the dataset until the class_count value
    """
    synthetic_subset = []
    for label in class_count.keys():
        label_instances = [(image, label) for image, l in synthetic_dataset if l == label]
        sample_size = class_count[label]
        if sample_size > len(label_instances):
            sample_size = len(label_instances)
        sampled_instances = random.sample(label_instances, sample_size)
        synthetic_subset.extend(sampled_instances)
    return synthetic_subset

## Step 3 - Defining the Subset

- Subset definition.
- Confirming the number of instances in the dataset after drop to match the target number.

In [None]:

synthetic_subset = random_sample(class_count=count_real, synthetic_dataset=synthetic_dataset)

count_synthetic = count_instances(synthetic_subset)

print(f"Synthetic intances: {count_synthetic}")

## Step 4 - Saving the New Synthetic Subset

- Saving the subset created for future use.

In [None]:
path = 'path/tp/the/subset'

for idx, (image, label) in enumerate(synthetic_subset):
    folder_name = discrete_class[label]
    image_name = os.path.basename(f"Class_{label}_Image_{idx}.jpg")
    folder_path = os.path.join(path, folder_name)
    os.makedirs(folder_path, exist_ok=True)
    image_path = os.path.join(folder_path, image_name)
    image.save(image_path)
