# Synthetic Data
___

With only 153 images to start, more are being created to evaluate practicality of the project.
In the real world, this dataset classification should be balanced 50/50.  The vessel should provide pictures of dirty holds and eventually clean holds.  A vessel is either dirty or clean.  

The multiclass classification problem should also be balanced.  The goal of the multiclass model is to determine how close the vessel's holds are to various cargo cleanliness standards.

For the time being, the dataset will be augmented to generate synethetic data, to be replaced by real data when received.

In [1]:
import os
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing.image import img_to_array, load_img, smart_resize, ImageDataGenerator

In [12]:
# data augmentation generator
datagen = ImageDataGenerator(
    brightness_range=(0.5, 1.5),
    rotation_range=45,
    width_shift_range=0.3,
    height_shift_range=0.3,
    shear_range=0.3,
    zoom_range=[0.4, 0.6],
    channel_shift_range=100,
    horizontal_flip=True,
    validation_split=0.1,
    fill_mode='nearest'
)

In [11]:
# Creating more clean images
clean_class_path = '../data/02_classified/clean/'
clean_class_files = os.listdir(clean_class_path)
clean_class_images = [clean_class_path + file for file in clean_class_files]

1488

In [13]:
def create_synthetics(image_list, target, destination_path):
    """Create more images from a given image list and save to destination"""
    # determine how many images to generate per image
    generate_per_image = target / (len(image_list))

    # loop through image list and generate synthetic images
    for image in image_list:
        x = load_img(image)
        x = img_to_array(x)
        x = x.reshape((1,) +x.shape)
        i = 0
        for batch in datagen.flow(x, batch_size=1, save_to_dir=destination_path, save_prefix='aug', save_format='jpg'):
            i += 1
            if i > generate_per_image:
                break    

In [14]:
# Commented out so as not to add more unrealistic pictures
# clean_target = 1500
# clean_processed = '../data/03_processed/clean/'
# create_synthetics(clean_class_images, clean_target, clean_processed)

In [27]:
clean_gray = []
image_gray_1 = '../data/02_classified/clean/465113660_8598614486852404_3598750916332545183_n.jpg'
image_gray_2 = '../data/02_classified/clean/50035257_2049242315123020_9194499840761921536_n.jpg'
clean_gray.append(image_gray_1)
clean_gray.append(image_gray_2)  

In [28]:
# Commented out so as not to add more unrealistic pictures
#clean_processed = '../data/03_processed/clean/'
#gray_target = 422
#create_synthetics(clean_gray, gray_target, clean_processed)

In [33]:
# Creating more dirty images
dirty_class_path = '../data/02_classified/dirty/'
dirty_class_files = os.listdir(dirty_class_path)
dirty_class_images = [dirty_class_path + file for file in dirty_class_files]

In [34]:
# Commented out so as not to add more unrealistic pictures
#dirty_target = 1500
#dirty_processed = '../data/03_processed/dirty/'
#create_synthetics(dirty_class_images, dirty_target, dirty_processed)

### Train / Validate / Test Split
___

This section will further organize the data into train, validate, and testing directories after generating synthetic data into two classes; clean or dirty.

The inspiration to do it this way is from the Medium article by Aravinda
https://aravinda-gn.medium.com/how-to-split-image-dataset-into-train-validation-and-test-set-5a41c48af332

In [35]:
import random
import shutil

In [38]:
# assemble lists of images
dirty_process_path = '../data/03_processed/dirty_STOP_SIGN/'
clean_process_path = '../data/03_processed/clean_STOP_SIGN/'

dirty_process_files = os.listdir(dirty_process_path)
clean_process_files = os.listdir(clean_process_path)

dirty_process_images = [dirty_process_path + file for file in dirty_process_files]
clean_process_images = [clean_process_path + file for file in clean_process_files]

len(dirty_process_images), len(clean_process_images)

(1418, 1363)

In [46]:
# setup each directory
train = os.path.join('../data/', '04_train')
validation = os.path.join('../data/', '05_valiation')
test = os.path.join('../data/', '06_test')

# setup each sub directory
train_dirty = os.path.join('../data/04_train/', 'train_dirty')
train_clean = os.path.join('../data/04_train/', 'train_clean')
validation_dirty = os.path.join('../data/05_valiation', 'validation_dirty')
validation_clean = os.path.join('../data/05_valiation', 'validation_clean')
test_dirty = os.path.join('../data/06_test', 'test_dirty')
test_clean = os.path.join('../data/06_test', 'test_clean')

# randomness
random.seed(27)
random.shuffle(dirty_process_images)
random.shuffle(clean_process_images)

# size of each group
train_dirty_size = round((len(dirty_process_images)) * 0.8, 0)
train_clean_size = round((len(clean_process_images)) * 0.8, 0)
validation_dirty_size  = round((len(dirty_process_images)) * 0.1, 0)
validation_clean_size = round((len(clean_process_images)) * 0.1, 0)
test_dirty_size  = round((len(dirty_process_images)) * 0.1, 0)
test_clean_size  = round((len(clean_process_images)) * 0.1, 0)

In [53]:
#len(dirty_process_images) == train_dirty_size + validation_dirty_size + test_dirty_size, \
#len(clean_process_images) == 
train_clean_size , validation_clean_size , test_clean_size

(1090.0, 136.0, 136.0)

In [54]:
# Create destination folders if they don't exist
for folder_path in [train_dirty, train_clean, validation_dirty, validation_clean, test_dirty, test_clean]:
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

In [61]:
tvt = [
    
]

In [62]:
tvt_dir = {
    'name': 'train_dirty',
    'source_list': dirty_process_images,
    'size': train_dirty_size,
    'destiantion': train_dirty,
}

tvt.append(tvt_dir)

In [None]:
# Train / Validate / Test per class

def train_validate_test(source_list, size, destination):
    for i, f in enumerate(source_list):
        if i < size:
            dest_folder = destination
        elif i < train_size + val_size:
            dest_folder = val_folder
        else:
            dest_folder = test_folder
        shutil.copy(os.path.join(data_path, f), os.path.join(dest_folder, f))