# Creation of an all-in-one dataset

During the preprocessing step in `Preprocessing.ipynb`, training and test images are stored in subfolders respective of the region. This proved impractical at times, especially when attempting retraining. When predicting results for the challenge submission, it is practical to have access to all test images at once.

In [15]:
from os.path import join
import shutil
import json

import utils

Order is important for loading datasets and labels in the right order!

In [16]:
sets = [('colombia', 'borde_rural'),
        ('colombia', 'borde_soacha'),
        ('guatemala', 'mixco_1_and_ebenezer'), 
        ('guatemala', 'mixco_3'),
        ('st_lucia', 'dennery')]

### Define paths

In [17]:
data_path = join('..', '..', 'stac')
train_dir = join('..', '..', 'stac', country, region, 'roofs_train')
all_in_one = join(data_path, 'all_in_one')

materials = {'concrete_cement':0, 'healthy_metal':1, 'incomplete':2, 'irregular_metal':3, 'other':4}

### Copy all labels into one file

In [18]:
all_labels = utils.get_labels(sets, data_path)
with open(join(all_in_one, 'roofs_train', 'simple_labels.json'), 'w') as label_file:
    json.dump(all_labels, label_file)

### Copy all samples into one directory per material

In [28]:
for country, region in sets:
    region_train_dir = join('..', '..', 'stac', country, region, 'roofs_train')
    region_test_dir = join('..', '..', 'stac', country, region, 'roofs_test')

    # Copy all training images by material
    for mat in materials.keys():
        region_mat_dir = join(region_train_dir, mat)
        all_mat_dir = join(all_in_one, 'roofs_train', mat)
        shutil.copytree(region_mat_dir, all_mat_dir, dirs_exist_ok=True)
        
    # Copy all test samples
    all_test_dir = join(all_in_one, 'region_test')
    shutil.copytree(region_test_dir, all_test_dir, dirs_exist_ok=True)