# split_in_folders
## Splits training data into train/test folders
Author - Ilan Valencius

Date: 1-31-22

**Input Directory**
* Images
  * img_1, img_2, ...
* Masks
  * mask_1, mask_2, ...
* *Note* indexes of img = indexes of mask i.e. img_1 <-> mask_1
* *Note* file-names of masks and images should be identical so they are sorted in the same alphabetical order

**Output Directory**
* Training (can change this name)
  * training_sources
    * img_1, img_2, ...
  * training_targets
    * mask_1,  mask_2, ...
* Test (can change this name)
  * test_sources
    * img_n, ...
  * test_targets
    * mask_n, ...

*Note*: please manually set up file architecture, training_source, training_targets, etc. must have exact same filename|

In [1]:
# Import dependencies
import os
import random

In [2]:
# Set parameters
img_folder = "D:\\Snyder_UNet_spring_2022\\UNet_data\\landcover.ai.v1\\img_output\\" # Folder containing images (use '\\' AND absolute path)
img_file_ext = "*.jpg" # File extension with regex: *.png for png files, *.tif for tiff files

mask_folder = "D:\\Snyder_UNet_spring_2022\\UNet_data\\landcover.ai.v1\\mask_output\\" # Folder to save new grayscale images
mask_file_ext = "*.jpg" # File extension with regex: *.png for png files, *.tif for tiff files

training_folder = "D:\\Snyder_UNet_spring_2022\\UNet_data\\landcover.ai.v1\\landai_training\\"
test_folder = "D:\\Snyder_UNet_spring_2022\\UNet_data\\landcover.ai.v1\\landai_test\\"

# Get lists of filenames in folders
img_files = sorted(os.listdir(img_folder))
mask_files = sorted(os.listdir(mask_folder))

# Get number of values in test sample
test_percentage = 0.2
test_n = int(test_percentage * len(img_files))
print("Number of samples: %d"%(len(img_files)))
print("Numbe of samples in training set: %d"%(test_n))

Number of samples: 10674
Numbe of samples in training set: 2134


In [3]:
# Shuffle both lists together
combined_files = list(zip(img_files, mask_files))
random.shuffle(combined_files)
img_files, mask_files = zip(*combined_files)

# Sanity check to make sure image and masks align
print(img_files[0])
print(mask_files[0])

# Put files into training dataset
for i in range(test_n, len(img_files)):
    img = img_files[i]
    mask = mask_files[i]

    new_img_path = training_folder + "training_sources\\" + os.path.basename(img)
    new_mask_path = training_folder + "training_targets\\" + os.path.basename(mask)
    
    os.rename(img_folder + img, new_img_path)
    os.rename(mask_folder + mask, new_mask_path)
    
# Put files into test dataset
for i in range(test_n):
    img = img_files[i]
    mask = mask_files[i]
    
    new_img_path = test_folder + "test_sources\\" + os.path.basename(img)
    new_mask_path = test_folder + "test_targets\\" + os.path.basename(mask)
    
    os.rename(img_folder + img, new_img_path)
    os.rename(mask_folder + mask, new_mask_path)



M-33-32-B-b-4-4_106.jpg
M-33-32-B-b-4-4_106.jpg
