# Split dataset into `train`/`test` folders

Using split-folders package: https://pypi.org/project/split-folders/

In [2]:
import splitfolders

From instructions on package's website:

In [6]:
# # Split with a ratio.
# # To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
# splitfolders.ratio("raw_images", 
#                    output="input_images", 
#                    seed=42, 
#                    ratio=(.8, .2))

**Tues 10/12, 1:30pm:**

It worked! The package created copies of all my files, sorted into `train` and `val` folders, each containing the originally named folders/split. 

For now I'm going to delete these, then re-run this code after I have added to the dataset a little today.

In [7]:
# Split with a fixed number of items.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("raw_images", 
                   output="input_images", 
                   seed=42, 
                   fixed=(100))

Copying files: 1461 files [00:15, 91.90 files/s] 


**Wed 10/13, 10:30am:**

Performed operation on full dataset after cropping images.

Decided to set aside 100 images of each class as holdout/test set and will use `validation_split` parameter in Keras's `ImageDataGenerator` to split the training set during model training.

**2:30pm:**

Had to repeat this process after re-orienting all images with `exif_transpose` (see EDA notebook).

In [4]:
# Delete metadata files created by Mac OS
!find . -name ".DS_Store" -delete

In [5]:
# Split with a fixed number of items.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("processed_images", 
                   output="input_images", 
                   seed=42, 
                   fixed=(100))

Copying files: 1461 files [00:08, 164.31 files/s]


**Thu Oct 14, 12pm:**

At data meeting with Max, he suggested reducing the number in the holdout set to 100 total (not 100 per class, which tbh I thought I was doing). Maximize training set since it's already so small!

In [2]:
# Delete metadata files created by Mac OS
!find . -name ".DS_Store" -delete

# Split with a fixed number of items.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("other_images/processed_images", 
                   output="input_images", 
                   seed=42, 
                   fixed=(50))

Copying files: 1461 files [00:13, 108.61 files/s]


**Tues Oct 19, 1:30pm:**

Realized that when I use `validation_split=0.1` in an `ImageDataGenerator` during training, it just takes the last 1% of images *in order* from the training class folders!!! This is a seriously huge issue since my data isn't stored in a random order. In fact, that means that the Google Maps Street View screenshots, which are last alphabetically, are only ever used for validation and never for training! Ultimately this means that my models are being trained and validation on a non-random sample of my dataset.

To address this issue, I will use `split-folders` to create a separate `validation` folder of images to use during training via a separate `ImageDataGenerator` specifically for that purpose, instead of using the `validation_split` parameter. Since `split-folders` does shuffle data during assignment, it should take care of this issue. I probably should have done this in the first place, but thought that using `validation_split` was better.

Not only that, but apparently using `validation_split` with image augmentation means that the validation set is still being augmented! This is seriously not good since *no* holdout/test set images (or new images being used for prediction) will be augmented. It means that ultimately my model validation metrics are not fully valid since they are a non-random sample *and* they've been augmented with image augmentation.

Creating a new folder of images specifically for validation will take care of both of these issues, although it still means that my original test/holdout set (which I won't touch or add to) will not include new data/images.

In [3]:
# Delete metadata files created by Mac OS
!find . -name ".DS_Store" -delete

# Split with a fixed number of items.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("input_images/full_combined", 
                   output="input_images/validation", 
                   seed=42, 
                   fixed=(50))

Copying files: 1719 files [00:15, 107.95 files/s]
