### Classes distribution

The dataset contains 3542 images, distributed as follows:

| Class  | #Samples | 
| --- | --- |
| **Species1** | 186 |
| **Species2** | 532 |
| **Species3** | 515 |
| **Species4** | 511 |
| **Species5** | 531 |
| **Species6** | 222 |
| **Species7** | 537 |
| **Species8** | 508 |


### Settings

Import the necessary modules

In [1]:
import os
import random

from shutil import copyfile

Set random seed for reproducibility

In [2]:
seed = 42

random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

### Data preprocessing

Define the dataset directory and the subdirectories that will contain the training, validation and test images

In [3]:
DATA_DIR = 'C:\\Users\\markz\\Desktop\\ANN_chall_1\\training_data_final'

SPLIT_DIR = ['training', 'validation', 'testing']

Set dataset percentages for training, validation and test

In [4]:
PERC = {}

PERC['training'] = 0.6         # 60%
PERC['validation'] = 0.2       # 20%
PERC['testing'] = 0.2          # 20%

Set the labels names

In [5]:
LABELS = ['Species1','Species2','Species3','Species4','Species5','Species6','Species7','Species8']

Create the folders that will contain the training, validation and testing data

In [6]:
def createFolder(path):
    # path : path of the folder to create
    try:
        os.mkdir(path)
    except OSError:
        print ("Creation of the directory " + path + " failed, maybe the folder already exists")

# Create the three folders training, validation and testing
for dir in SPLIT_DIR:
    
    folder = os.path.join(DATA_DIR, dir)
    createFolder(folder)

    # Create one subfolder for each class

    for species in LABELS:
        createFolder(os.path.join(folder, species))

Assign each image to one partition

In [7]:
SET = {}

SET['training'] = []
SET['validation'] = []
SET['testing'] = []

for label in LABELS:
    folder = os.path.join(DATA_DIR, label)
    images = next(os.walk(folder))[2]

    # Counts the number of samples
    samples_number = len(images)

    # Shuffles
    random.shuffle(images)

    # Splits the dataset 
    SET['training'].append(images[:int(PERC['training'] * samples_number)])
    SET['validation'].append(images[int(PERC['training'] * samples_number):int((PERC['training'] + PERC['validation'])*samples_number)])
    SET['testing'].append(images[int((PERC['training'] + PERC['validation'])*samples_number):])

Print the number of images in each set

In [8]:
for k in SET.keys(): print([len(label) for label in SET[k]])

[111, 319, 309, 306, 318, 133, 322, 304]
[37, 106, 103, 102, 106, 44, 107, 102]
[38, 107, 103, 103, 107, 45, 108, 102]


Copy the images into the new folders

In [9]:
for dir in SPLIT_DIR:
    folder = os.path.join(DATA_DIR, dir)
    i = 0
    # For each class target
    for label in LABELS:

        target = os.path.join(folder, label)
        source = os.path.join(DATA_DIR, label)

        # Copying each image to the new directory
        for sample in SET[dir][i]:
            copyfile(source + '\\' + sample, target + '\\' + sample)
        i += 1

Now the dataset is divided into training, validation and testing samples.<br>
We are all set to start with the model training.