**Group 22**

Name  | Surname | Email  
---------|-------------------|---------
Julio|Vigueras|20220661@novaims.unl.pt 
Ariel|Pérez|20220662@novaims.unl.pt
Miguelanguel|Mayuare|20220665@novaims.unl.pt
Ayotunde|Aribo|20221012@novaims.unl.pt

# Preprocessing
---

In this notebook, we review each of the datasets (train, validation and test) to see how they are composed. If is necessary, data will be moved from one set to another to balance them and make them useful in algorithms. 

In [None]:
# Make the imports
import os
import shutil
import random
from collections import defaultdict
import pathlib
from math import floor

from google.colab import drive

In [None]:
# Mounts the google drive folder
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [None]:
path = "/content/gdrive/MyDrive/project/moths-image-datasetclassification/" #change to where you have your directories

In [None]:
# If the dataset is located in the drive and it is zipped, this
# cell will unzip it and save it in the virtual machine
unzip_from_drive = False # if _True_ then it will unzip the dataset from drive
if unzip_from_drive:
  import zipfile
  drive.mount('/content/drive')
  path = "/content/drive/MyDrive/project/moths-image-datasetclassification.zip"
  shutil.copyfile(path, 'moths-image-datasetclassification.zip')
  zip_ = zipfile.ZipFile('moths-image-datasetclassification.zip')
  zip_.extractall()
  zip_.close()
  path = ""

Mounted at /content/drive


In the training set, each class has more than 100 images. On the other hand, for each class in both validation and test sets we only have 5 images. Therefore, we will move images from train set to validation and test sets with the objective that we have 80% of images in training set, 10% of images in validation set and 10% of images in test set.

The following function moves images from each train class set to their corresponding validation class and test class sets, so each class in the train set will have 80% of the images, and each validation and test class sets will have 10% of the images each one.
To maintain the format across the folders, the last images of the training class folders are the ones that will be moved to the validation and test sets and renamed as "6.jpg", "7.jpg", "8.jpg" and so on, to maintain the structure of the folders.

In [None]:
def structure_of_data(path):
  """
  path: path where the dataset is allocated
  """
  moth_classes = [item for item in os.listdir(path + 'train') if os.path.isdir(os.path.join(path + 'train',item))]
  for moth in moth_classes:
    total_train = len( os.listdir(path+'train/'+moth) )
    total = total_train + 5 + 5 # quantity_of_images_in(train_set + val_set + test_set)
    oitenta = floor( total*0.8 )
    dez_val = floor( (total-oitenta)/2 )
    dez_test = total - oitenta - dez_val
    
    new_dez_val = dez_val - 5 #number of images that must be moved from train to val set
    new_dez_test = dez_test - 5 #number of images that must be moved from train to test set

    #Moving images from train set to validation set
    for i in range(new_dez_val): #goes from 0 to new_dez_val-1
      old = path + 'train/' + moth + '/' + str(total_train-i).zfill(3) + '.jpg'
      new = path + 'valid/' + moth + '/' + str(5+i+1) + '.jpg'
      shutil.move(old,new)
    
    #Moving images from train set to test set
    for i in range(new_dez_test): #goes from 0 to new_dez_test-1
      old = path + 'train/' + moth + '/' + str(total_train-new_dez_val-i).zfill(3) + '.jpg'
      new = path + 'test/' + moth + '/' + str(5+i+1) + '.jpg'
      shutil.move(old,new)
    
    # The next prints are just to validate that we get 80% data in train set, 10% in validation and 10% in test
    print("_____ Class: ", moth,"_____")
    print( "Proportion of train data: ", round(len(os.listdir('train/'+ moth))/total,3) )
    print( "Proportion of validation data: ", round(len(os.listdir('valid/'+ moth))/total,3) )
    print( "Proportion of test data: ", round(len(os.listdir('test/'+ moth))/total,3) )
    print("---------------------------- \n")

In [None]:
structure_of_data(path)

_____ Class:  LUNA MOTH _____
Proportion of train data:  0.795
Proportion of validation data:  0.102
Proportion of test data:  0.102
---------------------------- 

_____ Class:  ELEPHANT HAWK MOTH _____
Proportion of train data:  0.8
Proportion of validation data:  0.1
Proportion of test data:  0.1
---------------------------- 

_____ Class:  ARCIGERA FLOWER MOTH _____
Proportion of train data:  0.8
Proportion of validation data:  0.1
Proportion of test data:  0.1
---------------------------- 

_____ Class:  FIERY CLEARWING MOTH _____
Proportion of train data:  0.799
Proportion of validation data:  0.101
Proportion of test data:  0.101
---------------------------- 

_____ Class:  ATLAS MOTH _____
Proportion of train data:  0.799
Proportion of validation data:  0.101
Proportion of test data:  0.101
---------------------------- 

_____ Class:  RUSTY DOT PEARL MOTH _____
Proportion of train data:  0.799
Proportion of validation data:  0.098
Proportion of test data:  0.103
----------------

Now let's see if our dataset is balanced.

In [None]:
moth_classes = [item for item in os.listdir(path + 'train') if os.path.isdir(os.path.join(path + 'train',item))]
moth_classes_count = [len(os.listdir(path + "train/" + moth + "/" )) for moth in moth_classes]
total_training_images = sum(moth_classes_count)
moth_classes_proportion = [round( (x/total_training_images)*100 ,3) for x in moth_classes_count]

print("_____ Proportion of each class in training set _____")
for i in range( len(moth_classes) ):
  print("Proportion of ", moth_classes[i], ": ", moth_classes_proportion[i], "%", sep="")

print("-------------------------")
print("Range of the proportions: (", min(moth_classes_proportion), "%,", max(moth_classes_proportion), "%)", sep="")

_____ Proportion of each class in training set _____
Proportion of LUNA MOTH: 1.648%
Proportion of ELEPHANT HAWK MOTH: 2.22%
Proportion of ARCIGERA FLOWER MOTH: 1.959%
Proportion of FIERY CLEARWING MOTH: 1.942%
Proportion of ATLAS MOTH: 1.812%
Proportion of RUSTY DOT PEARL MOTH: 2.399%
Proportion of EMPEROR MOTH: 1.975%
Proportion of RED NECKED FOOTMAN MOTH: 2.285%
Proportion of COMET MOTH: 1.502%
Proportion of IO MOTH: 1.697%
Proportion of BLUE BORDERED CARPET MOTH: 1.697%
Proportion of SIXSPOT BURNET MOTH: 1.436%
Proportion of VESTAL MOTH: 1.828%
Proportion of SCHORCHED WING MOTH: 2.138%
Proportion of CINNABAR MOTH: 1.828%
Proportion of OWL MOTH: 1.697%
Proportion of BIRD CHERRY ERMINE MOTH: 1.861%
Proportion of HUMMING BIRD HAWK MOTH: 1.991%
Proportion of GARDEN TIGER MOTH: 1.828%
Proportion of WHITE LINED SPHINX MOTH: 1.991%
Proportion of BLACK RUSTIC MOTH: 1.877%
Proportion of DEATHS HEAD HAWK MOTH: 2.334%
Proportion of PLUME MOTH: 2.35%
Proportion of ROSY MAPLE MOTH: 1.91%
Propor

All the classes have similar proportion of images, which are within the range (1.436% , 2.546%). Therefore, we can conclude that the dataset is balanced.

As it was stated in the explore notebook, all the images have the same size: 224x224. 

A subset is selected to be able to handle the data with colab limited resources. A random selection of 30 classes is made.

In [None]:
# Modify paths if you are working locally
# if google drive is used is required that you run the explore notebook beforehand
# to import dataset
source_path = "/content/gdrive/MyDrive/project/moths-image-datasetclassification"
subset_path = "/content/gdrive/MyDrive/project/moths-subset"

num_classes = 30
random.seed(1)

class_folders = os.listdir(source_path + '/train/')

classes = random.sample(class_folders, num_classes)

sets = ["/train/", "/valid/", "/test/"]
for _set in sets:
    for _class in classes:
        class_path = source_path + _set + _class
        dest_path = subset_path + _set + _class
        shutil.copytree(class_path, dest_path)


---
In this notebook, a balancing of the databases was performed in order to have representative data sets. Finally, 30 classes are randomly selected to be analyzed in the models presented in the following notebooks.