# Preprocessing
As mentioned in our proposal, we intend to:
* Group species into a larger category, instead by genus.
* Discard genus of less than 75 images.
* Apply data augmentation techniques, including rotations, flips, and Gaussian noise.
* Format data for ease of training, validating, and testing.

## Download
This uses the Kaggle API, but you can alternatively direct download. Make sure you have a key at `C:/Users/<username>/.kaggle/kaggle.json` (for Windows machines). We'll be doing our file manipulations in-place since the dataset is fairly large (~3.0 GB).

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
# download via API
!kaggle datasets download -d anaclaricerezende/mind-funga -p ../data --unzip

Dataset URL: https://www.kaggle.com/datasets/anaclaricerezende/mind-funga
License(s): CC-BY-SA-3.0
Downloading mind-funga.zip to ../data




  0%|          | 0.00/2.98G [00:00<?, ?B/s]
  0%|          | 1.00M/2.98G [00:00<07:31, 7.08MB/s]
  0%|          | 7.00M/2.98G [00:00<01:40, 31.8MB/s]
  0%|          | 13.0M/2.98G [00:00<01:25, 37.3MB/s]
  1%|          | 19.0M/2.98G [00:00<01:48, 29.2MB/s]
  1%|          | 26.0M/2.98G [00:00<01:44, 30.2MB/s]
  1%|          | 32.0M/2.98G [00:01<01:26, 36.4MB/s]
  1%|          | 36.0M/2.98G [00:01<01:36, 32.6MB/s]
  1%|▏         | 41.0M/2.98G [00:01<01:32, 34.0MB/s]
  2%|▏         | 47.0M/2.98G [00:01<01:17, 40.4MB/s]
  2%|▏         | 52.0M/2.98G [00:01<01:35, 33.0MB/s]
  2%|▏         | 57.0M/2.98G [00:01<01:36, 32.4MB/s]
  2%|▏         | 63.0M/2.98G [00:01<01:21, 38.3MB/s]
  2%|▏         | 70.0M/2.98G [00:02<01:09, 44.7MB/s]
  3%|▎         | 77.0M/2.98G [00:02<01:02, 49.8MB/s]
  3%|▎         | 83.0M/2.98G [00:02<01:35, 32.6MB/s]
  3%|▎         | 89.0M/2.98G [00:02<01:22, 37.7MB/s]
  3%|▎         | 95.0M/2.98G [00:02<01:13, 42.3MB/s]
  3%|▎         | 102M/2.98G [00:02<01:03, 48.6MB/s] 
 

## Genus grouping

In [1]:
import os
import shutil

def move_files(src: str, dst: str) -> None:
    files = os.scandir(src)
    for file in files:
        shutil.move(file.path, os.path.join(dst, file.name))

In [31]:
# run once, don't touch again
data_path = '../data/mind-funga/MIND.Funga App'
subdir = os.scandir(data_path)
num_genera = 0

prev_genus = ''
for species in subdir:
    if species.is_dir():
        curr_genus = species.name.split(' ', 1)[0]
        new_path = os.path.join(data_path, curr_genus)
        if curr_genus != prev_genus or len(prev_genus) == 0: # if different genus
            num_genera += 1
            prev_genus = curr_genus
            print(f'Preprocessing {curr_genus}')
            new_path = os.path.join(data_path, curr_genus)
            os.mkdir(new_path) # create new dir
            move_files(species.path, new_path)
            shutil.rmtree(species.path) # remove old dir
        else: # if same genus
            move_files(species.path, new_path)
            shutil.rmtree(species.path)

print('Number of genera:', num_genera)

Preprocessing Abrachium
Preprocessing Abundisporus
Preprocessing Aegis
Preprocessing Agaricus
Preprocessing Agrocybe
Preprocessing Aleurodiscus
Preprocessing Amanita
Preprocessing Amauroderma
Preprocessing Amparoina
Preprocessing Amylostereum
Preprocessing Antrodia
Preprocessing Antrodiella
Preprocessing Arambarria
Preprocessing Armilaria
Preprocessing Artolenzites
Preprocessing Artomyces
Preprocessing Ascopolyporus
Preprocessing Aseroe
Preprocessing Asteridiella
Preprocessing Asterostroma
Preprocessing Astrothelium
Preprocessing Atroporus
Preprocessing Aurantiopileus
Preprocessing Auricularia
Preprocessing Auriporia
Preprocessing Auriscalpium
Preprocessing Beauveria
Preprocessing Biscogniauxia
Preprocessing Boletinellus
Preprocessing Bondarzewia
Preprocessing Bresadolia
Preprocessing Brigantiaea
Preprocessing Brunneocorticium
Preprocessing Byssomerulius
Preprocessing Callistosporium
Preprocessing Calocera
Preprocessing Calvatia
Preprocessing Camarops
Preprocessing Camillea
Preprocessi

In [3]:
data_path = '../data/MIND.Funga App'
subdirs = os.scandir(data_path)
file_counts = [(genus, len(os.listdir(genus))) for genus in subdirs]
file_counts.sort(key=lambda x: x[1], reverse=True)

for genus, count in file_counts[10:]: # keep top 10 directories
    shutil.rmtree(genus)

## Data augmentation

In [4]:
import collections

Since the group plans to use data augmentation in model training. The data will first be split into training/validation/testing folders to ensure different augmented version of the same image will not appear accross the different sets. Benefits: Better model training and have a unbiased accuracy when running the testing set

Need to make a folder called splittedImgFolder and 3 folders inside called 'trainSet', 'validationSet', and 'testSet' for the code to run

In [5]:
def splitFunction(trainRatio = 0.75, valTratio = 0.15, testRatio = 0.1):
    src = '../data/MIND.Funga App'

    #Make a new folder called splittedImgFolder
    newLocation = '../splittedImgFolder'

    #Loop through all the folders
    Allclasses = [i for i in os.listdir(src)]
    for currClass in Allclasses:
        folder = os.path.join(src, currClass)

        #Loop through img in the folder 
        imgArr = [i for i in os.listdir(folder)]

        print(len(imgArr))
        
        numTrainSplit = int(len(imgArr) * trainRatio)
        numValSplit = int(len(imgArr) * (valTratio)) + numTrainSplit

        splitDictionary = collections.defaultdict(list)
        splitDictionary['trainSet'] = imgArr[:numTrainSplit]
        splitDictionary['validationSet'] = imgArr[numTrainSplit:numValSplit]
        splitDictionary['testSet'] = imgArr[numValSplit:]

        #loop through each split then add into folder
        for genType, images in splitDictionary.items():
            #make new folder. join location, typeSet
            newFolder = os.path.join(newLocation, genType, currClass)
            #construct folder
            os.makedirs(newFolder, exist_ok=True)
            #print(len(images))
            for img in images:
                from_ = os.path.join(folder, img)
                to_ = os.path.join(newFolder, img)
                
                #Copy over image
                shutil.copy2(from_, to_)

In [22]:
#Make sure the folder is first created
#Chose to split training: 75%, validation: 15% and testing: 10%
splitFunction(0.75, 0.15, 0.1)

389
719
486
346
979
1122
485
466
345
252


In [23]:
# BE CAREFUL WHEN RUNNING THIS MORE THAN ONCE, IF YOU WANT TO RUN THIS MORE THAN ONCE, USE THE CELL ABOVE TO CLEAR THE DATABASE FIRST

import numpy as np
import random
from PIL import Image
import matplotlib.pyplot as plt

def da_horizontal_flip(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_hflip = img.transpose(Image.FLIP_LEFT_RIGHT)
    img_hflip.save(os.path.join(save_dir, base_name.replace('.jpg', '_hflip.jpg')))

def da_90_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_90_rotate = img.rotate(90, expand=False)
    img_90_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_90_rotate.jpg')))

def da_180_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_180_rotate = img.rotate(180, expand=False)
    img_180_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_180_rotate.jpg')))

def da_270_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_270_rotate = img.rotate(270, expand=False)
    img_270_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_270_rotate.jpg')))

def da_gaussian_noise(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_gauss = np.array(img)
    row, col, ch = img_gauss.shape
    mean = 0
    var = 0.2 # this is doubled already, use this to adjust the noise
    sigma = var**0.5
    gauss = np.random.normal(mean, sigma, (row, col, ch))
    gaussed = img_gauss + gauss
    finalgauss = Image.fromarray(np.uint8(gaussed))
    finalgauss.save(os.path.join(save_dir, base_name.replace('.jpg', '_gauss.jpg')))

def da_random_erasing(img_path: str, save_dir: str, base_name: str) -> None:
    img = Image.open(img_path)
    img_re = np.array(img)
    row, col, ch = img_re.shape
    s = random.uniform(0.1, 0.4) # these vary the size of the rectangle
    r = random.uniform(0.1, 0.4)
    random_row = int(row * s)
    random_col = int(col * r)
    random_x = np.random.randint(0, row - random_row)
    random_y = np.random.randint(0, col - random_col)
    img_re[random_x:random_x+random_row, random_y:random_y+random_col, :] = 0
    img_re = Image.fromarray(np.uint8(img_re))
    img_re.save(os.path.join(save_dir, base_name.replace('.jpg', '_random_erasing.jpg')))

In [24]:
#Calls all the data augmentation functions for each set
def dataAugmentationHelper(data_path):
    subdir = os.scandir(data_path)
    for genus in subdir:
        if genus.is_dir():
            genus_name = genus.name
            genus_path = os.path.join(data_path, genus_name)
            files = [entry for entry in os.scandir(genus_path) if entry.is_file()]
            for i in range(0, len(files), 2):
                img_file = files[i]
                da_horizontal_flip(img_file.path, genus_path, img_file.name)    
                da_90_rotate(img_file.path, genus_path, img_file.name)    
                da_180_rotate(img_file.path, genus_path, img_file.name)    
                da_270_rotate(img_file.path, genus_path, img_file.name)   
                da_gaussian_noise(img_file.path, genus_path, img_file.name)
                da_random_erasing(img_file.path, genus_path, img_file.name)

In [25]:
#Call the data augmentation function
dataAugmentationHelper('../splittedImgFolder/trainSet')
dataAugmentationHelper('../splittedImgFolder/testSet')
dataAugmentationHelper('../splittedImgFolder/validationSet')

## Final steps
Run this at the very end.

In [26]:
import os
import shutil
from PIL import Image

try: # CUDA-specific install if running on Colab
    %load_ext cudf.pandas
except ModuleNotFoundError:
    print('CuDF not installed, defaulting to regular pandas')
import pandas as pd

CuDF not installed, defaulting to regular pandas


In [27]:
def generate_labelled_set(data_path: str, dimensions: tuple[int, int] = (300, 300), grayscale: bool = False,
                          set_type = 'test') -> None:
    """
    Generates a labelled dataset, with a few different output options. Internally operates on a pandas dataframe.
    :param dimensions: Output dimensions of the image. By default, will downscale to 300x300.
    :param grayscale: Option to convert RGB(A) colour image to grayscale. By default, will keep with all colour channels.
    :param data_path: Path to image folder this function will operate on.
    :param set_type: Parameter to set the output name.
    :return: Will output a CSV dataset (with path and label), and a serialised binary Pickle object (with image included).
    """
    df = pd.DataFrame()
    df['Path'] = None
    df['Genus'] = ''
    df['Image'] = None
    
    directories = os.scandir(data_path)
    for subdir in directories:
        genus = subdir.name
        files = os.scandir(subdir.path)
        
        if grayscale:
            add = pd.DataFrame(
                [{'Path': f"{file.path}",
                  'Genus': genus,
                  'Image': Image.open(file.path).resize(dimensions, Image.Resampling.LANCZOS).convert('L')}
                 for file in files]) # computationally efficient df concatenation
        else:
            add = pd.DataFrame(
                [{'Path': f"{file.path}",
                  'Genus': genus,
                  'Image': Image.open(file.path).resize(dimensions, Image.Resampling.LANCZOS)}
                 for file in files]) # computationally efficient df concatenation
        df = pd.concat([df, add])
        
    print('Output dataframe with size:', df.shape)
    df.to_pickle(f'../data/set_{set_type}.pkl')

    df = df.drop(columns='Image')
    df.to_csv(f'../data/set_{set_type}.csv', index=False)

In [28]:
#generate csv files for each of the data sets
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/trainSet', set_type='train') 
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/testSet', set_type='test') 
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/validationSet', set_type='validation') 

(16769, 3)
(2293, 3)
(3339, 3)


In [32]:
generate_labelled_set(grayscale=True, data_path='../data/MIND.Funga App', set_type='baseline')

(5589, 3)


At this point CSV files are generated. Go to transferLearningTensor.ipynb for the next part

MISC Functions (saved just in case)

In [29]:
# RUN THIS TO REMOVE AUGMENTED IMAGES
subdir = os.scandir(data_path)
for genus in subdir:
    if genus.is_dir():
        genus_name = genus.name
        genus_path = os.path.join(data_path, genus_name)
        for filename in os.listdir(genus_path):
            if "_hflip" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_90_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_180_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_270_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_gauss" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_random_erasing" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)