# Preprocessing
As mentioned in our proposal, we intend to:
* Group species into a larger category, instead by genus.
* Discard genus of less than 75 images.
* Apply data augmentation techniques, including rotations, flips, and Gaussian noise.
* Format data for ease of training, validating, and testing.

## Download
This uses the Kaggle API, but you can alternatively direct download. Make sure you have a key at `C:/Users/<username>/.kaggle/kaggle.json` (for Windows machines). We'll be doing our file manipulations in-place since the dataset is fairly large (~3.0 GB).

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [5]:
# download via API
!kaggle datasets download -d anaclaricerezende/mind-funga -p ../data --unzip

Dataset URL: https://www.kaggle.com/datasets/anaclaricerezende/mind-funga
License(s): CC-BY-SA-3.0
Downloading mind-funga.zip to ../data

('Bad zip file, please report on www.github.com/kaggle/kaggle-api', BadZipFile('File is not a zip file'))



  0%|          | 0.00/2.98G [00:00<?, ?B/s]
  0%|          | 1.00M/2.98G [00:00<09:03, 5.88MB/s]
  0%|          | 2.00M/2.98G [00:00<06:42, 7.93MB/s]
  0%|          | 4.00M/2.98G [00:00<05:24, 9.84MB/s]
  0%|          | 6.00M/2.98G [00:00<05:01, 10.6MB/s]
  0%|          | 8.00M/2.98G [00:00<04:55, 10.8MB/s]
  0%|          | 10.0M/2.98G [00:01<04:46, 11.1MB/s]
  0%|          | 12.0M/2.98G [00:01<04:42, 11.3MB/s]
  0%|          | 14.0M/2.98G [00:01<04:39, 11.4MB/s]
  1%|          | 16.0M/2.98G [00:01<04:38, 11.4MB/s]
  1%|          | 18.0M/2.98G [00:01<04:36, 11.5MB/s]
  1%|          | 20.0M/2.98G [00:01<04:36, 11.5MB/s]
  1%|          | 22.0M/2.98G [00:02<05:23, 9.81MB/s]
  1%|          | 23.0M/2.98G [00:02<06:07, 8.62MB/s]
  1%|          | 24.0M/2.98G [00:02<07:48, 6.76MB/s]
  1%|          | 25.0M/2.98G [00:02<07:26, 7.10MB/s]
  1%|          | 27.0M/2.98G [00:02<06:17, 8.40MB/s]
  1%|          | 29.0M/2.98G [00:03<05:42, 9.25MB/s]
  1%|1         | 31.0M/2.98G [00:03<05:26, 9.68MB/s]
 

Dataset URL: https://www.kaggle.com/datasets/anaclaricerezende/mind-funga
License(s): CC-BY-SA-3.0
Downloading mind-funga.zip to ../data
... resuming from 1934622720 bytes (1260734328 bytes left) ...

('Bad zip file, please report on www.github.com/kaggle/kaggle-api', BadZipFile('Bad magic number for file header'))



 61%|######    | 1.80G/2.98G [00:00<?, ?B/s]
 61%|######    | 1.80G/2.98G [00:00<06:01, 3.48MB/s]
 61%|######    | 1.80G/2.98G [00:00<04:54, 4.27MB/s]
 61%|######    | 1.80G/2.98G [00:00<04:25, 4.73MB/s]
 61%|######    | 1.81G/2.98G [00:00<03:50, 5.44MB/s]
 61%|######    | 1.81G/2.98G [00:00<03:31, 5.94MB/s]
 61%|######    | 1.81G/2.98G [00:01<03:20, 6.25MB/s]
 61%|######    | 1.81G/2.98G [00:01<03:28, 6.01MB/s]
 61%|######    | 1.81G/2.98G [00:01<03:31, 5.91MB/s]
 61%|######    | 1.81G/2.98G [00:01<03:27, 6.04MB/s]
 61%|######    | 1.81G/2.98G [00:01<03:43, 5.58MB/s]
 61%|######    | 1.81G/2.98G [00:02<04:02, 5.14MB/s]
 61%|######    | 1.81G/2.98G [00:02<04:06, 5.06MB/s]
 61%|######    | 1.81G/2.98G [00:02<04:08, 5.01MB/s]
 61%|######1   | 1.82G/2.98G [00:02<04:13, 4.92MB/s]
 61%|######1   | 1.82G/2.98G [00:03<04:16, 4.86MB/s]
 61%|######1   | 1.82G/2.98G [00:03<04:15, 4.87MB/s]
 61%|######1   | 1.82G/2.98G [00:03<04:11, 4.93MB/s]
 61%|######1   | 1.82G/2.98G [00:03<04:05, 5.07MB/s]


## Genus grouping

In [7]:
import os
import shutil

def move_files(src: str, dst: str) -> None:
    files = os.scandir(src)
    for file in files:
        shutil.move(file.path, os.path.join(dst, file.name))

In [9]:
# run once, don't touch again
data_path = '../data/MIND.Funga App'
subdir = os.scandir(data_path)
original_num_files = 0

prev_genus = ''
for species in subdir:
    if species.is_dir():
        original_num_files += len(os.listdir(species.path))
        
        curr_genus = species.name.split(' ', 1)[0]
        new_path = os.path.join(data_path, curr_genus)
        if curr_genus != prev_genus or len(prev_genus) == 0: # if different genus
            prev_genus = curr_genus
            print(f'Preprocessing {curr_genus}')
            new_path = os.path.join(data_path, curr_genus)
            os.mkdir(new_path) # create new dir
            move_files(species.path, new_path)
            shutil.rmtree(species.path) # remove old dir
        else: # if same genus
            move_files(species.path, new_path)
            shutil.rmtree(species.path)
            
new_num_files = 0
subdir = os.scandir(data_path)
for species in subdir:
    if species.is_dir():
        print(f'Number of files in {species.name}:', len(os.listdir(species.path)))
        new_num_files += len(os.listdir(species.path))

Preprocessing Abrachium
Preprocessing Abundisporus
Preprocessing Aegis
Preprocessing Agaricus
Preprocessing Agrocybe
Preprocessing Aleurodiscus
Preprocessing Amanita
Preprocessing Amauroderma
Preprocessing Amparoina
Preprocessing Amylostereum
Preprocessing Antrodia
Preprocessing Antrodiella
Preprocessing Arambarria
Preprocessing Armilaria
Preprocessing Artolenzites
Preprocessing Artomyces
Preprocessing Ascopolyporus
Preprocessing Aseroe
Preprocessing Asteridiella
Preprocessing Asterostroma
Preprocessing Astrothelium
Preprocessing Atroporus
Preprocessing Aurantiopileus
Preprocessing Auricularia
Preprocessing Auriporia
Preprocessing Auriscalpium
Preprocessing Beauveria
Preprocessing Biscogniauxia
Preprocessing Boletinellus
Preprocessing Bondarzewia
Preprocessing Bresadolia
Preprocessing Brigantiaea
Preprocessing Brunneocorticium
Preprocessing Byssomerulius
Preprocessing Callistosporium
Preprocessing Calocera
Preprocessing Calvatia
Preprocessing Camarops
Preprocessing Camillea
Preprocessi

In [10]:
print('Original number of files from Kaggle:', original_num_files)
print('Saved files:', new_num_files)

new_num_files = 0

Original number of files from Kaggle: 16491
Saved files: 16489


In [11]:
data_path = '../data/MIND.Funga App'
subdirs = os.scandir(data_path)
file_counts = [(genus, len(os.listdir(genus))) for genus in subdirs]
file_counts.sort(key=lambda x: x[1], reverse=True)

for genus, count in file_counts[10:]: # keep top 10 directories
    shutil.rmtree(genus)

In [12]:
subdirs = os.scandir(data_path)
for species in subdirs:
    if species.is_dir():
        print(f'Number of files in {species.name}:', len(os.listdir(species.path)))
        new_num_files += len(os.listdir(species.path))
        
print('Total saved files:', new_num_files)

Number of files in Auricularia: 389
Number of files in Cookeina: 719
Number of files in Entoloma: 486
Number of files in Geastrum: 346
Number of files in Hygrocybe: 979
Number of files in Marasmius: 1122
Number of files in Ophiocordyceps: 485
Number of files in Oudemansiella: 466
Number of files in Phallus: 345
Number of files in Schizophyllum: 252
Total saved files: 5589


## Data augmentation

In [6]:
import collections

In [7]:
def splitFunction(trainRatio = 0.75, valTratio = 0.15, testRatio = 0.1):
    src = '../data/MIND.Funga App'

    #Make a new folder called splittedImgFolder
    newLocation = '../splittedImgFolder'


    Allclasses = [d for d in os.listdir(src) if os.path.isdir(os.path.join(src, d))]
    for currClass in Allclasses:
        folder = os.path.join(src, currClass)

        #Its not copying the image correctly 
        #The clases got copied
        imgArr = [img for img in os.listdir(folder) if os.path.isfile(os.path.join(folder, img))]

        print(len(imgArr))
        
        numTrain = int(len(imgArr) * trainRatio)
        numVal = int(len(imgArr) * (valTratio))

        splitDictionary = collections.defaultdict(list)
        splitDictionary['trainSet'] = imgArr[:numTrain]
        splitDictionary['validationSet'] = imgArr[numTrain:numVal+numTrain]
        splitDictionary['testSet'] = imgArr[numVal+numTrain:]

        #loop through each split then add into folder
        for typeSet, images in splitDictionary.items():
            #make new folder. join location, typeSet
            newFolder = os.path.join(newLocation, typeSet, currClass)
            #construct folder
            os.makedirs(newFolder, exist_ok=True)
            #print(len(images))
            for img in images:
                from_ = os.path.join(folder, img)
                to_ = os.path.join(newFolder, img)

                shutil.copy2(from_, to_)

In [8]:
#Make sure the folder is first created
splitFunction(0.75, 0.15, 0.1)

389
719
486
346
979
1122
485
466
345
318


In [9]:
# BE CAREFUL WHEN RUNNING THIS MORE THAN ONCE, IF YOU WANT TO RUN THIS MORE THAN ONCE, USE THE CELL ABOVE TO CLEAR THE DATABASE FIRST

import numpy as np
import random
from PIL import Image
import matplotlib.pyplot as plt

def da_horizontal_flip(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_hflip = img.transpose(Image.FLIP_LEFT_RIGHT)
    img_hflip.save(os.path.join(save_dir, base_name.replace('.jpg', '_hflip.jpg')))

def da_90_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_90_rotate = img.rotate(90, expand=False)
    img_90_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_90_rotate.jpg')))

def da_180_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_180_rotate = img.rotate(180, expand=False)
    img_180_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_180_rotate.jpg')))

def da_270_rotate(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_270_rotate = img.rotate(270, expand=False)
    img_270_rotate.save(os.path.join(save_dir, base_name.replace('.jpg', '_270_rotate.jpg')))

def da_gaussian_noise(img_path: str, save_dir: str, base_name: str) -> None:
    #print("get")
    img = Image.open(img_path)
    img_gauss = np.array(img)
    row, col, ch = img_gauss.shape
    mean = 0
    var = 0.2 # this is doubled already, use this to adjust the noise
    sigma = var**0.5
    gauss = np.random.normal(mean, sigma, (row, col, ch))
    gaussed = img_gauss + gauss
    finalgauss = Image.fromarray(np.uint8(gaussed))
    finalgauss.save(os.path.join(save_dir, base_name.replace('.jpg', '_gauss.jpg')))

def da_random_erasing(img_path: str, save_dir: str, base_name: str) -> None:
    img = Image.open(img_path)
    img_re = np.array(img)
    row, col, ch = img_re.shape
    s = random.uniform(0.1, 0.4) # these vary the size of the rectangle
    r = random.uniform(0.1, 0.4)
    random_row = int(row * s)
    random_col = int(col * r)
    random_x = np.random.randint(0, row - random_row)
    random_y = np.random.randint(0, col - random_col)
    img_re[random_x:random_x+random_row, random_y:random_y+random_col, :] = 0
    img_re = Image.fromarray(np.uint8(img_re))
    img_re.save(os.path.join(save_dir, base_name.replace('.jpg', '_random_erasing.jpg')))

In [10]:
def dataAugmentationHelper(data_path):
    subdir = os.scandir(data_path)
    for genus in subdir:
        if genus.is_dir():
            genus_name = genus.name
            genus_path = os.path.join(data_path, genus_name)
            files = [entry for entry in os.scandir(genus_path) if entry.is_file()]
            for i in range(0, len(files), 2):
                img_file = files[i]
                da_horizontal_flip(img_file.path, genus_path, img_file.name)    
                da_90_rotate(img_file.path, genus_path, img_file.name)    
                da_180_rotate(img_file.path, genus_path, img_file.name)    
                da_270_rotate(img_file.path, genus_path, img_file.name)   
                da_gaussian_noise(img_file.path, genus_path, img_file.name)
                da_random_erasing(img_file.path, genus_path, img_file.name)

In [11]:
#Call the data augmentation function
dataAugmentationHelper('../splittedImgFolder/trainSet')
dataAugmentationHelper('../splittedImgFolder/testSet')
dataAugmentationHelper('../splittedImgFolder/validationSet')

## Final steps
Run this at the very end.

In [12]:
import os
import shutil
from PIL import Image

try: # CUDA-specific install if running on Colab
    %load_ext cudf.pandas
except ModuleNotFoundError:
    print('CuDF not installed, defaulting to regular pandas')
import pandas as pd

CuDF not installed, defaulting to regular pandas


In [13]:
def generate_labelled_set(dimensions: tuple[int, int] = (300, 300), grayscale: bool = False, 
                          data_path = '../splittedImgFolder/trainSet', setType = 'test') -> None:
    df = pd.DataFrame()
    df['Path'] = None
    df['Genus'] = ''
    df['Image'] = None
    
    directories = os.scandir(data_path)
    for subdir in directories:
        genus = subdir.name
        files = os.scandir(subdir.path)
        
        if grayscale:
            add = pd.DataFrame(
                [{'Path': f"{file.path}",
                  'Genus': genus,
                  'Image': Image.open(file.path).resize(dimensions, Image.Resampling.LANCZOS).convert('L')}
                 for file in files]) # computationally efficient df concatenation
        else:
            add = pd.DataFrame(
                [{'Path': f"{file.path}",
                  'Genus': genus,
                  'Image': Image.open(file.path).resize(dimensions, Image.Resampling.LANCZOS)}
                 for file in files]) # computationally efficient df concatenation
        df = pd.concat([df, add])
        
    print(df.shape)
    
    toName1 = '../data/set_' + setType + '.pkl'
    df.to_pickle(toName1)
    df = df.drop(columns='Image')
    toName2 = '../data/set_' + setType + '.csv'
    df.to_csv(toName2, index=False)

In [14]:
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/trainSet', setType = 'train') 
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/testSet', setType = 'test') 
generate_labelled_set(grayscale=True, data_path='../splittedImgFolder/validationSet', setType = 'validation') 

(20672, 3)
(2841, 3)
(4121, 3)


In [14]:
generate_labelled_set(grayscale=True, data_path='../data/MIND.Funga App', setType='baseline')

(3076, 3)


At this point CSV files are generated. Go to transferLearningTensor.ipynb for the next part

MISC Functions (saved just in case)

In [16]:
# RUN THIS TO REMOVE AUGMENTED IMAGES
subdir = os.scandir(data_path)
for genus in subdir:
    if genus.is_dir():
        genus_name = genus.name
        genus_path = os.path.join(data_path, genus_name)
        for filename in os.listdir(genus_path):
            if "_hflip" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_90_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_180_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_270_rotate" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_gauss" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)
            if "_random_erasing" in filename:
                file_path = os.path.join(genus_path, filename)
                os.remove(file_path)

## Data analysis

#Section to show statistics and examples

In [None]:
#Use data -> MIND.Funga APP file here (list of original images)
