# Dataset Preprocessing

This notebook is used to preprocess the dataset that we use from Kaggle by **Tedi Setiady** (https://www.kaggle.com/tedisetiady/leaf-rice-disease-indonesia).

The process of this notebook is described as follows:

1.   Download the corresponding dataset from Kaggle using Kaggle CLI
2.   Extract the downloaded dataset
3.   Resize the extracted dataset to match the size of **MobileNetV2** input shape
4.   Split up the resized dataset into three subset (**training**, **validation**, and **testing**)
5.   Compress those dataset into one single zip archive for the convenience of storing
6.   Copy zipped dataset into Google Drive


---





> **First step first, upload kaggle.json to this colab local runtime before executing below cell**
>
> **If kaggle.json already uploaded, then run below cell to install and import required libraries**

In [None]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!pip install -q split-folders

import splitfolders as sf
import zipfile
import os
from PIL import Image
from google.colab import drive

> **Mount Google Drive into `/drive`**

In [None]:
drive.mount('/drive', force_remount=True)

> **Download the dataset from Kaggle using Kaggle CLI, then extract the dataset**

In [None]:
!kaggle datasets download -d tedisetiady/leaf-rice-disease-indonesia
zip_loc = 'leaf-rice-disease-indonesia.zip'
zip_ref = zipfile.ZipFile(zip_loc, 'r')
zip_ref.extractall('/content/tedi')     # leaf-rice-disease-indonesia.zip
zip_ref.close()

!rm leaf-rice-disease-indonesia.zip

> **Prepare some folder for resizing the dataset**

In [None]:
!mkdir -p 'Resized/Blast'
!mkdir -p 'Resized/Blight'
!mkdir -p 'Resized/Tungro'
!mv /content/tedi/blast '/content/tedi/Blast'
!mv /content/tedi/blight '/content/tedi/Blight'
!mv /content/tedi/tungro '/content/tedi/Tungro'

> **Resize the dataset into 224 by 224 (if the image is 1:1) or 224 by 400 (if the image is not 1:1, but 400 is just the max height so the resized width will always be 224) to match the MobileNetV2 input shape**

In [None]:
IMAGE_DIR = '/content/tedi/'
RESIZED_DIR = '/content/Resized/'
for root, dirs, files in os.walk(IMAGE_DIR):
    for f in files:
        resized_filename = os.path.join(RESIZED_DIR, os.path.split(root)[-1], f)
        try:
            im = Image.open(os.path.join(root, f))
            im.convert('RGB')
            width, height = im.size
            if width == height:
                im = im.resize((224, 224), Image.LANCZOS)
            else:
                im.thumbnail((224, 400), Image.LANCZOS)
            im.save(resized_filename)
        except:
            print("Error creating thumbnail for {}".format(os.path.join(root, f)))

> **Split the resized dataset into three subset (training, validation, and test).**

In [None]:
DATASET_DIR = '/content/Resized'
sf.fixed(DATASET_DIR, output='/content/Splitted/', seed=22, fixed=(15, 10), oversample=True)

> **Compress the resized dataset into one single archive, then copy to Google Drive**

In [None]:
path = '/content/Splitted/'
with zipfile.ZipFile('dataset.zip', 'w', zipfile.ZIP_DEFLATED) as zipObj:
    for root, dirs, files in os.walk(path):
        for f in files:
            zipObj.write(os.path.join(root, f), os.path.relpath(os.path.join(root, f), os.path.join(path, '..')))

!cp dataset.zip '/drive/My Drive/Datasets/dataset-tedi.zip'