# EE382V - Hardware Architecture for Machine Learning
## NVIDIA Tensor Cores for Accelerating Machine Learning Workload

## Notebook 0 - Dataset Preparation

In this notebook, we will download and prepare the dataset to train our neural network model. Our model should be able to distinguish between cats and dogs when it is given an image. To accomplish this task, we need a lot of dog and cat images labeled correctly so that we can feed them to train our model. Luckily, we do not need to build the dataset ourself. We can just download the dataset from internet.

We will use dataset from Kaggle that contains 12,500 images of cats and 12,500 images of dogs [5]. You may think that 25,000 images are already a lot. In fact, more complex model may need million of images (data) to train so that it can accomplish complex tasks. This is where the big data comes useful for machine learning task. 

### Import Library
We need to import some libraries which are needed to perform some functions in this notebook.

In [None]:
from google_drive_downloader import GoogleDriveDownloader as gdd
import os
import shutil
import re
import split_folders
import tqdm
import matplotlib.pyplot as plt
from PIL import Image

### Global Variable
Here, we define global variables.

In [None]:
data_dir       = './'
raw_dir        = f'{data_dir}/raw'
raw_dogs_dir   = f'{raw_dir}/dogs'
raw_cats_dir   = f'{raw_dir}/cats'
train_dir      = f'{data_dir}/train'
train_dogs_dir = f'{train_dir}/dogs'
train_cats_dir = f'{train_dir}/cats'
val_dir        = f'{data_dir}/val'
val_dogs_dir   = f'{val_dir}/dogs'
val_cats_dir   = f'{val_dir}/cats'
log_dir        = f'{data_dir}/log'
chk_dir        = f'{data_dir}/checkpoint'
test_dir       = f'{data_dir}/test'

### Download Dataset
Let's download the training+validation dataset and test dataset. The training+validation dataset has a size of 543MB while the test dataset has a size of 271MB. It may take a while to download and extract the dataset.

In [None]:
# Download and Unzipping the Training+Validation Dataset
gdd.download_file_from_google_drive(file_id='1TgS3BLPIoc3FHUBrvp6rXaz6g1UJz_2E',
                                    dest_path='./raw.zip',
                                    showsize=False,
                                    overwrite=True,
                                    unzip=True)

In [None]:
# Download and Unzipping the Testing Dataset
gdd.download_file_from_google_drive(file_id='1JRMQY-gXp43ag65nP7HMNFEKhkJTxykw',
                                    dest_path='./test.zip',
                                    showsize=False,
                                    overwrite=True,
                                    unzip=True)

### Workspace Preparation
We create new directory to process our dataset.

In [None]:
os.makedirs(raw_cats_dir   ,exist_ok=True)
os.makedirs(raw_dogs_dir   ,exist_ok=True)
os.makedirs(train_dir      ,exist_ok=True)
os.makedirs(train_cats_dir ,exist_ok=True)
os.makedirs(train_dogs_dir ,exist_ok=True)
os.makedirs(val_dir        ,exist_ok=True)
os.makedirs(val_cats_dir   ,exist_ok=True)
os.makedirs(val_dogs_dir   ,exist_ok=True)
os.makedirs(log_dir        ,exist_ok=True)
os.makedirs(chk_dir        ,exist_ok=True)

### Dataset Grouping
Since we will work with two class of data: dogs and cats, it is a good practice to put all of the images of the same class in the same folder. Therefore, we will put all of dog images in dogs folder and all of cat images in cats folder.

In [None]:
files = os.listdir(raw_dir)
for f in files:
    catImageList = re.search("cat", f)
    dogImageList = re.search("dog", f)
    if catImageList:
        shutil.move(f'{raw_dir}/{f}', raw_cats_dir)
    elif dogImageList:
        shutil.move(f'{raw_dir}/{f}', raw_dogs_dir)

### Dataset Splitting
We need to split our dataset into training dataset and validation dataset. The training dataset is used to train our model and update our model parameter while the validation dataset is used to validate our model without updating our model parameter. In this way, we can see how our model performs when it encounters data that it has never seen before. Validation is also useful to see whether our model is overfit, that is it is only good for the data it has seen. Our training target is to get the best accuracy in our validation dataset.

The training+validation dataset contains 25,000 images: 12,500 cat images and 12,500 dog images. We will split the dataset into training dataset and validation dataset. Usually, a good ratio is 80% for training, 20% for validation but you are free to change the number. You are also free to change the random seed to obtain new data splitting randomness. It will take a while to split the dataset.

In [None]:
# Splitting dataset for training and validation

###################### Change as needed ######################
percentage_for_training   = 0.8
percentage_for_validation = 0.2
random_seed               = 12345
##############################################################

split_folders.ratio(f'{raw_dir}', output="./", seed=random_seed, ratio=(percentage_for_training, percentage_for_validation))

### Dataset Preview
Let's check whether we have correct data in each dataset class.

In [None]:
# Display the sample images in Cats Folder

###################### Change as needed ######################
num_of_cats_images = 5
##############################################################

cats_data_files = os.listdir(train_cats_dir)
fig, ax = plt.subplots(num_of_cats_images, figsize=(num_of_cats_images*5, num_of_cats_images*5))
fig.tight_layout(pad=5)
image_displayed   = 0

for fname in cats_data_files :    
    im         = Image.open(f'{train_cats_dir}/{fname}')
    ax[image_displayed].imshow(im)
    ax[image_displayed].axis('on')
    ax[image_displayed].set_title(fname)
    image_displayed += 1
    if(image_displayed>=num_of_cats_images) :
        break

In [None]:
# Display the sample images in Dogs Folder

###################### Change as needed ######################
num_of_dogs_images = 5
##############################################################

dogs_data_files = os.listdir(train_dogs_dir)
fig, ax = plt.subplots(num_of_dogs_images, figsize=(num_of_dogs_images*5, num_of_dogs_images*5))
fig.tight_layout(pad=5)
image_displayed   = 0

for fname in dogs_data_files :    
    im         = Image.open(f'{train_dogs_dir}/{fname}')
    ax[image_displayed].imshow(im)
    ax[image_displayed].axis('on')
    ax[image_displayed].set_title(fname)
    image_displayed += 1
    if(image_displayed>=num_of_dogs_images) :
        break

### End
This is the end of Notebook 0. Please take a note that each images in our dataset has different size and we must resize them to fit into our neural network model. Please move forward to Notebook 1 where we will train our neural network model.


Version 1.0  - January 5th, 2020 - ©2020 hanindhito@bagus.my.id