# Homework of Ch4. Image Classification by Convolutional Neural Network
----
This is the homework snippet of TU-ETP-AD1062 Machine Learning Fundamentals.

For more information, please refer to:
https://sites.google.com/view/tu-ad1062-mlfundamentals/

> You do NOT have to build up from nothing, please try your best for the following parts:
> - **Your task: HW4.2.1.**
> - **Your task: HW4.3.1.**
> - **Your task: HW4.3.2.**
> - **Your task: HW4.3.3.**
> - **Your task: HW4.4.**

## HW4.1. Import Packages
----
- Data pre-processing:
    - `os`: Used for path join
    - `sklearn.preprocessing.LabelEncoder`: Convert string-based labels into numeric labels
    - `PIL.Image`: For image file read and manipulation
    - `pandas`: Used for CSV reading and writing
- Models construction:
    - `models.*`, `layers.*`, and `optimizers.*`: For loading related components layers to constructing convolutional neural network
    - `utils.to_categorical`: For converting numerical labels into categorical labels
- Performance evaluation:
    - `sklearn.metrics.zero_one_loss`: Used for accuracy evaluation
    - `sklearn.model_selection.train_test_split`: Divide your data into training and validation set for once, then feed into classifier by yourself, observing the score and confusion matrix
    - `mlfund.plot.PlotMetric`: plot confusion matrix (provided by this repository)

In [None]:
!pip install pillow
!pip install pandas

import os
import numpy as np
import pandas as pd

from PIL import Image

from keras.models import Sequential, Model
from keras.layers import Embedding, Conv1D, Conv2D, MaxPooling2D, GlobalMaxPooling1D, GlobalAveragePooling2D, Flatten, Dense, Dropout, Activation
from keras.optimizers import Adadelta
from keras.utils import to_categorical

from matplotlib import pyplot as plt
from mlfund.plot import PlotMetric

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
%matplotlib inline

## HW4.2. Data pre-processing
----
The code snippet is used to read image files, with corresponded labels provided in its parent directory, i.e.,
- `buildings`
- `forest`
- `glacier` 
- `mountain`
- `sea`
- `street`

### HW4.2.1. Read Image Files
----
Prepare image files as `X_train`, which is an `numpy.ndarray` instance, and string-based label list `y_train_str`

> **Your task: HW4.2.1.**
> Try to adjust your `target_size` based on the training result below. You may have to adjust several times to have a better result. Remember that:
> - The higher `target_size`: Means more details learned into convolution layer, which means you may requires corresponded  pooling layer to reduce effects caused by spatial variations. Also, more computation time is required.
> - The lower `target_size`: Means less details reserved. It consumes less computation time, but some information which helps training and classification might be missed, too.

In [None]:
target_size = (32, 32)

The following code snippet helps you to read the image file. You don't have to modify the block shown below.

In [None]:
labels = ['buildings', 'forest', 'glacier', 'mountain', 'sea', 'street']

# Training set
def prepare_training_set():
    X_train = np.empty((0, target_size[0], target_size[1], 3), dtype='uint8')
    y_train_str = []
    sha1_train = []

    for label in labels:
        dir_path_current = os.path.join('data', 'hw4', 'train', label)
        print('Processing %s ...' % dir_path_current)

        filelist_img = os.listdir(dir_path_current)

        # Images
        X_label_set = np.array([np.array(Image.open(os.path.join(dir_path_current, filename_img)).resize( target_size )) for filename_img in filelist_img])
        X_train = np.append(X_train, X_label_set, axis=0)

        # Labels
        y_train_str = y_train_str + ([label] * len(X_label_set))

        # SHA1
        sha1_train = sha1_train + [filename_img.split('.')[0] for filename_img in filelist_img ]

    # Shuffle by SHA1
    idx_sorted = np.argsort(sha1_train)

    X_train = X_train[idx_sorted, :]
    y_train_str = np.array(y_train_str)[idx_sorted]
    sha1_train = np.array(sha1_train)[idx_sorted]
    
    return X_train, y_train_str, sha1_train


# Testing test
def prepare_testing_set():
    X_test = np.empty((0, target_size[0], target_size[1], 3), dtype='uint8')
    sha1_test = []

    dir_path_current = os.path.join('data', 'hw4', 'test')
    print('Processing %s ...' % dir_path_current)

    filelist_img = os.listdir(dir_path_current)

    # Images
    X_test = np.array([np.array(Image.open(os.path.join(dir_path_current, filename_img)).resize( target_size )) for filename_img in filelist_img])

    # SHA1
    sha1_test = [filename_img.split('.')[0] for filename_img in filelist_img ]
    
    return X_test, sha1_test


X_train, y_train_str, sha1_train = prepare_training_set()
X_test, sha1_test = prepare_testing_set()

### HW4.2.2. Convert String Label to Numeric Labels
----
Use `LabelEncoder` to convert the string-based labels into `0`, `1`, `2`, ..., and `5`.

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(y_train_str)

y_train = label_encoder.transform(y_train_str)

display( [ (idx, label) for idx, label in enumerate(label_encoder.classes_) ] )

### HW4.2.3. Show the First 10 images for each Category
----
It provides an overview of the 6 classes dataset after image resized.

In [None]:
y_train = np.array(y_train)

for label in np.unique(y_train):
    plt.figure(figsize=(16, 2))
    plt.suptitle(label_encoder.classes_[label])
    
    X_label_set = X_train[y_train == label]
    for i in range(0, 10):
        plt.subplot(1, 10, i+1)
        plt.imshow(X_label_set[i,:])


## HW4.3. Construct your Classifier
----
> **Your task: HW4.3.1**  
> Build your own Neural Network models by Keras framework, try to maximize the performance by adjust the model structures.
> Some documents listed below might be useful:
> - Convolutional layer: https://keras.io/layers/convolutional/
> - Pooling layer: https://keras.io/layers/pooling/
> - Dropout layer: https://keras.io/layers/core/#dropout
> - Dense (Fully-connected) layer: https://keras.io/layers/core/#dense

In [None]:
def create_convNet(num_classes):
    model = Sequential()

    model.add(Conv2D(100, kernel_size=(3,3), activation='relu', input_shape=(target_size[0], target_size[1], 3)))
    model.add(MaxPooling2D(2,2))
    model.add(Conv2D(50, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(2,2))
    model.add(Flatten())
    model.add(Dense(50,activation='relu'))
    model.add(Dropout(rate=0.5))
    model.add(Dense(6,activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer=Adadelta(), metrics=['accuracy'])
    
    return model

The code snippet helps you to split the known, training data `X_train`, `y_train` into `X1`, `X2`, `y1`, `y2` for validation.

In [None]:
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

y1_categorical = to_categorical(y1)
y2_categorical = to_categorical(y2)

> **Your task: HW4.3.2**  
> Adjust your `fit` process for training, including `batch_size` and `epochs` to meet your hardware conditions.
> 
> For more details, see: https://keras.io/models/model/#fit

In [None]:
model = create_convNet(len(label_encoder.classes_))
model.summary()

batch_size=64
epochs=10

model.fit(X1, y1_categorical, batch_size=batch_size, epochs=epochs)

> **Your task: HW4.3.3**  
> Check the zero-one-loss and confusion matrix to adjust the performance.
>
> **Notice:** In general, one should conduct cross-validation mentioned in Chapter 1.

In [None]:
y2_categorical_predict = model.predict(X2)
y2_predict = np.argmax(y2_categorical_predict, axis=1)

# Error rate
err_01loss = zero_one_loss(y2, y2_predict)
print('Error rate = %2.3f' % err_01loss)

# Confusion matrix of prediction
plot_conf_mat = PlotMetric(figsize=(6,6))
plot_conf_mat.set_labels(label_encoder.classes_.tolist())
plot_conf_mat.confusion_matrix(y2, y2_predict, True)

## HW4.4. Submit to Kaggle InClass
----
> **Your task: HW4.4.**
> 1. Training with full data set `X_train` with the model created by `create_convNet`,
> 2. Predict the **unknown** testing data `X_test` by the trained model, then
> 3. Submit your result to Kaggle

**Notice: You got 5 chances to submit your result every day.**

In [None]:
# Create model and train
y_train_categorical = to_categorical(y_train)

model = create_convNet(len(label_encoder.classes_))
model.fit(X_train, y_train_categorical, batch_size=batch_size, epochs=epochs)

# Predict the testing data
y_test_categorical_predict = model.predict(X_test)
y_test_predict = np.argmax(y_test_categorical_predict, axis=1)
y_test_predict_str = label_encoder.inverse_transform(y_test_predict)

## Before you submit
----
Please join the homework 4 competition by **using the Email ended with \@trendmicro.com as your Kaggle InClass team name**.

Type your Email in the variable `my_trendmicro_email_which_is_also_my_team_name` to make sure you've already read this paragraph, then the following code snippet will help you to generate the csv file for submission.

In [None]:
my_trendmicro_email_which_is_also_my_team_name = ''

import re
assert re.match(r"[^@]+@trendmicro.com", my_trendmicro_email_which_is_also_my_team_name), "Please read the instruction above paragraph carefully"

target_path = 'data/hw04.result.csv'
df_test_label = pd.DataFrame({'id': sha1_test, 'label': y_test_predict_str})
df_test_label.to_csv(target_path, index=False)

print('Congratulation! Please submit your result \'%s\' to https://www.kaggle.com/t/f462abb1fb02461eba8318493c482c7a' % target_path)