# CS5489 - Course Project - Final

Due date: May 4, 11:59pm

## Goal

For the course project, select one of the following competitions on Kaggle:

### [Plant Pathology 2020 - FGVC7](https://www.kaggle.com/c/plant-pathology-2020-fgvc7/overview): Identify the category of foliar diseases in apple trees

> Misdiagnosis of the many diseases impacting agricultural crops can lead to misuse of chemicals leading to the emergence of resistant pathogen strains, increased input costs, and more outbreaks with significant economic loss and environmental impacts. Current disease diagnosis based on human scouting is time-consuming and expensive, and although computer-vision based models have the promise to increase efficiency, the great variance in symptoms due to age of infected tissues, genetic variations, and light conditions within trees decreases the accuracy of detection.
>
> Objectives of ‘Plant Pathology Challenge’ are to train a model using images of training dataset to 1) Accurately classify a given image from testing dataset into different diseased category or a healthy leaf; 2) Accurately distinguish between many diseases, sometimes more than one on a single leaf; 3) Deal with rare classes and novel symptoms; 4) Address depth perception—angle, light, shade, physiological age of the leaf; and 5) Incorporate expert knowledge in identification, annotation, quantification, and guiding computer vision to search for relevant features during learning.

### [University of Liverpool - Ion Switching](https://www.kaggle.com/c/liverpool-ion-switching/overview): Identify the number of channels open at each time point

>Think you can use your data science skills to make big predictions at a submicroscopic level?
>
>Many diseases, including cancer, are believed to have a contributing factor in common. Ion channels are pore-forming proteins present in animals and plants. They encode learning and memory, help fight infections, enable pain signals, and stimulate muscle contraction. If scientists could better study ion channels, which may be possible with the aid of machine learning, it could have a far-reaching impact.
>
>When ion channels open, they pass electric currents. Existing methods of detecting these state changes are slow and laborious. Humans must supervise the analysis, which imparts considerable bias, in addition to being tedious. These difficulties limit the volume of ion channel current analysis that can be used in research. Scientists hope that technology could enable rapid automatic detection of ion channel current events in raw data.
>
>The University of Liverpool’s Institute of Ageing and Chronic Disease is working to advance ion channel research. Their team of scientists have asked for your help. In this competition, you’ll use ion channel data to better model automatic identification methods. If successful, you’ll be able to detect individual ion channel events in noisy raw signals. The data is simulated and injected with real world noise to emulate what scientists observe in laboratory experiments.
>
>Technology to analyze electrical data in cells has not changed significantly over the past 20 years. If we better understand ion channel activity, the research could impact many areas related to cell health and migration. From human diseases to how climate change affects plants, faster detection of ion channels could greatly accelerate solutions to major world problems.

### [Jigsaw Multilingual Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification): Use TPUs to identify toxicity comments across multiple languages

>It only takes one toxic comment to sour an online discussion. The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet.
>
>In the previous 2018 Toxic Comment Classification Challenge, Kagglers built multi-headed models to recognize toxicity and several subtypes of toxicity. In 2019, in the Unintended Bias in Toxicity Classification Challenge, you worked to build toxicity models that operate fairly across a diverse range of conversations. This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.
>
>Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.
>
>As our computing resources and modeling capabilities grow, so does our potential to support healthy conversations across the globe. Develop strategies to build effective multilingual models and you'll help Conversation AI and the entire industry realize that potential.


## Groups
Group projects should contain 2 students.  To sign up for a group, go to Canvas and under "People", join one of the existing "Project Groups".  _For group projects, the project report must state the percentage contribution from each project member._

## Methodology
You are free to choose the methodology to solve the task.  In machine learning, it is important to use domain knowledge to help solve the problem.  Hence, instead of blindly applying the algorithms to the data you need to think about how to represent the data in a way that makes sense for the algorithm to solve the task. 


## Evaluation on Kaggle

The final evaluation will be performed on Kaggle.

## Project Presentation

Each project group needs to give a presentation at the end of the semester.  The presentation time is 8 minutes.  You _must_ give a presentation.

## What to hand in

You need to turn in the following things:

1. This ipynb file `CourseProject-2020.ipynb` with your source code and documentation. You should write about all the various attempts that you make to find a good solution.
2. Your final submission file to Kaggle.
3. The ipynb file `CourseProject-2018-final.ipynb`, which contains the code that generates the final submission file that you submit to Kaggle. This code will be used to verify that your Kaggle submission is reproducible.
4. Presentation slides.

Files should be uploaded to "Course Project" on Canvas.


## Grading
The marks of the assignment are distributed as follows:
- 40% - Results using various feature representations, dimensionality reduction methods, classifiers, etc.
- 25% - Trying out feature representations (e.g. adding additional features, combining features from different sources) or methods not used in the tutorials.
- 15% - Quality of the written report.  More points for insightful observations and analysis.
- 15% - Project presentation
- 5% - Final ranking on the Kaggle test data (private leaderboard).

**Late Penalty:** 25 marks will be subtracted for each day late.
<hr>

# YOUR METHODS HERE

In [None]:
import numpy as np
import scipy as sp
import pandas as pd

import albumentations as A
import cv2

from sklearn import *
from sklearn.model_selection import train_test_split
import tensorflow as tf
import keras
from tensorflow.keras.layers import Dense, Activation, Dropout, Conv2D, Flatten, Input, GlobalAveragePooling2D
from keras.models import Sequential
from keras.callbacks import TensorBoard
from keras import backend as K

# from tqdm import tqdm
# tqdm.pandas()
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import plotly.offline as pyo
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
pyo.init_notebook_mode()
from keras.utils import plot_model

np.random.seed(2020)
# tf.random.set_seed(2020)
import datetime
import warnings
warnings.filterwarnings("ignore")

physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

print(f'Tensorflow version: {tf.__version__}')

## 1. Data Analysis

In [None]:
import os
IMAGE_PATH = "./dataset/images/"
TEST_PATH = "./dataset/test.csv"
TRAIN_PATH = "./dataset/train.csv"
SUB_PATH = "./dataset/sample_submission.csv"

if os.path.exists("/kaggle/input/plant-pathology-2020-fgvc7/images"):
    IMAGE_PATH = "/kaggle/input/plant-pathology-2020-fgvc7/images/"
    TEST_PATH = "/kaggle/input/plant-pathology-2020-fgvc7/test.csv"
    TRAIN_PATH = "/kaggle/input/plant-pathology-2020-fgvc7/train.csv"
    SUB_PATH = "/kaggle/input/plant-pathology-2020-fgvc7/sample_submission.csv"

sub = pd.read_csv(SUB_PATH)
test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)

In [None]:
print(len(train_data))
train_data.head()

In [None]:
print(len(test_data))
test_data.head()

In [None]:
def format_path(image_id, basedir=IMAGE_PATH):
    return basedir + image_id + '.jpg'

test_paths = test_data.image_id.apply(format_path).values
train_paths = train_data.image_id.apply(format_path).values

train_labels = np.float32(train_data.loc[:, 'healthy':'scab'].values)

In [None]:
print(len(train_paths))
print(len(test_paths))
print(train_paths[0])
print(train_labels[0])

### Visualize sample leaves

In [None]:
classes = ['healthy', 'multiple_diseases', 'rust', 'scab']
def get_tages(labels):
    tags = []
    for label in labels:
        tag = [classes[i] for i in range(4) if label[i] == 1]
        tags.append(tag)
    return tags

train_tags = get_tages(train_labels)

In [None]:
def load_image(file_path):
    image = cv2.imread(file_path)
    return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

In [None]:
def show_samples(c='healthy'):
    plt.figure(figsize=(16,10))
    num = 0
    for i in range(len(train_paths)):
        if c not in train_tags[i]:
            continue
        plt.subplot(4, 4, num+1)
        plt.imshow(cv2.resize(load_image(train_paths[i]), (205, 136))) 
        plt.title(' '.join(train_tags[i]))
        plt.axis('off')
        num += 1
        if num == 16:
            break
    plt.show()

#### Split training and valid set

In [None]:
# split training and valid dataset
train_paths, valid_paths, train_labels, valid_labels = \
    train_test_split(train_paths, train_labels, test_size=0.2, random_state=42)

## 2. Pre-processing data for generative classifiers and discriminative classifiers

...

## 3. Pre-processing data for DNN

In the next sections we use deep neural networks to fit the dataset.

In this section we build a set of data preprocessing pipelines including data augmentation, tensorflow dataset generation and lazy loading.

### Image augmentation 

...

### Build input pipeline

In this section we build the pipeline that assembles the steps of model training.

In [None]:
def decode_image(filename, label=None, image_size=(512, 512)):
    bits = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(bits, channels=3)
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.image.resize(image, image_size)
    
    if label is None:
        return image
    else:
        return image, label

def data_augment(image, label=None):
    # add different data augmentation methods here
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_brightness(image, max_delta=0.5)
    image = tf.image.random_saturation(image, 0.5, 3)
    image = tf.image.random_contrast(image, 0.5, 3)
    
    if label is None:
        return image
    else:
        return image, label

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
# BATCH_SIZE = 8  # 
# STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE


def build_dataset(batch_size, data_augment,
                 train_paths=train_paths,
                 train_labels=train_labels,
                 valid_paths=valid_paths,
                 valid_labels=valid_labels,
                 test_paths=test_paths):
    if data_augment:
        print('Data augment enabled.')
        train = (
            tf.data.Dataset
            .from_tensor_slices((train_paths, train_labels))
            .map(decode_image, num_parallel_calls=AUTOTUNE)
            .map(data_augment, num_parallel_calls=AUTOTUNE)
            .repeat()
            .shuffle(512)
            .batch(batch_size)
            .prefetch(AUTOTUNE)
        )
    else:
        train = (
            tf.data.Dataset
            .from_tensor_slices((train_paths, train_labels))
            .map(decode_image, num_parallel_calls=AUTOTUNE)
            .repeat()
            .shuffle(512)
            .batch(batch_size)
            .prefetch(AUTOTUNE)
        )


    valid = (
        tf.data.Dataset
        .from_tensor_slices((valid_paths, valid_labels))
        .map(decode_image, num_parallel_calls=AUTOTUNE)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )

    test = (
        tf.data.Dataset
        .from_tensor_slices(test_paths)
        .map(decode_image, num_parallel_calls=AUTOTUNE)
        .batch(batch_size)
    )
    
    return train, valid, test

# train_dataset, valid_dataset, test_dataset = build_dataset(BATCH_SIZE, None)

#     train_dataset = (
#         tf.data.Dataset
#         .from_tensor_slices((train_paths, train_labels))
#         .map(decode_image, num_parallel_calls=AUTOTUNE)
#         .map(data_augment, num_parallel_calls=AUTOTUNE)
#         .repeat()
#         .shuffle(512)
#         .batch(BATCH_SIZE)
#         .prefetch(AUTOTUNE)
#     )

# len(list(train_dataset.as_numpy_iterator())[0][0])
# K.clear_session()

In [None]:
## Log settings
log_dir = "./logs/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard = TensorBoard(log_dir=log_dir, histogram_freq=0, write_graph=False, write_images=False)
earlystop = keras.callbacks.EarlyStopping(
           monitor='val_loss', 
           min_delta=0.0001, patience=10, 
           verbose=1, mode='auto')

## Use a learning rate scheduler
def scheduler(epoch):
    if epoch < 10:
        return 0.001
    else:
        return 0.001 * tf.math.exp(0.1 * (10 - epoch))
lr_scheduler= tf.keras.callbacks.LearningRateScheduler(scheduler)

def train(model, trainset, validset, steps_per_epoch,
          lr=0.01, 
          lr_scheduler=None, 
          earlystop=None, 
          optimizer='adam', 
          epochs=3
         ):   
    callbacks_list = [tensorboard]
    if lr_scheduler:
        callbacks_list.append(lr_scheduler)
    else:
        learning_rate = lr

    if earlystop:
        callbacks_list.append(earlystop)
        
    model.compile(optimizer=optimizer, 
                  loss=keras.losses.categorical_crossentropy,
                  metrics=['categorical_accuracy'])

    history = model.fit(trainset, 
                        epochs=epochs, 
                        callbacks=callbacks_list, 
                        validation_data=validset,
                        steps_per_epoch=steps_per_epoch,
                        verbose=True)
    
    return history

Instead of adopting a fixed learning rate, we use a learning rate scheduler to adjust the learning rate. The learning rate decreases exponentially upon training epochs. Curve of learning rate scheduler is shown below.

Also, we may need a earlystop callback function to mitigate the overfitting problem when some indicators are not increasing or decreasing in the training process. In this project we will use validation loss as the monitored indicator. The `patience` parameter of the earlystop function depends on the variance of the model. With `patience = n`, we finally get the `n`-th step after the model with smallest validation loss. According to our experiments, in our deep neural network model the validation loss does not actually decrease steadily in the most of time. We choose a relatively larger `patience`, say `patience=10`.

In [None]:
## Visualisation
def plot_history(history): 
    fig, ax1 = plt.subplots()
    
    ax1.plot(history.history['loss'], 'r', label="training loss ({:.6f})".format(history.history['loss'][-1]))
    ax1.plot(history.history['val_loss'], 'r--', label="validation loss ({:.6f})".format(history.history['val_loss'][-1]))
    ax1.grid(True)
    ax1.set_xlabel('iteration')
    ax1.legend(loc="best", fontsize=9)    
    ax1.set_ylabel('loss', color='r')
    ax1.tick_params('y', colors='r')

    if 'categorical_accuracy' in history.history:
        ax2 = ax1.twinx()

        ax2.plot(history.history['categorical_accuracy'], 'b', label="training acc ({:.4f})".format(history.history['categorical_accuracy'][-1]))
        ax2.plot(history.history['val_categorical_accuracy'], 'b--', label="validation acc ({:.4f})".format(history.history['val_categorical_accuracy'][-1]))

        ax2.legend(loc="best", fontsize=9)
        ax2.set_ylabel('acc', color='b')        
        ax2.tick_params('y', colors='b')

## 4.NN models

In this section we train serveral neural networks models including shallow neural network and well known deep neural networks in the area of image recognition/classification, with or without data augmentation.

...

#### (3) DenseNet121 with data augmentation

In this section we try another deep network called DenseNet, proposed by Gao Huang et al[[2]](https://arxiv.org/pdf/1608.06993.pdf). DenseNet connects  each layer to every other layer in a feed-forward fashion to reuse the features. It has fewer parameters than ResNet but with same-level performance.

In [None]:
from tensorflow.keras.applications import DenseNet121

def create_model_densenet121():
    model = tf.keras.Sequential()
    model.add(DenseNet121(input_shape=(512, 512, 3),
                          weights='imagenet',
                          include_top=False))
    model.add(GlobalAveragePooling2D())
    model.add(Dense(4, activation='softmax'))
    
    print(model.summary())
    
    return model

In [None]:
# Setup
K.clear_session()
BATCH_SIZE = 8  # 
STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE
trainset, validset, testset = build_dataset(batch_size=BATCH_SIZE, data_augment=data_augment)

# Re-train the ResNet50 model with the same setup but with data augmentation
model = create_model_densenet121()
history = train(model, trainset, validset, 
                steps_per_epoch=STEPS_PER_EPOCH,
                lr_scheduler=lr_scheduler, 
                epochs=50
                )

In [None]:
plot_history(history)

In [None]:
probs_dnn = model.predict(testset, verbose=1)
sub.loc[:, 'healthy':] = probs_dnn
sub.to_csv('submission_DenseNet121_ww_augment.csv', index=False)
sub.head(10)

**Summary of DenseNet121** This setup got public score 0.962 on kaggle Public LB.

#### InceptionResNetV2

In [None]:
# Early Stop

# Setup
K.clear_session()
BATCH_SIZE = 8  # 
STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE
trainset, validset, testset = build_dataset(batch_size=BATCH_SIZE, data_augment=data_augment)

model = create_model_InceptionResNetV2()
history = train(model, trainset, validset, 
                steps_per_epoch=STEPS_PER_EPOCH,
                lr_scheduler=lr_scheduler, 
                earlystop=earlystop,
                epochs=100
                )

In [None]:
plot_history(history)

In [None]:
probs_dnn = model.predict(testset, verbose=1)
sub.loc[:, 'healthy':] = probs_dnn
sub.to_csv('submission_InceptionResNetV2_ww_augment_earlystop.csv', index=False)
sub.head(10)

## Reference 

[1]K. He, X. Zhang, S. Ren, and J. Sun, “[Deep Residual Learning for Image Recognition](http://arxiv.org/pdf/1512.03385),” arXiv:1512.03385 [cs], Dec. 2015, Accessed: May 04, 2020. [Online]. Available: http://arxiv.org/abs/1512.03385.


[2]G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, [“Densely Connected Convolutional Networks,”](https://arxiv.org/pdf/1608.06993.pdf) in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243.

[3]C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “[Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261),” arXiv:1602.07261 [cs], Aug. 2016, Accessed: May 04, 2020. [Online]. Available: http://arxiv.org/abs/1602.07261.

### Library docs, tutorials, manuals

[TensorFlow: Data augmentation](https://www.tensorflow.org/tutorials/images/data_augmentation)

[tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data)

[TensorFlow: Load images](https://www.tensorflow.org/tutorials/load_data/images)

[tensorflow.keras.applications](https://www.tensorflow.org/api_docs/python/tf/keras/applications)

[Keras applications, model summaries, etc.](https://keras.io/applications/)

[OpenCV background removal](https://www.kaggle.com/victorlouisdg/plant-pathology-opencv-background-removal/execution)