## This assignment is designed for automated pathology detection for Medical Images in a relalistic setup, i.e. each image may have multiple pathologies/disorders. 
### The goal, for you as an MLE, is to design models and methods to predictively detect pathological images and explain the pathology sites in the image data.

## Data for this assignment is taken from a Kaggle contest: https://www.kaggle.com/c/vietai-advance-course-retinal-disease-detection/overview
Explanation of the data set:
The training data set contains 3435 retinal images that represent multiple pathological disorders. The patholgy classes and corresponding labels are: included in 'train.csv' file and each image can have more than one class category (multiple pathologies).
The labels for each image are

```
-opacity (0), 
-diabetic retinopathy (1), 
-glaucoma (2),
-macular edema (3),
-macular degeneration (4),
-retinal vascular occlusion (5)
-normal (6)
```
The test data set contains 350 unlabelled images.

# For this assignment, you are working with specialists for Diabetic Retinopathy and Glaucoma only, and your client is interested in a predictive learning model along with feature explanability and self-learning for Diabetic Retinopathy and Glaucoma vs. Normal images.
# Design models and methods for the following tasks. Each task should be accompanied by code, plots/images (if applicable), tables (if applicable) and text:
## Task 1: Build a classification model for Diabetic Retinopathy and Glaucoma vs normal images. You may consider multi-class classification vs. all-vs-one classification. Clearly state your choice and share details of your model, paremeters and hyper-paramaterization pprocess. (60 points)
```
a. Perform 70/30 data split and report performance scores on the test data set.
b. You can choose to apply any data augmentation strategy. 
Explain your methods and rationale behind parameter selection.
c. Show Training-validation curves to ensure overfitting and underfitting is avoided.
```
## Task 2: Visualize the heatmap/saliency/features using any method of your choice to demonstrate what regions of interest contribute to Diabetic Retinopathy and Glaucoma, respectively. (25 points)
```
Submit images/folder of images with heatmaps/features aligned on top of the images, or corresponding bounding boxes, and report what regions of interest in your opinion represent the pathological sites.
```

## Task 3: Using the unlabelled data set in the 'test' folder augment the training data (semi-supervised learning) and report the variation in classification performance on test data set.(15 points)
[You may use any method of your choice, one possible way is mentioned below.] 

```
Hint: 
a. Train a model using the 'train' split.
b. Pass the unlabelled images through the trained model and retrieve the dense layer feature prior to classification layer. Using this dense layer as representative of the image, apply label propagation to retrieve labels correspndng to the unbalelled data.
c. Next, concatenate the train data with the unlabelled data (that has now been self labelled) and retrain the network.
d. Report classification performance on test data
Use the unlabelled test data  to improve classification performance by using a semi-supervised label-propagation/self-labelling approach. (20 points)
```
## Good Luck!

In [2]:
# Import statements.
import os
import math
import numpy as np
import pandas
from typing import Any, Dict, List, Optional
from datetime import datetime
import cv2
import matplotlib
import matplotlib.pyplot as plt
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import History, TensorBoard, ModelCheckpoint
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, \
    Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [3]:
# Global constants.
DEFAULT_ROOT_DIR = os.path.join('.', 'Data')
DEFAULT_DATASET_ARGS = {'val_split': 0.3}
DEFAULT_MODEL_ARGS = {'input_shape': (512, 512, 3),
                      'output_shape': (3,),
                      'learning_rate': 1e-4}
DEFAULT_TRAIN_ARGS = {'epochs': 5,
                      'batch_size': 32,
                      'shuffle': True,
                      'use_tensorboard': False,
                      'model_checkpoint_filename': None,
                      'augment': False}
METRICS = {'accuracy', 'precision', 'recall', 'F1'}
DEFAULT_CLASS_NAMES = ['normal', 'diabetic retinopathy', 'glaucoma']
TRAIN_KEY = 'train'
VAL_KEY = 'val'
TEST_KEY = 'test'

In [3]:
# Your drive might be named 'gdrive' rather than 'drive'
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/Colab Notebooks/Midterm/')
os.listdir('.')

ModuleNotFoundError: No module named 'google.colab'

# Task 1.

## Approach

We will use a convolutional neural network with a 3-neuron softmax output. We make this choice of model because:

* Convolutional neural networks are the state of the art in image classification problems, in almost every domain.
* A 3-neuron sigmoid output (i.e., multi-label classification) is easy to implement for a neural network, and also allows us to quickly extend the model to 2, 7, or any arbitrary number of outputs.

## Functions

Below are the functions we will use to work with the dataset and run the model.

In [4]:
def load_image_into_numpy_array(path: os.path) -> np.ndarray:
    """Load an image from file into a numpy array.
    :param path: The path to the image.
    :return: The image contents as an np.ndarray of type uint8.
    """
    img_data = cv2.imread(path)
    img_data = cv2.cvtColor(img_data, cv2.COLOR_BGR2RGB)
    (width, height) = img_data.shape[:2]
    return img_data.reshape((height, width, 3)).astype(np.uint8)

In [5]:
def normalize_image(img: np.ndarray) -> np.ndarray:
    """Returns the normalized image from the numpy array.
    :param img: The image to normalize, each element of which is np.uint8.
    :return: The normalized image, each element of which is np.float16.
    """
    if img.dtype != np.uint8:
        raise ValueError('Expected image array type np.uint8, but found {0}'
        .format(img.dtype))
    return img.astype(np.float16) / 255

In [6]:
def denormalize_image(norm_img: np.ndarray) -> np.ndarray:
    """Returns the denormalized image (i.e., inverts the normalization process)
    so that the output is a numpy array representing an image.
    :param norm_img: The image to denormalize, each element of which is
    np.float16.
    :return: The denormalized image, each element of which is np.uint8.
    """
    if norm_img.dtype != np.float16:
        raise ValueError('Expected normalized image array type np.float16,'
        'but found {0}'.format(img.dtype))
    return (img * 255).astype(np.uint8)

### A note on labels

One decision I make on labels for this problem is that I do not force the labels for each example to contain at least one 1. In other words, it is possible for the label for an example to be (0, 0, 0), which means that the eye has no glaucoma, no retinopathy, and is not normal.

In [7]:
def get_partition(root_dir: os.path, dataset_args: Dict[str, Any]) -> \
    Dict[str, List[str]]:
    """Returns a dict where the keys are 'train', 'test', and 'val', and the
    values are the images under each. The list is the authoratative order of the
    train/test examples; partition['train'][0] is the first training example,
    and x_train[0] will correspond with that filename.
    :param root_dir: The root directory of the dataset, in which are located
    test and train subdirectories.
    :param dataset_args: The dataset arguments. See DEFAULT_DATASET_ARGS for
    available options.
    :return: The train/val/test partition.
    """
    dataset_args = {**DEFAULT_DATASET_ARGS, **dataset_args}
    partition = {}
    train_image_dir = os.path.join(root_dir, 'train', 'train')
    train_image_filenames = [filename for
                             filename in os.listdir(train_image_dir) if
                             filename.endswith('.jpg')]
    rand_indices = np.random.permutation(len(train_image_filenames))
    split_index = int(dataset_args['val_split'] * len(train_image_filenames))
    val_indices = rand_indices[:split_index]
    train_indices = rand_indices[split_index:]
    partition[TRAIN_KEY] = [train_image_filenames[i] for i in train_indices]
    partition[VAL_KEY] = [train_image_filenames[i] for i in val_indices]
    test_image_dir = os.path.join(root_dir, 'test', 'test')
    test_image_filenames = [filename for
                            filename in os.listdir(test_image_dir) if
                            filename.endswith('.jpg')]
    partition[TEST_KEY] = test_image_filenames
    return partition

In [8]:
def get_labels(root_dir: os.path, class_names: List[str]) -> Dict[str,
                                                                  np.ndarray]:
    """Returns a dict where the keys are the image filenames and the values are
    the labels. Each label has classes in the order of class_names (e.g.,
    ['normal', 'diabetic retinopathy', 'glaucoma']).
    :param root_dir: The root directory of the dataset, in which are located
    test and train subdirectories.
    :param class_names: The names of the classes, in order.
    :return: The label dict.
    """
    labels = {}
    train_labels_filename = os.path.join(root_dir, 'train', 'train.csv')
    df_train = pandas.read_csv(train_labels_filename)[['filename'] +
                                                      class_names]
    for i, filename in enumerate(df_train['filename']):
        row = df_train.iloc[i]
        labels[filename] = np.array([row[class_name] for 
                                     class_name in class_names],
                                    dtype=np.float32)
    return labels

In [9]:
def _get_dataset_from_scratch(
    root_dir: os.path,
    partition: Dict[str, str],
    labels: Dict[str, np.ndarray]) -> (np.ndarray, np.ndarray, np.ndarray,
                                       np.ndarray):
    """Preprocesses and returns the dataset located at root_dir.
    :param root_dir: The root directory of the dataset, in which are located
    test and train subdirectories.
    :param partition: The train/test partition.
    :param labels: The labels.
    :return: The preprocessed dataset as a 4-tuple: x_train, y_train, x_test,
    y_test.
    """
    train_image_dir = os.path.join(root_dir, 'train', 'train')
    first_image = load_image_into_numpy_array(os.path.join(
        train_image_dir, partition[TRAIN_KEY][0]))
    input_shape = (len(partition[TRAIN_KEY]), *first_image.shape)
    print('Inferring training input shape: {0}'.format(input_shape))
    x_train = np.zeros(input_shape, dtype=np.float16)
    y_train = np.zeros((input_shape[0], len(labels[partition[TRAIN_KEY][0]])))
    for i, filename in enumerate(partition[TRAIN_KEY]):
        if i % 100 == 0:
            print('Loading image {0}'.format(i))
        img_arr = load_image_into_numpy_array(os.path.join(
            train_image_dir, filename))
        preprocessed_arr = normalize_image(img_arr)
        x_train[i, :] = preprocessed_arr
        y_train[i, :] = labels[filename]
    test_image_dir = os.path.join(root_dir, 'test', 'test')
    first_image = load_image_into_numpy_array(os.path.join(
        test_image_dir, partition[TEST_KEY][0]))
    input_shape = (len(partition[TEST_KEY]), *first_image.shape)
    if input_shape[1:] != x_train.shape[1:]:
        raise ValueError('Expected test set shape, {0}, to match train set'
        'shape, {1}.'.format(input_shape, x_train.shape))
    print('Inferring test input shape: {0}'.format(input_shape))
    x_test = np.zeros(input_shape, dtype=np.float16)
    y_test = None
    for i, filename in enumerate(partition[TEST_KEY]):
        if i % 100 == 0:
            print('Loading image {0}'.format(i))
        img_arr = load_image_into_numpy_array(os.path.join(
            test_image_dir, filename))
        preprocessed_arr = normalize_image(img_arr)
        x_test[i, :] = preprocessed_arr
    return x_train, y_train, x_test, y_test

In [10]:
def get_dataset(
    root_dir: os.path,
    partition: Dict[str, str],
    labels: Dict[str, np.ndarray]) -> (np.ndarray, np.ndarray, np.ndarray,
                                       np.ndarray):
    """Preprocesses and returns the dataset located at root_dir. If the cached
    dataset files are available, uses those.
    :param root_dir: The root directory of the dataset, in which are located
    test and train subdirectories.
    :param partition: The train/test partition.
    :param labels: The labels.
    :return: The preprocessed dataset as a 4-tuple: x_train, y_train, x_test,
    y_test.
    """
    return _get_dataset_from_scratch(root_dir, partition, labels)

In [11]:
def _get_alex_net(model_args: Dict[str, Any]) -> Model:
    """Returns AlexNet as a tensorflow.keras.models.Model instance.
    :param model_args: Model hyperparameters. See DEFAULT_MODEL_ARGS for
    hyperparameters and their default values.
    :return: A tensorflow.keras.models.Model instance.
    """
    model = Sequential()
    model.add(Conv2D(96, kernel_size=(11, 11), strides=4,
                     padding='valid', activation='relu',
                     input_shape=model_args['input_shape'],
                     kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2),
                           padding='valid', data_format=None))
    model.add(Conv2D(256, kernel_size=(5, 5), strides=1,
                     padding='same', activation='relu',
                     kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2),
                           padding='valid', data_format=None)) 
    model.add(Conv2D(384, kernel_size=(3, 3), strides=1,
                     padding='same', activation= 'relu',
                     kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(Conv2D(384, kernel_size=(3, 3), strides=1,
                     padding='same', activation='relu',
                     kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(Conv2D(256, kernel_size=(3, 3), strides=1,
                     padding='same', activation='relu',
                     kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2),
                           padding='valid', data_format=None))
    model.add(Flatten())
    if len(model_args['output_shape']) != 1:
        raise ValueError('Expected last layer to be flat, but found shape {0}'
        .format(len(model_args['output_shape'])))
    model.add(Dense(model_args['output_shape'][0], activation='sigmoid'))
    model.compile(
        optimizer=Adam(learning_rate=model_args['learning_rate']),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

In [12]:
def get_model(model_args: Dict[str, Any]) -> Model:
    """Returns a tensorflow.keras.models.Model instance for this dataset.
    :param model_args: Model hyperparameters. See DEFAULT_MODEL_ARGS for
    hyperparameters and their default values.
    :return: A tensorflow.keras.models.Model instance.
    """
    model_args = {**DEFAULT_MODEL_ARGS, **model_args}
    return _get_alex_net(model_args)

In [13]:
def train_model(model: Model,
                partition: Dict[str, List[str]],
                labels: Dict[str, np.ndarray],
                train_args: Dict[str, Any],
                root_dir: os.path,
                class_names: List[str]) -> History:
    """Trains the model and returns the History object that results.
    :param model: The model.
    :param partition: The train/val/test partition.
    :param labels: The labels.
    :param train_args: Training hyperparameters. See DEFAULT_TRAIN_ARGS for
    hyperparameters and their default values.
    :param root_dir: The root directory of the dataset, in which are located
    test and train subdirectories.
    :param class_names: The names of the classes, in order.
    :return: The History object that results from calling model.fit().
    """
    train_args = {**DEFAULT_TRAIN_ARGS, **train_args}
    df_train = pandas.DataFrame()
    df_train['filename'] = partition[TRAIN_KEY]
    for i, class_name in enumerate(class_names):
        df_train[class_name] = [labels[filename][i]
                                for filename in partition[TRAIN_KEY]]
    df_val = pandas.DataFrame()
    df_val['filename'] = partition[VAL_KEY]
    for i, class_name in enumerate(class_names):
        df_val[class_name] = [labels[filename][i]
                              for filename in partition[VAL_KEY]]
    if train_args['augment']:
        image_datagen = ImageDataGenerator(
            rotation_range=10.,
            width_shift_range=0.05,
            height_shift_range=0.05,
            zoom_range=0.2,
            channel_shift_range=0.05,
            horizontal_flip=True,
            vertical_flip=True,
            fill_mode='constant',
            data_format='channels_last',
            rescale=1 / 255)
    else:
        image_datagen = ImageDataGenerator(
            data_format='channels_last',
            rescale=1 / 255)
    train_image_dir = os.path.join(root_dir, 'train', 'train')
    train_generator = image_datagen.flow_from_dataframe(
        df_train,
        directory=train_image_dir,
        x_col='filename',
        y_col=class_names,
        target_size=model.input_shape[1:3],
        class_mode='raw',
        batch_size=train_args['batch_size'],
        shuffle=train_args['shuffle'])
    val_generator = image_datagen.flow_from_dataframe(
        df_val,
        directory=train_image_dir,
        x_col='filename',
        y_col=class_names,
        target_size=model.input_shape[1:3],
        class_mode='raw',
        batch_size=train_args['batch_size'],
        shuffle=train_args['shuffle'])
    callbacks = []
    if train_args['use_tensorboard']:
        log_dir = 'logs_{0}'.format(datetime.now())
        tensorboard_callback = TensorBoard(log_dir=log_dir)
        callbacks.append(tensorboard_callback)
    if train_args['model_checkpoint_filename']:
        checkpoint_callback = ModelCheckpoint(
            train_args['model_checkpoint_filename'])
        callbacks.append(checkpoint_callback)
    return model.fit(
        x=train_generator,
        epochs=train_args['epochs'],
        callbacks=callbacks,
        validation_data=val_generator,
        steps_per_epoch=math.ceil(len(partition[TRAIN_KEY]) /
                                  train_args['batch_size']),
        validation_steps=math.ceil(len(partition[VAL_KEY]) /
                                   train_args['batch_size']))

In [14]:
# TODO update to partition/labels.
def eval_model(model: Model, x_test: np.ndarray, y_test: np.ndarray,
               metric: str = 'accuracy') -> float:
    """Returns the model's performance on the test set as a single metric.
    :param model: The model.
    :param x_test: The preprocessed test dataset.
    :param y_test: The test labels.
    :param metric: The metric to use. One of METRICS (e.g., 'accuracy').
    :return: The model's performance as a single metric.
    """
    if metric not in METRICS:
        raise ValueError('Expected metric to be one of {0}, but found {1}'
        .format(METRICS, metric))
    result = model.evaluate(
        x=x_test,
        y=y_test,
        return_dict=True)
    return result[metric]

In [15]:
def smooth_curve(points: List[float], factor: float = 0.6) -> List[float]:
    """Returns points smoothed over an exponential.
    :param points: The points of the curve to smooth.
    :param factor: The smoothing factor.
    :return: The smoothed points.
    """
    smoothed_points = []
    for point in points:
        if smoothed_points:
            prev = smoothed_points[-1]
            smoothed_points.append(prev * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

In [16]:
def plot_history(history: History) -> None:
    """Plots the training history. You can also visualize the training process
    in Tensorboard or wandb.
    :param history: The training history.
    """
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, smooth_curve(acc, factor=smooth_fac), 'bo',
             label='Smoothed training acc')
    plt.plot(epochs, smooth_curve(val_acc, factor=smooth_fac), 'b',
             label='Smoothed validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.figure()
    plt.plot(epochs, smooth_curve(loss), 'bo', label='Smoothed training loss')
    plt.plot(epochs, smooth_curve(val_loss), 'b',
             label='Smoothed validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()

## Experiments

Below we run experiments on the dataset.

In [17]:
partition = get_partition(DEFAULT_ROOT_DIR, DEFAULT_DATASET_ARGS)
labels = get_labels(DEFAULT_ROOT_DIR, DEFAULT_CLASS_NAMES)
print('Found {0} training images, {1} val images, {2} test images.'.format(
    len(partition[TRAIN_KEY]), len(partition[VAL_KEY]),
    len(partition[TEST_KEY])))
print('Found {0} labels.'.format(len(labels.keys())))

Found 2405 training images, 1030 val images, 350 test images.
Found 3435 labels.


In [18]:
# Define hyperparameters.
model_args = {}
train_args = {'batch_size': 4}
metric = 'accuracy'

In [1]:
model = get_model(model_args)
history = train_model(model, partition, labels, train_args, DEFAULT_ROOT_DIR,
                      DEFAULT_CLASS_NAMES)
score = -1 #eval_model(model, x_test, y_test, metric=metric)
plot_history(history)
print('Model score ({0}): {1}'.format(metric, score))

NameError: name 'get_model' is not defined