# DS6050 Deep Learning Project
__Project: Covid19 Image Classification - Phase 1__
__Team: Paul Hicks(pdh2d), Sudharshan Luthra(sl3zs), Jay Hombal(mh4ey), Matt Dakolios(mrd7f)__

__Abstract:__ Our Aim is to detect Covid19 from chest X-rays. The covid19 image dataset we are using is small with about 3000 classes belonging to three classes  'Normal', 'Covid19' and 'Pneumonia' respectively. This dataset small and is insufficient to generalize. So for the purpoe of our project, in Phase-I we will first use NIH X-ray image data to retrain and finetune pretrained model architecture such as ResNet50V2, MobileNetV2 and VGG16.

In Phase2, we intend to reload the best saved model from Phase 1 to train, validate, finetune the model and finally evaluate the classifications on the target covid19 dataset.

Reference:
1. https://www.kaggle.com/nih-chest-xrays/data
2. https://www.kaggle.com/mushaxyz/covid19-customized-xray-dataset

## Project Code Orginzation

Cookiecutter is a command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, LaTeX documents, etc.
  
__Installed and created the project template using Cookiecutter:__  
Follow instructions from https://ericbassett.tech/cookiecutter-data-science-crash-course/



## Validating and pre-processing NIH X-ray metadata dataset  

Following instructions use make tool, run commands from from your terminal from your project folder

__Setup__  
Setup python environment   

    1. Validate Python is installed and create required directories  
        Run: make test_environment  

__Data Extraction:(execute only once)__  

    2. Download and unzip the NIH X-ray images in data/raw    
        Run: make get_nih_images   

__Data Validation:(execute only once)__  

    3. Validate Dataset (rename columns and delete patient record with age greater than 100)   
        Run: make validate_nih_images   

__Data Prepartion:(execute only once)__  

    4. Prepare Dataset (add path attribute, split dataset into train and validation dataset)
        Run: make prepare_nih_images

This proudces the three output files in processed folder:
    1. prepared_data_entry_2017.csv (full dataset)
    2. prepared_train_data_entry_2017.csv (train_dataset)
    3. prepared_valid_data_entry_2017.csv (validation_dataset)

Next, we use prepared_train_data_entry_2017.csv and prepared_valid_data_entry_2017.csv files to retrain CNN model architectures pre-trained using IMAGENET database

## Imports and Setup

In [None]:
# common imports
import os
import numpy as np
import matplotlib.pylab as plt
import pandas as pd
import random
from glob import glob
from pathlib import Path
from functools import partial
from sklearn.model_selection import GroupShuffleSplit
from sklearn.model_selection import train_test_split

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# prevent VRAM occupied
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import warnings
warnings.filterwarnings('ignore')


#Change working directory  - as the images are located in data/raw in the project folder __(Execute this cell Once)__

In [None]:

if '/notebooks' in os.getcwd():
    os.chdir("../")
    print("set the project directory as working directory")
else:
    print(os.getcwd())

In [None]:
# Import functions for trianing the model
%load_ext autoreload
%autoreload 2
import src.models.train_model as train_model

# load tensorboard extension
%reload_ext tensorboard

## Data Ingestion

In [None]:
# Constants
SEED =42
IMAGE_SIZE = (224,224)
IMAGE_SHAPE = (224,224,3)
BATCH_SIZE = 32
SHUFFLE = True
NUM_CLASSES = 15 # number of ClassesNUM
NUM_EPOCHS = 10
PRETRAINED_MODELS = ['ResNet50V2', 'MobileNetV2', 'VGG16']

# Train and validate function
def train_and_validate_model(model_name,
                             train_generator, 
                             valid_generator, 
                             save_model_filepath: str,
                             logs_dir: str,
                             freeze_layers:bool = True, 
                             activation: str = 'softmax', 
                             learning_rate: float =0.01, 
                             fine_tune_learning_rate: float = 0.0001,
                             fine_tune_at_layer:int = 186,
                             num_epochs:int = NUM_EPOCHS,
                             num_classes: int = NUM_CLASSES,
                             batch_size: int = BATCH_SIZE,
                             input_shape: int = IMAGE_SHAPE):
    
    print(model_name)
    
    my_model = train_model.get_base_model_with_new_toplayer(base_model=model_name,
                                                          freeze_layers = freeze_layers, 
                                                          num_classes = num_classes,
                                                          activation_func=activation,
                                                          learning_rate = learning_rate,
                                                          input_shape = input_shape)

    my_model_history = train_model.fit_model(my_model, 
                                             train_generator, 
                                             valid_generator,
                                             num_epochs=num_epochs,
                                             batch_size=batch_size,
                                             checkpoint_filepath=save_model_filepath,
                                             logs_dir = logs_dir)

    print(f'{model_name} Accuracy and Loss plots')
    train_model.plot_accuracy_and_loss(my_model_history)


    print("\n")
    #fine_tune model_name
    model_ft = train_model.fine_tune_model(my_model,
                                           fine_tune_learning_rate,
                                           optimizer='Adam',
                                           fine_tune_at_layer=fine_tune_at_layer,
                                           activation_func=activation)
    
    
    print("\n")
    print(f'Fine-Tuned {model_name} Training and Validation: ')
    model_ft_history = train_model.fit_model(model_ft, train_generator, 
            valid_generator, num_epochs=num_epochs,batch_size=batch_size)
    print(f'Fine-Tuned {model_name} Accuracy and Loss plots')
    train_model.plot_accuracy_and_loss(model_ft_history)
    return model_ft

In [None]:
def load_data():
    nih_xrays_train_df = pd.read_csv('data/processed/prepared_train_data_entry_2017.csv')
    nih_xrays_valid_df = pd.read_csv('data/processed/prepared_valid_data_entry_2017.csv')
    return nih_xrays_train_df,nih_xrays_valid_df
nih_xrays_train_df, nih_xrays_valid_df = load_data()

In [None]:
# Get fourteen unique diagnosis
# It is a function that takes a series of iterables and returns one iterable
# The asterisk "*" is used in Python to define a variable number of arguments. 
# The asterisk character has to precede a variable identifier in the parameter list 
from itertools import chain
all_labels = np.unique(list(chain(*nih_xrays_train_df['finding_label'].map(lambda x: x.split('|')).tolist())))
# remove the empty label
all_labels = [x for x in all_labels if len(x)>0]
print('All Labels ({}): {}'.format(len(all_labels), all_labels))

## Preprocess Images

In [None]:
from keras.applications.resnet_v2 import preprocess_input
from keras.preprocessing.image import ImageDataGenerator

train_generator = train_model.get_image_data_generator(nih_xrays_train_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)
valid_generator = train_model.get_image_data_generator(nih_xrays_train_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)

### Visualize Images

In [None]:
t_x, t_y = next(train_generator)
fig, m_axs = plt.subplots(4, 4, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap = 'bone', vmin = -1.5, vmax = 1.5)
    c_ax.set_title(', '.join([n_class for n_class, n_score in zip(all_labels, c_y) 
                             if n_score>0.5]))
    c_ax.axis('off')

---

## Experiment 1: Classification using all NIH data
  
train_model.py model includes all functions for training the model (src/models/train_model.py)

### ResNetV250

In [None]:
model_name = PRETRAINED_MODELS[0]
save_model_filepath = 'models/'+ model_name + 'exp1.h5'
logs_dir = 'logs/fit/ResNet50V2exp1'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=train_generator, 
                                 valid_generator=valid_generator, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)



### MobileNetV2

In [None]:
model_name = PRETRAINED_MODELS[1]
save_model_filepath = 'models/'+ model_name + 'exp1.h5'
logs_dir = 'logs/fit/MobileNetV2exp1'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=train_generator, 
                                 valid_generator=valid_generator, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

### VGG16

In [None]:
model_name = PRETRAINED_MODELS[2]
save_model_filepath = 'models/'+ model_name + 'exp1.h5'
logs_dir = 'logs/fit/VGG16exp1'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=train_generator, 
                                 valid_generator=valid_generator, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

---

## Experiment 2: Balance the dataset

In [None]:
nih_xrays_df = pd.read_csv('data/processed/prepared_data_entry_2017.csv')

In [None]:
def sample_with_weights(df, all_labels, num_samples: int = 40000):
    for lbl in all_labels: 
        df[lbl] = df['finding_label'].map(lambda find: 1 if lbl in find else 0)
    df['encoding'] = [[1 if l in lbl.split('|') else 0 for l in all_labels] for lbl in nih_xrays_df['finding_label']]

    class_count = {}
    for lbl in all_labels:
        class_count[lbl] = df[lbl].sum()

    classweight = {}
    for lbl in all_labels :
        classweight[lbl] = 1/class_count[lbl]

    classweight['NoFinding'] /= 2   #Extra penalising the none class 
    def apply_weights(row):
        weight = 0
        for lbl in all_labels: 
            if(row[lbl]==1):
                weight += classweight[lbl]
        return weight
    new_weights = df.apply(apply_weights, axis=1)
    sampled_data = df.sample(50000, weights = new_weights)

    
    nih_required_columns = {
            'patient_id',
            'image_name',
            'path',
            'finding_label'
        }

  
    sampled_data = sampled_data[nih_required_columns]

    
    group_shuffle_split = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=42)

    for train_idx, valid_idx in group_shuffle_split.split(sampled_data[:None],\
        groups=sampled_data[:None]['patient_id'].values):
        train_df = sampled_data.iloc[train_idx]
        valid_df = sampled_data.iloc[valid_idx]
        
    return train_df, valid_df

In [None]:
train_df, valid_df = sample_with_weights(nih_xrays_df,all_labels,num_samples=40000)
sampled_train_gen = train_model.get_image_data_generator(train_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)
sampled_valid_gen = train_model.get_image_data_generator(valid_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)

### ResNet50V2

In [None]:
model_name = PRETRAINED_MODELS[0]
save_model_filepath = 'models/'+ model_name + 'exp2.h5'
logs_dir = 'logs/fit/ResNet50V2exp2'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sampled_train_gen, 
                                 valid_generator=sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

### MobileNetV2

In [None]:
model_name = PRETRAINED_MODELS[1]
save_model_filepath = 'models/'+ model_name + 'exp2.h5'
logs_dir = 'logs/fit/MobileNetV2exp2'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sampled_train_gen, 
                                 valid_generator=sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

### VGG16

In [None]:
# model_name = PRETRAINED_MODELS[2]
save_model_filepath = 'models/'+ model_name + 'exp2.h5'
logs_dir = 'logs/fit/VGG16exp2'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sampled_train_gen, 
                                 valid_generator=sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

---

### Experiment 3: Sub Sampling Classes

In [None]:
sub_samples = ['Cardiomegaly','Effusion','Emphysema', 'Fibrosis', 'Infiltration', 'Pneumonia', 'Pneumothorax','Pleural_Thickening']

In [None]:
sub_nih_xrays_train_df  = nih_xrays_train_df [nih_xrays_train_df['finding_label'].isin(sub_samples)]
sub_nih_xrays_valid_df = nih_xrays_valid_df[nih_xrays_valid_df['finding_label'].isin(sub_samples)]
sub_sampled_train_gen = train_model.get_image_data_generator(sub_nih_xrays_train_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=sub_samples,shuffle=True,seed=SEED)
sub_sampled_valid_gen = train_model.get_image_data_generator(sub_nih_xrays_valid_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=sub_samples,shuffle=True,seed=SEED)

### ResNet50V2

In [None]:
model_name = PRETRAINED_MODELS[0]
save_model_filepath = 'models/'+ model_name + 'exp3.h5'
logs_dir = 'logs/fit/ResNet50V2exp3'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sub_sampled_train_gen, 
                                 valid_generator=sub_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir,
                                 num_classes=len(sub_samples))

## MobileNETV2

In [None]:
model_name = PRETRAINED_MODELS[1]
save_model_filepath = 'models/'+ model_name + 'exp3.h5'
logs_dir = 'logs/fit/MobileNetV2exp3'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sub_sampled_train_gen, 
                                 valid_generator=sub_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir,
                                 num_classes=len(sub_samples))

## VGG16

In [None]:
model_name = PRETRAINED_MODELS[2]
save_model_filepath = 'models/'+ model_name + 'exp3.h5'
logs_dir = 'logs/fit/VGG16exp3'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=sub_sampled_train_gen, 
                                 valid_generator=sub_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir,
                                 num_classes=len(sub_samples))

---

### Experiment 4: Randoming reducing multiple lables to a single label for an image where multiple lables exist

In [None]:
import random
reduced_nih_xrays_train_df = nih_xrays_train_df
reduced_nih_xrays_valid_df = nih_xrays_valid_df
reduced_nih_xrays_train_df['finding_label'] = reduced_nih_xrays_train_df['finding_label'].map( lambda x : random.choice(x.split('|')) )
reduced_nih_xrays_valid_df['finding_label'] = reduced_nih_xrays_valid_df['finding_label'].map( lambda x : random.choice(x.split('|')) )
reduced_sampled_train_gen = train_model.get_image_data_generator(reduced_nih_xrays_train_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)
reduced_sampled_valid_gen = train_model.get_image_data_generator(reduced_nih_xrays_valid_df,batch_size=BATCH_SIZE,image_size=IMAGE_SIZE,lables=all_labels,shuffle=True,seed=SEED)

### ResNet50V2

In [None]:
model_name = PRETRAINED_MODELS[0]
save_model_filepath = 'models/'+ model_name + 'exp4.h5'
logs_dir = 'logs/fit/ResNet50V2exp4'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=reduced_sampled_train_gen, 
                                 valid_generator=reduced_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

## MobileNETV2

In [None]:
model_name = PRETRAINED_MODELS[1]
save_model_filepath = 'models/'+ model_name + 'exp4.h5'
logs_dir = 'logs/fit/MobileNetV2exp4'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=reduced_sampled_train_gen, 
                                 valid_generator=reduced_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

## VGG16

In [None]:
model_name = PRETRAINED_MODELS[2]
save_model_filepath = 'models/'+ model_name + 'exp4.h5'
logs_dir = 'logs/fit/VGG16exp4'
model = train_and_validate_model(model_name = model_name, 
                                 train_generator=reduced_sampled_train_gen, 
                                 valid_generator=reduced_sampled_valid_gen, 
                                 save_model_filepath=save_model_filepath,
                                 logs_dir=logs_dir)

In [None]:
%tensorboard --logdir logs/fit/

### Clean UP 
run this cell after completing execution of the notebook

In [None]:
# clear gpu memory
from numba import cuda 
device = cuda.get_current_device()
device.reset()

__Run this cell from command prompt__  
  
  
jupyter-nbconvert --to pdf COVID-19-Image-Classification-phase1.ipynb