<a href="https://colab.research.google.com/github/mthomp89/NU_489_capstone/blob/develop_thompson/NIH_DenseNet_fine_tune_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Transfer Learning on NIH Chest-Xray 8 Sample

Two models for transfer learning are encompassed within this notebook: DenseNet121 and Inception_v3.

Expected input shape for DenseNet121 is (224, 224)
Expected input shape for Inception_v3 is (299, 299)

Notebook consists of a data load, limited EDA, data pipeline (preprocessing and augmentation), and modeling using one of the above architectures at a time.

Fine tuning of the selected model is also part of the notebook and consists of setting particular layers of the selected model as trainable.

*Credits:*
* *malgamves/DeepClassifyML https://github.com/malgamves/DeepClassifyML/blob/master/Inception_v3_fine_tune.ipynb*
* *xhlulu https://www.kaggle.com/xhlulu/transfer-learning-with-densenet-keras*

====================================================

### **Do not run all cells**
### **Several experimentations of modeling follows** 
### **Run only the model of interest then evaluate**

It should be noted that these images require some significant pre-processing and/or relabeling for best results. For this exercise we will do only minimal pre-processing of the images. See the following blogpost for more detail about specific challenges associated with this dataset: https://lukeoakdenrayner.wordpress.com/2017/12/18/the-chestxray14-dataset-problems/.

## **TO-DOs**

* Move notebook to Kaggle site and use full NIH Chest-Xray 8 dataset to mitigate overfitting
* Figure out how to insert use the `BalancedDataGenerator` class (https://medium.com/analytics-vidhya/how-to-apply-data-augmentation-to-deal-with-unbalanced-datasets-in-20-lines-of-code-ada8521320c9)
* Institute additional callbacks such as `EarlyStopping`, `ReduceLROnPlateau`, and `ModelCheckpoint` (https://www.kaggle.com/paultimothymooney/predicting-pathologies-in-x-ray-images)
* Visualize or confirm by tensor that classes are balanced after using the `BalancedDataGenerator` class or the `sklearn.utils.class_weight` class 
* Experiment with `batch_size`, `momentum`, `learning rate`, `activation`, and `optimizer` 
* Experiment with adding another model such as VGG16 as the top layers



# Set environment

In [0]:
from google.colab import files 
from google.colab import drive

drive.mount('/content/drive/')

**Import kaggle.json**

In [0]:
files.upload()

**Wire-up Kaggle**

In [0]:
!mkdir -p ~/.kaggle

In [0]:
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle

**Verify json**

In [0]:
!ls -l ~/.kaggle
!cat ~/.kaggle/kaggle.json

In [0]:
!pip install -q kaggle 
!pip install -q kaggle-cli

**Import libraries**

In [0]:
# Load the TensorBoard notebook extension to visualize evaluation of the model
%load_ext tensorboard

In [0]:
from keras.applications.inception_v3 import InceptionV3
from keras.applications.densenet import DenseNet121
from keras.preprocessing.image import ImageDataGenerator
from keras.utils.data_utils import Sequence
from imblearn.over_sampling import RandomOverSampler
from imblearn.keras import balanced_batch_generator
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, Input
from keras import backend as K
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from glob import glob
import numpy as np
import os 

In [0]:
class BalancedDataGenerator(Sequence):
    """ImageDataGenerator + RandomOversampling"""
    def __init__(self, x, y, datagen, batch_size=32):
        self.datagen = datagen
        self.batch_size = batch_size
        self._shape = x.shape        
        datagen.fit(x)
        self.gen, self.steps_per_epoch = balanced_batch_generator(x.reshape(x.shape[0], -1), y, sampler=RandomOverSampler(), batch_size=self.batch_size, keep_sparse=True)

    def __len__(self):
        return self._shape[0] // self.batch_size

    def __getitem__(self, idx):
        x_batch, y_batch = self.gen.__next__()
        x_batch = x_batch.reshape(-1, *self._shape[1:])
        return self.datagen.flow(x_batch, y_batch, batch_size=self.batch_size).next()

**Data load**

In [0]:
os.chdir('/content/drive/My Drive/')
!mkdir nih-chest-xrays

In [0]:
! kaggle datasets download -d nih-chest-xrays/sample

In [0]:
os.chdir('/content/drive/My Drive/nih-chest-xrays')
!unzip -q sample.zip

In [0]:
df = pd.read_csv("sample_labels.csv")

In [0]:
# Append image paths to csv; required for 'datagen.flow_from_dataframe'
image_paths = {os.path.basename(x): x for x in 
                   glob(os.path.join('/content/drive/My Drive/nih-chest-xrays/images/', '*.png'))}
print('Scans found:', len(image_paths), ', Total Headers', df.shape[0])

In [0]:
df['paths'] = df['Image Index'].map(image_paths.get)

# EDA - Exploratory Data Analysis


In [0]:
df.head()

In [0]:
des_df = df.describe()
des_df.T

In [0]:
# Visualize distribution of data

label_counts = df['Finding Labels'].value_counts()[:15]
fig, ax1 = plt.subplots(1,1,figsize = (12, 8))
ax1.bar(np.arange(len(label_counts))+0.5, label_counts)
ax1.set_xticks(np.arange(len(label_counts))+0.5)
_ = ax1.set_xticklabels(label_counts.index, rotation = 90)

In [0]:
df['Finding Labels'] = df['Finding Labels'].map(lambda x: x.replace('No Finding',''))

from itertools import chain
all_labels = np.unique(list(chain(*df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]

print('All Labels ({}): {}'.format(len(all_labels), all_labels))

for c_label in all_labels:
    if len(c_label)>1: # leave out empty labels
       df[c_label] = df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
df.sample(3)

In [0]:
# Eliminate observations having null values/no findings of disease

#df = df[~df['Finding Labels'].str.contains('No Finding')]
#df.shape

In [0]:
# Weight dataset to achieve a normal distribution when sampling
# Weight is 0.1 + number of findings

#sample_weights = df['Finding Labels'].map(lambda x: len(x.split('|')) if len(x)>100 else 0).values + 4e-2
#sample_weights /= sample_weights.sum()
#df = df.sample(df.shape[0], weights=sample_weights)

# Visualize new distribution

#label_counts = df['Finding Labels'].value_counts()[:15]
#fig, ax1 = plt.subplots(1,1,figsize = (12, 8))
#ax1.bar(np.arange(len(label_counts))+0.5, label_counts)
#ax1.set_xticks(np.arange(len(label_counts))+0.5)
#_ = ax1.set_xticklabels(label_counts.index, rotation = 90)

In [0]:
#labels = pd.read_csv('sample_labels.csv')

In [0]:
labels = labels[['Image Index','Finding Labels','Follow-up #','Patient ID','Patient Age','Patient Gender']]

# Create new columns for each disease
pathology_list = ['Cardiomegaly',
                  'Emphysema',
                  'Effusion',
                  'Hernia',
                  'Nodule',
                  'Pneumothorax',
                  'Atelectasis',
                  'Pleural_Thickening',
                  'Mass',
                  'Edema',
                  'Consolidation',
                  'Infiltration',
                  'Fibrosis',
                  'Pneumonia']

for pathology in pathology_list :
    labels[pathology] = labels['Finding Labels'].apply(lambda x: 1 if pathology in x else 0)

# Remove `Y` after age
labels['Age']=labels['Patient Age'].apply(lambda x: x[:-1]).astype(int)

# Visualize dataset by disease by gender

plt.figure(figsize=(15,10))
gs = gridspec.GridSpec(8,1)
ax1 = plt.subplot(gs[:7, :])
ax2 = plt.subplot(gs[7, :])
data1 = pd.melt(labels,
             id_vars=['Patient Gender'],
             value_vars = list(pathology_list),
             var_name = 'Category',
             value_name = 'Count')
data1 = data1.loc[data1.Count>0]
g=sns.countplot(y='Category',hue='Patient Gender',data=data1, ax=ax1, order = data1['Category'].value_counts().index)
ax1.set( ylabel="",xlabel="")
ax1.legend(fontsize=20)
ax1.set_title('X Ray partition',fontsize=18);

labels['Nothing']=labels['Finding Labels'].apply(lambda x: 1 if 'No Finding' in x else 0)

data2 = pd.melt(labels,
             id_vars=['Patient Gender'],
             value_vars = list(['Nothing']),
             var_name = 'Category',
             value_name = 'Count')
data2 = data2.loc[data2.Count>0]
g = sns.countplot(y='Category',hue='Patient Gender',data=data2,ax=ax2)
ax2.set( ylabel="",xlabel="Number of infected patient")
ax2.legend('')
plt.subplots_adjust(hspace=.5)

**Drop useless collumns**

In [0]:
df_drop = df.drop(['Follow-up #',
                   'Patient ID',
                   'Patient Age',
                   'Patient Gender',
                   'View Position',
                   'OriginalImageWidth',
                   'OriginalImageHeight'
                   ,'OriginalImagePixelSpacing_x',
                   'OriginalImagePixelSpacing_y'], axis = 1)

In [0]:
# Saving cleaned csv
df_drop.to_csv('/content/drive/My Drive/nih-chest-xrays/cleaned_sample.csv', index = False)

# Prepare data

In [0]:
df = pd.read_csv('/content/drive/My Drive/nih-chest-xrays/cleaned_sample.csv')

In [0]:
# Creating a list of All labels
df['Finding Labels'] = df['Finding Labels'].apply(lambda x: x.split('|'))

In [0]:
#os.chdir('/content/drive/My Drive/nih-chest-xrays/')

In [0]:
# Create and define image generators for data augmentation
datagen = ImageDataGenerator(rescale = 1./255., 
                             samplewise_center=True,
                             samplewise_std_normalization=True,
                             horizontal_flip = True,
                             vertical_flip = False,
                             height_shift_range= 0.05,
                             width_shift_range=0.1,
                             rotation_range=5,
                             shear_range = 0.1,
                             fill_mode = 'reflect',
                             zoom_range=0.15)



train_gen = datagen.flow_from_dataframe(dataframe = df[:1900],
                                        directory = '/content/drive/My Drive/nih-chest-xrays/',
                                        x_col ='paths',
                                        y_col ='Finding Labels',
                                        batch_size = 64,
                                        seed = 42,
                                        shuffle = True,
                                        class_mode = 'categorical',
                                        classes = ['Atelectasis','Cardiomegaly','Consolidation','Edema','Effusion',
                                                   'Emphysema', 'Fibrosis','Infiltration','Mass','Nodule','Pleural_Thickening',
                                                   'Pneumonia','Pneumothorax'], 
                                        
                                        # Inception_v3 target size = (299, 299)
                                        #target_size = (299,299))
                                        
                                        # DenseNet121 target size = (224, 224)
                                        target_size = (224, 224))


valid_gen = datagen.flow_from_dataframe(dataframe = df[1900:2000],
                                        directory = '/content/drive/My Drive/nih-chest-xrays/',
                                        x_col ='paths',
                                        y_col ='Finding Labels',
                                        batch_size = 64,
                                        seed = 42,
                                        shuffle = True,
                                        class_mode = 'categorical',
                                        classes = ['Atelectasis','Cardiomegaly','Consolidation','Edema','Effusion',
                                                   'Emphysema', 'Fibrosis','Infiltration','Mass','Nodule','Pleural_Thickening',
                                                   'Pneumonia','Pneumothorax'],
                                        
                                        # Inception_v3 target size = (299, 299)
                                        #target_size = (299,299))
                                        
                                        # DenseNet121 target size = (224, 224)
                                        target_size = (224, 224))


test_gen = datagen.flow_from_dataframe(dataframe = df[2000:],
                                       directory = '/content/drive/My Drive/nih-chest-xrays/',
                                       x_col ='paths',
                                       y_col ='Finding Labels',
                                       batch_size = 64,
                                       seed = 42,
                                       shuffle = True,
                                       class_mode = 'categorical',
                                       classes = ['Atelectasis','Cardiomegaly','Consolidation','Edema','Effusion',
                                                  'Emphysema', 'Fibrosis','Infiltration','Mass','Nodule','Pleural_Thickening',
                                                  'Pneumonia','Pneumothorax'],
                                       
                                       # Inception_v3 target size = (299, 299)
                                       #target_size = (299,299))
                                       
                                       # DenseNet121 target size = (224, 224)
                                       target_size = (224, 224))


# Balanced Data Generator


In [0]:
# Balance dataset to achieve sampling from normal distributions across all classes

#bgen = BalancedDataGenerator(x_col, y_col, datagen, batch_size=32)
#BGEN_STEPS_PER_EPOCH = balanced_gen.steps_per_epoch

# **STOP**
Several experimentations of modeling follows. Run only the model of interest then evaluate. 

# Modeling
Select which of the two models by changing lines 5 and 6 

In [0]:
# Model
# Inception_v3 input tensor = (299, 299, 3)
# DenseNet121 input tensor = (224, 224, 3)

input_tensor = Input(shape=(224, 224, 3))  
base_model = DenseNet121(input_tensor = input_tensor, weights = 'imagenet', include_top= False)

# Add new layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(2048, activation = 'relu')(x)
drop = Dropout(0.5)(x)
cl = Dense(13, activation = 'sigmoid')(drop)

model = Model(inputs=input_tensor, outputs = cl )

# Train only the top layers (which were randomly initialized)
for layer in base_model.layers:
    layer.trainable = False
    
model.compile(optimizer = 'rmsprop', loss ='binary_crossentropy', metrics= ['accuracy'])

In [0]:
# Define step sizes from each generator
STEP_SIZE_TRAIN = train_gen.n//train_gen.batch_size
STEP_SIZE_VALID = valid_gen.n//valid_gen.batch_size
STEP_SIZE_TEST = test_gen.n//test_gen.batch_size

# Balance dataset to achieve sampling from normal distributions across all classes
# The sklearn.utils.class_weight class is another option to balance the dataset

from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight(
               'balanced',
                np.unique(train_gen.classes), 
                train_gen.classes)

class_weights = dict(enumerate(class_weights))

In [0]:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

## Initial Training

In [0]:
# Training 

startTrainTime = time.time()

history = model.fit_generator(generator = train_gen, steps_per_epoch = STEP_SIZE_TRAIN,
                   validation_data = valid_gen,
                   validation_steps = STEP_SIZE_VALID,
                   class_weight = class_weights,
                   callbacks = [tensorboard_callback],
                   epochs = 25)

endTrainTime = time.time()
trainTime = endTrainTime - startTrainTime
print()
print('Total Training Time (sec): {}'.format(trainTime))

In [0]:
%tensorboard --logdir logs/fit

In [0]:
# saving model
model.save_weights('initial_model.h5')

## Initial Training - Tuning
Decrease epochs

In [0]:
# Training 

#startTrainTime = time.time()

#history = model.fit_generator(generator = train_gen, 
#                              steps_per_epoch = STEP_SIZE_TRAIN,
#                              validation_data = valid_gen,
#                              validation_steps = STEP_SIZE_VALID,
#                              class_weight = class_weights,
#                              callbacks = [tensorboard_callback],
#                              epochs = 5)

#endTrainTime = time.time()
#trainTime = endTrainTime - startTrainTime
#print()
#print('Total Training Time (sec): {}'.format(trainTime))

In [0]:
#%tensorboard --logdir logs/fit

In [0]:
# saving model
#model.save_weights('model.h5')

 # Fine Tuning of Selected Model

Up to this point, the selected model has been used as a feature selection model with only the added last two dense layers as trainable. This next section also sets the top convolutional layer of the selecte model as trainable. 

In [0]:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

In [0]:
# Set top convulation layers (313-on) as trainable
for layer in model.layers[:313]:
   layer.trainable = False
for layer in model.layers[313:]:
   layer.trainable = True

In [0]:
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss ='binary_crossentropy', metrics= ['accuracy'])

# Fine-Tuned Training

## Tuned Training - Round 1

In [0]:
startTrainTime = time.time()

history = model.fit_generator(generator = train_gen, 
                              steps_per_epoch = STEP_SIZE_TRAIN,
                              class_weight = class_weights,
                              validation_data = valid_gen,
                              validation_steps = STEP_SIZE_VALID,
                              callbacks = [tensorboard_callback],
                              epochs = 25)

endTrainTime = time.time()
trainTime = endTrainTime - startTrainTime
print()
print('Total Training Time (sec): {}'.format(trainTime))

In [0]:
%tensorboard --logdir logs/fit

## Tuned Training - Round 2
Decrease epochs

In [0]:
#startTrainTime = time.time()

#history = model.fit_generator(generator = train_gen, 
#                              steps_per_epoch = STEP_SIZE_TRAIN,
#                              validation_data = valid_gen,
#                              validation_steps = STEP_SIZE_VALID,
#                              class_weight = class_weights,
#                              callbacks = [tensorboard_callback],
#                              epochs = 10)

#endTrainTime = time.time()
#trainTime = endTrainTime - startTrainTime
#print()
#print('Total Training Time (sec): {}'.format(trainTime))

In [0]:
#%tensorboard --logdir logs/fit

## Set All Layers as Trainable

Finally and just for experimentation, the entire selected model is set as trainable. 

In [0]:
for layer in model.layers:
   layer.trainable = True

In [0]:
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss ='binary_crossentropy', metrics= ['accuracy'])

In [0]:
startTrainTime = time.time()

history = model.fit_generator(generator = train_gen, steps_per_epoch = STEP_SIZE_TRAIN,
                   validation_data = valid_gen,
                   validation_steps = STEP_SIZE_VALID,
                   class_weight = class_weights,
                   callbacks = [tensorboard_callback],
                   epochs = 5)

endTrainTime = time.time()
trainTime = endTrainTime - startTrainTime
print()
print('Total Training Time (sec): {}'.format(trainTime))

In [0]:
%tensorboard --logdir logs/fit

# Evaluate Results

In [0]:
results = model.evaluate_generator(generator = test_gen,
                                   steps = STEP_SIZE_TEST)


In [0]:
print(results)

In [0]:

import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss)+1)

plt.plot(epochs, loss, 'bo', label = 'Training loss')
plt.plot(epochs, val_loss, 'b', label = 'Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [0]:
acc = history.history['acc']
val_acc = history.history['val_acc']

plt.plot(epochs, acc, 'bo', label = 'Training acc')
plt.plot(epochs, val_acc, 'b', label = 'Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()


# Test Model

In [0]:
os.chdir('/content/drive/My Drive/nih-chest-xrays/')

# Inception_v3 input tensor = (299, 299, 3)
# DenseNet121 input tensor = (224, 224, 3)

input_tensor = Input(shape=(224, 224, 3))  
base_model = DenseNet121(input_tensor = input_tensor,weights = 'imagenet', include_top= False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(2048, activation = 'relu')(x)
drop = Dropout(0.5)(x)
cl = Dense(13, activation = 'sigmoid')(drop)
model = Model(inputs=input_tensor, outputs = cl )
  

model.load_weights('model.h5')

In [0]:
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss ='binary_crossentropy', metrics= ['accuracy'])
pred = model.evaluate_generator(generator = test_gen,
                                steps = STEP_SIZE_TEST)

In [0]:
pred

In [0]:
# getting prections
from keras.preprocessing import image
import numpy as np
img_path = '/content/drive/My Drive/nih-chest-xrays/images/00000030_001.png'

# Inception_v3 input tensor = (299, 299, 3)
# DenseNet121 input tensor = (224, 224, 3)

img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
preds = model.predict(x)
preds = preds.astype(float)*100


In [0]:
preds

In [0]:
from sklearn.metrics import confusion_matrix

y_true = train_gen.classes

confusion_matrix(y_true, preds)