**CHEST X-RAY IMAGE CLASSIFICATION ADVANCED DATA SCIENCE CAPSTONE PROJECT**

Paolo Cavadini, February 2021.

Dataset
https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

Find on GitHub: https://github.com/pcavad/capstone_x_rays

In [None]:
import os

import keras
from keras.preprocessing import image
from keras import backend as K
from keras.models import Sequential, load_model
from keras import layers
from keras.layers import Input, Dense, Dropout, Flatten, MaxPool2D 
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization, SeparableConv2D 
from keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint,EarlyStopping 
import itertools
import matplotlib.pyplot as plt
plt.style.use('seaborn')
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score 
import seaborn as sns
import tensorflow as tf

**MODEL DEFINITION**

In [None]:
X_train = np.load('./data/X_train.npy')
y_train = np.load('./data/y_train.npy')
X_test = np.load('./data/X_test.npy')
y_test = np.load('./data/y_test.npy')

**Logistics regression (non-deep learning model as baseline)**

In [None]:
# Re-shaping the data to fit for a logistics regression.

X_train_LR = np.reshape(X_train, (len(X_train),-1))
X_test_LR = np.reshape(X_test, (len(X_test),-1))
y_train_LR = np.reshape(y_train, (len(y_train),))
y_test_LR = np.reshape(y_test, (len(y_test),))

In [None]:
# Fitting and predicting using normal parameters.

LR = LogisticRegression(C=0.1, solver='liblinear').fit(X_train_LR,y_train_LR)
yhat = LR.predict(X_test_LR)
yhat

In [None]:
# Printing the sci-kit learn accuracy score which I'll use as baseline to improve the neural network.

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test_LR, yhat)*100))

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], '.0f'),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix.
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
# Plot confusion matrix.
plot_confusion_matrix(cnf_matrix, classes=['pneumonia','normal'],normalize= False,  title='Confusion matrix')

**DEEP LEARNING**

**Modeling**

I use the input shape and a batch size of 32 which is a good compromise between accuracy and response time. The activation function and the optimizer will be search trhough a fine tuning process as well as the option to augment the data. I chose the binary crossentropy loss function because it is a common choice for classification, and the accuracy metric because it is easy to interpret.

I will compose a neural network in 5 convolutional blocks with convolutional layer, max-pooling and batch-normalization. On top of it you find a flatten layer followed by two dense layers, and in between a dropout layer to reduce over-fitting. 

In [None]:
input_shape = (150,150,3)

In [None]:
def modeling(activation_f=None, optimizer_f=None, input_shape_img=None):
    '''
    This function creates the model based on different parameters, complile and save it.
    '''
    model = Sequential()
    
    #1st block
    model.add(Conv2D(32 , (3,3) , strides = 1 , padding = 'same' , activation = activation_f , input_shape = input_shape_img))
    model.add(BatchNormalization())
    model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
    #2nd block
    model.add(Conv2D(64 , (3,3) , strides = 1 , padding = 'same' , activation = activation_f))
    model.add(Dropout(0.1))
    model.add(BatchNormalization())
    model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
    #3rd block
    model.add(Conv2D(64 , (3,3) , strides = 1 , padding = 'same' , activation = activation_f))
    model.add(BatchNormalization())
    model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
    #4th block
    model.add(Conv2D(128 , (3,3) , strides = 1 , padding = 'same' , activation = activation_f))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())
    model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
    #5th block
    model.add(Conv2D(256 , (3,3) , strides = 1 , padding = 'same' , activation = activation_f))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())
    model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))

    model.add(Flatten())
    model.add(Dense(units = 128 , activation = activation_f))
    model.add(Dropout(0.2))
    model.add(Dense(units = 1 , activation = 'sigmoid')) 

    model.compile(optimizer=optimizer_f, loss='binary_crossentropy', metrics=['accuracy']) # compiling ##'MeanSquaredError'
    
    model.save('./data/model.h5') # saving
    
    return model

In [None]:
# just running a test
model = modeling('relu', 'adam', input_shape)
model.summary()