# COVID-19 Detection from chest X-Ray images using Convolutional Neural Networks

#### Pablo Lázaro Herrasti and Rubén Barco Terrones

In this Notebook we have collected all the steps and code that we have used during the developing of this research. All the sections of these Notebook are going to be explained below. If you have any doubt, question or suggestion, ask Pablo or Ruben via e-mail. You have our information on the [GitHub](https://github.com/polazaro/Covid-19-Detection) of the publication.

### Libraries

First of all, we are going to import all the libraries and functions that we are going to use in the project. We are going to use **Keras** over **Tensorflow** and some **Scikit-Learn** functions to compute the accuracies.

In [None]:
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from tensorflow.keras.preprocessing import image
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import applications
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential, Model 
from tensorflow.keras.layers import concatenate, Input, Dropout, Flatten, Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,BatchNormalization
from tensorflow.keras import backend as k 
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping

### Directories

Here, we use one cell to define all the paths that we are going to use during the Notebook. It is needed to say that we are going to have a main directory called `dir_covid`. This directory contains all the following subdirectories:
* **`COVID-19_TRAIN`**: Contains all the chest X-Ray images which tested possitive in COVID-19 that we are going to use in the **training** process. 
* **`COVID-19_TRAIN_AUG`**: The same as before but with data augmentation
* **`COVID-19_TEST`**: Contains all the chest X-Ray images which tested possitive in COVID-19 that we are going to use for **test**.
* **`COVID-19_VAL`**: Contains all the chest X-Ray images which tested possitive in COVID-19 that we are going to use as **validation** set.
* **`NORMAL`**: All the **healthy** images (train, validation and test).
* **`Viral Pneumonia`**: All the **Viral Pneumonia** images (train, validation and test).

In [None]:
dir_covid = '.../Covid-19-Detection/'

# Here we have to select only one of the following two directories (without and with DataAugmentation)
dir_covid_images_train  = dir_covid + 'COVID-19_TRAIN/'      # With DataAug
dir_covid_images_train  = dir_covid + 'COVID-19_TRAIN_AUG/'  # Without DataAug

dir_covid_images_test   = dir_covid + 'COVID-19_TEST/'
dir_covid_images_val    = dir_covid + 'COVID-19_VAL/'
dir_normal_images       = dir_covid + 'NORMAL/'
dir_pneumonia_images    = dir_covid + 'Viral Pneumonia/'

# Here we join both normal and viral pneumonia dirs in a list to use them in a loop easily
all_dir_images = [dir_normal_images, dir_pneumonia_images]

# Here we define the metadata files for the three classes
dir_covid_metadata      = dir_covid + 'COVID-19.metadata.xlsx'
dir_normal_metadata     = dir_covid + 'NORMAL.metadata.xlsx'
dir_pneumonia_metadata  = dir_covid + 'Viral Pneumonia.matadata.xlsx'

### Reading and Pre-processing Data

In this part we start loading the metadata files. In this case we only need to load the healthy and viral pneumonia metadata. We are going to assign one label to each class:
* **COVID-19** is **0**
* **Healthy** (normal) is **1**
* **Viral Pneumonia** (pneumonia) is **2**

In [None]:
metadata_normal = pd.read_excel(dir_normal_metadata)
metadata_normal['label'] = 1
metadata_pneumonia = pd.read_excel(dir_pneumonia_metadata)
metadata_pneumonia['label'] = 2
metadata_all = {dir_normal_images:metadata_normal, dir_pneumonia_images:metadata_pneumonia}

#### Data Augmentation

The Data Augmentation is going to be performed with the Augmentator library for Python. This library has a lot of possible ways to increase the number of samples of our database, like fliping images, rotate them, apply distorsions, etc. To learn more about them you can visit this [link](https://github.com/mdbloice/Augmentor).

**NOTE**: If you have already done your DataAugmentation, you can skip this two cells. Just change the previous directory to point to your data augmentated train folder. 

In [None]:
import Augmentor

# This command creates a folder with name 'output' inside the dir_covid_images_train directory in which all the new
# images are going to be created. After the data augmenation, we have to merge the images from 'output' with the original 
# COVID-19 data to obtain all the COVID-19 images
p = Augmentor.Pipeline(dir_covid_images_train)

In [None]:
p.rotate(probability=0.7, max_left_rotation=15, max_right_rotation=15)
p.flip_left_right(probability=0.7)
p.sample(len(os.listdir(dir_covid_images_train))*5)
p.process()

#### Reading Image Data

In the following three cells we are only going to load the images from healthy and pneumonia cases because we can split them in train, validation and test using the `train_test_split` function.

In [None]:
#Reading Image data and converting it into pixels and separating class labels
Data=[]
Label=[]

for dir_images in all_dir_images:
    files = os.listdir(dir_images)
    for index, row in metadata_all[dir_images].iterrows():
        Label.append(row['label'])
        filename=os.path.join(dir_images, files[index])
        im=image.load_img(filename,target_size=(224, 224))
        im=np.reshape(im,(224,224,3))
        im=im.astype('float32') / 255
        Data.append(im)

In [None]:
#Train Test Split
X_train, X_1, y_train, y_1 = train_test_split(np.array(Data), np.array(Label), test_size=0.3, random_state=42,stratify=Label)

#Train Test Split
X_cv, X_test, y_cv, y_test = train_test_split(X_1, y_1, test_size=0.5, random_state=42,stratify=y_1)

In [None]:
# Print some information about the amount of data
len(X_train), len(y_train), len(X_cv), len(y_cv), len(X_test), len(y_test)

Now we load the **COVID-19** cases. We load the three subsets directly from their directories because we have splitted them before running the code. The reason of doing this is because we only wanted to use the DataAugmentation technique on the COVID-19 trainig samples, so we had to isolate them from the validation and test COVID-19 images and from the healthy and viral pneumonia ones. 

After loading each of the three subsets (train, validation and test), we join them with the data from the other two classes.

In [None]:
# For data augmentation COVID with label 0

''' TRAINING '''
Data=[]
Label=[]

files = os.listdir(dir_covid_images_train)
for file in files:
    Label.append(0)
    filename=os.path.join(dir_covid_images_train, file)
    im=image.load_img(filename,target_size=(224, 224))
    im=np.reshape(im,(224,224,3))
    im=im.astype('float32') / 255
    Data.append(im)

y_train = np.array(list(y_train) + Label)
X_train = np.array(list(X_train) + Data)

''' VALIDATION '''
Data=[]
Label=[]
files = os.listdir(dir_covid_images_val)
for file in files:
    Label.append(0)
    filename=os.path.join(dir_covid_images_val, file)
    im=image.load_img(filename,target_size=(224, 224))
    im=np.reshape(im,(224,224,3))
    im=im.astype('float32') / 255
    Data.append(im)

y_cv = np.array(list(y_cv) + Label)
X_cv = np.array(list(X_cv) + Data)

''' TEST '''
Data=[]
Label=[]
files = os.listdir(dir_covid_images_test)
for file in files:
    Label.append(0)
    filename=os.path.join(dir_covid_images_test, file)
    im=image.load_img(filename,target_size=(224, 224))
    im=np.reshape(im,(224,224,3))
    im=im.astype('float32') / 255
    Data.append(im)

y_test = np.array(list(y_test) + Label)
X_test = np.array(list(X_test) + Data)

In [None]:
# Print some information about the amount of data
len(X_train), len(y_train), len(X_cv), len(y_cv), len(X_test), len(y_test)

Once all the data has been loaded and divided in train, validation and test sets, we can reshape all the samples to the input shape of the network. A very common shape used in a lot of state in the art publications is **`224 x 224`**.

In [None]:
from keras import backend as K
from keras.callbacks.callbacks import ModelCheckpoint
from keras.models import load_model

img_width=224
img_height=224

if K.image_data_format() == 'channels_first':
    
    input_shape = (3, img_width, img_height)
    X_train = X_train.reshape(X_train.shape[0],3,img_width,img_height)
    X_cv    = X_cv.reshape(X_cv.shape[0],3,img_width,img_height)
    X_test  = X_test.reshape(X_test.shape[0],3,img_width,img_height)
    
else:
    
    input_shape = (img_width, img_height, 3)
    X_train = X_train.reshape(X_train.shape[0],img_width,img_height,3)
    X_cv    = X_cv.reshape(X_cv.shape[0],img_width,img_height,3)
    X_test  = X_test.reshape(X_test.shape[0],img_width,img_height,3)

    
del Data

### Baseline Architecture

In the following cells, the **Baseline Architecture** is defined and trained. Some accuracy values are obtained after the training process.

In [None]:
model=Sequential()
model.add(Conv2D(32, 3, input_shape=input_shape, activation='relu', padding='same'))
model.add(MaxPooling2D(2))
model.add(Conv2D(64, 3, activation='relu', padding='same'))
model.add(MaxPooling2D(2))
model.add(Conv2D(128, 3, activation='relu', padding='same'))
model.add(MaxPooling2D(2))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
model.summary()

#### Training the Network

We are only saving the bests models here, but you can save only the last one or a model per epoch, as you wish. 

In [None]:
n_epochs = 30
batch_size = 32

checkpoint = ModelCheckpoint('model-{epoch:03d}-{acc:03f}-{val_acc:03f}.h5', save_best_only=True, monitor='val_acc', mode='max')
model.fit(x=X_train, y=y_train, batch_size=batch_size, epochs=n_epochs, callbacks=[checkpoint], validation_data=(X_cv,y_cv), shuffle=True)

#### Multiclass Predictions - Accuracy

In [None]:
train_acc = accuracy_score(model.predict_classes(X_train), y_train)
valid_acc = accuracy_score(model.predict_classes(X_cv), y_cv)
test_acc  = accuracy_score(model.predict_classes(X_test), y_test)

print("The final train accuracy is ",train_acc*100,"%")
print("The final validation accuracy is ",valid_acc*100,"%")
print("The final test accuracy is ",test_acc*100,"%")

In [None]:
''' COVID-19 Accuracy '''

X_covid  = X_test[y_test == 0]
y_covid  = y_test[y_test == 0]
test_acc = accuracy_score(model.predict_classes(X_covid), y_covid)

print("The final test accuracy for COVID-19 is ", test_acc*100, "%")

In [None]:
''' Healthy Accuracy '''

X_normal = X_test[y_test == 1]
y_normal = y_test[y_test == 1]
test_acc = accuracy_score(model.predict_classes(X_normal), y_normal)

print("The final test accuracy for NORMAL is ", test_acc*100, "%")

In [None]:
''' Viral Pneumonia Accuracy '''

X_pneumonia = X_test[y_test == 2]
y_pneumonia = y_test[y_test == 2]
test_acc    = accuracy_score(model.predict_classes(X_pneumonia), y_pneumonia)

print("The final test accuracy for PNEUMONIA is ", test_acc*100, "%")

### Inception Based Architecture

**Inception** is a method that applies convolutional layers in parallel with different filter sizes to obtain different features and then concatenates them. 

In [None]:
from keras import backend as K
from keras.callbacks.callbacks import ModelCheckpoint
from keras.models import load_model

img_width=224
img_height=224
input_shape = (img_width, img_height, 3)

input_layer = Input(shape=input_shape)
conv1 = Conv2D(64, 1, activation='relu', padding='same')(input_layer)
conv2 = Conv2D(32, 1, activation='relu', padding='same')(input_layer)
conv3 = Conv2D(16, 1, activation='relu', padding='same')(input_layer)

layer_out = concatenate([conv1, conv2, conv3], axis=-1)
layer_out = MaxPooling2D(2)(layer_out)
conv4     = Conv2D(128, 7, activation='relu', padding='same', name='last_conv')(layer_out)
layer_out = MaxPooling2D(2)(conv4)
layer_out = Flatten()(layer_out)
dense1    = Dense(128, activation='relu')(layer_out)
dense1    = Dropout(0.3)(dense1)
output    = Dense(3, activation='softmax')(dense1)
model     = Model(inputs=input_layer, outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
model.summary()

#### Training the Network

We are only saving the bests models here, but you can save only the last one or a model per epoch, as you wish. 

In [None]:
n_epochs = 30
batch_size = 32

checkpoint = ModelCheckpoint('model-{epoch:03d}-{acc:03f}-{val_acc:03f}.h5', save_best_only=True, monitor='val_acc', mode='max')
model.fit(x=X_train, y=y_train, batch_size=batch_size, epochs=n_epochs, callbacks=[checkpoint], validation_data=(X_cv,y_cv), shuffle=True)

#### Multiclass Predictions - Accuracy

In [None]:
train_acc = accuracy_score(model.predict_classes(X_train), y_train)
valid_acc = accuracy_score(model.predict_classes(X_cv), y_cv)
test_acc  = accuracy_score(model.predict_classes(X_test), y_test)

print("The final train accuracy is ",train_acc*100,"%")
print("The final validation accuracy is ",valid_acc*100,"%")
print("The final test accuracy is ",test_acc*100,"%")

In [None]:
''' COVID-19 Accuracy '''

X_covid  = X_test[y_test == 0]
y_covid  = y_test[y_test == 0]
test_acc = accuracy_score(model.predict_classes(X_covid), y_covid)

print("The final test accuracy for COVID-19 is ", test_acc*100, "%")

In [None]:
''' Healthy Accuracy '''

X_normal = X_test[y_test == 1]
y_normal = y_test[y_test == 1]
test_acc = accuracy_score(model.predict_classes(X_normal), y_normal)

print("The final test accuracy for NORMAL is ", test_acc*100, "%")

In [None]:
''' Viral Pneumonia Accuracy '''

X_pneumonia = X_test[y_test == 2]
y_pneumonia = y_test[y_test == 2]
test_acc    = accuracy_score(model.predict_classes(X_pneumonia), y_pneumonia)

print("The final test accuracy for PNEUMONIA is ", test_acc*100, "%")