# DEEP LEARNING WITH COVID-19 XRAY CONVOLUTED NEURAL NETWORK

### Tensorflow, Keras, Sci-Kit Learn

Mitchell Thomas


---
##[Disclaimer: Please note that this project is not scientifically tested or prepared for use in any other setting than a personal project.]



I decided to take on the project of identifying whether X-ray imagery of lungs contained COVID-19 virus or were healthy. Through doing this I was able to study various types of convolutional neural networks, image classification, and real world example of model analysis and where shortcomings working with real problems.





In [0]:
# import packages
!pip install tensorflow 
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import argparse
import cv2
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from numpy import array
from numpy import argmax
from imutils import paths

import argparse
import random
import shutil
import os
import pandas as pd
import argparse
import shutil
import os
import glob

import keras 
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D, SeparableConv2D
from keras.regularizers import l2
from keras.optimizers import SGD, RMSprop
from keras.utils import to_categorical
from keras.layers.normalization import BatchNormalization
from keras.utils.vis_utils import plot_model
from keras.layers import Input, GlobalAveragePooling2D
from keras import models
from keras.models import Model


**The Data**

I would say the largest shortcoming that I foresaw from the start was lack of data and COVID-19 X-ray images. However, I decided to move forward with the project anyways in hopes of larger datasets in the future that I can tune this network to.

I found a great resource online that had compiled a dataset of 25 Posterieranterior COVID-19 infected lung x-rays and 25 x-rays of healthy lungs to feed into my neural network ultimately. Please refer [to this helpful post from pyimagesearch.com](https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-tensorflow-and-deep-learning/) for more details.

The csv file comes from a Kaggle dataset that I found with the same COVID-19 infected lung x-rays as the author of the above pyimagesearch post used to compile the data.

Find the kaggle dataset here --> https://www.kaggle.com/bachrr/covid-chest-xray#metadata.csv

In [0]:
data = pd.read_csv('metadata.csv')

## **Let's Get To It**

After importing the dataset, below I have matched up all of the healthy/covid images with their corresponding labels using the package cv2.



---

I then go on to establish parameters (which I had tuned throughout the project to find optimum results). These parameters are number of epochs (or times passing through CNN), learning rate (which is how drastically the weights are affected as the network learns), and batch size (which is number of training samples used in one iteration).



## **Feature and Label Vectors**

I then initialize my feature and label vectors. My feature vectors consist of the attributes that are being used to determine the outcome, or prediction of the network, which is the label vector.

As I built these initial vectors, I resized the x-ray images to be 224x224 pixels so that they were standardized and uniform.

## **One-Hot Encoding the label vector data**

In order to work with the classes covid/healthy in a way that they were numerical, but not ordinal, I used one-hot encoding, which basically creates more attributes, or dimensions, your dataset is working with by number of unique labels and either fills them with a value of 1 or 0 based on whether that entry is associated with the corresponding label. Here I used LabelBinarizer() as I had two classes. (Binary)



---


Last but not least, I shuffled both vectors just for safe keeping to make sure that they were all jumbled up and my training data had a variety to train on and test on, or as much as possible with 50 total datapoints.

In [73]:
covid_data = []
normal_data = []
covid_imgs = glob.glob ("/content/datasets/covid/*")
normal_imgs = glob.glob("/content/datasets/healthy/*")

for myFile in covid_imgs:
    # print(myFile)
    image = cv2.imread(myFile)
    covid_data.append(image)

print('covid_data shape:', np.array(covid_data).shape)

for myFile in normal_imgs:
    # print(myFile)
    image = cv2.imread(myFile)
    normal_data.append(image)

print('normal_data shape:', np.array(normal_data).shape)

# initial model parameters
epochs = 25
lr = 1e-1
BS = 8

# lr 1e-1 works with BS of 32 epoch 25
# lr 1e-1 works with BS of 8 epoch 25 better

# grab the list of images in our dataset directory, then initialize
# the list of data (i.e., images) and class images
print("images are being vectorized")

X = []
y = []

# loop over the image paths
for imagePath in covid_imgs:
	# use the label name from folder name
	label = imagePath.split(os.path.sep)[-2]

	# load the image, swap color channels, and resize it to be a fixed
	# 224x224 pixels while ignoring aspect ratio
	image = cv2.imread(imagePath)
	image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
	image = cv2.resize(image, (224, 224))

	# build the feature and label vectors starting with COVID
	X.append(image)
	y.append(label)
 
for imagePath in normal_imgs:
	# use the label name from folder name
	label = imagePath.split(os.path.sep)[-2]

	# load the image, swap color channels, and resize it to be a fixed
	# 224x224 pixels while ignoring aspect ratio
	image = cv2.imread(imagePath)
	image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
	image = cv2.resize(image, (224, 224))

	# update the feature and label vectors with healthy lung x-rays
	X.append(image)
	y.append(label)


# standardize the data to range [0, 255]
# create numPy arrays
X = np.array(X) / 255.0
y = np.array(y)

# One Hot Encode the data
lb = LabelBinarizer()
y = lb.fit_transform(y)
y = to_categorical(y)

random.shuffle(X)
random.shuffle(y)


covid_data shape: (25,)
normal_data shape: (25,)
images are being vectorized


### Sci-Kit Learn's Train/Test Split

In order to avoid overfitting of my model, which means my model would not be generalized but too specific to my data, I split the data into training and testing sets. Generally a good ratio is around 70%/30% or 80%/20%. I used the default split option in this function (which I believe is 80/20)

In [0]:
# split our vectors into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Choosing the right Convolutional Neural Network

I had some choices here and I was between using a VGG16 network or GoogleNet's Inception network. From what I had gathered, VGG16 was the second choice of many due to it's slow and demanding computation. I looked into the fairly new 'Inception' model from Googlenet which I learned much about and challenged myself to it's implementation for this project.


---

## GoogLeNet Inception Convolutional Neural Network
Googlenet inception provides a convolutional neural network that takes an input of an image and then filters it essentially through three dimensions 1x1, 3x3, and 5x5 and a pooling layer. This reduces computational expense and avoids overfitting with a deep model. This makes your Convolutional Neural Network ‘wider’ as these operations are being performed on the same layers, and not ‘deeper’ which would have more layers for the input data vectors to pass through.

In other words, if we had a network that was supposed to identify whether the picture consisted of a dog or a cat, this type of neural network would “look at the picture from different angles” in order to decide, just like a human might look at a piece of art or photo to identify what it consisted of. Why? Because every picture of a dog or a cat is not quite set up the same way.

In [75]:
# GooGlenet Naive Inception CNN model

shapex = 224
shapey = 224
n_rows,n_cols,n_dims = X_train.shape[1:]
# in_shape = (n_rows, n_cols, n_dims)
nClasses = 2

input_vec = Input(shape=(shapex, shapey, 3))

hidden_layer_1 = Conv2D(10, (1,1), padding='same', activation='relu')(input_vec)
hidden_layer_1 = Conv2D(10, (3,3), padding='same', activation='relu')(hidden_layer_1)

hidden_layer_2 = Conv2D(10, (1,1), padding='same', activation='relu')(input_vec)
hidden_layer_2 = Conv2D(10, (5,5), padding='same', activation='relu')(hidden_layer_2)

hidden_layer_3 = MaxPooling2D((3,3), strides=(1,1), padding='same')(input_vec)
hidden_layer_3 = Conv2D(10, (1,1), padding='same', activation='relu')(hidden_layer_3)

combined_layers = keras.layers.concatenate([hidden_layer_1, hidden_layer_2, hidden_layer_3], axis = 3)

flattener = Flatten()(combined_layers)

dense_1 = Dense(10, activation='relu')(flattener)
dense_2 = Dense(5, activation='relu')(dense_1)
dense_3 = Dense(2, activation='relu')(dense_2)
output = Dense(nClasses, activation='softmax')(dense_3)

model = Model([input_vec], output)

plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

fitted_model = model.fit(X_train, y_train, epochs=epochs, batch_size=BS, validation_data=(X_test, y_test))

 


# make predictions on the testing set
print("Evaluating the Convoluted Neural Network. Please Wait.")
y_pred = model.predict(X_test, batch_size=BS)
 # # for each image in the testing set we need to find the index of the
# # label with corresponding largest predicted probability
y_pred = np.argmax(y_pred, axis=1)

# show a nicely formatted classification report
print(classification_report(y_test.argmax(axis=1), y_pred,
	target_names=lb.classes_))

# compute the confusion matrix and and use it to derive the raw
# accuracy, sensitivity, and specificity
matrix = confusion_matrix(y_test.argmax(axis=1), y_pred)
total = sum(sum(matrix))
accuracy = (matrix[0, 0] + matrix[1, 1]) / total
sensitivity = matrix[0, 0] / (matrix[0, 0] + matrix[0, 1])
specificity = matrix[1, 1] / (matrix[1, 0] + matrix[1, 1])

# show the confusion matrix, accuracy, sensitivity, and specificity
print(matrix)
print("acc: {:.4f}".format(accuracy))
print("sensitivity: {:.4f}".format(sensitivity))
print("specificity: {:.4f}".format(specificity))








Train on 37 samples, validate on 13 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Evaluating the Convoluted Neural Network. Please Wait.
              precision    recall  f1-score   support

       covid       0.92      1.00      0.96        12
     healthy       0.00      0.00      0.00         1

    accuracy                           0.92        13
   macro avg       0.46      0.50      0.48        13
weighted avg       0.85      0.92      0.89        13

[[12  0]
 [ 1  0]]
acc: 0.9231
sensitivity: 1.0000
specificity: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))


### Conclusion

For the amount of data that I had I'd say the network runs fairly well. Through the countless iterations I've run with it, I seemed to get overall validation accuracy score of around 75-92%. Of course, this is hard to guarantee because of the size of the data that is being used, but definitely hopeful for the future.

The confusion matrix above will outline how many predictions were false-positive, true-positive, false-negative, and true-negative. To analyze this performance metric, think this way --> You generally want the majority of your predictions to fall along the diagonal (which means your model predicted correctly), but in the cases it doesn't predict correctly, you want less to be false-positive. To explain this further, you would not want to go to the doctor and they tell you you aren't sick, when you really are. It would be less risky to be told you are sick when you actually aren't.




---

***I encourage you if you have any interest to check out the current COVID-19 Research challenge that Kaggle is holding. 

-->https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

I treated this project as a learning process, and while making my own decisions about what to implement ultimately, I did find the following articles to be very educational and guide me in the right directions.


-->https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-tensorflow-and-deep-learning/

Info on Googlenet Inception CNN
-->https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

Information on VGG16 CNN
-->https://neurohive.io/en/popular-networks/vgg16/




