# Pneumonia Classification using CNN

Convolutional neural networks (CNN) can be used as an efficient tool for detecting diseases between different types of medical imaging in a fast a reliable way, thus, this paper illustrate the creation and training of a CNN from scratch, describing the importance of data preprocessing as an effective approach for producing better results, since raw data give bad results and also takes more time to train a standard neural network(NN). This paper is based on another research article, but the difference is the way data management, as data augmentation is not the correct procedure for reducing the effort at the moment of developing a solution for this science area. As the main result, the solution mentioned can help radiology and medical personnel to categorize, in this case, if a patient has or not the pneumonia disease, taking into account the difficulty and the delay of reading an X-Ray image. About the dataset, its name is Chest X-Ray Image Dataset, it’s a public dataset of Kaggle and contains 5856 Jpeg images organized in three directories.

## 1. Library import process. 

In [None]:
# Linear algebra modules.
import numpy as np
import pandas as pd

# Module for working with OS (Linux).
import os

# Import keras classes for CNN design.
from keras.utils import Sequence
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import BatchNormalization, Dropout

# Open CV module for computer vision.
import cv2

# Keras module for loading our pretrained model.
from keras.models import model_from_json

import time

## Function to read Images from both directories

In [None]:
"""
    @param dirnameNormal
    Directory for normal X-Ray images.
    
    @param dirnamePneumonia
    Directory for Pneumonia X-Ray images.
    
    Return two arrays, both contains image directories.
    
"""
def readDirectory(dirnameNormal, dirnamePneumonia):
    array1 = []
    array2 = []
    for i in os.listdir(dirnameNormal):
        if '.DS_Store' not in i:
            array1.append(dirnameNormal + '/' + str(i))
            array2.append(0)
        
    for i in os.listdir(dirnamePneumonia):
        if '.DS_Store' not in i:
            array1.append(dirnamePneumonia + '/' + str(i))
            array2.append(1)
        
    return array1, array2

## Directories are loaded into arrays

In [None]:
# Dataset root directory.
base_dir = '/kaggle/input/chest-xray-pneumonia/chest_xray/chest_xray/'

# Arrays Train X and Y are loaded with directories.
train_x, train_y = readDirectory(base_dir+'train/NORMAL', base_dir+'train/PNEUMONIA')
print('Reading on train directory finished!')

# Arrays Test X and Y are loaded with directories.
test_x, test_y = readDirectory(base_dir+'test/NORMAL', base_dir+'test/PNEUMONIA')
print('Reading on test directory finished!')

# Validation directories loaded in variables.
val_x, val_y = readDirectory(base_dir+'val/NORMAL', base_dir+'val/PNEUMONIA')
print('Reading on val directory finished!')

## Join train and test arrays into one list

In [None]:
print('Length verification before joining process!')
print(len(train_x), '<-->', len(train_y))
print(len(test_x), '<-->', len(test_y))
print(len(val_x), '<-->', len(val_y))

# Joining train and test lists into one.

files = train_x + test_x

# Joining train and test labels.

labels = train_y + test_y

print('Length verificacion after joining process!')
print(len(files), '<-->', len(labels))

## Shuffle process

In [None]:
from sklearn.utils import shuffle

files_shuffled, labels_shuffled = shuffle(files, labels)

## Train and Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train_filenames, X_val_filenames, y_train, y_val = train_test_split(files_shuffled, labels_shuffled, test_size=0.2, random_state=1)


## Custom Generator and Image Preprocessing

In [None]:
from skimage.io import imread
from skimage.transform import resize

# This class inherit Sequence class in order to create a custom generator
class Data_Generator(Sequence):
    
    # We feed oun gerator with our parameters.
    def __init__(self, image_filenames, labels, batch_size):
        self.image_filenames = image_filenames
        self.labels = labels
        self.batch_size = batch_size
        
    # Computes the number of batches to produce.
    def __len__(self) :
        return (np.ceil(len(self.image_filenames) / float(self.batch_size))).astype(np.int)
    
    # We preprocess our dataset with the current batch (Here is where magic happens).
    def __getitem__(self, idx) :
        batch_x = self.image_filenames[idx * self.batch_size : (idx+1) * self.batch_size]
        batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
        
        return np.array(
            [self.preprocess_image(directory) for directory in batch_x]
        ), np.array(batch_y)
        
    
    # Preprocess a single image and return an array.
    def preprocess_image(self, directory):
        # Read image from directory
        img = cv2.imread(directory, cv2.IMREAD_GRAYSCALE)
        # Resize the image
        img = cv2.resize(src= img, dsize= (300, 300), interpolation = cv2.INTER_AREA)
        # Denoise the image
        img = cv2.fastNlMeansDenoising(img, None, 10, 7, 21)
        # Normalize the image
        img = img/255
        
        img.shape += (1,)
        
        return img

In [None]:
# Preprocess a single image and return an array.
def preprocess_image(directory):
    # Read image from directory
    img = cv2.imread(directory, cv2.IMREAD_GRAYSCALE)
    # Resize the image
    img = cv2.resize(src= img, dsize= (300, 300), interpolation = cv2.INTER_AREA)
    # Denoise the image
    img = cv2.fastNlMeansDenoising(img, None, 10, 7, 21)
    # Normalize the image
    img = img/255
        
    img.shape += (1,)
    
    return img

## Instantiation of the custom Data Generator with 64 epochs

In [None]:
batch_size = 64

my_training_batch_generator = Data_Generator(X_train_filenames, y_train, batch_size)
my_validation_batch_generator = Data_Generator(X_val_filenames, y_val, batch_size)


## Creation of the Convolutional Neural Network

In [None]:
# Sequential class instantiation for the CNN model.

model = Sequential()

# Input layer with a shape of 300x300 per 1 channel.
model.add(Conv2D(16, (3, 3), activation="relu", input_shape=(300, 300, 1)))

model.add(MaxPooling2D(pool_size = (2, 2)))

model.add(Conv2D(32, (3, 3), activation="tanh"))

model.add(MaxPooling2D(pool_size = (2, 2)))

model.add(Conv2D(32, (3, 3), activation="tanh"))

model.add(MaxPooling2D(pool_size = (2, 2)))

model.add(Conv2D(64, (3, 3), activation="tanh"))

model.add(MaxPooling2D(pool_size = (2, 2)))

model.add(Flatten())

model.add(Dense(activation = 'relu', units = 128))
model.add(Dense(activation = 'sigmoid', units = 1))

# Compile the CNN model, with adam optimizer.
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Print our model
model.summary()

## Plotting the CNN model

In [None]:
from keras.utils.vis_utils import plot_model

plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

## Start the training process

In [None]:
history = model.fit_generator(
    generator=my_training_batch_generator,
    steps_per_epoch = int( np.ceil(len(X_train_filenames) / batch_size)),
    epochs= 5,
    verbose= 1,
    validation_data= my_validation_batch_generator,
    validation_steps= int( np.ceil(len(X_val_filenames) / batch_size)),
    use_multiprocessing=True
)

## Train accuracy vs Validation Accuracy

In [None]:
accuracy_training = history.history['accuracy']
accuracy_testing = history.history['val_accuracy']
epochs = 5

In [None]:
acc_training = np.array(accuracy_training)
acc_testing = np.array(accuracy_testing)

## Save the accuracies in a CSV file

In [None]:
dataframe_accuracies = pd.DataFrame(list(zip(acc_training, acc_testing)), columns=['ACC_Training', 'ACC_Testing'])
dataframe_accuracies.to_csv('dataframe_accuracies.csv')
dataframe_accuracies.head()

## Plotting the accuracies obtained by the CNN model

In [None]:
from matplotlib import pyplot as plt

plt.rcParams.update({'font.size': 22})

plt.rcParams["figure.figsize"] = (12,8)
plt.grid()
plt.plot(accuracy_training, color='b', label="Training accuracy")
plt.xticks(np.arange(1, epochs, 1))
plt.xlabel('EPOCHS')
plt.ylabel('Accuracy en Training')
plt.tight_layout()
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (12,8)
plt.grid()
plt.plot(accuracy_testing, color='r')
plt.xticks(np.arange(1, epochs, 1))
plt.xlabel('EPOCHS')
plt.ylabel('Accuracy en Validation')
plt.tight_layout()
plt.show()

## Conclusions
Podemos constatar que el modelo propuesto en el presente artículo supera en ambas medidas de calidad al modelo base con una considerable diferencia, adicionalmente se obtiene esos valores de train y test con un número mucho menor de ephocs en comparación al modelo base.

Para la realización de este artículo tomamos el modelo de un paper debidamente publicado, así que el desarrollo del modelo propuesto, las pruebas y errores, fueron un gran reto para los autores porque competíamos con profesionales con más experiencia en el campo de la ciencia de datos. Como trabajo futuro pretendemos, extender este trabajo en la clasificación de imágenes de rayos X, proyectándonos a mejorar la precisión en clasificar imágenes que contienen cáncer o tumores, aportando a la medicina y más que todo a salvar vidas.
