AUTHORS: **ÓSCAR PIETTE AND LUCÍA SÁNCHEZ**


# **Pneumonia identification from X-Ray images**:

The aim of this project is to construct a classifier that allows us to differ between patients with pneumonia and normal patients employing X-Ray images. To do so, we used transfer learning by employing the pre-trained Keras deep learning model **ResNet50** as a base and modifying it in order to adapt it to our data. We also used the library Orca from Analytics Zoo which allows process distributed Big Data. All the dataset is in https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

Firstly, we start by importing all the necessary libraries and dependencies:

**Intalling Analytics Zoo:**

In [1]:
# Install jdk8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
# Set environment variable JAVA_HOME.
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version

openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)


In [2]:
# Install latest pre-release version of Analytics Zoo
# Installing Analytics Zoo from pip will automatically install pyspark, bigdl, and their dependencies.
!pip install --pre --upgrade analytics-zoo[ray]
exit() # refresh the pkg you just install



**Installing tensorflow:**

In [3]:
!pip install tensorflow==2.7.0
import tensorflow as tf
tf.__version__



'2.7.0'

**Initialize the Orca context:**

In [17]:
from zoo.orca import init_orca_context, stop_orca_context
from zoo.orca import OrcaContext

# It is recommended to set it to True when running Analytics Zoo in Jupyter notebook 
OrcaContext.log_output = True 

init_orca_context(cluster_mode="local", cores=2)

In [5]:
#Obtain the spark session from the Orca Context
spark = OrcaContext.get_spark_session()

**Load and preprocess the data:**

The data used for this project is located in kaggle. One option is to download the data from drive and upload it to drive to further use it in collab, but the option used was to get the data directly from kaggle. To do so we follow the instructions indicated in this source: https://buggyprogrammer.com/load-kaggle-dataset-in-colab-or-jupyter/

In [6]:
!pip install kaggle



In [7]:
#Make a directory named kaggle and copy the kaggle.json file there:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
# change the permission of the file
!chmod 600 ~/.kaggle/kaggle.json

In [8]:
# Obtain the dataset:
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

chest-xray-pneumonia.zip: Skipping, found more recently modified local copy (use --force to force download)


In [9]:
# Unziping the dataset:
from zipfile import ZipFile
file_name = 'chest-xray-pneumonia.zip' 
with ZipFile(file_name, 'r') as zip:
  zip.extractall()
  print('Done')

Done


In [10]:
# Establish the path to each folder:
test_path="/content/chest_xray/test/"
train_path="/content/chest_xray/train/"
val_path="/content/chest_xray/val/"

The dataset it is composed by three folders: Train, Test and Validation. Each of them has two folders, one for the "pneumonia" images and other for the "normal" images. There are 16 images in the validation folder, 624 images in the test folder and 5216 images in the train folder. 

In order to manage the data, we used Keras and TensorFlow which we can use in the Orca context. Keras is a neural network library and TensorFlow is a open-source library for a number of various tasks in machine learning. They were used to prepare and preprocess the data; and to construct our model using transfer learning. 

In [11]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# importing the necessary libraries

Following the indications in https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html, we constructed three functions, one for the training dataset, one for the test dataset and another for the validation dataset. With the function "image_dataset_from_directory" we obtained a tensorflow dataset with the images divided by the labels "NORMAL" and "PNEUMONIA". We also established that the images size should be 224 x 224 (which it is optimal for ResNet50). 

In the training function we perfom a data augmentation process and a duplication of the dataset in order to increase the number of images because it is really small for a deep learning process. We also perfom a duplication event in the validation dataset. For the three datasets, we perfom a general preprocess event with the function preprocess_input. 
 


In [12]:
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.applications.resnet50 import preprocess_input

# Obtain and preprocess the training dataset:
def train_data_creator(config, batch_size):
    
    # Obtaining the data from the directory:
    data_train = image_dataset_from_directory(train_path, 
                                            image_size=(224, 224), #Establish a general size for all images as 224 x 224
                                            batch_size = 1,
                                            shuffle=False,
                                            label_mode='binary',
                                            class_names = ['NORMAL', 'PNEUMONIA'])
   
    ##### Perform data augmentation:
    # It randomly flips the images horizontally
    # It randomly rotates the images by rotating counter 0.15 * 2pi radians clock-wise 
    # It randomly adjusts the contrast of the images by a factor of 0.1. Contrast is adjusted independently for each channel of each image during training
    # It randomly translates the images, it shift the image 0.1 down vertically and horizontally
    data_augmentation = keras.Sequential(
        [
            layers.RandomRotation(factor=0.15),
            layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
            layers.RandomFlip(),
            layers.RandomContrast(factor=0.1)
        ]
    )
    # Duplicate the data:
    data_train = data_train.repeat(2)

    # Applying the data augmentation proccess:
    data_train = data_train.map(lambda x, y: (data_augmentation(x, training = True), y))

    # Applying a preprocessing function from Keras to the images:
    data_train = data_train.map(lambda x, y: (keras.applications.resnet50.preprocess_input(x), y))
    
    # Transform the data into tf.float32:
    data_train = data_train.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))

    return data_train

# Obtain and preprocess the test dataset:
def test_data_creator(config, batch_size):
    
    # Obtaining the data from the directory:
    data_test = image_dataset_from_directory(test_path, 
                                            image_size=(224, 224), 
                                            batch_size = batch_size,
                                            shuffle=True,
                                            label_mode='binary',
                                            class_names = ['NORMAL', 'PNEUMONIA'])
   
    # Applying a preprocessing function from Keras to the images:
    data_test = data_test.map(lambda x, y: (keras.applications.resnet50.preprocess_input(x), y))
    
    # Transform the data into tf.float32:
    data_test = data_test.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))
    
    return data_test

# Obtain and preprocess the validation dataset:
def val_data_creator(config, batch_size):
    
    # Obtaining the data from the directory:
    data_val = image_dataset_from_directory(val_path, 
                                            image_size=(224, 224), 
                                            batch_size = 1,
                                            shuffle=False,
                                            label_mode='binary',
                                            class_names = ['NORMAL', 'PNEUMONIA'])
    # Duplicate the data:
    data_val = data_val.repeat(2)

    # Applying a preprocessing function from Keras to the images:
    data_val = data_val.map(lambda x, y: (keras.applications.resnet50.preprocess_input(x), y))
    # Casting it to tf.float32 to make conversions easier (because it also gave problems without it)
    data_val = data_val.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))
    
    # Returning the processed dataset
    return data_val



**Construction and application of the model:**

In order to use the model in the orca context, we first need to define a model creation function, which will be used to import a model created in keras. The model_creation function serves that purpose and imports the already existing ResNet50 model and modifies it in keras before compiling it. This process allows for the usage of this model in an Orca context through an Estimator created with the function from_keras.

In [13]:
def model_creation(config):
  
  from tensorflow.keras.applications import ResNet50
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Dense

  ### Initialize the pretrained model:
  # with include_top = False we leave out the last connected layer
  base_model = ResNet50(input_shape=(224, 224,3), include_top=False, weights="imagenet")
 
  for layer in base_model.layers: # keep the layers frozen
    layer.trainable = False

  base_model = Sequential()
  # Adding a single Fully Connected Layer on top of the pretrained model:
  base_model.add(ResNet50(include_top=False, weights='imagenet', pooling='max'))
  base_model.add(Dense(1, activation='sigmoid')) # For binary classification of the labels
 
  #Compile the model using the SGD optimizer:
  base_model.compile(optimizer = tf.keras.optimizers.SGD(lr=0.0001), loss = 'binary_crossentropy', metrics = ['acc'])

  return base_model


In [15]:
# Create an estimator using the model created in order to perform the training, the evaluation and prediction:
from zoo.orca.learn.tf2.estimator import Estimator
estimator_model = Estimator.from_keras(model_creator=model_creation)

2022-02-10 20:04:59,798	INFO services.py:1174 -- View the Ray dashboard at [1m[32mhttp://172.28.0.2:8265[39m[22m


{'node_ip_address': '172.28.0.2', 'raylet_ip_address': '172.28.0.2', 'redis_address': '172.28.0.2:6379', 'object_store_address': '/tmp/ray/session_2022-02-10_20-04-59_120376_2403/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-02-10_20-04-59_120376_2403/sockets/raylet', 'webui_url': '172.28.0.2:8265', 'session_dir': '/tmp/ray/session_2022-02-10_20-04-59_120376_2403', 'metrics_export_port': 57393, 'node_id': 'fd5560ec20d961218b3c02469e64faffd1ec2065a5be9123ce331998'}


[2m[36m(pid=2681)[0m Instructions for updating:
[2m[36m(pid=2681)[0m use distribute.MultiWorkerMirroredStrategy instead
[2m[36m(pid=2681)[0m 2022-02-10 20:05:06.301000: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


Finally, the last steps are to train the model and to evaluate its results:

In [16]:
stats = estimator_model.fit(train_data_creator,
                epochs=1,
                batch_size = 32,
                validation_data=val_data_creator)
                


[2m[36m(pid=2681)[0m Found 5216 files belonging to 2 classes.


[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function train_data_creator.<locals>.<lambda> at 0x7fe0c84394d0>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function train_data_creator.<locals>.<lambda> at 0x7fe0c8439dd0>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function train_data_creator.<locals>.<lambda> at 0x7fe0c8439d40>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7fe0c72d8b90>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7fe0c72d8f80>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 


[2m[36m(pid=2681)[0m Found 16 files belonging to 2 classes.


[2m[36m(pid=2681)[0m 2022-02-10 20:05:16.160341: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:766] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=2681)[0m op: "TensorSliceDataset"
[2m[36m(pid=2681)[0m input: "Placeholder/_0"
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "Toutput_types"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     list {
[2m[36m(pid=2681)[0m       type: DT_STRING
[2m[36m(pid=2681)[0m     }
[2m[36m(pid=2681)[0m   }
[2m[36m(pid=2681)[0m }
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "_cardinality"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     i: 5216
[2m[36m(pid=2681)[0m   }
[2m[36m(pid=2681)[0m }
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "is_files"
[2m[36m(pid=2681)[0m   value {
[2m[36

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m  - ETA: 1:11:34 - loss: 0.0127 - acc: 0.9982
[2m[36m(pid=2681)[0m  - ETA: 1:10:19 - loss: 0.0126 - acc: 0.9982
[2m[36m(pid=2681)[0m  - ETA: 1:07:46 - loss: 0.0122 - acc: 0.9983
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m  - ETA: 1:05:42 - loss: 0.0119 - acc: 0.9983
[2m[36m(pid=2681)[0m  - ETA: 56:52 - loss: 0.0109 - acc: 0.9985
[2m[36m(pid=2681)[0m  - ETA: 55:44 - loss: 0.0139 - acc: 0.9979
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m  - ETA: 47:14 - loss: 0.0128 - acc: 0.9980
[2m[36m(pid=2681)[0m  - ETA: 46:00 - loss: 0.0127 - acc: 0.9981
[2m[36m(pid=2681)[0m  - ETA: 40:58 - loss

[2m[36m(pid=2681)[0m 2022-02-10 22:37:00.695798: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:766] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=2681)[0m op: "TensorSliceDataset"
[2m[36m(pid=2681)[0m input: "Placeholder/_0"
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "Toutput_types"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     list {
[2m[36m(pid=2681)[0m       type: DT_STRING
[2m[36m(pid=2681)[0m     }
[2m[36m(pid=2681)[0m   }
[2m[36m(pid=2681)[0m }
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "_cardinality"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     i: 16
[2m[36m(pid=2681)[0m   }
[2m[36m(pid=2681)[0m }
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "is_files"
[2m[36m(pid=2681)[0m   value {
[2m[36m(



In [18]:
# Evaluate the model with the validation dataset:
validation_stats = estimator_model.evaluate(val_data_creator, batch_size = 32)
validation_stats

[2m[36m(pid=2681)[0m Found 16 files belonging to 2 classes.


[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7fe0c8439b90>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7fe0c8439200>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m 2022-02-10 22:39:52.697055: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:766] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=2681)[0m op: "TensorSliceDataset"
[2m[36m(pid=2681)[0m input: "Placeholder/_0"
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "Toutput_types"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     list {
[2m[36m(pid=2681)[0m       type: DT_STRING
[2m[36

 1/32 [..............................] - ETA: 12s - loss: 8.9606 - acc: 0.0000e+00
 2/32 [>.............................] - ETA: 5s - loss: 5.2671 - acc: 0.0000e+00 
 3/32 [=>............................] - ETA: 5s - loss: 5.9629 - acc: 0.0000e+00
 4/32 [==>...........................] - ETA: 5s - loss: 9.3301 - acc: 0.0000e+00
 5/32 [===>..........................] - ETA: 5s - loss: 8.6556 - acc: 0.0000e+00
 6/32 [====>.........................] - ETA: 5s - loss: 8.2196 - acc: 0.0000e+00
 7/32 [=====>........................] - ETA: 4s - loss: 7.8833 - acc: 0.0000e+00


{'validation_acc': 0.5, 'validation_loss': 4.38414192199707}

In [19]:
# Evaluation with the test dataset:
test_stats = estimator_model.evaluate(test_data_creator, batch_size = 32)
test_stats

[2m[36m(pid=2681)[0m Found 624 files belonging to 2 classes.


[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function test_data_creator.<locals>.<lambda> at 0x7fe0c8439f80>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m Cause: could not parse the source code of <function test_data_creator.<locals>.<lambda> at 0x7fe0c5c09b00>: no matching AST found among candidates:
[2m[36m(pid=2681)[0m 
[2m[36m(pid=2681)[0m 2022-02-10 22:40:11.083093: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:766] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=2681)[0m op: "TensorSliceDataset"
[2m[36m(pid=2681)[0m input: "Placeholder/_0"
[2m[36m(pid=2681)[0m attr {
[2m[36m(pid=2681)[0m   key: "Toutput_types"
[2m[36m(pid=2681)[0m   value {
[2m[36m(pid=2681)[0m     list {
[2m[36m(pid=2681)[0m       type: DT_STRING
[2m[

 1/20 [>.............................] - ETA: 2:19 - loss: 3.7326 - acc: 0.6562
 2/20 [==>...........................] - ETA: 1:27 - loss: 4.5879 - acc: 0.6094
 3/20 [===>..........................] - ETA: 1:22 - loss: 4.7107 - acc: 0.6250
 4/20 [=====>........................] - ETA: 1:17 - loss: 4.7919 - acc: 0.6250


{'validation_acc': 0.625, 'validation_loss': 4.74613618850708}

**Conclusions:**

The model trained has finally an accuracy of 0.625 which it is a low value for a neural network. It could be because we didn't adapt enough the pretrained model to our goal. We could have add more layers to try to fit it more. Another reason it could be that the amount of data is low. We can see this if we compare the metrics obtained in the training stage (accuracy of 0.99) and the test stage (accuracy of 0.625); there obviously a big overfitting event. 