### Mark Lisi (ml2622)
# Malaria Detection Using Convolutional Neural Networks


By training a convolutional neural network (CNN) on images of cells - both infected with malaria and uninfected - we can accurately detect the presence of malaria using just an image.

In [1]:
# For our CNN
import tensorflow as tf 
import numpy as np 
from tensorflow.keras import datasets, layers, models

# For file/image parsing
import cv2 
from PIL import Image
import os
import matplotlib.pyplot as plt

## 1. Data Loading/Parsing

First, we must load our data! We are using the following cell image dataset from Kaggle: https://www.kaggle.com/datasets/iarunava/cell-images-for-detecting-malaria

Since this dataset separates infected and uninfected images, we can add binary labels as we load the images (0 for uninfected, 1 for infected). We will also employ a trick to get the most out of our image data; we will add some distorted copies of our images to the dataset. As well as effectively increasing the size of our dataset, this addition will give our model experience with imperfect images, and will ultimately strengthen it. 

In [2]:
infected = os.listdir('data/cell_images/cell_images/Parasitized/') 
uninfected = os.listdir('data/cell_images/cell_images/Uninfected/')


data = []
labels = []

for i in infected:
    try:
        image = cv2.imread("data/cell_images/cell_images/Parasitized/"+i)
        image_array = Image.fromarray(image , 'RGB')    
        resize_img = image_array.resize((50 , 50)) # Make sure our images are of uniform size
        # Applying some distortions to our training data will make our model more robust!
        rotated45 = resize_img.rotate(45)
        rotated75 = resize_img.rotate(75)
        blur = cv2.blur(np.array(resize_img) ,(10,10))
        data.append(np.array(resize_img))
        data.append(np.array(rotated45))
        data.append(np.array(rotated75))
        data.append(np.array(blur))
        labels.extend([1,1,1,1])
        
    except AttributeError: # If CV2 can't read in the image, we discard it.
        pass
    
for u in uninfected:
    try:
        image = cv2.imread("data/cell_images/cell_images/Uninfected/"+u)
        image_array = Image.fromarray(image , 'RGB')
        resize_img = image_array.resize((50 , 50)) # More resizing...
        # ...and more distortions.
        rotated45 = resize_img.rotate(45)
        rotated75 = resize_img.rotate(75)
        blur = cv2.blur(np.array(resize_img) ,(10,10))
        data.append(np.array(resize_img))
        data.append(np.array(rotated45))
        data.append(np.array(rotated75))
        data.append(np.array(blur))
        labels.extend([0,0,0,0])
        
    except AttributeError:
        pass

In [3]:
cells = np.array(data)
labels = np.array(labels)

Lastly, we will shuffle together the infected and uninfected images and split the data into test and training data.

In [6]:
n = np.arange(len(cells)) # n is a list of ordered indices: (0, 1, 2, ... , len(cells))
np.random.shuffle(n) # then we shuffle it!

# numpy syntax to neatly reorder a list in-place
cells = cells[n]
labels = labels[n]

In [7]:
from sklearn.model_selection import train_test_split
train_images, test_images, train_labels, test_labels = train_test_split(cells, labels, test_size=0.2)

## 2. Constructing the Model

Now that our data is properly loaded in, we can begin the actual classification process! Binary classification problems such as these can be approached in many ways - for image classification, convolutional neural networks are a reliably accurate option. 

 - justify model choices, explain how things are working

In [14]:
model = tf.keras.Sequential(
    [
        tf.keras.Input(shape=(50,50,3)),
        layers.Conv2D(32, kernel_size=(6, 6), activation='relu', padding='same'),
        layers.MaxPooling2D(pool_size=(3, 3), padding='same'),
        layers.Conv2D(32, kernel_size=(6, 6), activation='relu', padding='same'),
        layers.MaxPooling2D(pool_size=(3, 3), padding='same'),
        layers.Flatten(),
        #layers.Dropout(0.5),
        #layers.Dense(128, activation='tanh'),
        layers.Dense(1, activation='sigmoid'),
    ]
)

model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_6 (Conv2D)           (None, 50, 50, 32)        3488      
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 17, 17, 32)       0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 17, 17, 32)        36896     
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 6, 6, 32)         0         
 2D)                                                             
                                                                 
 flatten_3 (Flatten)         (None, 1152)              0         
                                                                 
 dense_4 (Dense)             (None, 1)                

## 3. Training...

We can now compile and start training our model.  

In [12]:
batch_size = 128
epochs = 5

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(train_images, train_labels, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2a230318dc0>

With our CNN, we were able to achieve a training accuracy of ~94.5%!

## 4. ...and Testing!

In [13]:
score = model.evaluate(test_images, test_labels, verbose=0)
print("Test loss: %.4f" % score[0])
print("Test accuracy: %.2f%%" % (100*score[1]))

Test loss: 0.2048
Test accuracy: 93.70%


Our test accuracy is almost as high as our training accuracy - ~93.7% - which should dispel any worries of overfitting the model during training.