# **Malaria Detection**

##<b>Problem Definition</b>
**The context:** Why is this problem important to solve?<br>
**The objectives:** What is the intended goal?<br>
**The key questions:** What are the key questions that need to be answered?<br>
**The problem formulation:** What is it that we are trying to solve using data science?

## <b>Data Description </b>

There are a total of 24,958 train and 2,600 test images (colored) that we have taken from microscopic images. These images are of the following categories:<br>


**Parasitized:** The parasitized cells contain the Plasmodium parasite which causes malaria<br>
**Uninfected:** The uninfected cells are free of the Plasmodium parasites<br>


###<b> Mount the Drive

In [1]:
# Import drive unit

from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


### <b>Loading libraries</b>

In [8]:
# General data science libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.metrics import classification_report, confusion_matrix

from concurrent.futures import ThreadPoolExecutor, as_completed

# Image processing

import cv2
import os

# TensorFlow + Keras for deep learning

import tensorflow as tf
from tensorflow import keras

from keras.models import Sequential
from keras.layers import Input, Dense, MaxPool2D, Flatten, Dropout, BatchNormalization, LeakyReLU, MaxPool2D
from keras.optimizers import Adam, SGD
from keras.utils import to_categorical

### <b>Let us load the data</b>

**Note:**
- You must download the dataset from the link provided on Olympus and upload the same to your Google Drive. Then unzip the folder.

The extracted folder has different folders for train and test data will contain the different sizes of images for parasitized and uninfected cells within the respective folder name.

The size of all images must be the same and should be converted to 4D arrays so that they can be used as an input for the convolutional neural network. Also, we need to create the labels for both types of images to be able to train and test the model.

Let's do the same for the training data first and then we will use the same code for the test data as well.

In [46]:
# Train and test folders directories
test_dir = "cell_images/test"
train_dir = "cell_images/train"

###<b> Check the shape of train and test images

In [None]:
# Returns the number of files on the directory to show the progressof the loading of the images

def count_images(dir):

    return len([file for file in os.listdir(dir) if os.path.isfile(os.path.join(dir, file))])

In [52]:

parasitized_train_count = count_images(os.path.join(train_dir, "parasitized"))
print(f"Number of parasitized images in the training set: {parasitized_train_count}")
uninfected_train_count = count_images(os.path.join(train_dir, "uninfected"))
print(f"Number of uninfected images in the training set: {uninfected_train_count}")
train_count = parasitized_train_count + uninfected_train_count
print(f"Total number of images in the training set: {train_count}")

parasitized_test_count = count_images(os.path.join(test_dir, "parasitized"))
print(f"Number of parasitized images in the test set: {parasitized_test_count}")
uninfected_test_count = count_images(os.path.join(test_dir, "uninfected"))
print(f"Number of uninfected images in the test set: {uninfected_test_count}")
test_count = parasitized_test_count + uninfected_test_count
print(f"Total number of images in the test set: {test_count}")

Number of parasitized images in the training set: 12582
Number of uninfected images in the training set: 12376
Total number of images in the training set: 24958
Number of parasitized images in the test set: 1300
Number of uninfected images in the test set: 1300
Total number of images in the test set: 2600


In [53]:
# Process the image to be used

def process_image(img_path, label,  size):


  img = cv2.imread(img_path)
  img = cv2.resize(img, (size, size))

  if img is not None:

    return img, label
    
  else:

    return None, None

In [39]:
# This is gonna handle the loading of the data given the directories of train and test labeling the images according to the directory is holding it (1 for parasitized or 0 for uninfected)

def load_images_from_directory(dir, train_test):

  images = []
  labels = []

  for label in ['parasitized', 'uninfected']:

    count = 0

    path = os.path.join(dir, label)

    for image in os.listdir(path):

      img_path = os.path.join(path, image)

      img, label = process_image(img_path, label, 64) # Process the image to be used

      if img is not None and label is not None:

        images.append(img)
        labels.append(1 if label == 'parasitized' else 0) # Assigning 1 for parasitized and 0 for uninfected

      print(f"Loading {label} {train_test} images: {count}/{count_images(path)}", end="\r")

      count += 1



    print()

  return images, labels

In [54]:
# Loading images with multi threading to speed up the proccess.

def load_images_from_directory_with_multithreading(dir, train_test):
    
    images = []
    labels = []

    tasks = []

    with ThreadPoolExecutor() as executor:

        for label in ['parasitized', 'uninfected']:

            path = os.path.join(dir, label)

            for image in os.listdir(path):

                img_path = os.path.join(path, image)

                tasks.append(executor.submit(process_image, img_path, label, 64))

        for task in as_completed(tasks):

            img, label = task.result()

            if img is not None and label is not None:

                images.append(img)
                labels.append(1 if label == 'parasitized' else 0)
        
    return np.array(images), np.array(labels)
                
  

In [55]:
X_train, y_train = load_images_from_directory_with_multithreading(train_dir, "train")
X_test, y_test = load_images_from_directory_with_multithreading(test_dir, "test")

###<b> Check the shape of train and test labels

In [45]:
X_train.shape, y_train.shape

((24958, 64, 64, 3), (24958,))

####<b> Observations and insights: _____


In [56]:
train_min, train_max = np.min(X_train), np.max(X_train)
test_min, test_max = np.min(X_test), np.max(X_test)

print(f"Train images - Min pixel value: {train_min}, Max pixel value: {train_max}")
print(f"Test images - Min pixel value: {test_min}, Max pixel value: {test_max}")

Train images - Min pixel value: 0, Max pixel value: 255
Test images - Min pixel value: 0, Max pixel value: 255


####<b> Observations and insights: _____



###<b> Count the number of values in both uninfected and parasitized

###<b>Normalize the images

####<b> Observations and insights: _____

###<b> Plot to check if the data is balanced

####<b> Observations and insights: _____

### <b>Data Exploration</b>
Let's visualize the images from the train data

####<b> Observations and insights: _____

###<b> Visualize the images with subplot(6, 6) and figsize = (12, 12)

####<b>Observations and insights:

###<b> Plotting the mean images for parasitized and uninfected

<b> Mean image for parasitized

<b> Mean image for uninfected

####<b> Observations and insights: _____

### <b>Converting RGB to HSV of Images using OpenCV

###<b> Converting the train data

###<b> Converting the test data

####<b>Observations and insights: _____

###<b> Processing Images using Gaussian Blurring

###<b> Gaussian Blurring on train data

###<b> Gaussian Blurring on test data

####**Observations and insights: _____**

**Think About It:** Would blurring help us for this problem statement in any way? What else can we try?

## **Model Building**

### **Base Model**

**Note:** The Base Model has been fully built and evaluated with all outputs shown to give an idea about the process of the creation and evaluation of the performance of a CNN architecture. A similar process can be followed in iterating to build better-performing CNN architectures.

###<b> Importing the required libraries for building and training our Model

####<B>One Hot Encoding the train and test labels

###<b> Building the model

In [None]:
#

###<b> Compiling the model

<b> Using Callbacks

<b> Fit and train our Model

###<b> Evaluating the model on test data

<b> Plotting the confusion matrix

<b>Plotting the train and validation curves

So now let's try to build another model with few more add on layers and try to check if we can try to improve the model. Therefore try to build a model by adding few layers if required and altering the activation functions.

###<b> Model 1
####<b> Trying to improve the performance of our model by adding new layers


###<b> Building the Model

###<b> Compiling the model

<b> Using Callbacks

<b>Fit and Train the model

###<b> Evaluating the model

<b> Plotting the confusion matrix

<b> Plotting the train and the validation curves

###<b>Think about it:</b><br>
Now let's build a model with LeakyRelu as the activation function  

*  Can the model performance be improved if we change our activation function to LeakyRelu?
*  Can BatchNormalization improve our model?

Let us try to build a model using BatchNormalization and using LeakyRelu as our activation function.

###<b> Model 2 with Batch Normalization

###<b> Building the Model

###<b>Compiling the model

<b> Using callbacks

<b>Fit and train the model

<b>Plotting the train and validation accuracy

###<b>Evaluating the model

####<b>Observations and insights: ____

<b> Generate the classification report and confusion matrix

###**Think About It :**<br>

* Can we improve the model with Image Data Augmentation?
* References to image data augmentation can be seen below:
  *   [Image Augmentation for Computer Vision](https://www.mygreatlearning.com/blog/understanding-data-augmentation/)
  *   [How to Configure Image Data Augmentation in Keras?](https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/)





###<b>Model 3 with Data Augmentation

###<b> Use image data generator

###**Think About It :**<br>

*  Check if the performance of the model can be improved by changing different parameters in the ImageDataGenerator.



####<B>Visualizing Augmented images

####<b>Observations and insights: ____

###<b>Building the Model

<b>Using Callbacks

<b> Fit and Train the model

###<B>Evaluating the model

<b>Plot the train and validation accuracy

<B>Plotting the classification report and confusion matrix

<b> Now, let us try to use a pretrained model like VGG16 and check how it performs on our data.

### **Pre-trained model (VGG16)**
- Import VGG16 network upto any layer you choose
- Add Fully Connected Layers on top of it

###<b>Compiling the model

<b> using callbacks

<b>Fit and Train the model

<b>Plot the train and validation accuracy

###**Observations and insights: _____**

*   What can be observed from the validation and train curves?

###<b> Evaluating the model

<b>Plotting the classification report and confusion matrix

###<b>Think about it:</b>
*  What observations and insights can be drawn from the confusion matrix and classification report?
*  Choose the model with the best accuracy scores from all the above models and save it as a final model.


####<b> Observations and Conclusions drawn from the final model: _____



**Improvements that can be done:**<br>


*  Can the model performance be improved using other pre-trained models or different CNN architecture?
*  You can try to build a model using these HSV images and compare them with your other models.

#### **Insights**

####**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

####**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

####**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?