## Malaria Level Detection classifier

* This notebook implements a classifier using PyTorch to detect different stages of the malaria.
* The dataset used for this project has been downloaded from [kaggle](https://www.kaggle.com/kmader/malaria-bounding-boxes). The dataset contains total 1364 images with (~80000) cells annotated by human researchers in different categories.
* In each of the images, tens of blood smears are present. There are two JSON files in the dataset, which contains details about:
 * Image **path**
 * **shape** containing size of the image and number of channels
 * **objects** containing `lower left co-ordinates` and `upper right co-ordinates` of the the blood smears and `category` of the smear.
* We have used Python to crop out each cell using the co-ordinates of the images and save it to the respective folders created for each category. The script `crop_utils.py` uses opencv, pandas and other libraries.
* Exploratory Data Analysis and data preprocessing is done as the dataset is highly imbalanced. We have used up-sampling and down-sampling to bring the data disctribution in a desired ratio. The details and implementation is in `EDA_DataPreProcessing.ipynb`.
* The processed dataset is divided into three different subsets, `train`, `valid` and `test`.

### Classifier implementation
* As the dataset size is relatively small we have used [transfer learning](https://towardsdatascience.com/what-is-transfer-learning-8b1a0fa42b4), where a pre-trained model is used and we have customized the classifier part of the model.
* As the pre-trained model, for better feature extraction we have used the model saved from `Pretrained_model.ipynb`.
* In building the model we have done:
 * Data Transformation: [torchvision.transforms](https://pytorch.org/docs/stable/torchvision/transforms.html) module has been used for augmenting data while training to `flip`, `rotate`, `jitteruing`, ` cropping` and `normalizing`. The transformations are divided for `train` and `test and valid` separately as `test and validation` doesn't need same set of transformation.
 * We are feeding the network the dataset each epoch in batches of 16 for faster convergence.
 * We have dynamically allocatted the `device` based on availability of CUDA.
 * The `feature` network parametrs are frozen with pre-trained values and gradient calculation is set to False.
 * The customized fully connected `classifier` network uses:
  * a layer 1024 neurons, which takes input from the `feature` CNN network.
  * We have used [`ReLU`](https://pytorch.org/docs/stable/nn.html#torch.nn.ReLU) as our activation function.
  * And a [`dropout`](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout) of 0.2 is used to turn off 20% of the neurons randomly while training reduce overfitting and make the model more robust for generalisation.
 * As a loss function [`CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss) has been used as we have multiple categories.
 * Stochastic Gradient Descent([SGD](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)) is used as the optimizer of for the network to update the parameters per batches per epoch.
 * We are decaying the learning rate at a rate of 0.2 for each 5 epoch for smooth convergence to the optima.

In [16]:
# Import the required modules
import copy
import numpy as np
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import time
import torchvision
from torch.utils.data import random_split
from torch.optim import lr_scheduler
from torch.autograd import Variable
from torchvision import datasets, models, transforms
# from torch.utils.data.sampler import SubsetRandomSampler

torch.cuda.current_device() # Work around for the Bug https://github.com/pytorch/pytorch/issues/20635

0

In [2]:
# Dataset directory
data_dir = r"E:\Class_Notes_Sem2\ADM\Project\malaria-bounding-boxes\malaria\Processed_Images"

Transforming the batches of data every epoch every batches while traning.

In [3]:
transformations = {
    'train': transforms.Compose([
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(50),
        transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
        transforms.RandomResizedCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'test': transforms.Compose([
        transforms.Resize(240),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'valid': transforms.Compose([
        transforms.Resize(240),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
}

dataset = { x : datasets.ImageFolder(os.path.join(data_dir, x), transformations[x])
               for x in ['train', 'test', 'valid']
          }

dataset_loaders = {x : torch.utils.data.DataLoader(dataset[x], batch_size=16,
                        shuffle=True, num_workers=4) for x in ['train', 'test', 'valid']
                  }

In [4]:
# Dynamically allocating the device for computation
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [5]:
# Pretrained model location
model = torch.load(r'E:\Class_Notes_Sem2\ADM\Project\malaria_level_detection\first_model.pth')
# To load VGG models with pretrained parameters
# model = models.vgg16(pretrained=True)

# Setting requires_grad to false to stop calculating gradients for all layers
for param in model.parameters():
    param.requires_grad = False

# Getting the number of features coming from the feature network to the classifier network
num_ftrs = model.classifier[0].in_features

# Customizing the classifier network and replacing the loaded one, require_grad will be True for these by default.
model.classifier = nn.Sequential(
    nn.Linear(num_ftrs, 1024),  
    nn.ReLU(), 
    nn.Dropout(p=0.2),
    nn.Linear(1024, 512),
    nn.ReLU(), 
    nn.Dropout(p=0.2),
    nn.Linear(512, 5)
)

In [6]:
# Loading the model to the device
model.to(device)

# Loss Function definition
criterion = nn.CrossEntropyLoss()

# Optimizer for back propagation
optimizer_classifier = optim.SGD(model.classifier.parameters(), lr=0.005, momentum=0.9)

# Decay LR by a factor of 0.2 every 5 epochs
classifier_lr_scheduler = lr_scheduler.StepLR(optimizer_classifier, step_size=5, gamma=0.2)

#### Defining the Training function.
* For train and valid we are turning on and off the dropout layers.
* We will be saving the model weights as per best accuracy on validation set.
* General accuracy of the model will be printed for each epoch

In [7]:
def train_model(model, criterion, optimizer_cl, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'valid']:
            if phase == 'train':
                scheduler.step()
                model.train(True)  # Set model to training mode
            else:
                model.train(False)  # Set model to evaluate mode to avoid dropout
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for data in dataset_loaders[phase]:
                # Getting the inputs and labels
                inputs, labels = data
                # Loading the model to the device
                inputs, labels = inputs.to(device), labels.to(device)

                # Gradient parameters are zeroed for every calculation
                optimizer_cl.zero_grad()

                # Forward pass and find the loss
                outputs = model(inputs)
                _, preds = torch.max(outputs.data, 1)
                loss = criterion(outputs, labels)

                # Backward pass, optimize only if in the training phase
                if phase == 'train':
                    loss.backward()
                    optimizer_cl.step()

                # Get the statistics of loss and accuracy for each batch
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            # Get the statistics of loss and accuracy for each epoch
            epoch_loss = running_loss / len(dataset[phase])
            epoch_acc = running_corrects.item() / len(dataset[phase])

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # Copy the model with best accuracy
            if phase == 'valid' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [8]:
def evaluate_model(model, datalaoder, criterion):
    model.train(False)
    running_loss, running_corrects = 0.0, 0.0
    for data in datalaoder:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs.data, 1)
        loss = criterion(outputs, labels)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
    test_loss = running_loss / len(dataset['test'])
    test_acc = running_corrects.item() / len(dataset['test'])
    print('Test Loss: {:.4f} Acc: {:.4f}'.format(test_loss, test_acc))
    return test_loss, test_acc

In [9]:
# Train the model for 25 epochs
model_ft = train_model(model, criterion, optimizer_classifier, classifier_lr_scheduler,
                       num_epochs=25)

Epoch 1/25
----------
train Loss: 1.2071 Acc: 0.4753
valid Loss: 1.0916 Acc: 0.5450

Epoch 2/25
----------
train Loss: 1.0771 Acc: 0.5475
valid Loss: 0.9052 Acc: 0.6160

Epoch 3/25
----------
train Loss: 1.0390 Acc: 0.5613
valid Loss: 0.9546 Acc: 0.5970

Epoch 4/25
----------
train Loss: 0.9973 Acc: 0.5856
valid Loss: 0.9286 Acc: 0.6440

Epoch 5/25
----------
train Loss: 0.9296 Acc: 0.6144
valid Loss: 0.9004 Acc: 0.6580

Epoch 6/25
----------
train Loss: 0.9091 Acc: 0.6296
valid Loss: 0.9123 Acc: 0.6070

Epoch 7/25
----------
train Loss: 0.8882 Acc: 0.6349
valid Loss: 0.8504 Acc: 0.6450

Epoch 8/25
----------
train Loss: 0.8790 Acc: 0.6451
valid Loss: 0.8145 Acc: 0.6700

Epoch 9/25
----------
train Loss: 0.8685 Acc: 0.6457
valid Loss: 0.7909 Acc: 0.6910

Epoch 10/25
----------
train Loss: 0.8734 Acc: 0.6495
valid Loss: 0.7931 Acc: 0.6910

Epoch 11/25
----------
train Loss: 0.8460 Acc: 0.6503
valid Loss: 0.8070 Acc: 0.6690

Epoch 12/25
----------
train Loss: 0.8529 Acc: 0.6531
valid Los

In [10]:
# Evaluate the model on test data
evaluate_model(model_ft, dataset_loaders['test'], criterion)

Test Loss: 0.7463 Acc: 0.7007


(0.7463246703147888, 0.7006666666666667)

In [16]:
# Free up CUDA Cached memory
torch.cuda.empty_cache()

### Creating the confusion matrix for the model evaluation

In [None]:
### Create the confusion matrix
nb_classes = 5

confusion_matrix = torch.zeros(nb_classes, nb_classes)
with torch.no_grad():
    for i, (inputs, labels) in enumerate(dataset_loaders['test']):
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model_ft(inputs)
        _, preds = torch.max(outputs, 1)
        for t, p in zip(labels.view(-1), preds.view(-1)):
                confusion_matrix[t.long(), p.long()] += 1

print(confusion_matrix)

In [12]:
print(confusion_matrix.diag()/confusion_matrix.sum(1))

tensor([0.5033, 0.7933, 0.8633, 0.7967, 0.5467])


In [15]:
dataset['test'].class_to_idx

{'gametocyte': 0,
 'red_blood_cell': 1,
 'ring': 2,
 'schizont': 3,
 'trophozoite': 4}

In [17]:
confusion_matrix_df = pd.DataFrame(columns = dataset['test'].class_to_idx.keys())

In [None]:
for rows in enumerate(te)