## Malaria Level Detection classifier

* This notebook implements a classifier using PyTorch to detect different stages of the malaria.
* The dataset used for this project has been downloaded from [kaggle](https://www.kaggle.com/kmader/malaria-bounding-boxes). The dataset contains total 1364 images with (~80000) cells annotated by human researchers in different categories.
* In each of the images, tens of blood smears are present. There are two JSON files in the dataset, which contains details about:
 * Image **path**
 * **shape** containing size of the image and number of channels
 * **objects** containing `lower left co-ordinates` and `upper right co-ordinates` of the the blood smears and `category` of the smear.
* We have used Python to crop out each cell using the co-ordinates of the images and save it to the respective folders created for each category. The script `crop_utils.py` uses opencv, pandas and other libraries.
* Exploratory Data Analysis and data preprocessing is done as the dataset is highly imbalanced. We have used up-sampling and down-sampling to bring the data disctribution in a desired ratio. The details and implementation is in `EDA_DataPreProcessing.ipynb`.
* The processed dataset is divided into three different subsets, `train`, `valid` and `test`.

### Classifier implementation
* As the dataset size is relatively small we have used [transfer learning](https://towardsdatascience.com/what-is-transfer-learning-8b1a0fa42b4), where a pre-trained model is used and we have customized the classifier part of the model.
* As the pre-trained model, for better feature extraction we have used the model saved from `Pretrained_model.ipynb`.
* In building the model we have done:
 * Data Transformation: [torchvision.transforms](https://pytorch.org/docs/stable/torchvision/transforms.html) module has been used for augmenting data while training to `flip`, `rotate`, `jitteruing`, ` cropping` and `normalizing`. The transformations are divided for `train` and `test and valid` separately as `test and validation` doesn't need same set of transformation.
 * We are feeding the network the dataset each epoch in batches of 16 for faster convergence.
 * We have dynamically allocatted the `device` based on availability of CUDA.
 * The `feature` network parametrs are frozen with pre-trained values and gradient calculation is set to False.
 * The customized fully connected `classifier` network uses:
  * a layer 1024 neurons, which takes input from the `feature` CNN network.
  * We have used [`ReLU`](https://pytorch.org/docs/stable/nn.html#torch.nn.ReLU) as our activation function.
  * And a [`dropout`](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout) of 0.2 is used to turn off 20% of the neurons randomly while training reduce overfitting and make the model more robust for generalisation.
 * As a loss function [`CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss) has been used as we have multiple categories.
 * Stochastic Gradient Descent([SGD](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)) is used as the optimizer of for the network to update the parameters per batches per epoch.
 * We are decaying the learning rate at a rate of 0.2 for each 5 epoch for smooth convergence to the optima.

In [None]:
# Import the required modules
import copy
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import time
import torchvision
from sklearn.metrics import classification_report
from torch.utils.data import random_split
from torch.optim import lr_scheduler
from torch.autograd import Variable
from torchvision import datasets, models, transforms
%matplotlib inline

torch.cuda.current_device() # Work around for the Bug https://github.com/pytorch/pytorch/issues/20635

In [None]:
# Dataset directory
data_dir = r"E:\Class_Notes_Sem2\ADM\Project\malaria-bounding-boxes\malaria\Processed_Images"

Transforming the batches of data every epoch every batches while traning.

In [None]:
transformations = {
    'train': transforms.Compose([
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(50),
        transforms.RandomResizedCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'test': transforms.Compose([
        transforms.Resize(240),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'valid': transforms.Compose([
        transforms.Resize(240),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
}

dataset = { x : datasets.ImageFolder(os.path.join(data_dir, x), transformations[x])
               for x in ['train', 'test', 'valid']
          }

dataset_loaders = {x : torch.utils.data.DataLoader(dataset[x], batch_size=16,
                        shuffle=True, num_workers=4) for x in ['train', 'test', 'valid']
                  }

In [None]:
# Dynamically allocating the device for computation
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

In [None]:
# Pretrained model location
model = torch.load(r'E:\Class_Notes_Sem2\ADM\Project\malaria_level_detection\first_model.pth')

# Setting requires_grad to false to stop calculating gradients for all layers
for param in model.parameters():
    param.requires_grad = False

# Getting the number of features coming from the feature network to the classifier network
num_ftrs = model.classifier[0].in_features

# Customizing the classifier network and replacing the loaded one, require_grad will be True for these by default.
model.classifier = nn.Sequential(
    nn.Linear(num_ftrs, 1024),  
    nn.ReLU(), 
    nn.Dropout(p=0.3),
    nn.Linear(1024, 512),
    nn.ReLU(), 
    nn.Dropout(p=0.3),
    nn.Linear(512, 5)
)

In [None]:
# Loading the model to the device
model.to(device)

# Loss Function definition
criterion = nn.CrossEntropyLoss()

# Optimizer for back propagation
optimizer_classifier = optim.SGD(model.classifier.parameters(), lr=0.005, momentum=0.9)

# Decay LR by a factor of 0.2 every 5 epochs
classifier_lr_scheduler = lr_scheduler.StepLR(optimizer_classifier, step_size=5, gamma=0.1)

#### Defining the Training function.
* For train and valid we are turning on and off the dropout layers.
* We will be saving the model weights as per best accuracy on validation set.
* General accuracy of the model will be printed for each epoch

In [None]:
def train_model(model, criterion, optimizer_cl, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0
    train_loss = []
    valid_loss = []
    train_acc = []
    valid_acc = []
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'valid']:
            if phase == 'train':
                scheduler.step()
                model.train(True)  # Set model to training mode
            else:
                model.train(False)  # Set model to evaluate mode to avoid dropout
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for data in dataset_loaders[phase]:
                # Getting the inputs and labels
                inputs, labels = data
                # Loading the model to the device
                inputs, labels = inputs.to(device), labels.to(device)

                # Gradient parameters are zeroed for every calculation
                optimizer_cl.zero_grad()

                # Forward pass and find the loss
                outputs = model(inputs)
                _, preds = torch.max(outputs.data, 1)
                loss = criterion(outputs, labels)

                # Backward pass, optimize only if in the training phase
                if phase == 'train':
                    loss.backward()
                    optimizer_cl.step()

                # Get the statistics of loss and accuracy for each batch
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            # Get the statistics of loss and accuracy for each epoch
            epoch_loss = running_loss / len(dataset[phase])
            epoch_acc = running_corrects.item() / len(dataset[phase])
            if phase == 'train':
                train_loss.append(epoch_loss)
                train_acc.append(epoch_acc)
            else:
                valid_loss.append(epoch_loss)
                valid_acc.append(epoch_acc)
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # Copy the model with best accuracy
            if phase == 'valid' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, train_loss, valid_loss, train_acc, valid_acc

In [None]:
def evaluate_model(model, datalaoder, criterion):
    model.train(False)
    running_loss, running_corrects = 0.0, 0.0
    for data in datalaoder:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs.data, 1)
        loss = criterion(outputs, labels)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
    test_loss = running_loss / len(dataset['test'])
    test_acc = running_corrects.item() / len(dataset['test'])
    print('Test Loss: {:.4f} Acc: {:.4f}'.format(test_loss, test_acc))
    return test_loss, test_acc

In [None]:
# Train the model for 25 epochs
model_ft, train_loss, valid_loss, train_acc, valid_acc = train_model(model, criterion, optimizer_classifier, classifier_lr_scheduler,
                       num_epochs=10)

In [None]:
# Evaluate the model on test data
evaluate_model(model_ft, dataset_loaders['test'], criterion)

In [None]:
plt.plot(train_loss, label="Training loss")
plt.plot(train_acc, label = "Training accuracy")
plt.plot(valid_loss, label = "Validation loss")
plt.plot(valid_acc, label = "Validation accuracy")
plt.legend(labels = ["Training loss", "Training accuracy", "Validation loss", "Validation accuracy"])
plt.show()

In [None]:
# Free up CUDA Cached memory
torch.cuda.empty_cache()

### Save the model for final evaluation and Confusion Matrix
* Here we are saving the trained model locally.
* Re-loading the model into a different object and re-evaluating the performance and making the confusion matrix

In [None]:
torch.save(model_ft, r"E:\Class_Notes_Sem2\ADM\Project\malaria_level_detection\malaria_classifier.pth")

In [None]:
trained_model = torch.load(r"E:\Class_Notes_Sem2\ADM\Project\malaria_level_detection\malaria_classifier.pth")
# Evaluate the loaded model on test data
evaluate_model(trained_model, dataset_loaders['test'], criterion)

### Creating the confusion matrix for the model evaluation

In [None]:
### Create the confusion matrix
nb_classes = 5

confusion_matrix = torch.zeros(nb_classes, nb_classes)
pred_list = []
label_list = []
with torch.no_grad():
    for i, (inputs, labels) in enumerate(dataset_loaders['test']):
        inputs, labels = inputs.to(device), labels.to(device)
        label_list.append(labels)
        outputs = trained_model(inputs)
        _, preds = torch.max(outputs, 1)
        pred_list.append(preds)
        for p, t in zip(preds.view(-1), labels.view(-1),):
                confusion_matrix[p.long(), t.long()] += 1

In [None]:
overall_accuracy = confusion_matrix.diag().sum()/len(dataset_loaders['test'].dataset)
print("Overall Accuracy of the model : {}".format(overall_accuracy))

In [None]:
dataset['test'].class_to_idx

In [None]:
confusion_matrix_df = pd.DataFrame(columns = dataset['test'].class_to_idx.keys())

In [None]:
for i, rows in enumerate(confusion_matrix):
    confusion_matrix_df.loc[list(dataset['test'].class_to_idx.keys())[i]] = rows.numpy()

In [None]:
confusion_matrix_df

In [None]:
pred_list_ar = []
label_list_ar = []
for i in pred_list:
    pred_list_ar = pred_list_ar + list(i.cpu().numpy())
for i in label_list:
    label_list_ar = label_list_ar + list(i.cpu().numpy())

In [None]:
print(classification_report(label_list_ar, pred_list_ar))