# Using Deep Learning to detect Pneumonia in X-ray images
> Using Pytorch to develop a Deep Learning framework to predict pneumonia in X-ray images

- toc: true 
- badges: true
- comments: true
- categories: [jovian, pytorch, transfer learning, fastpages, jupyter]
- image: images/xray-pneumonia-1.png

# About

This blog is towards the [Course Project](https://jovian.ml/forum/t/assignment-5-course-project/1563) for the [Pytorch Zero to GANS] free online course(https://jovian.ml/forum/c/pytorch-zero-to-gans/18) run by [JOVIAN.ML](https://www.jovian.ml).

The course [competition](https://jovian.ml/forum/t/assignment-4-in-class-data-science-competition/1564/2) was based on analysing protein cells with muti-label classification.

Therefore, to extend my understanding of dealing with medical imaging I decided to use the [X-Ray image database](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) in Kaggle.

Seeing as I ran out of GPU hours on Kaggle because of the competition (restricted to 30hrs/week at the time of writing June 2020) I opted to use Google Colab. 

This blog is in the form of a Jupyter notebook and inspired by [link](https://github.com/viritaromero/Detecting-Pneumonia-in-Chest-X-Rays/blob/master/Detecting_Pneumonia.ipynb).

The blog talks about getting the dataset in Google Colab, explore the dataset, develop the training model, metrics and then does some preliminary training to get a model which is then used to make a few predictions. 
I will then talk about some of the lessons learned.

> Warning! The purpose of this blog is to outline the steps taken in a typical Machine Learning project and should be treated as such.

**Link to non-sanitised notebook on Jovian.ML here **

# Import libraries

# collapse-hide
import os
import torch
import pandas as pd
import time
import copy
import PIL
import numpy as np
from torch.utils.data import Dataset, random_split, DataLoader
from PIL import Image
import torchvision
from torchvision import datasets
import torchvision.models as models
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import torchvision.transforms as T
from sklearn.metrics import f1_score
import torch.nn.functional as F
import torch.nn as nn
from torch.optim import lr_scheduler
from collections import OrderedDict
from torchvision.utils import make_grid
from torch.autograd import Variable
import seaborn as sns
import csv
%matplotlib inline

# Colab setup and getting data from Kaggle

I used Google Colab with GPU processing for this project because I had exhausted my Kaggle hours (30hrs/wk) working on the competition :( The challenge here was signing into Colab, setting up the working directoty and then linking to Kaggle and copying the data over. The size of the dataset was about 1.3Gb which wasn't too much of a bother as Google gives each Gmail account 15Gb for free!


> Tip: I used the monokai settings in Colab which gave excellent font contrast and colours for editing.![monokai]({{"/"|relative_url}}images/xray-colab-monokai.png "LOOKING PRETTY")

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

The default directory that is linked to the Google's gdrive ( the one connected to the gmail address) is /content/drive/My Drive/

I created a project directory jovian-xray and use this as the new root directory.

os.listdir(root_dir)

root_dir = '/content/drive/My Drive/jovian-xray'
os.chdir(root_dir)
!pwd
os.mkdir('kaggle')

Install Kaggle in your current Colab session.
Log into Kaggle, point to the dataset and copy the API key. This downloads a kaggle.json file.
Upload this kaggle.json to Colab.

!pip install -q kaggle

from google.colab import files

Select the kaggle.json file. This will be uploaded to your current working directory which is the root_dir as specified above.
Create a ./kaggle directory  in the home directory
Copy the kaggle.json from the current directory to this new directory.
Change permissions so that it can be executed by user and group.

upload = files.upload()

!mkdir ~/.kaggle
!ls
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

proj_dir = os.path.join(root_dir, 'kaggle', 'chest_xray')
os.chdir(proj_dir)
!pwd

In the [Kaggle data directory](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) page select New Notebook > Three vertical dots, Copy API Command

#API key
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

!unzip chest-xray-pneumonia

os.listdir(proj_dir)

The dataset is structured into training, val and test folders, each with sub-folders of NORMAL and PNEUMONIA images.

# Data exploration
## Image transforms
We will now prepare the data for reading into Pytorch as numpy arrays using DataLoaders.

Havig data augmentation is a good way to get extra training data for free. However, care must be taken to ensure that the transforms requested are likely to appear in the inference (or test set).

The images (RGB) are normalized using the mean [0.485,0.456,0.406] and standard deviation [0.229,0.224,0.225] of that used for the Imagenet data in the Resnet model, so that the new input images have the same distribution and mean as that used in the Resnet model.

I have set up two transforms dictionaries, one with and one without so it would be easy to plot images and compare.

In [None]:
# collapse-hide
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

data_transforms = {'train' : T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.RandomRotation(20),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize(*imagenet_stats, inplace=True)
]),
'test' : T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(*imagenet_stats, inplace=True)
]),
'val' : T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(*imagenet_stats, inplace=True)
])}

data_no_transforms = {'train' : T.Compose([ T.ToTensor() ]),
'test' : T.Compose([T.ToTensor() ]),
'val' : T.Compose([T.ToTensor() ])}

In [None]:
proj_dir = os.path.join(root_dir, 'kaggle', 'chest_xray')

proj_dir

![xray-raw-images]({{"/"|relative_url}}imagesimages/xray-raw-images.png)

# Transfer Learning model

## Use a Resnet34 model with our custom classifier for X-ray pneumonia images

The method of transfer learning is widely used to take advantage of the clever and hardworking chaps who have spent time to train a model on million+ images and save the trained model architecture and weights.

The Resnet34 model has been trained on the Imagenet database which has 1000 classes [from trombones to toilet tissue.](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a)

Resnet34 has a top-most fully connected layer to predict 1000 classes. In our case we need only two so we will remove the last fc layer and add our own.

Have a look [here](https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8) for a good explanation of Resnet architectures. Briefly, the after each set of convolutions the input is added to the output. This helps to maintain the reslotion of the input, ie do not lose any features of the input model.

In a deep neural network the early layers capture generic features such as edges, texture  and colour while the latter layers capture more specific features such as cats ears, eyes, elephant trunks and so on.

So our process is take the trained resnet architecture and weights, remove the head ie the last layers that are used to predict the 1000 classes and add our own tailored to the number of classes we want to predict, which in our case is two.

We will do a first pass of training where the weights of the resnet model are locked ie, ie we do not want to overwrite or lose those values which will mean more GPU expense for us. Then we will unfreeze the weights and run the entire model at our prefereed laerning rate. Note, idelaly we would like to unfreeze only specific layer, say layer 1 and layer 4, which I will cover in a separate blogpost.

# Setup the Training (and Validation) model

The dataset has training, val and tests which makes our lives a little bot easier ie we don't have to do any data splitting and can set up specific transforms for each.

# Check if GPU is available

# Run the Training (and Validation) model

# Testing (Inference)

### Save the interim model (checkpoint)

### Visualise the Training/Validation images

### Visualise the predictions

# Lessons Learned

1. There are five methods to reduce model overfitting.  Overfitting results when the model fits very well to the training data (low error) but not very well to the validation data (high error).
These are:
> Get more data  
> Data augmentation  
> Generalizable architectures  
> Regularisation  
> Reduce architecture complexity  

2. Undertaking an online course like the [Jovian Zero to Gans]((https://jovian.ml/forum/c/pytorch-zero-to-gans/18)) has been an excellent opportunity to immerse myself in Machine Learning. Taking part in the competition (which is ongoing) and writing this blog on the X-ray dataset has helped me to better understand important concepts such as Dataloaders, learning rate, batch size, optimizers and loss functions.    


3. Thank you to Aakash the course instructor and the rest pf the Jovian team for the efforts in helping us to better understand such an exciting paradigm.

------


fin