# **COVI-DET**

The following notebook implements a Deep Neural Network model, whose backbone is a Dense-Net based architecture known as CheXNet. CheXNet is a model trained on Pneumonia X-Rays, which gives better performance than radiologists. We apply transfer learning on this model to the COVID-19 dataset to detect COVID-19 from X-Ray Images. We also apply RISE (Randomized Input Sampling for Explanation of Black-box Models) to generate Saliency maps for model interpretability. 

## Dataset Sources



*   COVID-19 Chest X-Ray Dataset : https://github.com/ieee8023/covid-chestxray-dataset
*   Pneumonia Chest X-Ray Dataset : https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
*   Pre-trained weights for CheXNet : https://github.com/arnoweng/CheXNet



We also use the Pneumonia Chest X-Ray dataset, because the class frequencies for COVID-19 dataset has fewer images for Pneumonia and Normal X-Rays. We combine them to form our dataset, which will be further split into training, validation and test sets.

## References



*   https://arxiv.org/abs/2004.12823
*   https://arxiv.org/abs/2004.09803
*   https://github.com/arnoweng/CheXNet
*   https://stanfordmlgroup.github.io/projects/chexnet/
*   https://github.com/eclique/RISE







# Imports and Data Downloading

In [None]:
import os
import numpy as np
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torchvision
import torchvision.transforms as transforms
from torch import optim
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.sampler import SubsetRandomSampler, RandomSampler, SequentialSampler
import re
from shutil import copyfile
import glob
import warnings
from tqdm import tqdm_notebook as tqdm

import datetime
import json

import seaborn as sn
import pandas as pd
from scipy import interp
from itertools import cycle
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, auc, f1_score
from skimage.transform import resize

from PIL import Image
from pylab import rcParams
import matplotlib.pyplot as plt
import cv2

warnings.filterwarnings("ignore")
pd.set_option('max_colwidth', 1000)

## Data Downloading

The following cell clones my repo, which contains the dataset for this task. The data present in the repo is as it is downloaded from the above mentioned Data Sources and stored at one place (cannot download from Kaggle without a private API token). All the pre-processing will be done in this notebook only.

The repo also contains the pretrained weights for CheXNet Model.

In [None]:
!git clone https://github.com/dragonsan17/covid_detection_from_xray

In [None]:
!mkdir final_data
!mkdir final_data/train
!mkdir final_data/validation
!mkdir final_data/test

# Config

In [None]:
CKPT_PATH = '/content/covid_detection_from_xray/data/chexnet_pretrained'
METADATA_PATH = '/content/covid_detection_from_xray/data/covid-chestxray-dataset/metadata.csv'

TRAIN_DATA_PATH = '/content/final_data/train'
VAL_DATA_PATH = '/content/final_data/validation'
TEST_DATA_PATH = '/content/final_data/test'
SAVE_PATH = '/content/final_data'

NORMAL_DATA_PATH = "/content/covid_detection_from_xray/data/NORMAL"
PNEUMONIDA_DATA_PATH = "/content/covid_detection_from_xray/data/PNEUMONIA"
COVID_DATA_PATH = '/content/covid_detection_from_xray/data/covid-chestxray-dataset'

BEST_PATH = CKPT_PATH
BEST_VAL = 100000

NUM_EPOCHS_FIRST_RUN = 30 #1 Replace by 1 to see the functioning faster
NUM_EPOCHS_SECOND_RUN = 10 #1 Replace by 1 to see the functioning faster
BATCH_SIZE_FIRST_RUN = 16
BATCH_SIZE_SECOND_RUN = 4

# Data Pre-Processing

In [None]:
img_paths = []
classes = []

## Data Reading for Normal and Pneumonia X-Rays

Reads the images from their respective folders and splits into train-val-test images

In [None]:
for f in glob.glob(os.path.join(NORMAL_DATA_PATH, '*')):
  img_paths.append(f)
  classes.append(0)

for f in glob.glob(os.path.join(PNEUMONIDA_DATA_PATH, '*')):
  if 'bacteria' in f:
    img_paths.append(f)
    classes.append(1)
  else:
    img_paths.append(f)
    classes.append(2)

chest_xray_data = pd.DataFrame({'img_paths' : img_paths, 'classes' : classes})
chest_xray_train, chest_xray_test, _, _ = train_test_split(chest_xray_data, chest_xray_data.classes, test_size=0.3, random_state=42, stratify=chest_xray_data.classes)
chest_xray_valid, chest_xray_test, _, _ = train_test_split(chest_xray_test, chest_xray_test.classes, test_size=0.33, random_state=42, stratify=chest_xray_test.classes)

train_img_paths = list(chest_xray_train.img_paths)
train_classes = list(chest_xray_train.classes)

val_img_paths = list(chest_xray_valid.img_paths)
val_classes = list(chest_xray_valid.classes)

test_img_paths = list(chest_xray_test.img_paths)
test_classes = list(chest_xray_test.classes)

for index, row in chest_xray_train.iterrows():
  src = row.img_paths
  img_name = src.split('/')[-1]
  dst = os.path.join(TRAIN_DATA_PATH, img_name)
  copyfile(src,dst)

for index, row in chest_xray_valid.iterrows():
  src = row.img_paths
  img_name = src.split('/')[-1]
  dst = os.path.join(VAL_DATA_PATH, img_name)
  copyfile(src,dst)

for index, row in chest_xray_test.iterrows():
  src = row.img_paths
  img_name = src.split('/')[-1]
  dst = os.path.join(TEST_DATA_PATH, img_name)
  copyfile(src,dst)

## Data Reading for COVID-19 X-Rays


*   Reads the Metadata and only chooses COVID-19 images
*   Due to the presence of multiple images of same patient-id, the train-val-test split is made in such a way that same patient-id's image does not fall into train and the others, thus preventing information leakage



In [None]:
covid_data = pd.read_csv(METADATA_PATH).fillna('')
covid_data = covid_data[((covid_data.view == 'PA') | (covid_data.view == 'AP') | (covid_data.view == 'AP Supine')) & ((covid_data.finding == 'Pneumonia/Viral/COVID-19'))]
covid_data.describe()

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,in_icu,needed_supplemental_O2,extubated,temperature,pO2_saturation,leukocyte_count,neutrophil_count,lymphocyte_count,view,modality,date,location,folder,filename,doi,url,license,clinical_notes,other_notes,Unnamed: 29
count,478,478.0,478,478.0,478,478,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478,478,478,478,478,478,478.0,478,478.0,478.0,478.0,478.0
unique,295,40.0,3,64.0,1,3,3.0,3.0,3.0,3.0,3.0,3.0,3.0,29.0,34.0,15.0,23.0,22.0,3,1,55,100,1,478,77.0,215,9.0,342.0,104.0,1.0
top,250,,M,,Pneumonia/Viral/COVID-19,Y,,,,,,,,,,,,,PA,X-ray,2020,"Hannover Medical School, Hannover, Germany",images,333932bd.jpg,,https://github.com/ml-workgroup/covid-19-image-repository,,,,
freq,7,92.0,287,125.0,478,284,309.0,318.0,318.0,237.0,285.0,416.0,447.0,418.0,385.0,464.0,454.0,442.0,196,478,320,79,478,1,231.0,79,205.0,97.0,272.0,478.0


In [None]:
unique_patient_ids = np.array(covid_data.patientid.unique())
patient_id_counts = []
total_samples = len(covid_data)
for patient_id in unique_patient_ids:
  count = len(covid_data[covid_data.patientid==patient_id])
  patient_id_counts.append([count, patient_id])

patient_id_counts.sort(reverse = True)

train_patient_ids = []
val_patient_ids = []
test_patient_ids = []
t_c, v_c, te_c = 0,0,0
total_count = 0
for count,id in patient_id_counts:
  total_count += count
  if total_count < 0.7*total_samples:
    train_patient_ids.append(id)
    t_c += count
  elif total_count < 0.9*total_samples:
    val_patient_ids.append(id)
    v_c += count
  else:
    test_patient_ids.append(id)
    te_c += count

print(f'Total Samples : {total_samples}')
print(f'Training Set contains {(t_c)} samples')
print(f'Validation Set contains {v_c} samples')
print(f'Test Set contains {te_c} samples')

for patient_id in train_patient_ids:
  details = covid_data[covid_data.patientid == patient_id]
  filenames = details.filename

  for filename in filenames:
    src = os.path.join(COVID_DATA_PATH, filename)
    dst = os.path.join(TRAIN_DATA_PATH, filename)
    img_paths.append(filename)
    classes.append(3)
    train_img_paths.append(filename)
    train_classes.append(3)
    copyfile(src, dst)

for patient_id in val_patient_ids:
  details = covid_data[covid_data.patientid == patient_id]
  filenames = details.filename

  for filename in filenames:
    src = os.path.join(COVID_DATA_PATH, filename)
    dst = os.path.join(VAL_DATA_PATH, filename)
    img_paths.append(filename)
    classes.append(3)
    
    val_img_paths.append(filename)
    val_classes.append(3)
    copyfile(src, dst)

for patient_id in test_patient_ids:
  details = covid_data[covid_data.patientid == patient_id]
  filenames = details.filename

  for filename in filenames:
    src = os.path.join(COVID_DATA_PATH, filename)
    dst = os.path.join(TEST_DATA_PATH, filename)
    img_paths.append(filename)
    classes.append(3)
    test_img_paths.append(filename)
    test_classes.append(3)
    copyfile(src, dst)

Total Samples : 478
Training Set contains 334 samples
Validation Set contains 96 samples
Test Set contains 48 samples


## Data-Set Class

This class will be responsible to supply data, and also has an loss function as a member, which implements binary-weighted crossentropy loss.

In [None]:
all_data = pd.DataFrame({'img_paths' : img_paths, 'classes' : classes})
all_data.img_paths = all_data.img_paths.transform(lambda x : str(x).split('/')[-1])

train_df = pd.DataFrame({'img_paths' : train_img_paths, 'classes' : train_classes})
train_df.img_paths = train_df.img_paths.transform(lambda x : str(x).split('/')[-1])

val_df = pd.DataFrame({'img_paths' : val_img_paths, 'classes' : val_classes})
val_df.img_paths = val_df.img_paths.transform(lambda x : str(x).split('/')[-1])

test_df = pd.DataFrame({'img_paths' : test_img_paths, 'classes' : test_classes})
test_df.img_paths = test_df.img_paths.transform(lambda x : str(x).split('/')[-1])

In [None]:
class Data_Set(Dataset):
    def __init__(self, df, rand=False, transform=None):

        self.df = df.reset_index(drop=True)
        self.rand = rand
        self.transform = transform
        self.num_normal = len(df[df.classes == 0])
        self.num_bact = len(df[df.classes == 1])
        self.num_viral = len(df[df.classes == 2])
        self.num_covid = len(df[df.classes == 3])
        self.total = len(df)
        self.loss_weight_minus = torch.FloatTensor([self.num_normal, self.num_bact, self.num_viral, self.num_covid]).unsqueeze(0).cuda() / self.total
        self.loss_weight_plus = 1.0 - self.loss_weight_minus
        
    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, index):
        row = self.df.iloc[index]
        img_path = row.img_paths
        path = ''
        if os.path.exists(os.path.join(TRAIN_DATA_PATH, img_path)):
          path = os.path.join(TRAIN_DATA_PATH, img_path)
        elif os.path.exists(os.path.join(VAL_DATA_PATH, img_path)):
          path = os.path.join(VAL_DATA_PATH, img_path)
        else:
          path = os.path.join(TEST_DATA_PATH, img_path)

        image = Image.open(path).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)
        label = np.zeros(4).astype(np.float32)
        label[row.classes] = 1.
        return torch.tensor(image).float(), torch.tensor(label)

    def loss(self, output, target):
        
        weight_plus = torch.autograd.Variable(self.loss_weight_plus.repeat(1, target.size(0)).view(-1, self.loss_weight_plus.size(1)).cuda())
        weight_neg = torch.autograd.Variable(self.loss_weight_minus.repeat(1, target.size(0)).view(-1, self.loss_weight_minus.size(1)).cuda())

        loss = output
        pmask = (target >= 0.5).data
        nmask = (target < 0.5).data
        
        epsilon = 1e-15
        loss[pmask] = (loss[pmask] + epsilon).log() * weight_plus[pmask]
        loss[nmask] = (1-loss[nmask] + epsilon).log() * weight_plus[nmask]
        loss = -loss.sum()
        return loss