# ECE176: Pneumonia Detection, CNN

## Introduction

**In this report, we aim to address the following questions:**

1. How accurately can we distinguish healthy and pneumonia patients with our CNN? 

2. Can we distinguish between viral and bacterial pneumonia? 
       
3. Can we use pre-trained CNNs or UNET models to create a more accurate model? 

## Dataset

[Collection of Chest X Ray of Healthy vs Pneumonia affected patients](https://www.kaggle.com/datasets/praveengovi/coronahack-chest-xraydataset)

## Import Packages and Prepare GPU

In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np
import os
import pandas as pd
import shutil

In [22]:
USE_GPU = True
num_class = 100
dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('using device:', device)

using device: cuda


## Load Dataset With Augmentations

We first need to split the images based on labeled classes. 
**Note: This only needs to be run one time**

In [23]:
# first read in the metadata file
data = pd.read_csv("Chest_xray_Corona_Metadata.csv")
data.head

<bound method NDFrame.head of       Unnamed: 0            X_ray_image_name     Label Dataset_type  \
0              0           IM-0128-0001.jpeg    Normal        TRAIN   
1              1           IM-0127-0001.jpeg    Normal        TRAIN   
2              2           IM-0125-0001.jpeg    Normal        TRAIN   
3              3           IM-0122-0001.jpeg    Normal        TRAIN   
4              4           IM-0119-0001.jpeg    Normal        TRAIN   
...          ...                         ...       ...          ...   
5905        5928  person1637_virus_2834.jpeg  Pnemonia         TEST   
5906        5929  person1635_virus_2831.jpeg  Pnemonia         TEST   
5907        5930  person1634_virus_2830.jpeg  Pnemonia         TEST   
5908        5931  person1633_virus_2829.jpeg  Pnemonia         TEST   
5909        5932  person1632_virus_2827.jpeg  Pnemonia         TEST   

     Label_2_Virus_category Label_1_Virus_category  
0                       NaN                    NaN  
1          

In [24]:
print(set(data['Label']))

{'Pnemonia', 'Normal'}


In [25]:
# subdirectories for each class
#os.mkdir("Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train/Pnemonia")
#os.mkdir("Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train/Normal")
#os.mkdir("Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/test/Pnemonia")
#os.mkdir("Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/test/Normal")

In [26]:
path_train = "Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train"
path_test = "Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/test"

train_num = len(os.listdir(path_train))
print("Train data: " + str(train_num))
test_num = len(os.listdir(path_test))
print("Test data: " + str(test_num))

Train data: 5311
Test data: 626


In [29]:
# ONLY NEED TO RUN ONCE

normal_train = 0
pnemonia_train = 0
normal_test = 0
pnemonia_test = 0

# "X_ray_image_name" = name of file
# "Label" = pneumonia or normal
# "Dataset_type" = train or test

for i in range(data.shape[0]):
    if data["Dataset_type"][i] == "TRAIN":
        if data["Label"][i] == "Normal":
            shutil.copy(path_train + "/" + data["X_ray_image_name"][i], path_train + "/Normal/" + data["X_ray_image_name"][i])
            normal_train = normal_train + 1
        else:
            shutil.copy(path_train + "/" + data["X_ray_image_name"][i], path_train + "/Pnemonia/" + data["X_ray_image_name"][i])
            pnemonia_train = pnemonia_train + 1
    elif data["Dataset_type"][i] == "TEST":
        if data["Label"][i] == "Normal":
            shutil.copy(path_test + "/" + data["X_ray_image_name"][i], path_test + "/Normal/" + data["X_ray_image_name"][i])
            normal_test = normal_test + 1
        else:
            shutil.copy(path_test + "/" + data["X_ray_image_name"][i], path_test + "/Pnemonia/" + data["X_ray_image_name"][i])
            pnemonia_test = pnemonia_test + 1

print(
    "X-ray of Normal patients (TRAIN DATASET): " + str(normal_train) + "\n",
    "X-ray of Infected patients (TRAIN DATASET): " + str(pnemonia_train) + "\n",
    "X-ray of Normal patients (TEST DATASET): " + str(normal_test) + "\n",
    "X-ray of Infected patients (TEST DATASET): " + str(pnemonia_test) + "\n"
)

X-ray of Normal patients (TRAIN DATASET): 1342
 X-ray of Infected patients (TRAIN DATASET): 3944
 X-ray of Normal patients (TEST DATASET): 234
 X-ray of Infected patients (TEST DATASET): 390



In [31]:
# redefine number of train test 
train_num = normal_train + pnemonia_train
print("Train data: " + str(train_num))
test_num = normal_test + pnemonia_test
print("Test data: " + str(test_num))

Train data: 5286
Test data: 624


In [32]:
train_data_path = "Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train/"
test_data_path = "Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/test/"
batch_size = 64

# data augmentation 
transform = T.Compose([
    T.Resize(256),
    T.RandomHorizontalFlip(), # horizontal flips
    T.CenterCrop(256), # crops
    T.ToTensor(),
    T.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)) # hard coded mean and std rgb values from assignment 5
    ])

train_data = dset.ImageFolder(root=train_data_path, transform=transform)
train_data_loader = DataLoader(train_data, batch_size=batch_size, num_workers=2, sampler=sampler.SubsetRandomSampler(range(train_num)))
test_data = dset.ImageFolder(root=test_data_path, transform=transform)
test_data_loader  = DataLoader(test_data, batch_size=batch_size, num_workers=2, sampler=sampler.SubsetRandomSampler(range(test_num))) 


# Visualize Data

Hi Terry, see here for some visualization ideas: 
https://www.kaggle.com/code/frozenwolf/coronahack-finetuning-resnet18-pytorch