## Detecting COVID-19 from X-Ray. Training a CNN.

_________________________________________________________________________________________________________________
Date: 2021-03-01
_________________________________________________________________________________________________________________

Video: https://www.youtube.com/watch?v=nHQDDAAzIsI

Covid X-Ray Image Dataset for positive cases - https://github.com/ieee8023/covid-che...​ 

Kaggle X-Ray Chest Images for negative cases - https://www.kaggle.com/paultimothymoo...​ 

_________________________________________________________________________________________________________________

In this Jupyter Notebook we are following the tutorial of the video mentioned before to create an image classificator between covid and normal images. The images are obtained from the datasets mentioned before.

Before starting to work with the images, it is necessary to download them from the repositories mentioned before.

In [1]:
import pandas as pd
import os
import shutil
import random
import numpy as np

The OS module will be used to create folders and move files in.

In [2]:
# Create the data for positive samples

FILE_PATH = "Datasets/covid_chestxray_dataset/metadata.csv"
IMAGES_PATH = "Datasets/covid_chestxray_dataset/images"

In [3]:
df = pd.read_csv(FILE_PATH)
df.head()

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,...,date,location,folder,filename,doi,url,license,clinical_notes,other_notes,Unnamed: 29
0,2,0.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
1,2,3.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
2,2,5.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
3,2,6.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
4,4,0.0,F,52.0,Pneumonia/Viral/COVID-19,Y,,N,N,N,...,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,,


First we create a new folder called _Dataset_covid_chestxray_, where we will storage the images obtained to do the classification. Inside, we will create a folder called _Covid_.

In [4]:
TARGET_DIR = "Dataset_covid_chestxray/Covid"

if not os.path.exists(TARGET_DIR):
    os.mkdir(TARGET_DIR)
    print("Covid folder created")

The column _finding_ has the diagnosis. We have images with front view and images from top or side. We need to discard those because we only want front views. This information is in the column _View_.

With cnt we can see how many frontal views of covid are there. And we are going to copy the selected images to a new directory.

In [5]:
cnt = 0

for (i,row) in df.iterrows():
  if row["finding"] == "Pneumonia/Viral/COVID-19" and row["view"] == "PA":
    filename = row["filename"]
    image_path = os.path.join(IMAGES_PATH, filename)
    image_copy_path = os.path.join(TARGET_DIR,filename)
    shutil.copy2(image_path,image_copy_path)
    # print("Moving image", cnt)
    cnt += 1

print(cnt)

196


For the normal images, we will use the kaggle dataset. I want to keep the ratio 50-50, so we want to have the same number of covid images as normal.

In [6]:
# Sampling of images from Kaggle dataset (negative samples)

KAGGLE_FILE_PATH = "Datasets/kaggle_datasets/chest_xray_pneumonia/train/NORMAL"
TARGET_NORMAL_DIR = "Dataset_covid_chestxray/Normal"

We want to move the same number of images as covid, so in this case 196 images. Firstly, I will extract the image names and then I will pick random 196 images.

In [7]:
image_names = os.listdir(KAGGLE_FILE_PATH)

In [8]:
# Random shuffling
random.shuffle(image_names)

In [9]:
# Number of selected covid front view images
cnt

196

We have shuffled the images so now we will pick the first 196 images.

In [10]:
for i in range(cnt):
    
    image_name = image_names[i]
    image_path = os.path.join(KAGGLE_FILE_PATH, image_name)
    
    target_path = os.path.join(TARGET_NORMAL_DIR, image_name)
    
    shutil.copy2(image_path, target_path)
    # print("Copying image", i)

Once we have the dataset ready we can divide the images in training and test. First, we will suffle the images and then we will decide the proportion to use in each dataset.

In [11]:
COVID_FILE_PATH = "Dataset_covid_chestxray/Covid"
NORMAL_FILE_PATH = "Dataset_covid_chestxray/Normal"

covid_image_names = os.listdir(COVID_FILE_PATH)
normal_image_names = os.listdir(NORMAL_FILE_PATH)

# Random shuffling
random.shuffle(normal_image_names)
random.shuffle(covid_image_names)

In [12]:
# Folders where the train and test data will be placed
TRAIN_DIR = "Dataset_covid_chestxray/train"
TEST_DIR = "Dataset_covid_chestxray/test"

Inside each of this folders will be a folder named _Normal_ and a folder named _Covid_.

In [17]:
# Define the cut_id to set the proportion
# In this case we will use 0.75 for train and 0.25 for test
cut_id = int(0.75*cnt)
cut_id

147

Now we will divide data for training and testing.

In [19]:
TARGET_TRAIN_COVID = "Dataset_covid_chestxray/train/Covid"   
TARGET_TRAIN_NORMAL = "Dataset_covid_chestxray/train/Normal"

# TRAIN
for i in range(cut_id):
    
    covid_image_name = covid_image_names[i]
    normal_image_name = normal_image_names[i]
    covid_image_path = os.path.join(COVID_FILE_PATH, covid_image_name)
    normal_image_path = os.path.join(NORMAL_FILE_PATH, normal_image_name)
    
    covid_train_path = os.path.join(TARGET_TRAIN_COVID, covid_image_name)
    normal_train_path = os.path.join(TARGET_TRAIN_NORMAL, normal_image_name)
    
    shutil.copy2(covid_image_path, covid_train_path)
    shutil.copy2(normal_image_path, normal_train_path)

We should have the same number of images in train folders than cut_id, in this case 147.

In [32]:
TARGET_TEST_COVID = "Dataset_covid_chestxray/test/Covid"   
TARGET_TEST_NORMAL = "Dataset_covid_chestxray/test/Normal"

# TEST
for i in range(cut_id,cnt):
    
    covid_image_name = covid_image_names[i]
    normal_image_name = normal_image_names[i]
    covid_image_path = os.path.join(COVID_FILE_PATH, covid_image_name)
    normal_image_path = os.path.join(NORMAL_FILE_PATH, normal_image_name)
    
    covid_test_path = os.path.join(TARGET_TEST_COVID, covid_image_name)
    normal_test_path = os.path.join(TARGET_TEST_NORMAL, normal_image_name)
    
    shutil.copy2(covid_image_path, covid_test_path)
    shutil.copy2(normal_image_path, normal_test_path)

In [33]:
# In test folders we should have the following number of images
print(cnt - cut_id)

49


From now on, the rest of the work will be done in google collab to use GPU: https://colab.research.google.com/drive/1OJhPU1IVRQ7QVZic-VaKlQ7xbrgGDzRD?authuser=1#scrollTo=6i76U0XcqBbX