This is consolidated notebook to setup the environment for preprocessing and preprocessing the XRay images from multiple sources based on the instructions in this page : https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md

The final output of this notebook is to create Train, Test and Metadata files for the X-Ray images

This notebook is based on 2 script files in the COVID net repository :
1. create_COVIDx.ipynb
2. rsna_test_patients_{}.npy

First step is to download all the script repo and data repo as per the below order :

As second step setup the present working directory to /content in the Google colab:

* Script Repo : Download the COVID-Net repository containing the Integration scripts : https://github.com/lindawangg/COVID-Net

* Data Repo : Download the datasets listed above
1. git clone https://github.com/ieee8023/covid-chestxray-dataset.git
2. git clone https://github.com/agchung/Figure1-COVID-chestxray-dataset.git
3. git clone https://github.com/agchung/Actualmed-COVID-chestxray-dataset.git
4. Go to this link <https://www.kaggle.com/tawsifurrahman/covid19-radiography-database> to download the COVID-19 Radiography database. Only the COVID-19 image folder and metadata file is required. The overlaps between covid-chestxray-dataset are handled.
5. Go to this link<https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data> to download the RSNA pneumonia dataset
6. Create a data directory in /content and within the data directory, create a train and test directory


### (ADD : )Configuration and setup before running data consolidation script - create_COVIDx.ipynb




#### **Final structure of the directory**
##### Common account - Ansgl125@gmail.com  **My Drive/Capstone-Dec2020/**

Store this data for reference only not for integration.
* **raw_data/kaggle_covid19_radiography_db      [*Data imported from setp-4*]
* **raw_data/kaggle_lungs_normal_pneumonia_rsna**                         [*Data imported from step-5*]

* **processed_data/[data/train, data/test**                         [*Contains original consolidated train and test data as zip files* ]

This is the final output for model consumption.

### (ADD:)**Runtime Pre-requisites :** Initialize Colab runtime with the following :
##### a) Python 3 ( Runtime->Change Runtime Type to Python 3) 
##### c) Tensor Flow 1.x

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [3]:
%tensorflow_version 1.x
import tensorflow as tf
tf.__version__

TensorFlow 1.x selected.


'1.15.2'

Download the Git repository containing the create_COVIDx.ipynb and rsna_test_patients_{}.npy

In [8]:
#!git clone https://github.com/lindawangg/COVID-Net.git

In [10]:
#!mv '/content/COVID-Net' '/gdrive/My Drive/Capstone_Dec2020/data_integration_scripts/' 

Step-1 (Extract all the COVID datasets)

In [4]:
# Directly in the runtime
!git clone https://github.com/ieee8023/covid-chestxray-dataset.git

Cloning into 'covid-chestxray-dataset'...
remote: Enumerating objects: 3622, done.[K
remote: Total 3622 (delta 0), reused 0 (delta 0), pack-reused 3622[K
Receiving objects: 100% (3622/3622), 632.87 MiB | 47.43 MiB/s, done.
Resolving deltas: 100% (1439/1439), done.
Checking out files: 100% (1173/1173), done.


In [5]:
# Directly in the runtime
!git clone https://github.com/agchung/Figure1-COVID-chestxray-dataset.git

Cloning into 'Figure1-COVID-chestxray-dataset'...
remote: Enumerating objects: 112, done.[K
remote: Counting objects: 100% (112/112), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 112 (delta 28), reused 95 (delta 14), pack-reused 0[K
Receiving objects: 100% (112/112), 14.13 MiB | 44.93 MiB/s, done.
Resolving deltas: 100% (28/28), done.


In [6]:
# Directly in the runtime
!git clone https://github.com/agchung/Actualmed-COVID-chestxray-dataset.git

Cloning into 'Actualmed-COVID-chestxray-dataset'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 422 (delta 1), reused 6 (delta 1), pack-reused 412[K
Receiving objects: 100% (422/422), 1.56 GiB | 35.53 MiB/s, done.
Resolving deltas: 100% (16/16), done.
Checking out files: 100% (240/240), done.


Step-2 ( Download new COVID-19 dataset from Kaggle )

* Download **"Covid-19 radiography database"** from the link to local laptop/desktop - **https://www.kaggle.com/tawsifurrahman/covid19-radiography-database**
* Upload the corresponding zip file to **/Capstone-Dec2020/raw_data/kaggle_covid19_radiography_db**

Step-3 (Clone the repository containing the data integration scripts)

In [7]:
!ls '/content' 

Actualmed-COVID-chestxray-dataset  Figure1-COVID-chestxray-dataset
covid-chestxray-dataset		   sample_data


In [9]:
!pwd

/content


Step-4 ( Unzip the Radiography zip files and the RSNA zip files )

In [11]:
### Required to unzip the file in the gdrive itself (Takes hours to upload extracted file from local laptop to gdrive !)
!apt install unzip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-21ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


In [12]:
!ls /gdrive/MyDrive/Capstone_Dec2020/raw_data/kaggle_covid19_radiography_db/archive.zip

/gdrive/MyDrive/Capstone_Dec2020/raw_data/kaggle_covid19_radiography_db/archive.zip


In [13]:
# Commented so that we do not run it my mistake..
!unzip '/gdrive/MyDrive/Capstone_Dec2020/raw_data/kaggle_covid19_radiography_db/archive.zip'

Archive:  /gdrive/MyDrive/Capstone_Dec2020/raw_data/kaggle_covid19_radiography_db/archive.zip
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (10).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (100).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1000).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1001).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1002).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1003).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1004).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1005).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1006).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1007).png  
  inflating: COVID-19 Radiography Database/COVID-19/COVID-19 (1008).png  
  inflating: COVID-19 Ra

Rename the radiograph database directory as given here "COVID-19-Radiography-Database" in /content folder

In [14]:
!ls

 Actualmed-COVID-chestxray-dataset   Figure1-COVID-chestxray-dataset
'COVID-19 Radiography Database'      sample_data
 covid-chestxray-dataset


In [16]:
!mkdir '/content/rsna-pneumonia-detection-challenge'

In [17]:
import os
os.getcwd()
os.chdir("/content/rsna-pneumonia-detection-challenge")

In [18]:
!pwd

/content/rsna-pneumonia-detection-challenge


In [19]:
!unzip '/gdrive/MyDrive/Capstone_Dec2020/raw_data/kaggle_lungs_normal_pneumonia_rsna/rsna-pneumonia-detection-challenge.zip'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: stage_2_train_images/d5231546-354e-4071-9af1-6644beabfd86.dcm  
  inflating: stage_2_train_images/d5252a78-3ea1-48e9-9ffb-e7535be3ce80.dcm  
  inflating: stage_2_train_images/d525eafb-8908-45fd-a942-48d07c435487.dcm  
  inflating: stage_2_train_images/d5265640-17db-4880-866d-d2952e32941c.dcm  
  inflating: stage_2_train_images/d5277276-f8f8-40e9-b8e1-791cf5d96ac0.dcm  
  inflating: stage_2_train_images/d528d9e9-647a-4e2e-a16c-bd5e32a5bbf5.dcm  
  inflating: stage_2_train_images/d5293a3e-f050-4b98-8bbf-1f40e25bced5.dcm  
  inflating: stage_2_train_images/d52cbb5a-1d0a-457d-8c72-0f7aeec21ca7.dcm  
  inflating: stage_2_train_images/d52ce67b-be7c-4349-8dc4-38562928d208.dcm  
  inflating: stage_2_train_images/d535a3c8-c4a4-4856-b5cd-17f6332eac8b.dcm  
  inflating: stage_2_train_images/d5360dc4-6bea-4a7b-bc49-5b2547ad7877.dcm  
  inflating: stage_2_train_images/d5364bc1-bc2a-4bd0-a1bd-0cfb5a369ccc.dcm  
  inflating

The steps until here covers downloading and placing the files under the respective directories so that the data integration steps can begin...

### Data Integration steps..

**Pre-requisite : Install pydicom everytime after the runtime restarts**

In [20]:
### Required by the Data consolidation script
!pip install pydicom

Collecting pydicom
[?25l  Downloading https://files.pythonhosted.org/packages/f4/15/df16546bc59bfca390cf072d473fb2c8acd4231636f64356593a63137e55/pydicom-2.1.2-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 6.3MB/s 
[?25hInstalling collected packages: pydicom
Successfully installed pydicom-2.1.2


In [21]:
import numpy as np
import pandas as pd
import os
import random 
from shutil import copyfile
import pydicom as dicom
import cv2

In [22]:
!pwd

/content/rsna-pneumonia-detection-challenge


In [23]:
os.chdir("/content")

In [54]:
# set parameters here
savepath = 'data'
seed = 0
np.random.seed(seed) # Reset the seed so all runs are the same.
random.seed(seed)
MAXVAL = 255  # Range [0 255]

# path to covid-19 dataset from https://github.com/ieee8023/covid-chestxray-dataset
cohen_imgpath = './covid-chestxray-dataset/images' 
cohen_csvpath = './covid-chestxray-dataset/metadata.csv'

# path to covid-19 dataset from https://github.com/agchung/Figure1-COVID-chestxray-dataset
fig1_imgpath = './Figure1-COVID-chestxray-dataset/images'
fig1_csvpath = './Figure1-COVID-chestxray-dataset/metadata.csv'

# path to covid-19 dataset from https://github.com/agchung/Actualmed-COVID-chestxray-dataset
actmed_imgpath = './Actualmed-COVID-chestxray-dataset/images'
actmed_csvpath = './Actualmed-COVID-chestxray-dataset/metadata.csv'

# path to covid-19 dataset from https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
sirm_imgpath = './COVID-19-Radiography-Database/COVID-19'
sirm_csvpath = './COVID-19-Radiography-Database/COVID-19.metadata.xlsx'

# path to https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
rsna_datapath = './rsna-pneumonia-detection-challenge'
# get all the normal from here
rsna_csvname = 'stage_2_detailed_class_info.csv' 
# get all the 1s from here since 1 indicate pneumonia
# found that images that aren't pneunmonia and also not normal are classified as 0s
rsna_csvname2 = 'stage_2_train_labels.csv' 
rsna_imgpath = 'stage_2_train_images'

# parameters for COVIDx dataset
train = []
test = []
test_count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}
train_count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}

mapping = dict()
mapping['COVID-19'] = 'COVID-19'
mapping['SARS'] = 'pneumonia'
mapping['MERS'] = 'pneumonia'
mapping['Streptococcus'] = 'pneumonia'
mapping['Klebsiella'] = 'pneumonia'
mapping['Chlamydophila'] = 'pneumonia'
mapping['Legionella'] = 'pneumonia'
mapping['E.Coli'] = 'pneumonia'
mapping['Normal'] = 'normal'
mapping['Lung Opacity'] = 'pneumonia'
mapping['1'] = 'pneumonia'

# train/test split
split = 0.1

# to avoid duplicates
patient_imgpath = {}

In [55]:
!mkdir data
!mkdir data/train
!mkdir data/test

!ls -l

total 28
drwxr-xr-x  4 root root 4096 Dec 26 06:56 Actualmed-COVID-chestxray-dataset
drwxr-xr-x  6 root root 4096 Dec 26 07:30 COVID-19-Radiography-Database
drwxr-xr-x 10 root root 4096 Dec 26 06:55 covid-chestxray-dataset
drwxr-xr-x  4 root root 4096 Dec 26 07:58 data
drwxr-xr-x  4 root root 4096 Dec 26 06:56 Figure1-COVID-chestxray-dataset
drwxr-xr-x  4 root root 4096 Dec 26 07:01 rsna-pneumonia-detection-challenge
drwxr-xr-x  1 root root 4096 Dec 21 17:29 sample_data


In [56]:
sirm_csvpath

'./COVID-19-Radiography-Database/COVID-19.metadata.xlsx'

In [57]:
!pwd

/content


UPLOAD METADATA FILE FROM DESKTOP OR COPY FROM RAW_DATA FOLDER

In [58]:
# adapted from https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py#L814
cohen_csv = pd.read_csv(cohen_csvpath, nrows=None)
#idx_pa = csv["view"] == "PA"  # Keep only the PA view
views = ["PA", "AP", "AP Supine", "AP semi erect", "AP erect"]
cohen_idx_keep = cohen_csv.view.isin(views)
cohen_csv = cohen_csv[cohen_idx_keep]

fig1_csv = pd.read_csv(fig1_csvpath, encoding='ISO-8859-1', nrows=None)
actmed_csv = pd.read_csv(actmed_csvpath, nrows=None)

sirm_csv = pd.read_excel(sirm_csvpath)

In [59]:
sirm_csv.shape

(219, 4)

In [60]:
# get non-COVID19 viral, bacteria, and COVID-19 infections from covid-chestxray-dataset, figure1 and actualmed
# stored as patient id, image filename and label
filename_label = {'normal': [], 'pneumonia': [], 'COVID-19': []}
count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}
covid_ds = {'cohen': [], 'fig1': [], 'actmed': [], 'sirm': []}

for index, row in cohen_csv.iterrows():
    f = row['finding'].split('/')[-1] # take final finding in hierarchy, for the case of COVID-19, ARDS
    if f in mapping: # 
        count[mapping[f]] += 1
        entry = [str(row['patientid']), row['filename'], mapping[f], 'cohen']
        filename_label[mapping[f]].append(entry)
        if mapping[f] == 'COVID-19':
            covid_ds['cohen'].append(str(row['patientid']))
        
for index, row in fig1_csv.iterrows():
    if not str(row['finding']) == 'nan':
        f = row['finding'].split(',')[0] # take the first finding
        if f in mapping: # 
            count[mapping[f]] += 1
            if os.path.exists(os.path.join(fig1_imgpath, row['patientid'] + '.jpg')):
                entry = [row['patientid'], row['patientid'] + '.jpg', mapping[f], 'fig1']
            elif os.path.exists(os.path.join(fig1_imgpath, row['patientid'] + '.png')):
                entry = [row['patientid'], row['patientid'] + '.png', mapping[f], 'fig1']
            filename_label[mapping[f]].append(entry)
            if mapping[f] == 'COVID-19':
                covid_ds['fig1'].append(row['patientid'])

for index, row in actmed_csv.iterrows():
    if not str(row['finding']) == 'nan':
        f = row['finding'].split(',')[0]
        if f in mapping:
            count[mapping[f]] += 1
            entry = [row['patientid'], row['imagename'], mapping[f], 'actmed']
            filename_label[mapping[f]].append(entry)
            if mapping[f] == 'COVID-19':
                covid_ds['actmed'].append(row['patientid'])
    
sirm = set(sirm_csv['URL'])
cohen = set(cohen_csv['url'])
discard = ['100', '101', '102', '103', '104', '105', 
           '110', '111', '112', '113', '122', '123', 
           '124', '125', '126', '217']

for idx, row in sirm_csv.iterrows():
    #print(idx, row)
    patientid = row['FILE NAME']
    #print("Patient id:",patientid)
    if row['URL'] not in cohen and patientid[patientid.find('(')+1:patientid.find(')')] not in discard:
        count[mapping['COVID-19']] += 1
        imagename = patientid + '.' + row['FORMAT'].lower()
        if not os.path.exists(os.path.join(sirm_imgpath, imagename)):
            imagename = patientid.split('(')[0] + ' ('+ patientid.split('(')[1] + '.' + row['FORMAT'].lower()
            #print(imagename)
        entry = [patientid, imagename, mapping['COVID-19'], 'sirm']
        #print("SIRM : ",entry)
        filename_label[mapping['COVID-19']].append(entry)
        covid_ds['sirm'].append(patientid)
    
print('Data distribution from covid datasets:')
print(count)

COVID-19 (69).png
SIRM :  ['COVID-19(69)', 'COVID-19 (69).png', 'COVID-19', 'sirm']
COVID-19 (70).png
SIRM :  ['COVID-19(70)', 'COVID-19 (70).png', 'COVID-19', 'sirm']
COVID-19 (71).png
SIRM :  ['COVID-19(71)', 'COVID-19 (71).png', 'COVID-19', 'sirm']
COVID-19 (72).png
SIRM :  ['COVID-19(72)', 'COVID-19 (72).png', 'COVID-19', 'sirm']
COVID-19 (74).png
SIRM :  ['COVID-19(74)', 'COVID-19 (74).png', 'COVID-19', 'sirm']
COVID-19 (75).png
SIRM :  ['COVID-19(75)', 'COVID-19 (75).png', 'COVID-19', 'sirm']
COVID-19 (76).png
SIRM :  ['COVID-19(76)', 'COVID-19 (76).png', 'COVID-19', 'sirm']
COVID-19 (77).png
SIRM :  ['COVID-19(77)', 'COVID-19 (77).png', 'COVID-19', 'sirm']
COVID-19 (78).png
SIRM :  ['COVID-19(78)', 'COVID-19 (78).png', 'COVID-19', 'sirm']
COVID-19 (79).png
SIRM :  ['COVID-19(79)', 'COVID-19 (79).png', 'COVID-19', 'sirm']
COVID-19 (80).png
SIRM :  ['COVID-19(80)', 'COVID-19 (80).png', 'COVID-19', 'sirm']
COVID-19 (81).png
SIRM :  ['COVID-19(81)', 'COVID-19 (81).png', 'COVID-19', 

In [61]:
# add covid-chestxray-dataset, figure1 and actualmed into COVIDx dataset
# since these datasets don't have test dataset, split into train/test by patientid
# for covid-chestxray-dataset:
# patient 8 is used as non-COVID19 viral test
# patient 31 is used as bacterial test
# patients 19, 20, 36, 42, 86 are used as COVID-19 viral test
# for figure 1:
# patients 24, 25, 27, 29, 30, 32, 33, 36, 37, 38

# Made file name changes to the SIRM dataset COVID-19() to COVID-19 () as per the db in kaggle - Krishna

ds_imgpath = {'cohen': cohen_imgpath, 'fig1': fig1_imgpath, 'actmed': actmed_imgpath, 'sirm': sirm_imgpath}

for key in filename_label.keys():
    print(key, filename_label[key])
    arr = np.array(filename_label[key])
    if arr.size == 0:
        continue
    # split by patients
    # num_diff_patients = len(np.unique(arr[:,0]))
    # num_test = max(1, round(split*num_diff_patients))
    # select num_test number of random patients
    # random.sample(list(arr[:,0]), num_test)
    if key == 'pneumonia':
        test_patients = ['8', '31']
    elif key == 'COVID-19':
        test_patients = ['19', '20', '36', '42', '86', 
                         '94', '97', '117', '132', 
                         '138', '144', '150', '163', '169', '174', '175', '179', '190', '191'
                         'COVID-00024', 'COVID-00025', 'COVID-00026', 'COVID-00027', 'COVID-00029',
                         'COVID-00030', 'COVID-00032', 'COVID-00033', 'COVID-00035', 'COVID-00036',
                         'COVID-00037', 'COVID-00038',
                         'ANON24', 'ANON45', 'ANON126', 'ANON106', 'ANON67',
                         'ANON153', 'ANON135', 'ANON44', 'ANON29', 'ANON201', 
                         'ANON191', 'ANON234', 'ANON110', 'ANON112', 'ANON73', 
                         'ANON220', 'ANON189', 'ANON30', 'ANON53', 'ANON46',
                         'ANON218', 'ANON240', 'ANON100', 'ANON237', 'ANON158',
                         'ANON174', 'ANON19', 'ANON195',
                         'COVID-19(119)', 'COVID-19(87)', 'COVID-19(70)', 'COVID-19(94)', 
                         'COVID-19(215)', 'COVID-19(77)', 'COVID-19(213)', 'COVID-19(81)', 
                         'COVID-19(216)', 'COVID-19(72)', 'COVID-19(106)', 'COVID-19(131)', 
                         'COVID-19(107)', 'COVID-19(116)', 'COVID-19(95)', 'COVID-19(214)', 
                         'COVID-19(129)']
    else: 
        test_patients = []
    #print('Key: ', key)
    #print('Test patients: ', test_patients)
    # go through all the patients
    for patient in arr:
        if patient[0] not in patient_imgpath:
            patient_imgpath[patient[0]] = [patient[1]]
        else:
            if patient[1] not in patient_imgpath[patient[0]]:
                patient_imgpath[patient[0]].append(patient[1])
            else:
                continue  # skip since image has already been written
        if patient[0] in test_patients:
            #print("If block..", patient[0])
            if patient[3] == 'sirm':
                #print("In SIRM block")
                image = cv2.imread(os.path.join(ds_imgpath[patient[3]], patient[1]))
                gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
                patient[1] = patient[1].replace(' ', '')
                cv2.imwrite(os.path.join(savepath, 'test', patient[1]), gray)
            else:
                copyfile(os.path.join(ds_imgpath[patient[3]], patient[1]), os.path.join(savepath, 'test', patient[1]))
            test.append(patient)
            test_count[patient[2]] += 1
        else:
            #print("Else block..", patient[0])
            if patient[3] == 'sirm':
                #print("In SIRM block")
                image = cv2.imread(os.path.join(ds_imgpath[patient[3]], patient[1]))
                gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
                patient[1] = patient[1].replace(' ', '')
                cv2.imwrite(os.path.join(savepath, 'train', patient[1]), gray)
            else:
                copyfile(os.path.join(ds_imgpath[patient[3]], patient[1]), os.path.join(savepath, 'train', patient[1]))
            train.append(patient)
            train_count[patient[2]] += 1

print('test count: ', test_count)
print('train count: ', train_count)

normal []
pneumonia [['3', 'SARS-10.1148rg.242035193-g04mr34g0-Fig8a-day0.jpeg', 'pneumonia', 'cohen'], ['3', 'SARS-10.1148rg.242035193-g04mr34g0-Fig8b-day5.jpeg', 'pneumonia', 'cohen'], ['3', 'SARS-10.1148rg.242035193-g04mr34g0-Fig8c-day10.jpeg', 'pneumonia', 'cohen'], ['7', 'SARS-10.1148rg.242035193-g04mr34g04a-Fig4a-day7.jpeg', 'pneumonia', 'cohen'], ['7', 'SARS-10.1148rg.242035193-g04mr34g04b-Fig4b-day12.jpeg', 'pneumonia', 'cohen'], ['8', 'SARS-10.1148rg.242035193-g04mr34g05x-Fig5-day9.jpeg', 'pneumonia', 'cohen'], ['9', 'SARS-10.1148rg.242035193-g04mr34g07a-Fig7a-day5.jpeg', 'pneumonia', 'cohen'], ['9', 'SARS-10.1148rg.242035193-g04mr34g07b-Fig7b-day12.jpeg', 'pneumonia', 'cohen'], ['10', 'SARS-10.1148rg.242035193-g04mr34g09a-Fig9a-day17.jpeg', 'pneumonia', 'cohen'], ['10', 'SARS-10.1148rg.242035193-g04mr34g09b-Fig9b-day19.jpeg', 'pneumonia', 'cohen'], ['10', 'SARS-10.1148rg.242035193-g04mr34g09c-Fig9c-day27.jpeg', 'pneumonia', 'cohen'], ['29', 'streptococcus-pneumoniae-pneumonia

In [62]:
# add normal and rest of pneumonia cases from https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
csv_normal = pd.read_csv(os.path.join(rsna_datapath, rsna_csvname), nrows=None)
csv_pneu = pd.read_csv(os.path.join(rsna_datapath, rsna_csvname2), nrows=None)
patients = {'normal': [], 'pneumonia': []}

for index, row in csv_normal.iterrows():
    if row['class'] == 'Normal':
        patients['normal'].append(row['patientId'])

for index, row in csv_pneu.iterrows():
    if int(row['Target']) == 1:
        patients['pneumonia'].append(row['patientId'])

for key in patients.keys():
    arr = np.array(patients[key])
    if arr.size == 0:
        continue
    # split by patients 
    # num_diff_patients = len(np.unique(arr))
    # num_test = max(1, round(split*num_diff_patients))
    test_patients = np.load('/gdrive/My Drive/Capstone_Dec2020/data_integration_scripts/COVID-Net/rsna_test_patients_{}.npy'.format(key)) # random.sample(list(arr), num_test), download the .npy files from the repo.
    # Commented By : Krishna Kumar
    # np.save('rsna_test_patients_{}.npy'.format(key), np.array(test_patients))
    for patient in arr:
        if patient not in patient_imgpath:
            patient_imgpath[patient] = [patient]
        else:
            continue  # skip since image has already been written
                
        ds = dicom.dcmread(os.path.join(rsna_datapath, rsna_imgpath, patient + '.dcm'))
        pixel_array_numpy = ds.pixel_array
        imgname = patient + '.png'
        if patient in test_patients:
            cv2.imwrite(os.path.join(savepath, 'test', imgname), pixel_array_numpy)
            test.append([patient, imgname, key, 'rsna'])
            test_count[key] += 1
        else:
            cv2.imwrite(os.path.join(savepath, 'train', imgname), pixel_array_numpy)
            train.append([patient, imgname, key, 'rsna'])
            train_count[key] += 1

print('test count: ', test_count)
print('train count: ', train_count)

test count:  {'normal': 885, 'pneumonia': 594, 'COVID-19': 100}
train count:  {'normal': 7966, 'pneumonia': 5475, 'COVID-19': 517}


In [63]:
# final stats
print('Final stats')
print('Train count: ', train_count)
print('Test count: ', test_count)
print('Total length of train: ', len(train))
print('Total length of test: ', len(test))

Final stats
Train count:  {'normal': 7966, 'pneumonia': 5475, 'COVID-19': 517}
Test count:  {'normal': 885, 'pneumonia': 594, 'COVID-19': 100}
Total length of train:  13958
Total length of test:  1579


Export the metadata files for Train and Test

In [64]:
# export to train and test csv
# format as patientid, filename, label, separated by a space
train_file = open("train_split.txt",'w') 
for sample in train:
    if len(sample) == 4:
        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + ' ' + sample[3] + '\n'
    else:
        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + '\n'
    train_file.write(info)

train_file.close()

test_file = open("test_split.txt", 'w')
for sample in test:
    if len(sample) == 4:
        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + ' ' + sample[3] + '\n'
    else:
        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + '\n'
    test_file.write(info)

test_file.close()

In [48]:
!cd /content/data/train

In [66]:
!ls /content/data/train | wc -l

13958


Final Step - Zip the Train and Test folders as zip files without directory information

In [74]:
!zip -r -j xray-images-train.zip /content/data/train/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: 89a70e5a-1a24-46a0-9b60-e0780752a1d7.png (deflated 1%)
  adding: d8e8af62-c7de-4f67-a5de-d38507419763.png (deflated 1%)
  adding: 09f314e5-047f-4f7d-87b3-b944e9e566dd.png (deflated 2%)
  adding: a65bb30b-a401-4e2d-848e-7aa7ecd69b00.png (deflated 1%)
  adding: 1149a2ab-1e6e-42a3-b850-27486c330a87.png (deflated 0%)
  adding: 28eaff00-5fb7-40ee-8206-9d2ff5b081e0.png (deflated 0%)
  adding: 4769fa12-e694-48ce-ad4e-0628e5dd75f0.png (deflated 0%)
  adding: 894e2c84-f131-41f6-b846-eb5ae491a1c0.png (deflated 0%)
  adding: e87d8111-5071-446b-82f6-00a55f338954.png (deflated 2%)
  adding: 82cd9979-23bc-422a-86a4-3987ffccf77b.png (deflated 0%)
  adding: e67b71d5-e6cb-40be-bf98-ee06d26a6a96.png (deflated 2%)
  adding: bca964e1-2a1e-4d66-b191-1550430659cb.png (deflated 0%)
  adding: bfa0bdbb-8f93-4c32-b8b6-4e71f6caea5e.png (deflated 0%)
  adding: 9d2c812a-8537-454f-84e2-be8ebe51b23d.png (deflated 1%)
  adding: 615aa2bd-c60b-4

In [73]:
!zip -r -j xray-images-test.zip /content/data/test/

  adding: ebd4509c-0b8f-4be2-aa51-bebad08235be.png (deflated 1%)
  adding: 9681620c-bb96-4a93-b0c8-2e10912e88ee.png (deflated 0%)
  adding: ee2aa282-fcf7-4368-a66c-70fc51aedbab.png (deflated 1%)
  adding: 3c2f580e-bdac-4f28-a33e-38b147fc60ca.png (deflated 1%)
  adding: 2398ea30-8cf0-4265-b374-0300a5181759.png (deflated 1%)
  adding: a541be19-c9f6-487e-b615-d15042a246b1.png (deflated 1%)
  adding: 0c24e2a2-9fca-463a-a1fa-fbcda2712260.png (deflated 1%)
  adding: db7d9187-7313-4dec-8ba6-1d36800c44d5.png (deflated 1%)
  adding: 2ab70125-b741-487a-9f06-12a66018dc5e.png (deflated 1%)
  adding: b2d2ef9e-fabe-4e70-b36a-e1e62e0a00f2.png (deflated 1%)
  adding: bcce369d-8397-46d1-8c53-5142c4fd2866.png (deflated 2%)
  adding: 7105d840-ed2b-42fa-a32c-ba10ff3604bb.png (deflated 1%)
  adding: 67702847-2e38-4864-acd6-5ca69941ee68.png (deflated 2%)
  adding: 55c25b52-0043-4763-b889-3f5f44f01728.png (deflated 1%)
  adding: a16503fc-2eb7-4a00-b101-2b5d976dc8a9.png (deflated 0%)
  adding: 999e9144-e7f0-4

In [76]:
!ls /content

Actualmed-COVID-chestxray-dataset   sample_data
COVID-19-Radiography-Database	    test_split.txt
covid-chestxray-dataset		    train_split.txt
data				    xray-images-test.zip
Figure1-COVID-chestxray-dataset     xray-images-train.zip
rsna-pneumonia-detection-challenge


Copy the zip files from the VM into the My Drive folder

In [77]:
!cp /content/xray-images-test.zip /gdrive/MyDrive/Capstone_Dec2020/processed_data/data/test/

In [82]:
!cp /content/xray-images-train.zip /gdrive/MyDrive/Capstone_Dec2020/processed_data/data/train/