## Prepare train/test dataset split

This Notebook contains all the code used to prepare the datasets used for the training and test of our DNN models.

It takes all the 1431 images with Pneumonia label from the original NIH-CXR-14 dataset.

It takes an additional number of NON Pneumonia images in order to satisfy the following requirements:

* Train-test split follows the proportion 80-20. In other words, Train set contains 4X the number of images in the test set
* Train set is balanced: Pneumonia images are 50% of the train set
* Test set is not balanced. The percentage of Pneumonia images is close to what could be expected from CXR of patients admitted to a Hospital. It is in our case 25% of the total in the test set
* For no patient we have images both in the train and in test set.

Putting these constraints in a linear system, we get the following result:

* Total number of images in the train set: 2544
* Total number of images in the test set: 636
* Total number of Pneumonia images in the train set: 1272
* Total number of NON-Pneumonia images in the train set: 1272
* Total number of Pneumonia images in the test set: 159
* Total number of NON-Pneumonia images in test train set: 477

All images pre-processed:
* compressed in JPEG, resized to 512x512
* packed in files in **Tensorflow TFRecord** format

to speed up training and get the maxumim utilization of TPU:

In [1]:
import numpy as np
import pandas as pd
import os
import cv2

from itertools import chain
from sklearn.model_selection import GroupKFold
import random
import glob
import random
import time

import tensorflow as tf

import matplotlib.pyplot as plt

In [2]:
IMAGE_DIR = '/volb/cxr/images'
FILE_ORIG = './Data_Entry_2017.csv'

df_orig = pd.read_csv(FILE_ORIG)

print('Original number of images:', df_orig.shape[0])

Original number of images: 112120


In [3]:
# add other columns to identify pneumonia and not pneumonia
def prepare_df_with_all_diseases(f_name):
    # add the columns for eac disease (0,1)
    full_df = pd.read_csv(f_name)
    # remove useless last column
    full_df.drop(['Unnamed: 11', 'OriginalImage[Width', 'Height]', 
              'OriginalImagePixelSpacing[x', 'y]'], axis = 1, inplace=True)
    
    # add one column per label
    all_labels = np.unique(list(chain(*full_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
    all_labels = [x for x in all_labels if len(x)>0]
    
    for c_label in all_labels:
        if len(c_label)>1: # leave out empty labels
            full_df[c_label] = full_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
    
    return full_df

In [4]:
full_df = prepare_df_with_all_diseases(FILE_ORIG)

In [5]:
# select ONLY pneumonia

condition = (full_df['Pneumonia'] == 1)

pneumonia_df = full_df[condition]

N_PNEUMONIA = len(pneumonia_df)

print('Total number of images with Pneumonia label:', N_PNEUMONIA)

Total number of images with Pneumonia label: 1431


### First split pneumonia images between test and train datasets
#### the split is done in order to have no intersection on Patient ID

In [6]:
# images will be split between train and test
N_PNEUMONIA_TEST = 159
N_PNEUMONIA_TRAIN = 1272

# select PNEUMONIA images for test, train

# We don't want the same Patient ID in train and test, thefore we're using GroupKFold split
# split is stratified on Patient ID
# total list of Patient ID
groups = sorted(pneumonia_df['Patient ID'].values)

# 8 is the ratio N_PNEUMONIA_TRAIN/N_PNEUMONIA_TEST, therefore we divide in 9 parts
# first 8 parts for train, last for test
gkf = GroupKFold(n_splits = int(N_PNEUMONIA_TRAIN/N_PNEUMONIA_TEST) +1) 

for i, (train_index, test_index) in enumerate(gkf.split(pneumonia_df, groups=groups)):
    # we could take any one of the splits, we choose the first
    if i == 0:
        pne_idxs_train = train_index
        pne_idxs_test = test_index

In [7]:
# make some controls

print('Check: number of Pneumonia in train and test is correct')
assert len(pne_idxs_train) == N_PNEUMONIA_TRAIN
assert len(pne_idxs_test) == N_PNEUMONIA_TEST
print('Check OK')

# no intersection between Patient ID?
print('Check: No intersection of Patients between train and test')
pne_pid_train = list(pneumonia_df.iloc[pne_idxs_train]['Patient ID'].values)
pne_pid_test = list(pneumonia_df.iloc[pne_idxs_test]['Patient ID']. values)
assert len(set(pne_pid_train).intersection(set(pne_pid_test))) == 0
print('Check OK')

Check: number of Pneumonia in train and test is correct
Check OK
Check: No intersection of Patients between train and test
Check OK


In [8]:
# now pne_idxs_train and pne_idxs_test can be used on pneumonia_df to get image name with pneumonia: this is the split train, test set

pne_train_image_list = list(pneumonia_df.iloc[pne_idxs_train]['Image Index'].values)
pne_test_image_list = list(pneumonia_df.iloc[pne_idxs_test]['Image Index'].values)

# now we need to add NON Pneumonia, respecting separation of Patient ID and correct ratio

### add NON pneumonia images to test set

In [9]:
# now we add to the test set 477 NON pneumonia images, where the Patient ID is NOT in pne_pid_train
N_NON_PNE_TEST = 477

# select non pneumonia
condition = (full_df['Pneumonia'] == 0.)
non_pneumonia_df = full_df[condition]

# we want only Patient ID (pid) not in TRAIN
condition = ~non_pneumonia_df['Patient ID'].isin(pne_pid_train)

# we want N_NON_PNE_TEST images for test
non_pne_test_df = non_pneumonia_df[condition].sample(n=N_NON_PNE_TEST)

# make a check: intersection is null
print('Check: No intersection of Patients between non pneumonia test and pneumonia train')
non_pne_pid_test = list(non_pne_test_df['Patient ID'].values)
assert len(set(non_pne_pid_test).intersection(set(pne_pid_train))) == 0
print('Check OK')

Check: No intersection of Patients between non pneumonia test and pneumonia train
Check OK


In [10]:
# this is the list of NON pneumonia images to add to the test set
non_pne_test_image_list = list(non_pne_test_df['Image Index'].values)

print('Check: total images in test are: 477+159')
assert len(non_pne_test_image_list) + len(pne_test_image_list) == (477 + 159)
print('Check OK')

ratio = len(non_pne_test_image_list)/len(pne_test_image_list)
print('Split NON PNE vs PNE in test set is:', len(non_pne_test_image_list), 'vs', len(pne_test_image_list), 'ratio is:', ratio)

# this is the complete list (pne and non pne for test)
test_image_list = non_pne_test_image_list + pne_test_image_list

Check: total images in test are: 477+159
Check OK
Split NON PNE vs PNE in test set is: 477 vs 159 ratio is: 3.0


### Test set is not balanced. Pneumonia images are 25% of the total.

### Now: complete the train set

In [11]:
# now we need to complete the train set adding NON PNE images to pne_train_image_list
# I need to add 1272 NON PNE images
# condition:
# 1. images must be NON PNE
# 2. Patient ID must not be in those of test images
# 3. Images must not be in the test set

# 1. we use non_pneumonia_df
# 2. Patient ID must NOT be in non_pne_test_pid + pne_pid_test == test_pid

# this is the list of ALL Patient ID in test set
test_pid = non_pne_pid_test + pne_pid_test

# first select a list of images satisying 1+2
condition = ~non_pneumonia_df['Patient ID'].isin(test_pid)

candidate_list = list(non_pneumonia_df[condition]['Image Index'].values)

# select only those images not in test image list (condition 3)
non_pne_train_image_list = [x for x in candidate_list if x not in test_image_list]

# now select ONLY 1272 NON PNE for train
N_NON_PNE_TRAIN = 1272
non_pne_train_image_list = random.sample(non_pne_train_image_list, N_NON_PNE_TRAIN)

In [12]:
# this is the complete list (pne and non pne) for train
train_image_list = non_pne_train_image_list + pne_train_image_list

ratio = len(non_pne_train_image_list)/len(pne_train_image_list)

print('Split NON PNE vs PNE in train set is:', len(non_pne_train_image_list), 'vs', len(pne_train_image_list), 'ratio is:', ratio)

Split NON PNE vs PNE in train set is: 1272 vs 1272 ratio is: 1.0


### Train set is balanced 50%-50%

In [13]:
# check that there is no intersection between train and test
print('Check: there are no common images between train and test')
assert len(set(train_image_list).intersection(set(test_image_list))) == 0
print('Check OK')

Check: there are no common images between train and test
Check OK


In [14]:
# create label list

# order is always non-pneumonia + pneumonia in train and test
# first part all zeros (non pneumonia) then all one (pneumonia)
train_label_list = list(np.zeros(len(non_pne_train_image_list))) + list(np.ones(len(pne_train_image_list)))
test_label_list = list(np.zeros(len(non_pne_test_image_list))) + list(np.ones(len(pne_test_image_list)))

### before creating TFREcord files we need to shuffle the lists... to avoid having consequential non-pneu and pneu images !!!

In [15]:
# shuffle train set

idxs = np.arange(len(train_label_list))
np.random.shuffle(idxs)

train_label_list = np.array(train_label_list)
train_image_list = np.array(train_image_list)

# shuffle and back to list
train_label_list = list(train_label_list[idxs])
train_image_list = list(train_image_list[idxs])

In [16]:
# shuffle test set

idxs = np.arange(len(test_label_list))
np.random.shuffle(idxs)

test_label_list = np.array(test_label_list)
test_image_list = np.array(test_image_list)

# shuffle and back to list
test_label_list = list(test_label_list[idxs])
test_image_list = list(test_image_list[idxs])

In [17]:
# save data in two csv files, for further processing

df_train_csv = pd.DataFrame(list(zip(train_image_list, train_label_list)), columns = ['image_name', 'label'])
df_test_csv = pd.DataFrame(list(zip(test_image_list, test_label_list)), columns = ['image_name', 'label'])

In [18]:
df_train_csv.to_csv('train-submission3.csv')
df_test_csv.to_csv('test-submission3.csv')

## Prepare TFRecord files

In [19]:
# utility functions from TF2 docs

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [20]:
# features: image, fname, label (as target)

# feature contains the schema: image, image_file_name, label

def serialize_example(img, img_idx, label):
  feature = {
      'image': _bytes_feature(img),
      'image_idx': _bytes_feature(img_idx),
      'label': _int64_feature(label),
  }
  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

In [21]:
# everything in a function

# how many images for file
SIZE = 200
# image size (es: 512x512)
IMG_PIXEL = 512
# directory where we put TFREC files
TFREC_DIR = '/volb/cxr/tfrec3-512'

def create_tfrec(image_list, label_list, file_prefix='train'):
    # imgs to process
    IMGS = image_list
    label_list = label_list
    
    CT = len(IMGS)//SIZE + int(len(IMGS)%SIZE!=0)
    
    for j in range(CT):
        print(); 
        print('Writing TFRecord %i of %i...'%(j+1, CT))
    
        tStart = time.time()
        
        # nmber of images that will go in the file
        CT2 = min(SIZE, len(IMGS)-j*SIZE)
        
        # j here is the number given to the file (starting with 00)
        with tf.io.TFRecordWriter(os.path.join(TFREC_DIR, file_prefix + '%.2i-%i.tfrec'%(j,CT2))) as writer:
            for k in range(CT2):
                index = SIZE*j+k
            
                # read and preprocess png image
                img = cv2.imread(os.path.join(IMAGE_DIR, IMGS[index]))
                img = cv2.resize(img, (IMG_PIXEL, IMG_PIXEL), interpolation = cv2.INTER_AREA)
            
                # encode image as JPEG, to save space and reduce read time
                img = cv2.imencode('.jpg', img, (cv2.IMWRITE_JPEG_QUALITY, 94))[1].tostring()
                name = IMGS[index]
            
                # get the label
                label = label_list[index]
            
                # build the record
                # here the structure is img, image_name, label
                # must be aligned to the serialize_example() above defined
                example = serialize_example(img, str.encode(name),int(label))
                
                writer.write(example)
            
                # print progress
                if k%100==0: print('#','',end='')
    
        tEnd = time.time()
    
        print('')
        print('Elapsed: ', round((tEnd - tStart),1), ' (sec)')

### TFRecords for train set

In [22]:
create_tfrec(train_image_list, train_label_list, file_prefix='train')


Writing TFRecord 1 of 13...
# # 
Elapsed:  5.1  (sec)

Writing TFRecord 2 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 3 of 13...
# # 
Elapsed:  4.8  (sec)

Writing TFRecord 4 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 5 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 6 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 7 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 8 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 9 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 10 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 11 of 13...
# # 
Elapsed:  5.0  (sec)

Writing TFRecord 12 of 13...
# # 
Elapsed:  4.9  (sec)

Writing TFRecord 13 of 13...
# # 
Elapsed:  3.4  (sec)


### Now for test

In [23]:
create_tfrec(test_image_list, test_label_list, file_prefix='test')


Writing TFRecord 1 of 4...
# # 
Elapsed:  4.6  (sec)

Writing TFRecord 2 of 4...
# # 
Elapsed:  4.6  (sec)

Writing TFRecord 3 of 4...
# # 
Elapsed:  4.5  (sec)

Writing TFRecord 4 of 4...
# 
Elapsed:  0.8  (sec)


In [24]:
# the list of files produces
!ls -l $TFREC_DIR

total 151336
-rw-rw-r-- 1 ubuntu ubuntu 9546273 Jan  7 11:29 test00-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9767954 Jan  7 11:30 test01-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9776337 Jan  7 11:30 test02-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 1780421 Jan  7 11:30 test03-36.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9791415 Jan  7 11:28 train00-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9734354 Jan  7 11:28 train01-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9776307 Jan  7 11:28 train02-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9582946 Jan  7 11:28 train03-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9683012 Jan  7 11:28 train04-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9732832 Jan  7 11:29 train05-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9720501 Jan  7 11:29 train06-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9801484 Jan  7 11:29 train07-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9925128 Jan  7 11:29 train08-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9765686 Jan  7 11:29 train09-200.tfrec
-rw-rw-r-- 1 ubuntu ubuntu 9743104 Jan  7 11:29 train10-200.tfrec
-r

#### For each file the number of images contained is embedded in the name (ex: -200)