# Preprocessing the NIH Chest X-ray Dataset

In this notebook, we prepare the NIH Chest X-ray dataset for use in the GAN framework.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import pandas as pd
from itertools import chain
import matplotlib.pyplot as plt

The dataset is composed of chest X-ray images. We also have an accompanying CSV file containg information about each image. We read the CSV file into a Pandas dataframe to inspect it and clean the data if necessary.

In [2]:
csv_path = "../input/data/Data_Entry_2017.csv"
img_paths = [f"../input/data/images_{i:03}/images/" for i in range(1, 13)]
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,58,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,81,F,PA,2582,2991,0.143,0.143,


The `Finding Labels` column contains combinations of pathological labels. If no pathological signs are present in the image, the `No Finding` label is used. If we are going to use these labels as target variables, we need to convert them into numerical values.

In [3]:
labels = list(set(list(chain(*df['Finding Labels'].str.split('|')))))
labels

['Edema',
 'Infiltration',
 'Nodule',
 'Pneumothorax',
 'Consolidation',
 'Effusion',
 'Pneumonia',
 'Mass',
 'Pleural_Thickening',
 'Cardiomegaly',
 'No Finding',
 'Atelectasis',
 'Fibrosis',
 'Hernia',
 'Emphysema']

In [4]:
df['Finding Labels'].value_counts().head(20)

No Finding                           60361
Infiltration                          9547
Atelectasis                           4215
Effusion                              3955
Nodule                                2705
Pneumothorax                          2194
Mass                                  2139
Effusion|Infiltration                 1603
Atelectasis|Infiltration              1350
Consolidation                         1310
Atelectasis|Effusion                  1165
Pleural_Thickening                    1126
Cardiomegaly                          1093
Emphysema                              892
Infiltration|Nodule                    829
Atelectasis|Effusion|Infiltration      737
Fibrosis                               727
Edema                                  628
Cardiomegaly|Effusion                  484
Consolidation|Infiltration             441
Name: Finding Labels, dtype: int64

In [5]:
df['Finding Labels'].value_counts(dropna=False).tail(20)

Effusion|Hernia|Pleural_Thickening                                                        1
Effusion|Fibrosis|Mass|Pleural_Thickening                                                 1
Atelectasis|Consolidation|Effusion|Infiltration|Nodule|Pleural_Thickening                 1
Atelectasis|Cardiomegaly|Consolidation|Effusion|Pleural_Thickening                        1
Atelectasis|Cardiomegaly|Consolidation|Effusion|Infiltration|Nodule|Pleural_Thickening    1
Atelectasis|Consolidation|Edema|Effusion|Pneumonia                                        1
Cardiomegaly|Effusion|Fibrosis|Mass                                                       1
Cardiomegaly|Edema|Effusion|Fibrosis|Hernia                                               1
Emphysema|Infiltration|Pneumonia                                                          1
Consolidation|Effusion|Emphysema|Infiltration                                             1
Emphysema|Infiltration|Pleural_Thickening|Pneumonia                             

We focus on a couple of columns:
* `Image Index` containing the filenames
* `Finding Labels`containing the labels
* `Patient Gender`
* `View Position` which tells you how the image was acquired

In [6]:
assert df['Image Index'].str.endswith('.png').all()
assert df['Patient Gender'].isin(['M', 'F']).all()
assert df['View Position'].isin(['PA', 'AP']).all()

In [7]:
df_clean = df[['Image Index', 'Finding Labels', 'Patient Gender' , 'View Position']]
df_clean.head()

Unnamed: 0,Image Index,Finding Labels,Patient Gender,View Position
0,00000001_000.png,Cardiomegaly,M,PA
1,00000001_001.png,Cardiomegaly|Emphysema,M,PA
2,00000001_002.png,Cardiomegaly|Effusion,M,PA
3,00000002_000.png,No Finding,M,PA
4,00000003_000.png,Hernia,F,PA


The images are stored in multiple directories. Therefore, we need to extract the location of each file and add it to the dataframe.

In [8]:
filenames_list = list(df_clean['Image Index'])
num_files_list = []
offset = 0
num_files = 0
for path in img_paths:
    filenames = os.listdir(path)
    filenames.sort()
    num_files += len(filenames)
    filenames_subset = filenames_list[offset:num_files]
    filenames_subset.sort()
    assert filenames_subset == filenames
    num_files_list.append(len(filenames))
    offset = num_files
assert len(df_clean) == num_files

filename_col = []
for path, num in zip(img_paths, num_files_list):
    filename_col.extend([path]*num)
df_clean.insert(0, 'Filename', filename_col)
df_clean['Filename'] = df_clean[['Filename','Image Index']].sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [9]:
df_clean.head()

Unnamed: 0,Filename,Image Index,Finding Labels,Patient Gender,View Position
0,../input/data/images_001/images/00000001_000.png,00000001_000.png,Cardiomegaly,M,PA
1,../input/data/images_001/images/00000001_001.png,00000001_001.png,Cardiomegaly|Emphysema,M,PA
2,../input/data/images_001/images/00000001_002.png,00000001_002.png,Cardiomegaly|Effusion,M,PA
3,../input/data/images_001/images/00000002_000.png,00000002_000.png,No Finding,M,PA
4,../input/data/images_001/images/00000003_000.png,00000003_000.png,Hernia,F,PA


In [10]:
# Renaming
df_clean['Image Index'] = df_clean['Image Index'].str.replace('png', 'jpg')
df_clean = df_clean.rename({
    'Finding Labels': 'Label',
    'Patient Gender': 'Male',
    'View Position':  'AP'
}, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


We convert all the information we have about the images into binary values.

In [11]:
df_clean['Male'] = df_clean['Male'].map({
    'M': 1,
    'F': 0
})
df_clean['AP'] = df_clean['AP'].map({
    'AP': 1,
    'PA': 0
})
for col in labels:
    df_clean[col] = df_clean['Label'].str.contains(col).astype(int)

In [12]:
df_clean.head()

Unnamed: 0,Filename,Image Index,Label,Male,AP,Edema,Infiltration,Nodule,Pneumothorax,Consolidation,Effusion,Pneumonia,Mass,Pleural_Thickening,Cardiomegaly,No Finding,Atelectasis,Fibrosis,Hernia,Emphysema
0,../input/data/images_001/images/00000001_000.png,00000001_000.jpg,Cardiomegaly,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,../input/data/images_001/images/00000001_001.png,00000001_001.jpg,Cardiomegaly|Emphysema,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
2,../input/data/images_001/images/00000001_002.png,00000001_002.jpg,Cardiomegaly|Effusion,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
3,../input/data/images_001/images/00000002_000.png,00000002_000.jpg,No Finding,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,../input/data/images_001/images/00000003_000.png,00000003_000.jpg,Hernia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


Here, we define how loading each image from disk takes place. This is handled by the `tf.data.Dataset` object. We resize the images to make the training more manageable.

In [13]:
def load_img(filename):
    img = tf.io.read_file(filename)
    img = tf.io.decode_png(img, channels=1)
    img = tf.image.resize(img, size=(64, 64), method='bilinear', antialias=True)
    return img

Now, we save the resized images to disk.

In [14]:
x_train = tf.data.Dataset.from_tensor_slices(list(df_clean['Filename'])).map(load_img, num_parallel_calls=tf.data.AUTOTUNE)
filenames = tf.data.Dataset.from_tensor_slices(list(df_clean['Image Index']))
dataset = tf.data.Dataset.zip((x_train, filenames)).batch(256)

In [15]:
os.mkdir('out')
os.mkdir('out/images')

In [16]:
for images, filenames in dataset:
    for i in range(len(images)):
        plt.imsave(f"out/images/{filenames[i].numpy().decode()}", images[i, :, :, 0], cmap='gray')

Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup ca

Finally, we save the clean dataframe as a CSV file.

In [17]:
df_clean.drop(columns=['Label', 'Filename']).to_csv('out/info.csv', index=False)