# CheXpert Dataset - Download and Data Preparation

This notebook shows how to download and preprocess the CheXpert dataset for making weakly supervised experiments.

The original CheXpert paper, "CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison" by Irvin et al. (2019), can be found here: https://arxiv.org/pdf/1901.07031.pdf.

# Data description

The CheXpert training set is composed of chest radiographs, which were annotated on the basis of reports using the rule-based CheXpert labeler. Each image is labeled with respect 12 pathologies as well as the observations "No Finding" and "Support Devices". For each of these categories, except "No Finding", the assigned weak label is either: (Irvin et al. (2019))

- positive (1.0)
- negative (0.0)
- not mentioned (blank)
- uncertain (-1.0) 

The development set was annotated by radiologists and therefore only contains the binary labels: (Irvin et al. (2019))

- positive (1.0) 
- negative (0.0)

# Getting access to the data

You can register for obtaining the data under the following link: https://stanfordmlgroup.github.io/competitions/chexpert/. Once the registration is finished, you should receive an email which contains links for two different versions of the dataset, the original CheXpert dataset (around 439 GB) and a version with downsampled resolution (around 11 GB). The code below uses the downsampled version. Please unzip the downloaded folder in a directory of your choice and don't change the filenames or the folder structure, otherwise you might need to change some of the paths used in the following code in order for it to run properly. The zip file you obtained should contain a training and a validation set. The CheXpert test set is not publicly available, as it is used for the CheXpert competition (see link above). The reports that were used to label the images are also unavailable.

Please note that some output of the code below is not displayed due to the "Stanford University School of Medicine CheXpert Dataset Research Use Agreement" which can be found here: https://stanfordmlgroup.github.io/competitions/chexpert/.
Please run the code yourself with the data you downloaded from the above link.

# Imports

First, let's import the required packages.

In [1]:
import os
from typing import List

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import torch
import torchvision.transforms as transforms
from PIL import Image
from tabulate import tabulate
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import save_image
from tqdm import tqdm

# Define storing locations for the preprocessed data

If you wish to save the preprocessed data on your computer, please specify a path to the location in which you want to store the data where "storing_location_path" is mentioned in the code underneath.

At the end of the tutorial, you will be presented two options of storing the preprocessed data:

- storing each preprocessed image tensor and its corresponding labels in a separate .npz file (~ 35 GB in total)
- storing each preprocessed image as a .jpg file and saving all of the labels in a joblib file (~ 2 GB in total)

Plese note that in both approaches, the training and validation set will be stored separately. 
If you wish to store the data, please create two folders, named "train_images" and "valid_images" respectively, in your specified location.

In [None]:
storing_location = "storing_location_path"

# joblib files in which labels are stored if second option is chosen
joblib_labels_train = os.path.join(storing_location, 'chexpert_data_train_labels.joblib')
joblib_labels_valid = os.path.join(storing_location, 'chexpert_data_valid_labels.joblib')

# Load the dataset

Now, let's load the files train.csv and valid.csv, which accompany the images. First, please change the working directory to the appropriate location by inserting the path to the folder in which you stored train.csv and valid.csv where "data_path" is mentioned in the code. If you didn't change the folder structure, this path should end with "\CheXpert-v1.0-small\CheXpert-v1.0-small".



In [2]:
path = "data_path"
os.chdir(path) # change working directory to appropriate location

## Get train data

First, let's load the training data.

In [3]:
training_set = pd.read_csv('train.csv')

Let's take a look at the first five rows of the raw training data. 

Please note that the small dataframe displayed below is entirely made up of fake entries. If you want to see the true data, please run training_set.head(5) yourself.

In [None]:
training_set.head(5)

In [5]:
# dataframe with fake entries for demonstration purposes

fake_entries = list(zip(["CheXpert-v1.0-small/train/patient99999/study1/...", "CheXpert-v1.0-small/train/patient99999/study2/..."],
                        ["Female", "Female"], [44, 48], ["Frontal", "Frontal"], ["AP", "AP"], ["NaN", 1.0],
                        ["NaN", "NaN"], ["NaN", "NaN"], ["NaN", "NaN"], ["NaN", -1.0], [-1.0, "NaN"], ["NaN", "NaN"],
                        ["NaN", "NaN"], ["NaN", "NaN"], ["NaN", "NaN"], ["NaN", "NaN"], ["NaN", "NaN"], 
                        [1.0, -1.0], ["NaN", "NaN"]))

fake_training_set = pd.DataFrame(data = fake_entries, index = None, columns = training_set.columns, dtype = None, copy = None)

fake_training_set.head()

Unnamed: 0,Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
0,CheXpert-v1.0-small/train/patient99999/study1/...,Female,44,Frontal,AP,,,,,,-1.0,,,,,,,1.0,
1,CheXpert-v1.0-small/train/patient99999/study2/...,Female,48,Frontal,AP,1.0,,,,-1.0,,,,,,,,-1.0,


In [6]:
print("Number of observations in training set:", training_set.shape[0])

Number of observations in training set: 223414


As you can see, each observation in the training set consists of a path to an image, some additional information about the patient and the nature of the image, as well as the weak labels for all 14 classes. 12 of the classes, "Enlarged Cardiomediastinum" to "Fracture", are considered pathologies. "No Finding" is assigned the label 1 (meaning "positive") if no pathology was marked as positive (1.0) or uncertain (-1.0) for this observation. Labels which were blank (meaning that the pathology was not mentioned in the report) turned into NaNs when loading the data.

The training set has a total of 223414 observations. (Irvin et al. (2019))

## Get validation data

Now, let's do the same for the validation set.

In [7]:
validation_set = pd.read_csv("valid.csv")

In [None]:
validation_set.head(5)

In [9]:
# dataframe with fake entries for demonstration purposes

fake_entries = list(zip(["CheXpert-v1.0-small/train/patient00000/study1/...", "CheXpert-v1.0-small/train/patient00001/study2/..."],
                        ["Male", "Female"], [65, 82], ["Frontal", "Lateral"], ["AP", "NaN"], [0.0, 0.0],
                        [0.0, 0.0], [0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 0.0], [0.0, 0.0],
                        [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 1.0]))

fake_validation_set = pd.DataFrame(data = fake_entries, index = None, columns = training_set.columns, dtype = None, copy = None)

fake_validation_set.head()

Unnamed: 0,Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
0,CheXpert-v1.0-small/train/patient00000/study1/...,Male,65,Frontal,AP,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CheXpert-v1.0-small/train/patient00001/study2/...,Female,82,Lateral,,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [10]:
print("Number of observations in validation set:", validation_set.shape[0])

Number of observations in validation set: 234


As you can see, the validation set has the same structure as the training set. However, the validation set only uses positive (1.0) and negative (0.0) labels and no uncertainty (-1.0) or not-mentioned labels (blank).

The validation set contains 234 observations. (Irvin et al. (2019))

## Collect statistics

In the following, some statistics are computed in order to get an idea about the label distribution in the two datasets.

In [11]:
training_labels = training_set.iloc[:, -13:-1]
labels_per_row = training_labels.count(axis = 1) # number of non-NaN labels per row in the training set

vals = pd.DataFrame(labels_per_row.value_counts())

# make a table
val_list = [(i, vals[0][i]) for i in vals.index]
    
print(tabulate(val_list, headers = ["Number of non-NaN labels", "Number of datapoints"]))

  Number of non-NaN labels    Number of datapoints
--------------------------  ----------------------
                         3                   64386
                         4                   50893
                         2                   50672
                         5                   23828
                         1                   23185
                         6                    7922
                         7                    1960
                         8                     294
                         0                     250
                         9                      19
                        10                       5


As you can see, most training samples have at least a few pathologies, for which they have a label which is not blank, meaning it is either positive (1.0), negative (0.0) or uncertain (-1.0). Nevertheless, there are 250 observations for which none of the 12 pathologies were mentioned in the corresponding report.

The following two tables give an idea about the label distribution for the different pathologies in the training set and in the validation set.

In [12]:
val_list = []
for cond in training_labels.columns:
    vals = np.array(pd.Categorical(training_labels[cond], categories = [-1.0, 0.0, 1.0]).value_counts().sort_index(ascending = True))
    val_list.append([cond, vals[0], vals[1], vals[2]])

print("Label distribution in the training set:", "\n")
print(tabulate(val_list, headers = ["Pathology", "-1.0", "0.0", "1.0"]))

Label distribution in the training set: 

Pathology                     -1.0    0.0     1.0
--------------------------  ------  -----  ------
Enlarged Cardiomediastinum   12403  21638   10798
Cardiomegaly                  8087  11116   27000
Lung Opacity                  5598   6599  105581
Lung Lesion                   1488   1270    9186
Edema                        12984  20726   52246
Consolidation                27742  28097   14783
Pneumonia                    18770   2799    6039
Atelectasis                  33739   1328   33376
Pneumothorax                  3145  56341   19448
Pleural Effusion             11628  35396   86187
Pleural Other                 2653    316    3523
Fracture                       642   2512    9040


In [13]:
validation_labels = validation_set.iloc[:, -13:-1]

val_list = []
for cond in validation_labels.columns:
    vals = np.array(pd.Categorical(validation_labels[cond],categories = [0.0, 1.0]).value_counts().sort_index(ascending=True))
    val_list.append([cond, vals[0], vals[1]])
        
print("Label distribution in the validation set:", "\n")
print(tabulate(val_list, headers = ["Pathology", "0.0", "1.0"]))

Label distribution in the validation set: 

Pathology                     0.0    1.0
--------------------------  -----  -----
Enlarged Cardiomediastinum    125    109
Cardiomegaly                  166     68
Lung Opacity                  108    126
Lung Lesion                   233      1
Edema                         189     45
Consolidation                 201     33
Pneumonia                     226      8
Atelectasis                   154     80
Pneumothorax                  226      8
Pleural Effusion              167     67
Pleural Other                 233      1
Fracture                      234      0


As can be seen, some observations like "Fracture" or "Lung Lesion", which are positively mentioned (i.e. label 1.0) for numerous observations in the training set, barely appear or don't appear at all in the validation set.

## Image preprocessing

Now that we have familiarized ourselves with the labels in the training and in the validation set, let's take a look at an example image from the training set. In order to do that, we first need to create paths that lead to the images.

In [14]:
# paths to training images
image_paths_train = [os.path.join(path[: path.find("CheXpert-v1.0-small")], "CheXpert-v1.0-small", p) for p in training_set["Path"]]

# paths to validation images
image_paths_valid = [os.path.join(path[: path.find("CheXpert-v1.0-small")], "CheXpert-v1.0-small", p) for p in validation_set["Path"]]

In [None]:
sample_image = Image.open(image_paths_train[0]).convert('RGB')
plt.imshow(sample_image)
plt.show()
print("Dimensions of image:", sample_image.size)

All images in the (downsampled) dataset have a dimensionality of around 390 x 320 pixels, however, the individual image sizes vary. (Garbin et al. (2021))

Now we can define a transform sequence which will be used to preprocess the images. The images are:

- firstly, resized to 224 x 224 pixels, since this is a requirement for many pre-trained CNNs in PyTorch;
- secondly, normalized with the mean and standard deviation from ImageNet.

Such transformations have previously been applied by several other submissions regarding CheXpert on GitHub such as:
[this](https://github.com/gaetandi/cheXpert/blob/master/cheXpert_final.ipynb) and [this](https://github.com/Stomper10/CheXpert/blob/master/CheXpert_DenseNet121_FL.ipynb).

In [16]:
transform_list = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # normalization from ImageNet
    ])

The next step is image transformation and preparation of the dataset for PyTorch's DataLoader, which is done with the `CheXpertDatasetProcessor` class.

The structure of the class was inspired by the GitHub submissions credited above.

### Label preparation

Apart from image preprocessing, an important feature of the `CheXpertDatasetProcessor` class is the handling of the uncertainty labels. The arguments `to_ones`, `to_zeros` and `to_ignore` each take a list consisting of pathologies as their input and transform the uncertainty labels for these pathologies accordingly. In particular:

- `to_ones`: pathologies for which uncertainty labels (-1.0) should be turned to positive (1.0)
- `to_zeros`: pathologies for which uncertainty labels (-1.0) should be turned to negative (0.0)
- `to_ignore`: pathologies for which uncertainty labels (-1.0) should be turned to nan

Not changing the uncertainty labels to either 1.0, 0.0 or nan results in training the model with 3 possible classes (positive, negative and uncertain). These 4 methods of handling the uncertainty labels were also described in the original CheXpert paper by Irvin et al. (2019).

If you have already saved the transformed images but wish to experiment with different settings of `to_ones`, `to_zeros` or `to_ignore`, you can set the `return_image` parameter of the class to False, so that only the labels are returned.

The "Support Device" label as well as the "No Finding" label is dropped for all instances, since support devices do not count as pathologies and the "No Finding" label only depends on the labels for the remaining 12 pathologies.

The "blank" labels, indicating that an observation was not mentioned in the corresponding report, will be treated as "uncertain".

In [17]:
class CheXpertDatasetProcessor():
    
    def __init__(self, 
                 path: str,
                 subset: str, 
                 image_paths: List[str], 
                 number_of_images: int,  
                 transform_sequence: List = None,
                 to_ones: List[str] = None,
                 to_zeros: List[str] = None, 
                 to_ignore: List[str] = None,
                 return_image: bool = True):
        
        """
        Args:
            path: path to the folder where train.csv and valid.csv are stored
            subset: either "train" to load the train.csv or "valid" to load valid.csv
            image_paths: paths to the images
            number_of_images: number of images in the dataset
            transform_sequence: sequence used to transform the images
            to_ones: list of pathologies for which uncertainty labels should be replaced by 1
            to_zeros: list of pathologies for which uncertainty labels should be replaced by 0
            to_ignore: list of pathologies for which uncertainty labels should be ignored (label will be turned to nan)
            return_image: True: image tensor and labels are returned, False: only labels are returned
        Returns: 
            224 x 224 image tensor and a corresponding tensor containing 12 labels
        """
        
        self.path = path
        self.subset = subset
        self.image_paths = image_paths
        self.number_of_images = number_of_images
        self.transform_sequence = transform_sequence
        self.to_ones = to_ones
        self.to_zeros = to_zeros
        self.to_ignore = to_ignore
        self.return_image = return_image
        
    def process_chexpert_dataset(self):
        
        # read dataset
        if self.subset == "train":
            data = pd.read_csv("train.csv")
            
        elif self.subset == "valid":
            data = pd.read_csv("valid.csv")
            
        else:
            raise ValueError("Invalid subset, please choose either 'train' or 'valid'")
            
        pathologies = data.iloc[:, -13:-1].columns
        
        # prepare labels
        data.iloc[:, -13:-1] = data.iloc[:, -13:-1].replace(float("nan"), -1) # blank labels -> uncertain
        
        if self.to_ones is not None:
            if all(p in pathologies for p in self.to_ones): # check whether arguments are valid pathologies
                data[self.to_ones] = data[self.to_ones].replace(-1, 1) # replace uncertainty labels with ones
            else:
                raise ValueError("List supplied to to_ones contains invalid pathology, please choose from:",
                                 list(pathologies))
            
        if self.to_zeros is not None:
            if all(p in pathologies for p in self.to_zeros):
                    data[self.to_zeros] = data[self.to_zeros].replace(-1, 0) # replace uncertainty labels with zeros
            else:
                raise ValueError("List supplied to to_zeros contains invalid pathology, please choose from:",
                                 list(pathologies))
            
        if self.to_ignore is not None:
            if all(p in pathologies for p in self.to_ignore):
                    data[self.to_ignore] = data[self.to_ignore].replace(-1, float("nan")) # replace uncertainty labels with nan
            else:
                raise ValueError("List supplied to to_ignore contains invalid pathology, please choose from:",
                                     list(pathologies))
        
        self.data = data
    
    def __getitem__(self, index: int):
        
        """
        index: index of example that should be retrieved
        """
        
        if self.return_image not in [True, False]:
            raise ValueError("Please set return_image argument either to True or False")
        
        image_labels = self.data.iloc[index, -13:-1]
        
        if self.return_image is False: # only labels are returned, not the images
            return torch.tensor(image_labels)
        
        else:
            image_name = self.image_paths[index]
        
            patient_image = Image.open(image_name).convert('RGB')
        
            if self.transform_sequence is not None:
                patient_image = self.transform_sequence(patient_image) # apply the transform_sequence if one is specified
        
            else:
                # even if no other transformation is applied, the image should be turned into a tensor
                to_tensor = transforms.ToTensor()
                patient_image = to_tensor(patient_image)
            
            return patient_image, torch.tensor(image_labels)
    
    def __len__(self):
        return self.number_of_images

In [18]:
# prepare training data
chexpert_train = CheXpertDatasetProcessor(path = path, subset = "train", image_paths = image_paths_train,
                                          number_of_images = training_set.shape[0], transform_sequence = transform_list)
chexpert_train.process_chexpert_dataset()

# prepare validation data
chexpert_valid = CheXpertDatasetProcessor(path = path, subset = "valid", image_paths = image_paths_valid,
                                          number_of_images = validation_set.shape[0], transform_sequence = transform_list)
chexpert_valid.process_chexpert_dataset()

For a given index, the `__getitem__` function returns the tensor representing the preprocessed image as well as a tensor containing all of the labels for this observation.

Let's take a look at what `__getitem__` returns for the training example with index 0.

In [None]:
chexpert_train.__getitem__(0)

In [20]:
shape_of_image_tensor = chexpert_train.__getitem__(0)[0].shape
shape_of_label_tensor = chexpert_train.__getitem__(0)[1].shape

print("Shape of image tensor:", shape_of_image_tensor)
print("Shape of label tensor:", shape_of_label_tensor)

Shape of image tensor: torch.Size([3, 224, 224])
Shape of label tensor: torch.Size([12])


Let's take a look at the same example, this time using the `to_ones`, `to_zeros` and `to_ignore` arguments to see exactly what they do. 

In [None]:
# example using to_ones, to_zeros and to_ignore
all_pathologies = training_set.iloc[:, -13:-1].columns
chexpert_train_alt = CheXpertDatasetProcessor(path = path, subset = "train", image_paths = image_paths_train,
                                          number_of_images = training_set.shape[0], transform_sequence = transform_list,
                                          to_ones = all_pathologies[0:2],
                                          to_zeros = all_pathologies[2:4],
                                          to_ignore = all_pathologies[4:6],
                                          return_image = False)
chexpert_train_alt.process_chexpert_dataset()
chexpert_train_alt.__getitem__(0)

As you can see, the uncertainty labels for the first 6 pathologies changed depending on whether the name of the pathology was supplied to `to_ones`, `to_zeros` or `to_ignore`.

## Store the preprocessed data

Now, we want to apply the preprocessing to our entire training and validation set and save the results for further use.

Remember, the two different options of storing the data provided in this tutorial are:

- storing each preprocessed image tensor and its corresponding labels separate .npz file (~ 35 GB in total)
- storing each resized image as a .jpg file and saving all of the labels in a joblib file (~ 2 GB in total)

### Store data as .npz files

If you prefer storing the images and labels in .npz files, please run the following code.

In [None]:
# store the training set
for i in tqdm(range(0, training_set.shape[0])):
    x = chexpert_train.__getitem__(i)
    np.savez_compressed(os.path.join(storing_location, "train_images", "image_" + str(i) + ".npz"), 
                        image = x[0], label = x[1])
    
# store the validation set
for i in tqdm(range(0, validation_set.shape[0])):
    x = chexpert_valid.__getitem__(i)
    np.savez_compressed(os.path.join(storing_location, "valid_images", "image_" + str(i) + ".npz"), 
                        image = x[0], label = x[1])

Please note that the `np.load` function returns an array and not a tensor, so the "image" result has to be transposed and turned into a tensor again. The labels also need to be converted to a tensor again. Here is a quick example how to do this:

In [None]:
to_tensor = transforms.ToTensor()

example = np.load(os.path.join(storing_location, "train_images", "image_" + str(0) + ".npz"))
print(to_tensor(example["image"].transpose(1,2,0)))
print(torch.tensor(example["label"]))

In [None]:
# compare with original output from __getitem__()
chexpert_train.__getitem__(0) # same result

### Store images as .jpg files

If you want to store the resized images as .jpg files and the labels in joblib files, you can run the code below.
Please note that due to the negative values that result from the normalization, the images are saved without the normalization.

In [None]:
# define transformations (without normalization)

transform_list_resize = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    ])

# store the training images
chexpert_train_resize = CheXpertDatasetProcessor(path = path, subset = "train", image_paths = image_paths_train, number_of_images = training_set.shape[0], transform_sequence = transform_list_resize)
chexpert_train_resize.process_chexpert_dataset()

for i in tqdm(range(0, training_set.shape[0])):
    ex = chexpert_train_resize.__getitem__(i)[0]
    save_image(ex, os.path.join(storing_location, "train_images", "image_" + str(i) + ".jpg"))
    
# store the training labels
joblib_file = open(joblib_labels_train, 'wb')

chexpert_train_labels = CheXpertDatasetProcessor(path = path, subset = "train", image_paths = image_paths_train, number_of_images = training_set.shape[0], return_image = False)
chexpert_train_labels.process_chexpert_dataset()
train_label_loader = DataLoader(chexpert_train_labels, batch_size = training_set.shape[0])
dataiter = iter(train_label_loader)
train_labels = dataiter.next()
joblib.dump(train_labels, joblib_file, compress = 2)

# store the validation images
chexpert_valid_resize = CheXpertDatasetProcessor(path = path, subset = "valid", image_paths = image_paths_valid, number_of_images = validation_set.shape[0], transform_sequence = transform_list_resize)
chexpert_valid_resize.process_chexpert_dataset()

for i in tqdm(range(0, validation_set.shape[0])):
    ex = chexpert_valid_resize.__getitem__(i)[0]
    save_image(ex, os.path.join(storing_location, "valid_images", "image_" + str(i) + ".jpg"))
    
# store the validation labels
joblib_file = open(joblib_labels_valid, 'wb')

chexpert_valid_labels = CheXpertDatasetProcessor(path = path, subset = "valid", image_paths = image_paths_valid, number_of_images = validation_set.shape[0], return_image = False)
chexpert_valid_labels.process_chexpert_dataset()
valid_label_loader = DataLoader(chexpert_valid_labels, batch_size = validation_set.shape[0])
dataiter = iter(valid_label_loader)
valid_labels = dataiter.next()
joblib.dump(valid_labels, joblib_file, compress = 2)

## Finish

This concludes the preprocessing of the CheXpert data.

### References

CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison by Irvin et al. (2019):
https://arxiv.org/pdf/1901.07031.pdf

Structured dataset documentation: a datasheet for CheXpert by Garbin et al. (2021):
https://arxiv.org/pdf/2105.03020.pdf