# Preprocessing of MIMIC-CXR dataset

## Getting all set up

To get access to the data, one needs to be a "credential user" in PhysioNet and sign the data use agreement. 
1) To get a credentialed PhysioNet account, follow the following instructions https://physionet.org/about/citi-course/. 

2) Download the files
    "cxr-record-list.csv.gz" (from mimic-cxr),
    "cxr-study-list.csv.gz" (from mimic-cxr),
    "mimic-cxr-2.0.0-split.csv.gz" (from mimic-cxr-jpg),
    "mimic-cxr-2.0.0-chexpert.csv.gz" (from mimic-cxr-jpg)
    and unzip them with 7zip
    
3) set your working directory to this folder

In [None]:
#os.chdir("")

## Imports

In [None]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from PIL import Image
import random
import copy
import torch.optim as optim
import csv
from torch.utils.data import DataLoader, Dataset
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torchvision.models as models
import itertools
from typing import Dict

## Downloading MIMIC-CXR-JPG images

In [None]:
#set n between 1 and 377110
n = 1000
record_list = pd.read_csv("cxr-record-list.csv").to_numpy()
study_list = pd.read_csv("cxr-study-list.csv").to_numpy()

username = "your_username_here"
password = "your_pw_here"

#image download - run only once
for i in tqdm(range(n)):
    url = ["wget -r -N -c -np --user=", username, " --password=", password, " https://physionet.org/files/mimic-cxr-jpg/2.0.0/",record_list[i,3]]
    command = "".join(url)
    command = "".join([command.replace(".dcm", ""),".jpg"])
    os.system(command)


## Downloading MIMIC-CXR reports

In [None]:
username = "your_username_here"
password = "your_pw_here"

url = ["wget -r -N -c -np --user=", username, " --password=", password, " https://physionet.org/files/mimic-cxr/2.0.0/mimic-cxr-reports.zip"]
command = "".join(url)
os.system(command)

'''
#Now unzip the folder with 7zip

#### extracting "Findings" and "Impressions" from each txt file and saving it to one csv

In [None]:
with open('mimic_cxr_text.csv', 'w', newline='', encoding='utf-8') as f:
    for i in tqdm(range(len(study_list))):
        with open(''.join(["mimic-cxr-reports/", study_list[i,2]])) as f_path:
            text = ''.join(f_path.readlines())
        text = text.replace("\n", "")
        text = text.replace(",", "")
        start = text.find("FINDINGS:")
        end = text.find("IMPRESSION:")
        findings = text[start:end]
        impressions = text[end:len(text)]
        row = [study_list[i,0],study_list[i,1], findings, impressions]
        csvwriter = csv.writer(f)
        csvwriter.writerow(row)

#open
reports = pd.read_csv("mimic_cxr_text.csv", names = ["patient","id", "findings", "impressions"])

## Adding labels and split

mimic-cxr-2.0.0-chexpert.csv contains one of the four values: 1.0,-1.0, 0.0 or missing. They have the following interpretation:

* 1.0 The label was positively mentioned in the associated study, and is present in one or more of the corresponding images
* 0.0 The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images
* -1.0 The label was either: (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not
* Missing (empty element) - No mention of the label was made in the report

So we are primarily interested in the 1.0s.
One study can have multiple labels positively mentioned, like it is the case in row 7 below.

In [None]:
split_list = pd.read_csv("mimic-cxr-2.0.0-split.csv")
labels_chexpert = pd.read_csv("mimic-cxr-2.0.0-chexpert.csv")
record_list = pd.read_csv("cxr-record-list.csv")

labels_chexpert.head(8)

If there is exactly one label positively mentioned, this label will be assigned to the study. If there are multiple labels positively mentioned, one label will be chosen randomly. When there is non 1.0 assigned to the study, the label will be set to 'No Finding'. 

In [None]:
#initialise labels with 0
labels_chexpert['label'] = 0
labels_list = labels_chexpert.columns.to_numpy()
#iterate through labels: 
#three cases: only one, non, or multiple diagnoses
for i in tqdm(range(len(labels_chexpert))):
    #which labels are 1? 
    label_is1 = labels_chexpert.iloc[i,:] == 1.0
    if (sum(label_is1)==1):
       labels_chexpert.iloc[i,16] = labels_list[label_is1]
    elif sum(label_is1) > 1:
        labels_chexpert.iloc[i,16] = random.choice(labels_list[label_is1])
    else: 
        labels_chexpert.iloc[i,16] = 'No Finding'

Next, the records, labels and split are merged to one file. 

In [None]:
#merge records, labels and split
record_split_list = pd.merge(record_list, split_list, how = 'left', on = ['dicom_id', 'study_id','subject_id'])
record_split_label_list = pd.merge(record_split_list, labels_chexpert.iloc[:,[0,1,16]], how = 'left', on = ['study_id','subject_id'])

print("train-val-test split proportions:" ,record_split_label_list.groupby('split').size()/len(record_split_label_list))
print("classes proportions:", record_split_label_list.groupby('label').size()/len(record_split_label_list))

There are 14 classes, where about 40% of the studies are assigned to the class 'No finding'. We should keep in mind that we are dealing with an unbalanced dataset.

Now the labels get replaced by their id. This gives us the final "input_list". Since we will need this file multiple times, it is cached.

In [None]:
labels = {'Atelectasis':0,
           'Cardiomegaly':1,
           'Consolidation':2,
           'Edema':3,
           'Enlarged Cardiomediastinum':4,
           'Fracture':5,
           'Lung Lesion':6,
           'Lung Opacity':7,
           'Pleural Effusion':8,
           'Pneumonia':9,
           'Pneumothorax':10,
           'Pleural Other':11,
           'Support Devices':12,
           'No Finding':13}

for i in tqdm(range(len(record_split_label_list))):
 record_split_label_list.iloc[i,5] = labels.get(record_split_label_list.iloc[i,5])
              
input_list = record_split_label_list.to_numpy()
#save the whole file
np.save("input_list.npy",input_list)
#open only first n rows
input_list = np.load('input_list.npy', allow_pickle=True)[0:n,:]

## Creating rules from Chexpert-labler

We are using the "phrases" from Chexpert-labler https://github.com/stanfordmlgroup/chexpert-labeler/tree/master/phrases for building our rules. Therefore, we need to download the .txt files corresponding to each class. 

In [None]:
#download synonym list from chexpert
classes = list(labels)
#lower case
classes = [each_string.lower() for each_string in classes]
#replace whitespace with _
classes = [each_string.replace(" ", "_") for each_string in classes]
labels2ids = {classes[i]:i for i in range(14)}
#create folder
os.makedirs("".join([os.getcwd(),"/chexpert_rules"]))
#store files in folder
for i in range(len(classes)):
    os.system("".join(["curl https://raw.githubusercontent.com/stanfordmlgroup/chexpert-labeler/master/phrases/mention/", 
                       classes[i], ".txt ", "-o chexpert_rules/", classes[i], ".txt"]))

The T matrix contains information about which rule corresponds to which label. In the following snippets, we build this matrix. 

In [None]:
#read txt in
lines = {}
for i in range(len(classes)):
    with open("".join(["chexpert_rules/", classes[i], ".txt"])) as f:
        lines[classes[i]] = [each_string.replace("\n", "") for each_string in f.readlines()]
          
mentions = pd.DataFrame({'label': label, 'rule': rule} for (label, rule) in lines.items())
mentions.head()

In [None]:
#building the dataframe "rules"
rules = pd.DataFrame([i for i in itertools.chain.from_iterable(mentions['rule'])], columns = ["rule"])
rules['rule_id'] = range(len(rules))
rules['label'] = np.concatenate([np.repeat(mentions['label'][i], len(mentions['rule'][i])) for i in range(14)])
rules['label_id'] = [labels2ids[rules['label'][i]] for i in range(len(rules))]
rules.head()

In [None]:
rule2rule_id = dict(zip(rules["rule"], rules["rule_id"]))
rule2label = dict(zip(rules["rule_id"], rules["label_id"]))

def get_mapping_rules_labels_t(rule2label: Dict, num_classes: int) -> np.ndarray:
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.zeros([len(rule2label), num_classes])
    for rule, labels in rule2label.items():
        mapping_rules_labels_t[rule, labels] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, len(labels2ids))
mapping_rules_labels_t[0:5,:]
mapping_rules_labels_t.shape

Now we want to check if there are any rules which assign to the same class. Indeed, there is one case, rule "defib" builds a rule for two different classes. 

In [None]:
len(np.unique(rules['rule'])) == len(rules['rule'])
rules_size = rules.groupby('rule').size() 
rules_size[np.where(rules_size > 1)[0]]
#rule defib appears for two different classes

# Image encoding 
## Finetuning a pretrained CNN and extracting the second last layer as features

For the image encoding, we use the concept of transfer learning. 
Therefore, we take a pretrained CNN and continue training the model with our data. This saves a lot of time and leads to satisfying results even on a small dataset. 
For the implementation we use pytorch and the pretrained resnet50 from torchvision. 

The following class loads the data and transforms it in the way it is required for resnet50. It is written in the form such that it is compatible with torch.utils.data.DataLoader. 

In [None]:
class mimicDataset(Dataset):
    
    def __init__(self, path):
        'Initialization'
        self.path = path
        #self.y = y
        
    def __len__(self):
        'Denotes the total number of samples'
        return len(self.path)
    
    def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        image = Image.open("".join(["physionet.org/files/mimic-cxr-jpg/2.0.0/",self.path[index,3].replace(".dcm", ".jpg")])).convert('RGB')
        X = self.transform(image)
        label = self.path[index,5]
        
        return X, torch.tensor(label)
    
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])


Next, we split the data according to the split given in mimic-cxr-jpg. 

In [None]:
input_train = input_list[input_list[:,4] == 'train',:]
input_validate = input_list[input_list[:,4] == 'validate',:]
input_test = input_list[input_list[:,4] == 'test',:]

Since the dataset is unbalanced, we use a weighted sampler 

In [None]:
class_counts = np.zeros(14)
for i in range(14): class_counts[i] = sum(input_train[:,5]==i)
weight = 1/class_counts
sample_weights = np.array([weight[t] for t in input_train[:,5]])
sample_weights = torch.from_numpy(sample_weights)
sample_weights = sample_weights.double()
sampler = torch.utils.data.WeightedRandomSampler(weights=sample_weights, num_samples=len(sample_weights))

In [None]:
dataset = {'train' : mimicDataset(input_train),
           'val': mimicDataset(input_validate),
           'test':  mimicDataset(input_test)}

dataloaders = {'train': DataLoader(dataset['train'] , batch_size=4, num_workers=0, sampler = sampler),
               'val': DataLoader(dataset['val'] , batch_size=4, num_workers=0 )}

dataset_sizes = {x: len(dataset[x]) for x in ['train', 'val']}
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

The following function is taken from https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

In [None]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        optimizer.zero_grad()
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                if phase == 'val':
                    print('predictions',preds)

            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    print('Best val Acc: {:4f}'.format(best_acc), )

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model


In [None]:
model = models.resnet50(pretrained=True)
num_ftrs = model.fc.in_features
# set output size to 14 (number of classes)
model.fc = nn.Linear(num_ftrs, 14)
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)
step_lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
model = train_model(model, criterion, optimizer, step_lr_scheduler, num_epochs=2)

Now, that we have trained our model, we will extract the features. 
In the next step, the last layer is removed from the model, so the output is then the second last layer with dimension 1x2048.


In [None]:
modules = list(model.children())[:-1]
model=torch.nn.Sequential(*modules)
for p in model.parameters():
    p.requires_grad = False
    
model.eval()
#apply modified resnet50 to data
dataloaders = DataLoader(mimicDataset(input_list[:n,:]), batch_size=n,num_workers=0)
    
data, labels = next(iter(dataloaders))
with torch.no_grad():
    features_var = model(data)
    features = features_var.data 
    all_X = features.reshape(n,2048).numpy()


In [None]:
from joblib import dump
#only features
dump(all_X, "all_X.lib")