<a href="https://colab.research.google.com/github/ipavlopoulos/rtex/blob/master/RTEx_IUXray.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RTEx: Ranking, Tagging and Explanatory Diagnostic Captioning of Radiographs
In this notebook we use the IUXray dataset to showcase an implementation of our suggested methodology. Please see our article for more details.

In [0]:
import os
import numpy as np
import pandas as pd
from keras.models import load_model
from keras.models import Model
from keras.applications.densenet import preprocess_input
from keras.preprocessing import image
from tqdm import tqdm

Using TensorFlow backend.


# Data setup

Download the IUXray test data and the data handling code.

In [0]:
# Download test datafiles
! gdown --id 1nubphDVrKpB3Ss9uNaxHLUf2DWpqWRzq
! unzip -q iu_xray_data.zip

Downloading...
From: https://drive.google.com/uc?id=1nubphDVrKpB3Ss9uNaxHLUf2DWpqWRzq
To: /content/iu_xray_data.zip
  0% 0.00/166k [00:00<?, ?B/s]100% 166k/166k [00:00<00:00, 24.2MB/s]


In [0]:
# Download iu xray images
!wget https://openi.nlm.nih.gov/imgs/collections/NLMCXR_png.tgz
!mkdir iu_xray
!tar -xzf NLMCXR_png.tgz -C iu_xray/

In [0]:
!git clone https://github.com/ipavlopoulos/rtex
from rtex import data_handler

Cloning into 'rtex'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 16 (delta 5), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


Load tags

In [0]:
tags_df = pd.read_csv("mti_tags.csv", header=None)
tags_list = tags_df[0].to_list()

Load train data

In [0]:
train = pd.read_csv("iu_xray_abnormal_train.tsv", sep="\t")
train_cases_images = dict(zip(train.reports,train.images))
train_cases_tags = dict(zip(train.reports,train.mti_tags))
train_cases_captions = dict(zip(train.reports,train.captions))

Load test data and encode images

In [0]:
# Load data
import pandas as pd
test = pd.read_csv("iu_xray_all_test.tsv", sep="\t")
test_cases_images = dict(zip(test.reports,test.images))
test_case_ids = list(test_cases_images.keys())

In [0]:
# Encode all test images
x_test = data_handler.encode_images(test_cases_images, "iu_xray")

#RTEx@R
* Rank a batch of radiography exams based on the probability that they contain an abnormality.
* Top ranked are more likely to be abnormal.

In [0]:
# Download Bi-CXN checkpoint
! gdown --id 1D8oHHSib1k8QHnDD_LLzf6lLbyjrtfOm
! unzip -q iu_xray_bi_cxn.zip

Downloading...
From: https://drive.google.com/uc?id=1D8oHHSib1k8QHnDD_LLzf6lLbyjrtfOm
To: /content/iu_xray_bi_cxn.zip
74.8MB [00:01, 70.2MB/s]


Load the BI-CXN model, which will rank the radiography exams

In [0]:
# Load Bi-CXN checkpoint
bi_cxn = load_model("iu_xray_bi_cxn.hdf5")

Get the probabilities of the exams

In [0]:
# Bi-CXN prediction...
test_abn_probs = bi_cxn.predict(x_test, batch_size=16, verbose=1).flatten()



Sort exams by abnormality probability in descending order and get the top 100 exams

In [0]:
cases_probs = dict(zip(test_case_ids, test_abn_probs))
# Sort all exams (a.k.a. cases)
sorted_cases_probs = {k: v for k, v in sorted(cases_probs.items(), key=lambda item: item[1], reverse=True)}
sorted_cases = list(sorted_cases_probs.keys())
# Get the top 100 abnormal exams
abnormal_cases_images = {case: test_cases_images[case] for case in sorted_cases[:100]}

#RTEx@T

In [0]:
# Download TagCXN checkpoint
! gdown --id 1MaQW72B1bOULBwvr4ieqE6lONPkqHgil
! unzip -q iu_xray_tag_cxn.zip

Downloading...
From: https://drive.google.com/uc?id=1MaQW72B1bOULBwvr4ieqE6lONPkqHgil
To: /content/iu_xray_tag_cxn.zip
85.6MB [00:01, 73.3MB/s]


Load the TagCXN model, which will assign tags to the radiography exams

In [0]:
# Load TagCXN checkpoint
tag_cxn = load_model("iu_xray_tag_cxn.hdf5")

In [0]:
# Encode abnormal test images
abnormal_x_test = data_handler.encode_images(abnormal_cases_images, "iu_xray")

Assign tags to each exam

In [0]:
# Get predictions for test set
test_tag_probs = tag_cxn.predict(abnormal_x_test, batch_size=16, verbose=1)

best_threshold = 0.097

tagging_results = {}
# for each exam, assign all tags above threshold
for i in range(len(test_tag_probs)):
    predicted_tags = []
    for j in range(len(tags_list)):
        if test_tag_probs[i, j] >= best_threshold:
            predicted_tags.append(tags_list[j])
    tagging_results[list(abnormal_cases_images.keys())[i]] = ";".join(predicted_tags)



In [0]:
results = list(tagging_results.items())
print(f"For example, for the top-ranked {results[0][0]} exam, the following tags were found:\n{', '.join(results[0][1].split(';'))}")

For example, for the top-ranked CXR3892 exam, the following tags were found:
atelectases, atelectasis, cardiomegaly, degenerative change, opacity, pleural effusion, pleural effusions, scarring


# RTEx@X

Use TagCXN to get exam embeddings for CNN+NN

In [0]:
# Extract from the model the concatenation layer
vector_extraction_model = Model(inputs=tag_cxn.input,
                        outputs=tag_cxn.get_layer("concatenate_1").output)

Get the embeddings of the (abrnormal) train and test exams

In [0]:
# Get train embeddings
train_images_vec = data_handler.extract_img_embeddings(vector_extraction_model, 
                                              "iu_xray", train_cases_images)
# Get test embeddings
test_images_vec = data_handler.extract_img_embeddings(vector_extraction_model, 
                                            "iu_xray", abnormal_cases_images)

100%|██████████| 1183/1183 [09:14<00:00,  2.13it/s]
100%|██████████| 100/100 [00:45<00:00,  2.19it/s]


In [0]:
def CNN_NN(vec, train_image_mat, train_ids):
  """ 
  Search the train abnormal exams for the nearest one to the test exam.
  Use Cosine similarity and the CNN-encoded radiograph embeddings.
  @return: the text of the nearest exam  
  """
  assert train_image_mat.shape[0] == len(train_ids)
  vec = vec / np.sum(vec)
  vec_clones = np.array([vec] * train_image_mat.shape[0])
  similarities = np.sum(vec_clones * train_image_mat, 1)
  newarest_id = train_ids[np.argmax(similarities)]
  return train_cases_captions[newarest_id]

`CNNtNN`:
Search the train abnormal exams with the same tags assigned, for the nearest one to the test exam. Use Cosine similarity and the CNN-encoded radiograph embeddings. Return the text of the nearest exam.

In [0]:
# 1NN+
sim_test_results = {}
for test_id in tqdm(abnormal_cases_images.keys()):
    # screen the train DB for cases with the predicted tags (if none, use all)
    predicted_tags = set(tagging_results[test_id])  # test_cases_tags[test_id])
    train_indices = [i for i in train_images_vec.keys() if set(predicted_tags) == set(train_cases_tags[i])]
    if len(train_indices) == 0:
        train_indices = list(train_images_vec.keys())
    # compute dot similarity with the train DB (screening or not)
    raw = np.array([train_images_vec[i] for i in train_indices])
    raw = raw / np.array([np.sum(raw, 1)] * raw.shape[1]).transpose()
    sim_test_results[test_id] = CNN_NN(test_images_vec[test_id], raw, train_indices)

100%|██████████| 100/100 [00:00<00:00, 177.97it/s]


In [0]:
results = list(sim_test_results.items())
print(f"For example, for the top-ranked {results[0][0]} exam, the following decription was provided:\n{results[0][1]}")

For example, for the top-ranked CXR3892 exam, the following decription was provided:
moderate bilateral interstitial edema with cardiomegaly and bilateral effusion consistent with moderate cardiac failure newsentence a large calcified right mediastinal adenopathy xxxx chronic fungal newsentence no pneumothorax
