## AIMI High School Internship 2023
### Notebook 1: Extracting Labels from Radiology Reports

**The Problem**: Given a chest X-ray, our goal in this project is to predict the distance from an endotracheal tube to the carina. This is an important clinical task - endotracheal tubes that are positioned too far (>5cm) above the carina will not work effectively.

In order to train a model that can predict tube distances given chest X-rays, we require a ***training set*** with chest X-rays and labeled tube distances. However, when working with real-world medical data, important labels (e.g. endotracheal tube distances) are often not annotated ahead of time. The only data that a researcher has access to are the raw images and free-form clinical text written by the radiologist.

**Your First Task**: Given a set of chest X-rays and paired radiology reports, your goal is to use natural language processing tools to extract endotracheal tube distances from the reports.

**Looking Ahead**: When you complete this task, you should have a training dataset with chest X-rays labeled with endotracheal tube distances. You will later use this dataset to train a computer vision model that predicts the tube distance given an image.

### Load Data

Upload `data.zip`. It should take about 10 minutes for these files to be uploaded. Then, run the following cells to unzip the dataset (which should take < 10 seconds)

### Understanding the Data

Let's first go through some terminology. Medical data is often stored in a hierarchy consisting of three levels: patient, study, and images.
- Patient: A patient is a single unique individual.
- Study: Each patient may have multiple sets of images taken, perhaps on different days. Each set of images is referred to as a *study*.
- Images: Each study consists of one or more *images*.

Chest X-ray images and radiology reports are stored in `data/` and are organized as follows:
- `data/mimic-train`:
  - Images: The MIMIC training set consists of 5313 subfolders, each representing a patient. Every patient has one or more studies, which are stored as subfolders. Images are stored in study folders as `.jpg` files with 512x512 pixels.
  - Text: Reports are stored in patient folders with  `.txt` extensions. The filename corresponds to the study id and the content of the report applies to all images in the corresponding study.
- `data/mimic-test`: The MIMIC test set is organized in a similar fashion as the MIMIC training set. Note that this is a held-out test set with 500 images that we will use for scoring models, so reports are not provided!
- `data/mimic_train_student.csv`: This spreadsheet provides mappings between image paths, report paths, patient ids, study ids, and image ids for samples in the training set.
- `data/mimic_test_student.csv`: This spreadsheet provides mappings between image paths, patient ids, study ids, and image ids for samples in the test set.

In [2]:
# Example Image
from PIL import Image
img = Image.open(f"data/mimic-train/12000/59707/90529.jpg")
img.show()

In [3]:
# Example Text Report
with open(f"data/mimic-train/12000/59707.txt", "r") as f:
  txt = f.readlines()
txt

['                                 FINAL REPORT\n',
 ' PORTABLE CHEST ___\n',
 ' \n',
 ' COMPARISON:  ___ radiograph.\n',
 ' \n',
 ' FINDINGS:  Tip of endotracheal tube terminates 6 cm above the carina. \n',
 ' Cardiomediastinal contours are normal in appearance, and lungs are grossly\n',
 ' clear.\n']

In [4]:
# Load csv file with mappings
import pandas as pd
subjects = pd.read_csv(f'data/mimic_test_student.csv')

subjects = subjects.drop(columns=["Unnamed: 0"])
subjects


Unnamed: 0,split,patient_id,study_id,image_id,image_path
0,test,10345,50410,80276,mimic-test/10345/50410/80276.jpg
1,test,10345,50232,80350,mimic-test/10345/50232/80350.jpg
2,test,10189,50388,80353,mimic-test/10189/50388/80353.jpg
3,test,10127,50441,80124,mimic-test/10127/50441/80124.jpg
4,test,10004,50475,80218,mimic-test/10004/50475/80218.jpg
...,...,...,...,...,...
495,test,10252,50003,80193,mimic-test/10252/50003/80193.jpg
496,test,10363,50408,80072,mimic-test/10363/50408/80072.jpg
497,test,10363,50145,80039,mimic-test/10363/50145/80039.jpg
498,test,10054,50265,80374,mimic-test/10054/50265/80374.jpg


### Extracting Tube Distance Labels

You're now ready to begin this task! Keep in mind that not every chest X-ray provided in the training set contains endotracheal tube distance information, and there may be several edge cases to consider.

In [5]:
def is_cm(string):
  valid = "0123456789., "
  unit = None
  for char in string:
    if valid.find(char) == -1:
      unit = char
      break
  return unit

def validate(string):
  valid = "0123456789CcentiMmrl,. "

  for i, char in enumerate(string):
    if valid.find(char) == -1:
      return False
  return True

def handle_no_space(string):
  valid = "0123456789"
  end_of_int = 0

  for i, char in enumerate(string):
    if valid.find(char) == -1:
      end_of_int = i - 1
  return string[:end_of_int] + ".0"

def handle_space(string):
  space_loc = string.find(" ")
  if space_loc != -1:
      return string.split(" ")[0] + ".0"
  return handle_no_space(string)

def handle_period(string):
  out = ""
  valid = "0123456789."
  for i, char in enumerate(string):
    if valid.find(char) != -1:
      out += char
  if out[len(out) - 1] == ".":
    out = out[:len(out) - 1]
    return out
  return out

def parse_measure(measurement):
  comma_loc = measurement.find(",")
  period_loc = measurement.find(".")
  # no comma or period
  if period_loc == -1 and comma_loc == -1:
    return handle_space(measurement)
  # only a comma
  if comma_loc != -1:
    measurement = measurement.replace(",", ".")
  # at this point, all strings that made it
  # this far have only a period
  return handle_period(measurement)

def parse_report(path):
  with open(f"data/{path}", "r") as f:
    txt = f.readlines()
  return "".join(txt).replace("\n", "")


## Measurement parsing test harness

In [6]:
# Test 1: 3.5 cm
print("3.5" == parse_measure("3.5 cm"))
# parse_measure("3.5 cm")

# Test 2: 3,5 cm
print("3.5" == parse_measure("3,5 cm"))
# parse_measure("3,5 cm")

# Test 3: 3 . 5 centimeter
print("3.5" == parse_measure("3 . 5 centimeter"))
# parse_measure("3 . 5 centimeter")

# Test 4: 3.5centimeters
print("3.5" == parse_measure("3.5centimeters"))
# parse_measure("3.5centimeters")

# Test 5: 3,5 centimetrs
print("3.5" == parse_measure("3,5 centimetrs"))
# parse_measure("3,5 centimetrs")

# Test 6: 3 cm
print("3.0" == parse_measure("3 cm"))
# parse_measure("3 cm")

# Test 7: 3.0 cm
print("3.0" == parse_measure("3.0 cm"))
# parse_measure("3.0 cm")

# Test 8: 12.5 cm
print("12.5" == parse_measure("12.5. cm"))
#print(parse_measure("12.5. cm"))

# Test 9: 22,5 cm
print("22.5" == parse_measure("22,5 cm"))
# parse_measure("3,5 cm")

# Test 10: 25 . 5 centimeter
print("25.5" == parse_measure("25 . 5 centimeter"))
# parse_measure("3 . 5 centimeter")

# Test 11: 361.5centimeters
print("361.5" == parse_measure("361.5centimeters"))
# parse_measure("3.5centimeters")

# Test 12: 46,5 centimetrs
print("46.5" == parse_measure("46,5 centimetrs"))
# parse_measure("3,5 centimetrs")

# Test 13: 75 cm
print("75.0" == parse_measure("75 cm"))
# parse_measure("3 cm")

# Test 14: 232.0 cm
print("232.0" == parse_measure("232.0 cm"))
# parse_measure("3.0 cm")

# Test 15: 232.0 cm
print("232.0" == parse_measure("2 3 2 .    0 cm"))
# parse_measure("3.0 cm")

# Test 16: 232 mm
print("m" == is_cm("232 mm"))

# Test 17: 232.0 millimietr
print("m" == is_cm("232.0 millimietr"))

# Test 18: 257 , 0 cm
print("c" == is_cm("257 , 0 cm"))

# Test 19: 25 , 0 cm
print("c" == is_cm("257 , 0 cm"))

# Test 20: 25
print(None == is_cm("25"))

# Test 21: 2 and 3 cm
print(False == validate("2 and 3 cm"))

# Test 22: 2 or 3 cm
print(False == validate("2 or 3 cm"))

# Test 23: 2-3 cm
print(False == validate("2-3 cm"))

# Test 24: less than 2 cm:
print(False == validate("less than 2 cm"))

# Test 25: 2 away
print(False == validate("2 away"))

# Test 26:
print(True == validate("2 3 2 .    0 cm"))

# Test 27:
print(True == validate("25 . 5 centimeter"))

# Test 28:
print(True == validate("22,5 cm"))



True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True


In [7]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForQuestionAnswering

 # pass device=0 if using gpu

def biomed_token_class(report):
  tokenizer = AutoTokenizer.from_pretrained("d4data/biomedical-ner-all")
  model = AutoModelForTokenClassification.from_pretrained("d4data/biomedical-ner-all")
  pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
  return pipe(report)

def roberta_squad2(report):
  question = "What is the exact distance between the ETT device and the carina?"

  model_name = "deepset/roberta-base-squad2"
  model = AutoModelForQuestionAnswering.from_pretrained(model_name)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  pipe = pipeline('question-answering', model=model, tokenizer=tokenizer)

  qa_input = {
      'question': question,
      'context': report
  }

  return pipe(qa_input)['answer']

def get_distance(report):
  measure_roberta = roberta_squad2(report)
  roberta_valid = validate(measure_roberta)
  if not roberta_valid:
    return "-1"
  if roberta_valid and is_cm(measure_roberta) is not None:
    return measure_roberta
  return "-2"

def get_biomed_token_class(report):
  response = biomed_token_class(report)
  measure = "-1"
  max_score = 0
  for json in response:
    if json["entity_group"] == "Distance" and json["score"] > max_score:
      measure = json["word"]
      max_score = json["score"]
  return measure

NOT_VALID = -1.0
NO_UNITS = -2.0

def process_volume(path):
  report = parse_report(path)
  distance = get_distance(report)
  if distance == "-1":
    return NOT_VALID
  if distance == "-2":
    return NO_UNITS
  unit = is_cm(distance)
  measurement = parse_measure(distance)
  if unit == "m" or unit == "M":
    return float(measurement) / 10.0
  return float(measurement)

  from .autonotebook import tqdm as notebook_tqdm


In [8]:

def good_positioning(measure):
  if measure < 0:
    return -1
  if measure <= 5:
    return 1
  return 0


# Load csv file with mappings
import pandas as pd
import numpy as np

def get_labels(data):
  PATH = f"data/mimic_{data}_student.csv"
  subjects = pd.read_csv(PATH)
  subjects = subjects.drop(columns=["Unnamed: 0", "study_id", "image_id"])
  report_paths = subjects["report_path"].to_numpy()
  measures = []
  positioning = [] # 1 is good, 0 is bad

  for i, path in enumerate(report_paths):
    if i % 100 == 0:
      print(f"Progress checkpoint, processed {i} volumes. {len(report_paths) - (i)} remain.")
    if i % 1000 == 0:
      print(f"Saving {i} data points to data/batch-{i}.csv")
      checkpoint = pd.DataFrame()
      checkpoint["measures"] = measures
      checkpoint["positioning"] = positioning
      checkpoint.to_csv(f"data/batch-{i}.csv")

    measure = process_volume(path)
    measures.append(measure)
    positioning.append(good_positioning(measure))

  subjects["measures"] = measures
  subjects["positioning"] = positioning
  subjects.to_csv(f"data/mimic_{data}_labels.csv")

# get_labels(data="train")
# print("Task completed.")

In [16]:
raw_labels = pd.read_csv(f"mimic_train_labels.csv")
num_problems = len(np.where(raw_labels["measures"].to_numpy() < 0)[0])

num_not_valid = len(np.where(raw_labels["measures"].to_numpy() == -1.0)[0])
num_no_units = len(np.where(raw_labels["measures"].to_numpy() == -2.0)[0])

# These are images without endotracheal devices
print(f"Number of volumes with invalid text files: {num_not_valid}")
# Delete these rows
print(f"Number of volumes with no units: {num_no_units}")
print(f"Total number of problematic volumes: {num_problems}")

Number of volumes with invalid text files: 667
Number of volumes with no units: 13
Total number of problematic volumes: 680


In [17]:
# Prune data
pruned = raw_labels[raw_labels.measures > -1.0]

pruned.to_csv("mimic_train_labels_pruned.csv")

new = pd.read_csv(f"mimic_train_labels_pruned.csv")
num_problems = len(np.where(new["measures"].to_numpy() < 0)[0])

num_not_valid = len(np.where(new["measures"].to_numpy() == -1.0)[0])
num_no_units = len(np.where(new["measures"].to_numpy() == -2.0)[0])

# These are images without endotracheal devices
print(f"Number of volumes with invalid text files: {num_not_valid}")
# Delete these rows
print(f"Number of volumes with no units: {num_no_units}")
print(f"Total number of problematic volumes: {num_problems}")
print(f"Total number of data points that remain: {new['measures'].to_numpy().shape}")

Number of volumes with invalid text files: 0
Number of volumes with no units: 0
Total number of problematic volumes: 0
Total number of data points that remain: (11565,)
