## AIMI High School Internship 2023
### Notebook 1: Extracting Labels from Radiology Reports

**The Problem**: Given a chest X-ray, our goal in this project is to predict the distance from an endotracheal tube to the carina. This is an important clinical task - endotracheal tubes that are positioned too far (>5cm) above the carina will not work effectively.

In order to train a model that can predict tube distances given chest X-rays, we require a ***training set*** with chest X-rays and labeled tube distances. However, when working with real-world medical data, important labels (e.g. endotracheal tube distances) are often not annotated ahead of time. The only data that a researcher has access to are the raw images and free-form clinical text written by the radiologist.

**Your First Task**: Given a set of chest X-rays and paired radiology reports, your goal is to use natural language processing tools to extract endotracheal tube distances from the reports.

**Looking Ahead**: When you complete this task, you should have a training dataset with chest X-rays labeled with endotracheal tube distances. You will later use this dataset to train a computer vision model that predicts the tube distance given an image.

### Load Data

Upload the four sets of files: `mimic-train.zip`, `mimic-test.zip`, `mimic_train_student.csv`, and `mimic_test_student.csv`. It should take about 10 minutes for these files to be uploaded. Then, run the following cells to unzip the dataset (which should take < 10 seconds)

In [None]:
!unzip content/mimic-train.zip

In [None]:
!unzip content/mimic-test.zip

In [None]:
!ls content

### Understanding the Data

Let's first go through some terminology. Medical data is often stored in a hierarchy consisting of three levels: patient, study, and images.
- Patient: A patient is a single unique individual.
- Study: Each patient may have multiple sets of images taken, perhaps on different days. Each set of images is referred to as a *study*.
- Images: Each study consists of one or more *images*.

Chest X-ray images and radiology reports are stored in `data/` and are organized as follows:
- `data/mimic-train`:
  - Images: The MIMIC training set consists of 5313 subfolders, each representing a patient. Every patient has one or more studies, which are stored as subfolders. Images are stored in study folders as `.jpg` files with 512x512 pixels.
  - Text: Reports are stored in patient folders with  `.txt` extensions. The filename corresponds to the study id and the content of the report applies to all images in the corresponding study.
- `data/mimic-test`: The MIMIC test set is organized in a similar fashion as the MIMIC training set. Note that this is a held-out test set with 500 images that we will use for scoring models, so reports are not provided!
- `data/mimic_train_student.csv`: This spreadsheet provides mappings between image paths, report paths, patient ids, study ids, and image ids for samples in the training set.
- `data/mimic_test_student.csv`: This spreadsheet provides mappings between image paths, patient ids, study ids, and image ids for samples in the test set.

In [1]:
# Example Image
from PIL import Image
img = Image.open(f"content/mimic-train/12000/59707/90529.jpg")
img.show()

In [2]:
# Example Text Report
with open(f"content/mimic-train/12000/59707.txt", "r") as f:
    txt = f.readlines()
txt

['                                 FINAL REPORT\n',
 ' PORTABLE CHEST ___\n',
 ' \n',
 ' COMPARISON:  ___ radiograph.\n',
 ' \n',
 ' FINDINGS:  Tip of endotracheal tube terminates 6 cm above the carina. \n',
 ' Cardiomediastinal contours are normal in appearance, and lungs are grossly\n',
 ' clear.\n']

In [63]:
# Load csv file with mappings
import pandas as pd
df = pd.read_csv('content/mimic_train_student.csv',index_col=0)
df

Unnamed: 0,patient_id,study_id,image_id,image_path,report_path
0,13282,56112,91263,mimic-train/13282/56112/91263.jpg,mimic-train/13282/56112.txt
1,13282,58693,86967,mimic-train/13282/58693/86967.jpg,mimic-train/13282/58693.txt
2,13360,54397,84764,mimic-train/13360/54397/84764.jpg,mimic-train/13360/54397.txt
3,13360,57560,92873,mimic-train/13360/57560/92873.jpg,mimic-train/13360/57560.txt
4,13360,62326,88457,mimic-train/13360/62326/88457.jpg,mimic-train/13360/62326.txt
...,...,...,...,...,...
12240,13795,60202,87633,mimic-train/13795/60202/87633.jpg,mimic-train/13795/60202.txt
12241,13795,60202,82617,mimic-train/13795/60202/82617.jpg,mimic-train/13795/60202.txt
12242,13818,59053,93743,mimic-train/13818/59053/93743.jpg,mimic-train/13818/59053.txt
12243,13906,62812,85124,mimic-train/13906/62812/85124.jpg,mimic-train/13906/62812.txt


In [286]:
samp = df.sample(10)
sample_reports = []
for n in range(len(samp)):
    path = samp.iloc[n]['report_path']
    f = open('content/'+path, "r")
    report = ''.join(f.readlines())
    sample_reports.append(report)

### Extracting Tube Distance Labels

You're now ready to begin this task! Keep in mind that not every chest X-ray provided in the training set contains endotracheal tube distance information, and there may be several edge cases to consider.

In [8]:
sample = ['                                 FINAL REPORT\n',
 ' PORTABLE CHEST ___\n',
 ' \n',
 ' COMPARISON:  ___ radiograph.\n',
 ' \n',
 ' FINDINGS:  Tip of endotracheal tube terminates 6 cm above the carina. \n',
 ' Cardiomediastinal contours are normal in appearance, and lungs are grossly\n',
 ' clear.\n']

In [91]:
sample_reports[4].strip()

'FINAL REPORT\n INDICATION:  ___ s/p DD liver transplant admitted to SICU // pls assess for\n ptx, fluid status, any acute abnormalities\n \n TECHNIQUE:  Chest PA and lateral\n \n COMPARISON:  ___\n \n FINDINGS: \n \n Endotracheal tube in situ with the tip 45 mm proximal to the carina. \n Swan-Ganz catheter in situ in the appropriate position.  Right-sided IJV\n sheath in situ with the tip in the proximal SVC. The cardiomediastinal shadow\n is normal. No pulmonary edema. No pneumothorax. No pleural effusion. No\n airspace consolidation. NG tube in situ coursing out of sight inferiorly. \n Abdominal ___-___ drain in situ.\n \n IMPRESSION: \n \n No pneumothorax or pulmonary edema.'

In [178]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/tinyroberta-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

In [289]:
for n, rep in enumerate(sample_reports):
    QA_input = {
    'question': 'What is the current distance between the carina and the endotracheal (ET) tube (ETT) or tracheostomy tube?',
    'context': sample_reports[n].strip().replace('\n','')
}
    print(n,nlp(QA_input))

0 {'score': 0.5575271844863892, 'start': 207, 'end': 213, 'answer': '3.0 cm'}
1 {'score': 0.36157333850860596, 'start': 1137, 'end': 1158, 'answer': '6 cm above the carina'}
2 {'score': 0.5293547511100769, 'start': 681, 'end': 686, 'answer': '59 mm'}
3 {'score': 0.10115028917789459, 'start': 263, 'end': 269, 'answer': '3.8 cm'}
4 {'score': 0.21408334374427795, 'start': 507, 'end': 513, 'answer': '5.4 cm'}
5 {'score': 0.6741148233413696, 'start': 295, 'end': 301, 'answer': '4.2 cm'}
6 {'score': 0.4947572350502014, 'start': 143, 'end': 149, 'answer': '6.5 cm'}
7 {'score': 0.7383593916893005, 'start': 1461, 'end': 1467, 'answer': '2.9 cm'}
8 {'score': 0.532941460609436, 'start': 238, 'end': 244, 'answer': '3.5 cm'}
9 {'score': 5.3249912923547527e-08, 'start': 858, 'end': 877, 'answer': 'level of the carina'}


In [290]:
q = 'What is the current distance between the carina and the endotracheal (ET) tube (ETT) or tracheostomy tube?'
def extract_distance(report_path):
    f = open('content/'+report_path, "r")
    report = ''.join(f.readlines())
    f.close()
    QA_input = {
        'question': q,
        'context': report.strip().replace('\n','')
    }
    output = nlp(QA_input)
    if output['score'] > 0.005:
        return output['answer']
    else:
        return 'NO DISTANCE'

In [288]:
list(samp['report_path'].apply(extract_distance))

['3.0 cm',
 '6 cm above the carina',
 '59 mm',
 '3.8 cm',
 '5.4 cm',
 '4.2 cm',
 '6.5 cm',
 '2.9 cm',
 '3.5 cm',
 'NO DISTANCE']

In [None]:
df['distance_raw'] = df['report_path'].apply(extract_distance)

In [None]:
df.to_csv('with_dist.csv')

In [60]:
import re

In [102]:
def convert_mm(input_text):
    if input_text == "NO DISTANCE":
        return None
    match = re.search(r"(?i)\d+(\.\d+)?\s?(cm|mm|centimeter)",input_text)
    if match == None:
        return None
    match = match.group()
    num = float(re.search(r"\d+(\.\d+)?",match).group())
    unit = re.search(r"(?i)(cm|mm|centimeter)",match).group().lower()
    if unit == 'cm' or unit == 'centimeter':
        num *= 10
    return num

In [74]:
with_dist = pd.read_csv('content/with_dist.csv',index_col=0)

In [103]:
with_dist['distance_mm'] = with_dist['distance_raw'].apply(convert_mm)

In [114]:
df_mm = with_dist[with_dist['distance_mm'] > 0].reset_index().drop('index',axis=1)
df_mm

Unnamed: 0,patient_id,study_id,image_id,image_path,report_path,distance_raw,distance_mm
0,13282,56112,91263,mimic-train/13282/56112/91263.jpg,mimic-train/13282/56112.txt,6 cm,60.0
1,13360,54397,84764,mimic-train/13360/54397/84764.jpg,mimic-train/13360/54397.txt,5.6 cm,56.0
2,13360,57560,92873,mimic-train/13360/57560/92873.jpg,mimic-train/13360/57560.txt,4.6 cm,46.0
3,13360,62326,88457,mimic-train/13360/62326/88457.jpg,mimic-train/13360/62326.txt,5 cm,50.0
4,13360,59248,87908,mimic-train/13360/59248/87908.jpg,mimic-train/13360/59248.txt,1.8 cm,18.0
...,...,...,...,...,...,...,...
11451,13795,60202,87633,mimic-train/13795/60202/87633.jpg,mimic-train/13795/60202.txt,3.7 cm,37.0
11452,13795,60202,82617,mimic-train/13795/60202/82617.jpg,mimic-train/13795/60202.txt,3.7 cm,37.0
11453,13818,59053,93743,mimic-train/13818/59053/93743.jpg,mimic-train/13818/59053.txt,4.7 cm,47.0
11454,13906,62812,85124,mimic-train/13906/62812/85124.jpg,mimic-train/13906/62812.txt,3.5 cm,35.0


In [116]:
df_mm.to_csv('content/distances.csv')