## Process Chexpert Data
- Main aim: is to train the vision encoder used in llava-med to get it to better recognise features in the image

#### Data Representation

According to the source: https://physionet.org/content/mimic-cxr-jpg/2.1.0/

The original data is represented in the following manner:
- Each label column contains one of four values: 1.0, -1.0, 0.0, or missing. These labels have the following interpretation:

- **1.0** : The label was positively mentioned in the associated study, and is present in one or more of the corresponding images
e.g. "A large pleural effusion"
- **0.0** : The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images
e.g. "No pneumothorax."
- **-1.0** : The label was either: (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not
    - Explicit uncertainty: "The cardiac size cannot be evaluated."
    - Ambiguous language: "The cardiac contours are stable."
- Missing (empty element) - No mention of the label was made in the report

**We will replace the missing values with a placeholder value to train the vision encoder of the LLava-Med model to better extract features from X-ray images**

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("raw_data/mimic-cxr-2.0.0-chexpert.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227827 entries, 0 to 227826
Data columns (total 16 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   subject_id                  227827 non-null  int64  
 1   study_id                    227827 non-null  int64  
 2   Atelectasis                 57666 non-null   float64
 3   Cardiomegaly                66799 non-null   float64
 4   Consolidation               23076 non-null   float64
 5   Edema                       65833 non-null   float64
 6   Enlarged Cardiomediastinum  21837 non-null   float64
 7   Fracture                    5831 non-null    float64
 8   Lung Lesion                 8287 non-null    float64
 9   Lung Opacity                58425 non-null   float64
 10  No Finding                  75455 non-null   float64
 11  Pleural Effusion            87272 non-null   float64
 12  Pleural Other               2902 non-null    float64
 13  Pneumonia     

In [4]:
df

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227822,19999442,58708861,,,,,,,,,1.0,,,,,1.0
227823,19999733,57132437,,,,,,,,,1.0,,,,,
227824,19999987,55368167,1.0,-1.0,,,,,0.0,,,0.0,,,0.0,
227825,19999987,58621812,1.0,,,,,,,,,,,,,1.0


In [5]:
processed_df = df.fillna(2.0)

In [6]:
processed_df

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
1,10000032,53189527,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
2,10000032,53911762,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
3,10000032,56699142,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
4,10000764,57375967,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,-1.0,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227822,19999442,58708861,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0
227823,19999733,57132437,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
227824,19999987,55368167,1.0,-1.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,2.0,2.0,0.0,2.0
227825,19999987,58621812,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0


### Get the image paths and merge with the chexpert findings

In [7]:
reports_df = pd.read_csv("processed_data/single_image_reports_removed_comparisons_removed_history.csv")

In [8]:
reports_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34576 entries, 0 to 34575
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   study_id     34576 non-null  int64 
 1   size         34576 non-null  int64 
 2   subject_id   34576 non-null  int64 
 3   report_path  34576 non-null  object
 4   dicom_id     34576 non-null  object
 5   split        34576 non-null  object
 6   image_path   34576 non-null  object
 7   report       34576 non-null  object
 8   indication   34576 non-null  object
dtypes: int64(3), object(6)
memory usage: 2.4+ MB


In [9]:
reports_df = reports_df[["study_id", "subject_id", "image_path", "report", "split"]]
reports_df

Unnamed: 0,study_id,subject_id,image_path,report,split
0,50000014,11941242,files/p11/p11941242/s50000014/dffc8ab2-ff37704...,Lung volumes are low. Retrocardiac opacity wit...,train
1,50000125,19309850,files/p19/p19309850/s50000125/dfa001f0-9c3d0a8...,There has been interval resolution of the mode...,train
2,50000198,16548129,files/p16/p16548129/s50000198/b66847d6-6848ea1...,Heart size is normal. The mediastinal and hila...,train
3,50000511,13658672,files/p13/p13658672/s50000511/5f930e4e-77b4587...,Single portable view of the chest. There is le...,train
4,50001417,18283050,files/p18/p18283050/s50001417/5b37ed21-cd7243b...,The patient is status post a right upper lobec...,train
...,...,...,...,...,...
34571,59998831,11911069,files/p11/p11911069/s59998831/be6982b6-5d4333f...,A right internal jugular port-a-cath is presen...,train
34572,59999179,15654175,files/p15/p15654175/s59999179/7516bc44-0c66852...,"An ng tube is present, the tip and side port l...",train
34573,59999335,11216730,files/p11/p11216730/s59999335/7ef1a4c7-a7c0b86...,No change.,train
34574,59999824,19148695,files/p19/p19148695/s59999824/66bd155f-6b30082...,Heart size remains mildly enlarged. Mediastina...,train


In [10]:
reports_df["split"].unique()

array(['train', 'test', 'validate'], dtype=object)

In [11]:
merged_df = pd.merge(reports_df, processed_df, on=["subject_id", "study_id"], suffixes=("", "_remove"))
merged_df.drop([i for i in merged_df.columns if "remove" in i], axis=1, inplace=True)
merged_df

Unnamed: 0,study_id,subject_id,image_path,report,split,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,50000014,11941242,files/p11/p11941242/s50000014/dffc8ab2-ff37704...,Lung volumes are low. Retrocardiac opacity wit...,train,-1.0,1.0,2.0,0.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,-1.0,2.0,2.0
1,50000125,19309850,files/p19/p19309850/s50000125/dfa001f0-9c3d0a8...,There has been interval resolution of the mode...,train,2.0,1.0,2.0,0.0,2.0,2.0,2.0,1.0,2.0,0.0,2.0,2.0,0.0,1.0
2,50000198,16548129,files/p16/p16548129/s50000198/b66847d6-6848ea1...,Heart size is normal. The mediastinal and hila...,train,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
3,50000511,13658672,files/p13/p13658672/s50000511/5f930e4e-77b4587...,Single portable view of the chest. There is le...,train,-1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,-1.0,2.0,2.0
4,50001417,18283050,files/p18/p18283050/s50001417/5b37ed21-cd7243b...,The patient is status post a right upper lobec...,train,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34571,59998831,11911069,files/p11/p11911069/s59998831/be6982b6-5d4333f...,A right internal jugular port-a-cath is presen...,train,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
34572,59999179,15654175,files/p15/p15654175/s59999179/7516bc44-0c66852...,"An ng tube is present, the tip and side port l...",train,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0
34573,59999335,11216730,files/p11/p11216730/s59999335/7ef1a4c7-a7c0b86...,No change.,train,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0
34574,59999824,19148695,files/p19/p19148695/s59999824/66bd155f-6b30082...,Heart size remains mildly enlarged. Mediastina...,train,2.0,2.0,2.0,-1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0


In [13]:
def write_csv(df, output_path, split):
    df = df[df["split"] == split]
    df.to_csv(f"{output_path}_{split}.csv", index=False)

In [15]:
write_csv(merged_df, "processed_data/processed_mimic-cxr-2.0.0-chexpert", "train")
write_csv(merged_df, "processed_data/processed_mimic-cxr-2.0.0-chexpert", "test")
write_csv(merged_df, "processed_data/processed_mimic-cxr-2.0.0-chexpert", "validate")

### Check chexpert data
- Check if there are any images that are corrupted

In [1]:
import pandas as pd
from PIL import Image
import os

In [None]:
# validate_df = pd.read_csv("~/Datasets/mimic-cxr/processed_data/processed_mimic-cxr-2.0.0-chexpert_validate.csv")
# base_path = "/home/FYP/angk0064/Datasets/mimic-cxr-jpg/2.1.0"
# image_paths = []

# for image_path in validate_df["image_path"]:
#     try:
#       image = Image.open(os.path.join("/home/FYP/angk0064/Datasets/mimic-cxr-jpg/2.1.0", image_path)).convert("RGB")
#     except:
#       image_paths.append(image_path)
#       print(image_path)
