<a href="https://colab.research.google.com/github/jeet1912/florence-2_ablationStudy/blob/main/mimic_cxr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CheXpert-derived labels** were used for the following reasons:
  - Wider clinical coverage (more findings per study captured).
  - Multi-label ready (each study has 14 possible findings).
  - Includes uncertain cases (better mimics clinical reality).
  - CheXpert prioritization logic (positive > uncertain > negative) is more consistent with human diagnostic decision-making.
  - Standardized mention extraction patterns (including synonyms and abbreviations).

The following are the labels extracted using CheXpert:
- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity                                            
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices


| Label Value | Meaning       | Interpretation                                    |
| ----------- | ------------- | ------------------------------------------------- |
| `1.0`       | Positive      | Condition is clearly **present**                 |
| `0.0`       | Negative      | Condition is clearly **absent**                   |
| `-1.0`      | Uncertain     | Radiologist **suspects** condition but isn't sure |
| `null`      | Not mentioned | No statement about the condition                  |


There are over 14 view positions for the chest X-rays in the dataset.

For chest X-rays, **PA (Posteroanterior)** and **AP (Anteroposterior)** are the most common and diagnostically rich frontal views:

- **PA** is the standard in ambulatory settings:
  - The patient stands facing the detector.
  - Results in a clearer and less magnified heart silhouette.

- **AP** is often used for bedridden or ICU patients:
  - Taken with the detector behind the patient.
  - More prone to artifacts, but still interpretable.

These views are preferred because:

- They provide a **full frontal projection** of the chest.
- They are **easier to learn from** due to greater dataset availability.
- They offer **more consistent anatomical display** across patients.
- They are **commonly labeled** as part of “normal” or “abnormal” classes in training datasets.

By **restricting training to PA/AP views**, we:
- **Reduce variability** caused by pose or projection differences.
- Help the model **focus on pathology**, rather than view-dependent artifacts.


In [1]:
!pip install --upgrade gcsfs



In [2]:
!pip install --upgrade google-cloud-storage



In [16]:
import pandas as pd
from google.colab import auth, drive
from google.cloud import storage
import os
import shutil
import gcsfs
import base64
import io
auth.authenticate_user()

In [10]:
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [12]:
bucket_name = "filtered_cxr"
file_path = "cxr"
gcs_path = f"gs://{bucket_name}/{file_path}"

df = pd.read_csv(gcs_path)
df.head()

Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Effusion,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime
0,14887088,54257662,1bc85033-355accce-e8d0ed50-78188cd3-dac92e86,,0.0,0.0,,,,,...,,,,,1.0,train,AP,2539,776,2135-01-27 04:49:44.968
1,18650767,54780106,7920918a-6b7415f5-f5b6f7d5-815bb396-c6702eb3,,-1.0,,-1.0,,,,...,,,1.0,,,train,PA,1504,1188,2131-09-14 22:14:49.221
2,11548266,59905684,95bf2c9c-fd2e6da5-fe82e5db-032420dd-d4daa45c,,,,,,,,...,,,,,,train,PA,1713,1309,2129-04-20 13:37:47.133
3,13191147,55483625,b9cf3249-939bd85e-aaf9795b-d2c0a201-3c0d322c,,,,0.0,,1.0,,...,,,,,,train,PA,2000,1356,2184-12-08 12:03:51.120
4,13880916,53360233,729183be-b03670e1-35db29c1-356f54ed-94481994,,,,,,,,...,,,,,,train,PA,1986,1380,2170-04-27 12:42:31.124


In [13]:
df['image_data'] = None

In [20]:
billing_project = 'ablation-study'
fs = gcsfs.GCSFileSystem(project=billing_project, requester_pays=True)

In [None]:
for idx, row in df.iterrows():
    sub = str(row['subject_id'])
    st  = str(row['study_id'])
    dic = str(row['dicom_id'])
    prefix = f"p{sub[:2]}"
    image_path = f'mimic-cxr-jpg-2.1.0.physionet.org/files/{prefix}/p{sub}/s{st}/{dic}.jpg'

    try:
        with fs.open(image_path, 'rb') as img_file:
            img_bytes = img_file.read()
            b64 = base64.b64encode(img_bytes).decode('utf-8')
            df.at[idx, 'image_data'] = b64
            print(f"✅ Embedded image for {dic}")
    except Exception as e:
        print(f"❌ Failed {dic}: {e}")
        df.at[idx, 'image_data'] = ""

# Step 5: Save the new CSV to local disk
out_csv = "/content/filtered_cohort_with_images.csv"
df.to_csv(out_csv, index=False)

# Step 6: Save to Google Drive (optional)
drive_cxr_dir = "/content/drive/MyDrive/cxr"
os.makedirs(drive_cxr_dir, exist_ok=True)
!cp "{out_csv}" "{drive_cxr_dir}/filtered_cohort_with_images.csv"

print("✅ CSV with embedded images saved to Google Drive at: cxr/filtered_cohort_with_images.csv")

✅ Embedded image for 1bc85033-355accce-e8d0ed50-78188cd3-dac92e86
✅ Embedded image for 7920918a-6b7415f5-f5b6f7d5-815bb396-c6702eb3
✅ Embedded image for 95bf2c9c-fd2e6da5-fe82e5db-032420dd-d4daa45c
✅ Embedded image for b9cf3249-939bd85e-aaf9795b-d2c0a201-3c0d322c
✅ Embedded image for 729183be-b03670e1-35db29c1-356f54ed-94481994
✅ Embedded image for ef0b5e29-2f62aa4b-4c57e7a1-a511837a-e480eea0
✅ Embedded image for f33c14c1-7da00e1d-0b25c925-9ba25639-9210fefd
✅ Embedded image for 12622d34-1a419a0d-6809a110-318f0fb1-eb8635e6
✅ Embedded image for e35b7ee2-29df0d95-f9449c63-97920a18-0517a4ca
✅ Embedded image for 399cdc12-0687c51c-12a16b0d-a80cba18-8822dc33
✅ Embedded image for abb6d09f-211a3765-7aa525f0-5f49935f-240cb4dc
✅ Embedded image for 6a2f92bf-fc003a6c-0157ddcc-e73294c3-290ba945
✅ Embedded image for 29b7295a-befc9d9e-7f774c6f-794c3388-d4b65420
✅ Embedded image for 0fd413de-52d4b5bc-985b64a5-df3b60c4-ed15d568
✅ Embedded image for 9c688249-ad293971-52c869e3-2f2a487e-cef1bf9a
✅ Embedded