**CheXpert-derived labels** were used for the following reasons:
  - Wider clinical coverage (more findings per study captured).
  - Multi-label ready (each study has 14 possible findings).
  - Includes uncertain cases (better mimics clinical reality).
  - CheXpert prioritization logic (positive > uncertain > negative) is more consistent with human diagnostic decision-making.
  - Standardized mention extraction patterns (including synonyms and abbreviations).

The following are the labels extracted using CheXpert:
- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity                                            
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices


| Label Value | Meaning       | Interpretation                                    |
| ----------- | ------------- | ------------------------------------------------- |
| `1.0`       | Positive      | Condition is clearly **present**                 |
| `0.0`       | Negative      | Condition is clearly **absent**                   |
| `-1.0`      | Uncertain     | Radiologist **suspects** condition but isn't sure |
| `null`      | Not mentioned | No statement about the condition                  |


There are over 14 view positions for the chest X-rays in the dataset.

For chest X-rays, **PA (Posteroanterior)** and **AP (Anteroposterior)** are the most common and diagnostically rich frontal views:

- **PA** is the standard in ambulatory settings:
  - The patient stands facing the detector.
  - Results in a clearer and less magnified heart silhouette.

- **AP** is often used for bedridden or ICU patients:
  - Taken with the detector behind the patient.
  - More prone to artifacts, but still interpretable.

These views are preferred because:

- They provide a **full frontal projection** of the chest.
- They are **easier to learn from** due to greater dataset availability.
- They offer **more consistent anatomical display** across patients.
- They are **commonly labeled** as part of “normal” or “abnormal” classes in training datasets.

By **restricting training to PA/AP views**, we:
- **Reduce variability** caused by pose or projection differences.
- Help the model **focus on pathology**, rather than view-dependent artifacts.


In [1]:
import pandas as pd
import gcsfs
import base64
import gc
import time
from tqdm import tqdm
import glob
# For local environment, use Application Default Credentials
# Make sure you've run: gcloud auth application-default login
# or set GOOGLE_APPLICATION_CREDENTIALS environment variable

In [2]:
bucket_name = "filtered_cxr"
file_path = "cxr"
gcs_path = f"gs://{bucket_name}/{file_path}"

df = pd.read_csv(gcs_path)
df.head()



Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Effusion,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime
0,14887088,54257662,1bc85033-355accce-e8d0ed50-78188cd3-dac92e86,,0.0,0.0,,,,,...,,,,,1.0,train,AP,2539,776,2135-01-27 04:49:44.968
1,18650767,54780106,7920918a-6b7415f5-f5b6f7d5-815bb396-c6702eb3,,-1.0,,-1.0,,,,...,,,1.0,,,train,PA,1504,1188,2131-09-14 22:14:49.221
2,11548266,59905684,95bf2c9c-fd2e6da5-fe82e5db-032420dd-d4daa45c,,,,,,,,...,,,,,,train,PA,1713,1309,2129-04-20 13:37:47.133
3,13191147,55483625,b9cf3249-939bd85e-aaf9795b-d2c0a201-3c0d322c,,,,0.0,,1.0,,...,,,,,,train,PA,2000,1356,2184-12-08 12:03:51.120
4,13880916,53360233,729183be-b03670e1-35db29c1-356f54ed-94481994,,,,,,,,...,,,,,,train,PA,1986,1380,2170-04-27 12:42:31.124


### Sampling images per lable

In [None]:
label_cols = [
'Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
'Enlarged_Cardiomediastinum', 'Fracture', 'Lung_Lesion',
'Lung_Opacity', 'No_Finding', 'Pleural_Effusion',
'Pleural_Other', 'Pneumonia', 'Pneumothorax', 'Support_Devices'
]

#Filter to rows with at least one positive label
df_labels = df.copy()
df_labels = df_labels[df_labels[label_cols].notna().any(axis=1)]

#Container for sampled entries
sampled_rows = []

#Keep track of already used dicom_ids
used_dicom_ids = set()

#Sampling 5000 per label
for label in label_cols:
    label_df = df_labels[df_labels[label] == 1.0]
    # Avoid reusing the same dicom_id unless unavoidable
    label_df = label_df[~label_df['dicom_id'].isin(used_dicom_ids)]
    # Take up to 5000
    sample = label_df.sample(n=min(5000, len(label_df)), random_state=33)
    # Track used dicoms
    used_dicom_ids.update(sample['dicom_id'])
    sampled_rows.append(sample)

#Combine all sampled subsets
sampled_df = pd.concat(sampled_rows, ignore_index=True)

def label_to_text(label, condition):
    if pd.isna(label):
        return None
    elif label == 1.0:
        return f"Evidence of {condition.lower()} is present."
    elif label == 0.0:
        return f"No evidence of {condition.lower()}."
    elif label == -1.0:
        return f"There is suspicion of {condition.lower()}."
    else:
        return None

def create_report(row):
    sentences = []
    for condition in label_cols:
        sentence = label_to_text(row[condition], condition)
        if sentence:
            sentences.append(sentence)
    return " ".join(sentences)

#Add new column to the DataFrame
sampled_df["mini_report"] = sampled_df.apply(create_report, axis=1)

#Save final result
sampled_df.to_csv('sampled_5000_per_label.csv', index=False)



Atelectasis: 5000 samples
  Train: 4105
  Validate: 395
  Test: 500
Cardiomegaly: 5000 samples
  Train: 4243
  Validate: 256
  Test: 501
Consolidation: 5000 samples
  Train: 4799
  Validate: 53
  Test: 148
Edema: 5000 samples
  Train: 4480
  Validate: 113
  Test: 407
Enlarged_Cardiomediastinum: 5000 samples
  Train: 4924
  Validate: 16
  Test: 60
Fracture: 4269 samples
  Train: 4206
  Validate: 12
  Test: 51
Lung_Lesion: 5000 samples
  Train: 4887
  Validate: 38
  Test: 75
Lung_Opacity: 5000 samples
  Train: 4324
  Validate: 176
  Test: 500
No_Finding: 5000 samples
  Train: 4000
  Validate: 500
  Test: 500
Pleural_Effusion: 5000 samples
  Train: 4748
  Validate: 68
  Test: 184
Pleural_Other: 1414 samples
  Train: 1399
  Validate: 1
  Test: 14
Pneumonia: 5000 samples
  Train: 4903
  Validate: 26
  Test: 71
Pneumothorax: 5000 samples
  Train: 4949
  Validate: 22
  Test: 29
Support_Devices: 5000 samples
  Train: 4848
  Validate: 62
  Test: 90

Final split distribution:
train: 60815 sample

In [12]:
sampled_df.shape

(65683, 23)

In [13]:
sampled_df.head()

Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime,mini_report
0,15672829,53399488,fabd4f59-1ee52083-5cddd2c5-186cc8bb-8a3b8148,1.0,,,,,,,...,,,,,train,PA,3056,2544,2150-11-04 14:45:02.218,Evidence of atelectasis is present.
1,12324075,55152148,875b8cfe-cb7c35f7-9e7ac4a5-08bf33cf-8b34e96c,1.0,,1.0,0.0,,,,...,,-1.0,0.0,1.0,train,AP,2381,3050,2122-03-05 10:19:41.734,Evidence of atelectasis is present. Evidence o...
2,15721773,52591378,9536d495-7733c58c-665e061b-072565cf-8e7860f1,1.0,,1.0,-1.0,,,,...,,,,,train,AP,2219,2794,2180-02-06 18:21:22.578,Evidence of atelectasis is present. Evidence o...
3,12095092,55634770,39cefdbf-02a85e10-754d9386-b6261d85-f6bdf4d7,1.0,,,,,,,...,,,,,train,AP,3050,2539,2120-03-27 04:18:35.703,Evidence of atelectasis is present.
4,10997073,50423961,7558a0f5-822c7d0c-33dde2e9-d132cd32-181ca2b9,1.0,,,,,,,...,,,,,train,AP,2741,2539,2169-06-22 14:04:13.484,Evidence of atelectasis is present. Evidence o...


In [15]:
# Count unique entries in the 'split' column
split_counts = sampled_df['split'].value_counts()
print("Split column value counts:")
print(split_counts)
print(f"Total samples: {split_counts.sum()}")

# Alternative way to see the distribution
print("\nSplit distribution:")
for split_value, count in split_counts.items():
    percentage = (count / len(sampled_df)) * 100
    print(f"{split_value}: {count} samples ({percentage:.1f}%)")

Split column value counts:
split
train       60815
test         3130
validate     1738
Name: count, dtype: int64
Total samples: 65683

Split distribution:
train: 60815 samples (92.6%)
test: 3130 samples (4.8%)
validate: 1738 samples (2.6%)


In [9]:
billing_project = 'ablation-study'
fs = gcsfs.GCSFileSystem(project=billing_project, requester_pays=True)