**CheXpert-derived labels** were used for the following reasons:
  - Wider clinical coverage (more findings per study captured).
  - Multi-label ready (each study has 14 possible findings).
  - Includes uncertain cases (better mimics clinical reality).
  - CheXpert prioritization logic (positive > uncertain > negative) is more consistent with human diagnostic decision-making.
  - Standardized mention extraction patterns (including synonyms and abbreviations).

The following are the labels extracted using CheXpert:
- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity                                            
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices


| Label Value | Meaning       | Interpretation                                    |
| ----------- | ------------- | ------------------------------------------------- |
| `1.0`       | Positive      | Condition is clearly **present**                 |
| `0.0`       | Negative      | Condition is clearly **absent**                   |
| `-1.0`      | Uncertain     | Radiologist **suspects** condition but isn't sure |
| `null`      | Not mentioned | No statement about the condition                  |


There are over 14 view positions for the chest X-rays in the dataset.

For chest X-rays, **PA (Posteroanterior)** and **AP (Anteroposterior)** are the most common and diagnostically rich frontal views:

- **PA** is the standard in ambulatory settings:
  - The patient stands facing the detector.
  - Results in a clearer and less magnified heart silhouette.

- **AP** is often used for bedridden or ICU patients:
  - Taken with the detector behind the patient.
  - More prone to artifacts, but still interpretable.

These views are preferred because:

- They provide a **full frontal projection** of the chest.
- They are **easier to learn from** due to greater dataset availability.
- They offer **more consistent anatomical display** across patients.
- They are **commonly labeled** as part of “normal” or “abnormal” classes in training datasets.

By **restricting training to PA/AP views**, we:
- **Reduce variability** caused by pose or projection differences.
- Help the model **focus on pathology**, rather than view-dependent artifacts.


In [1]:
import pandas as pd
import gcsfs
import base64
import gc
import time
from tqdm import tqdm
import glob
# For local environment, use Application Default Credentials
# Make sure you've run: gcloud auth application-default login
# or set GOOGLE_APPLICATION_CREDENTIALS environment variable

In [2]:
bucket_name = "filtered_cxr"
file_path = "cxr"
gcs_path = f"gs://{bucket_name}/{file_path}"

df = pd.read_csv(gcs_path)
df.head()



Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Effusion,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime
0,14887088,54257662,1bc85033-355accce-e8d0ed50-78188cd3-dac92e86,,0.0,0.0,,,,,...,,,,,1.0,train,AP,2539,776,2135-01-27 04:49:44.968
1,18650767,54780106,7920918a-6b7415f5-f5b6f7d5-815bb396-c6702eb3,,-1.0,,-1.0,,,,...,,,1.0,,,train,PA,1504,1188,2131-09-14 22:14:49.221
2,11548266,59905684,95bf2c9c-fd2e6da5-fe82e5db-032420dd-d4daa45c,,,,,,,,...,,,,,,train,PA,1713,1309,2129-04-20 13:37:47.133
3,13191147,55483625,b9cf3249-939bd85e-aaf9795b-d2c0a201-3c0d322c,,,,0.0,,1.0,,...,,,,,,train,PA,2000,1356,2184-12-08 12:03:51.120
4,13880916,53360233,729183be-b03670e1-35db29c1-356f54ed-94481994,,,,,,,,...,,,,,,train,PA,1986,1380,2170-04-27 12:42:31.124


### Sampling images per lable

In [11]:
label_cols = [
    'Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
    'Enlarged_Cardiomediastinum', 'Fracture', 'Lung_Lesion',
    'Lung_Opacity', 'No_Finding', 'Pleural_Effusion',
    'Pleural_Other', 'Pneumonia', 'Pneumothorax', 'Support_Devices'
]

df_labels = df.copy()
df_labels = df_labels[df_labels[label_cols].notna().any(axis=1)]

sampled_rows = []

used_dicom_ids = set()

for label in label_cols:
    label_df = df_labels[df_labels[label] == 1.0]

    label_df = label_df[~label_df['dicom_id'].isin(used_dicom_ids)]
    

    train_samples = label_df[label_df['split'] == 'train'].sample(
        n=min(4000, len(label_df[label_df['split'] == 'train'])), random_state=33
    )
    validate_samples = label_df[label_df['split'] == 'validate'].sample(
        n=min(500, len(label_df[label_df['split'] == 'validate'])), random_state=33
    )
    test_samples = label_df[label_df['split'] == 'test'].sample(
        n=min(500, len(label_df[label_df['split'] == 'test'])), random_state=33
    )
    

    label_samples = pd.concat([train_samples, validate_samples, test_samples])
    

    if len(label_samples) < 5000:
        remaining_needed = 5000 - len(label_samples)
        used_in_label = set(label_samples['dicom_id'])
        remaining_df = label_df[~label_df['dicom_id'].isin(used_in_label)]
        additional_samples = remaining_df.sample(
            n=min(remaining_needed, len(remaining_df)), random_state=33
        )
        label_samples = pd.concat([label_samples, additional_samples])
    

    used_dicom_ids.update(label_samples['dicom_id'])
    sampled_rows.append(label_samples)
    
    print(f"{label}: {len(label_samples)} samples")
    print(f"  Train: {len(label_samples[label_samples['split'] == 'train'])}")
    print(f"  Validate: {len(label_samples[label_samples['split'] == 'validate'])}")
    print(f"  Test: {len(label_samples[label_samples['split'] == 'test'])}")

sampled_df = pd.concat(sampled_rows, ignore_index=True)

print("\nFinal split distribution:")
split_counts = sampled_df['split'].value_counts()
for split_value, count in split_counts.items():
    percentage = (count / len(sampled_df)) * 100
    print(f"{split_value}: {count} samples ({percentage:.1f}%)")

def label_to_text(label, condition):
    if pd.isna(label):
        return None
    elif label == 1.0:
        return f"Evidence of {condition.lower()} is present."
    elif label == 0.0:
        return f"No evidence of {condition.lower()}."
    elif label == -1.0:
        return f"There is suspicion of {condition.lower()}."
    else:
        return None

def create_report(row):
    sentences = []
    for condition in label_cols:
        sentence = label_to_text(row[condition], condition)
        if sentence:
            sentences.append(sentence)
    return " ".join(sentences)

sampled_df["mini_report"] = sampled_df.apply(create_report, axis=1)

sampled_df.to_csv('sampled_5000_per_label_balanced_splits.csv', index=False)

print(f"\nTotal samples: {len(sampled_df)}")
print(f"Unique DICOM IDs: {sampled_df['dicom_id'].nunique()}")

Atelectasis: 5000 samples
  Train: 4105
  Validate: 395
  Test: 500
Cardiomegaly: 5000 samples
  Train: 4243
  Validate: 256
  Test: 501
Consolidation: 5000 samples
  Train: 4799
  Validate: 53
  Test: 148
Edema: 5000 samples
  Train: 4480
  Validate: 113
  Test: 407
Enlarged_Cardiomediastinum: 5000 samples
  Train: 4924
  Validate: 16
  Test: 60
Fracture: 4269 samples
  Train: 4206
  Validate: 12
  Test: 51
Lung_Lesion: 5000 samples
  Train: 4887
  Validate: 38
  Test: 75
Lung_Opacity: 5000 samples
  Train: 4324
  Validate: 176
  Test: 500
No_Finding: 5000 samples
  Train: 4000
  Validate: 500
  Test: 500
Pleural_Effusion: 5000 samples
  Train: 4748
  Validate: 68
  Test: 184
Pleural_Other: 1414 samples
  Train: 1399
  Validate: 1
  Test: 14
Pneumonia: 5000 samples
  Train: 4903
  Validate: 26
  Test: 71
Pneumothorax: 5000 samples
  Train: 4949
  Validate: 22
  Test: 29
Support_Devices: 5000 samples
  Train: 4848
  Validate: 62
  Test: 90

Final split distribution:
train: 60815 sample

In [4]:
sampled_df.shape

(65726, 23)

In [5]:
sampled_df.head()

Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime,mini_report
0,18310719,57997889,84eb7e34-12b4cdcf-04196ef9-a5d6e59d-3d17fafc,1.0,,,,,,,...,,,,,train,AP,2544,3056,2129-05-30 17:47:45.468,Evidence of atelectasis is present.
1,14004638,54797919,110e4fe4-2b4d6b84-14687e5f-877c71ea-d469285a,1.0,,,,,,,...,,,,,train,AP,3050,2539,2156-01-24 08:14:06.953,Evidence of atelectasis is present.
2,13233757,56501945,43eaf4e8-7069e8c1-472726df-b9acc888-c8069f09,1.0,,,,,,,...,,,0.0,1.0,train,AP,2539,3050,2156-08-30 05:51:23.468,Evidence of atelectasis is present. Evidence o...
3,16444272,54476134,32e6ec27-1022a80a-37c2af58-4415c8de-d3da40ef,1.0,,,1.0,,,,...,,,,,train,PA,3056,2544,2146-02-09 23:01:11.437,Evidence of atelectasis is present. Evidence o...
4,17372922,51457365,796cef17-0b455d0e-56288df5-80097fde-f090b2ba,1.0,,,0.0,,,,...,,,,1.0,train,AP,2544,3056,2156-10-05 20:30:24.765,Evidence of atelectasis is present. No evidenc...


In [10]:
# Count unique entries in the 'split' column
split_counts = sampled_df['split'].value_counts()
print("Split column value counts:")
print(split_counts)
print(f"\nTotal unique splits: {len(split_counts)}")
print(f"Total samples: {split_counts.sum()}")

# Alternative way to see the distribution
print("\nSplit distribution:")
for split_value, count in split_counts.items():
    percentage = (count / len(sampled_df)) * 100
    print(f"{split_value}: {count} samples ({percentage:.1f}%)")

Split column value counts:
split
train       64078
test         1128
validate      520
Name: count, dtype: int64

Total unique splits: 3
Total samples: 65726

Split distribution:
train: 64078 samples (97.5%)
test: 1128 samples (1.7%)
validate: 520 samples (0.8%)


In [9]:
billing_project = 'ablation-study'
fs = gcsfs.GCSFileSystem(project=billing_project, requester_pays=True)