**CheXpert-derived labels** were used for the following reasons:
  - Wider clinical coverage (more findings per study captured).
  - Multi-label ready (each study has 14 possible findings).
  - Includes uncertain cases (better mimics clinical reality).
  - CheXpert prioritization logic (positive > uncertain > negative) is more consistent with human diagnostic decision-making.
  - Standardized mention extraction patterns (including synonyms and abbreviations).

The following are the labels extracted using CheXpert:
- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity                                            
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices


| Label Value | Meaning       | Interpretation                                    |
| ----------- | ------------- | ------------------------------------------------- |
| `1.0`       | Positive      | Condition is clearly **present**                 |
| `0.0`       | Negative      | Condition is clearly **absent**                   |
| `-1.0`      | Uncertain     | Radiologist **suspects** condition but isn't sure |
| `null`      | Not mentioned | No statement about the condition                  |


There are over 14 view positions for the chest X-rays in the dataset.

For chest X-rays, **PA (Posteroanterior)** and **AP (Anteroposterior)** are the most common and diagnostically rich frontal views:

- **PA** is the standard in ambulatory settings:
  - The patient stands facing the detector.
  - Results in a clearer and less magnified heart silhouette.

- **AP** is often used for bedridden or ICU patients:
  - Taken with the detector behind the patient.
  - More prone to artifacts, but still interpretable.

These views are preferred because:

- They provide a **full frontal projection** of the chest.
- They are **easier to learn from** due to greater dataset availability.
- They offer **more consistent anatomical display** across patients.
- They are **commonly labeled** as part of “normal” or “abnormal” classes in training datasets.

By **restricting training to PA/AP views**, we:
- **Reduce variability** caused by pose or projection differences.
- Help the model **focus on pathology**, rather than view-dependent artifacts.


In [6]:
import pandas as pd
import gcsfs
from sklearn.model_selection import train_test_split
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
# For local environment, use Application Default Credentials
# Make sure you've run: gcloud auth application-default login
# or set GOOGLE_APPLICATION_CREDENTIALS environment variable

In [2]:
bucket_name = "filtered_cxr"
file_path = "cxr"
gcs_path = f"gs://{bucket_name}/{file_path}"

df = pd.read_csv(gcs_path)
df.head()



Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Effusion,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime
0,14887088,54257662,1bc85033-355accce-e8d0ed50-78188cd3-dac92e86,,0.0,0.0,,,,,...,,,,,1.0,train,AP,2539,776,2135-01-27 04:49:44.968
1,18650767,54780106,7920918a-6b7415f5-f5b6f7d5-815bb396-c6702eb3,,-1.0,,-1.0,,,,...,,,1.0,,,train,PA,1504,1188,2131-09-14 22:14:49.221
2,11548266,59905684,95bf2c9c-fd2e6da5-fe82e5db-032420dd-d4daa45c,,,,,,,,...,,,,,,train,PA,1713,1309,2129-04-20 13:37:47.133
3,13191147,55483625,b9cf3249-939bd85e-aaf9795b-d2c0a201-3c0d322c,,,,0.0,,1.0,,...,,,,,,train,PA,2000,1356,2184-12-08 12:03:51.120
4,13880916,53360233,729183be-b03670e1-35db29c1-356f54ed-94481994,,,,,,,,...,,,,,,train,PA,1986,1380,2170-04-27 12:42:31.124


### Sampling images per lable

In [3]:
label_cols = [
'Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema',
'Enlarged_Cardiomediastinum', 'Fracture', 'Lung_Lesion',
'Lung_Opacity', 'No_Finding', 'Pleural_Effusion',
'Pleural_Other', 'Pneumonia', 'Pneumothorax', 'Support_Devices'
]

# Separate No_Finding from pathological conditions
pathological_cols = [col for col in label_cols if col != 'No_Finding']

# Filter to rows with at least one positive label (including No_Finding)
df_labels = df.copy()
df_labels = df_labels[df_labels[label_cols].notna().any(axis=1)]

# Container for sampled entries
sampled_rows = []

# Keep track of already used dicom_ids
used_dicom_ids = set()

# Sampling 5000 per pathological label (excluding No_Finding)
for label in pathological_cols:
    label_df = df_labels[df_labels[label] == 1.0]
    # Avoid reusing the same dicom_id unless unavoidable
    label_df = label_df[~label_df['dicom_id'].isin(used_dicom_ids)]
    # Take up to 5000
    sample = label_df.sample(n=min(5000, len(label_df)), random_state=33)
    # Track used dicoms
    used_dicom_ids.update(sample['dicom_id'])
    sampled_rows.append(sample)
    print(f"Sampled {len(sample)} images for {label}")

# Separately handle No_Finding (normal cases)
no_finding_df = df_labels[df_labels['No_Finding'] == 1.0]
# For No_Finding, we want cases that are ONLY normal (no other positive findings)
truly_normal = no_finding_df[
    (no_finding_df[pathological_cols] != 1.0).all(axis=1) |  # No positive pathological findings
    no_finding_df[pathological_cols].isna().all(axis=1)      # Or all pathological labels are NaN
]

# Avoid reusing dicom_ids
truly_normal = truly_normal[~truly_normal['dicom_id'].isin(used_dicom_ids)]
normal_sample = truly_normal.sample(n=min(5000, len(truly_normal)), random_state=33)
used_dicom_ids.update(normal_sample['dicom_id'])
sampled_rows.append(normal_sample)
print(f"Sampled {len(normal_sample)} truly normal images (No_Finding only)")

# Combine all sampled subsets
sampled_df = pd.concat(sampled_rows, ignore_index=True)

def label_to_text(label, condition):
    if pd.isna(label):
        return None
    elif condition == 'No_Finding':
        # Special handling for No_Finding
        if label == 1.0:
            return "The chest X-ray appears normal with no acute findings."
        elif label == 0.0:
            return "Abnormal findings are present."  # If No_Finding is explicitly negative
        elif label == -1.0:
            return "The image quality or findings are uncertain."
        else:
            return None
    else:
        # Handle pathological conditions
        if label == 1.0:
            return f"Evidence of {condition.lower().replace('_', ' ')} is present."
        elif label == 0.0:
            return f"No evidence of {condition.lower().replace('_', ' ')}."
        elif label == -1.0:
            return f"There is suspicion of {condition.lower().replace('_', ' ')}."
        else:
            return None

def create_report(row):
    sentences = []
    
    # Check if this is a normal case (No_Finding = 1.0 and no positive pathological findings)
    is_normal = (row['No_Finding'] == 1.0 and 
                 (row[pathological_cols] != 1.0).all())
    
    if is_normal:
        # For normal cases, just use the No_Finding description
        normal_sentence = label_to_text(row['No_Finding'], 'No_Finding')
        if normal_sentence:
            sentences.append(normal_sentence)
    else:
        # For abnormal cases, describe all findings except No_Finding
        for condition in pathological_cols:
            sentence = label_to_text(row[condition], condition)
            if sentence:
                sentences.append(sentence)
        
        # Only mention No_Finding if it's explicitly negative (meaning abnormal)
        if row['No_Finding'] == 0.0:
            no_finding_sentence = label_to_text(row['No_Finding'], 'No_Finding')
            if no_finding_sentence:
                sentences.insert(0, no_finding_sentence)  # Put at beginning
    
    return " ".join(sentences)

# Add new column to the DataFrame
sampled_df["mini_report"] = sampled_df.apply(create_report, axis=1)

# Print some examples
print("\nSample reports:")
print("Normal cases:")
normal_cases = sampled_df[sampled_df['No_Finding'] == 1.0].head(3)
for idx, row in normal_cases.iterrows():
    print(f"- {row['mini_report']}")

print("\nAbnormal cases:")
abnormal_cases = sampled_df[sampled_df['No_Finding'] != 1.0].head(3)
for idx, row in abnormal_cases.iterrows():
    print(f"- {row['mini_report']}")

# Save final result
sampled_df.to_csv('sampled_5000_per_label.csv', index=False)

print(f"\nFinal dataset: {len(sampled_df)} samples")
print(f"Normal cases (No_Finding=1.0): {len(sampled_df[sampled_df['No_Finding'] == 1.0])}")
print(f"Abnormal cases: {len(sampled_df[sampled_df['No_Finding'] != 1.0])}")

Sampled 5000 images for Atelectasis
Sampled 5000 images for Cardiomegaly
Sampled 5000 images for Consolidation
Sampled 5000 images for Edema
Sampled 5000 images for Enlarged_Cardiomediastinum
Sampled 4277 images for Fracture
Sampled 5000 images for Lung_Lesion
Sampled 5000 images for Lung_Opacity
Sampled 5000 images for Pleural_Effusion
Sampled 1449 images for Pleural_Other
Sampled 5000 images for Pneumonia
Sampled 5000 images for Pneumothorax
Sampled 5000 images for Support_Devices
Sampled 5000 truly normal images (No_Finding only)

Sample reports:
Normal cases:
- Evidence of support devices is present.
- Evidence of support devices is present.
- No evidence of consolidation. No evidence of edema. No evidence of enlarged cardiomediastinum. No evidence of pleural effusion. No evidence of pneumothorax. Evidence of support devices is present.

Abnormal cases:
- Evidence of atelectasis is present.
- Evidence of atelectasis is present.
- Evidence of atelectasis is present. Evidence of lung

In [4]:
sampled_df.shape

(65726, 23)

In [5]:
sampled_df.head()

Unnamed: 0,subject_id,study_id,dicom_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged_Cardiomediastinum,Fracture,Lung_Lesion,...,Pleural_Other,Pneumonia,Pneumothorax,Support_Devices,split,ViewPosition,Rows,Columns,study_datetime,mini_report
0,18310719,57997889,84eb7e34-12b4cdcf-04196ef9-a5d6e59d-3d17fafc,1.0,,,,,,,...,,,,,train,AP,2544,3056,2129-05-30 17:47:45.468,Evidence of atelectasis is present.
1,14004638,54797919,110e4fe4-2b4d6b84-14687e5f-877c71ea-d469285a,1.0,,,,,,,...,,,,,train,AP,3050,2539,2156-01-24 08:14:06.953,Evidence of atelectasis is present.
2,13233757,56501945,43eaf4e8-7069e8c1-472726df-b9acc888-c8069f09,1.0,,,,,,,...,,,0.0,1.0,train,AP,2539,3050,2156-08-30 05:51:23.468,Evidence of atelectasis is present. Evidence o...
3,16444272,54476134,32e6ec27-1022a80a-37c2af58-4415c8de-d3da40ef,1.0,,,1.0,,,,...,,,,,train,PA,3056,2544,2146-02-09 23:01:11.437,Evidence of atelectasis is present. Evidence o...
4,17372922,51457365,796cef17-0b455d0e-56288df5-80097fde-f090b2ba,1.0,,,0.0,,,,...,,,,1.0,train,AP,2544,3056,2156-10-05 20:30:24.765,Evidence of atelectasis is present. No evidenc...


In [None]:
# Create stratification labels matrix (only using the 14 label columns)
# Replace NaN with 0 for stratification purposes
stratify_labels = sampled_df[label_cols].fillna(0).values

print(f"Original dataset shape: {sampled_df.shape}")
print(f"Stratification labels shape: {stratify_labels.shape}")

# First split: separate out test set (20%)
splitter_test = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=33)
train_val_idx, test_idx = next(splitter_test.split(sampled_df, stratify_labels))

# Get train+val and test sets
train_val_df = sampled_df.iloc[train_val_idx].copy()
test_df = sampled_df.iloc[test_idx].copy()
train_val_labels = stratify_labels[train_val_idx]

print(f"Train+Val set shape: {train_val_df.shape}")
print(f"Test set shape: {test_df.shape}")

# Second split: split train+val into train (60% of original) and val (20% of original)
# This means val will be 0.25 of the train_val set (0.25 * 0.8 = 0.2 of original)
splitter_val = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=33)
train_idx, val_idx = next(splitter_val.split(train_val_df, train_val_labels))

# Get final train and val sets
train_df = train_val_df.iloc[train_idx].copy()
val_df = train_val_df.iloc[val_idx].copy()

print(f"\nFinal split sizes:")
print(f"Train set: {train_df.shape[0]} samples ({train_df.shape[0]/len(sampled_df)*100:.1f}%)")
print(f"Val set: {val_df.shape[0]} samples ({val_df.shape[0]/len(sampled_df)*100:.1f}%)")
print(f"Test set: {test_df.shape[0]} samples ({test_df.shape[0]/len(sampled_df)*100:.1f}%)")

def print_label_distribution(df, split_name):
    print(f"\n{split_name} label distribution:")
    for label in label_cols:
        positive_count = (df[label] == 1.0).sum()
        total_count = df[label].notna().sum()
        if total_count > 0:
            percentage = positive_count / total_count * 100
            print(f"  {label}: {positive_count}/{total_count} ({percentage:.1f}%)")

print_label_distribution(train_df, "Train")
print_label_distribution(val_df, "Validation")
print_label_distribution(test_df, "Test")

# Save the splits
train_df.to_csv('train_split.csv', index=False)
val_df.to_csv('val_split.csv', index=False)
test_df.to_csv('test_split.csv', index=False)

print(f"\nSaved splits to CSV files:")
print(f"- train_split.csv: {len(train_df)} samples")
print(f"- val_split.csv: {len(val_df)} samples") 
print(f"- test_split.csv: {len(test_df)} samples")

Original dataset shape: (65726, 23)
Stratification labels shape: (65726, 14)
Train+Val set shape: (52527, 23)
Test set shape: (13199, 23)

Final split sizes:
Train set: 39422 samples (60.0%)
Val set: 13105 samples (19.9%)
Test set: 13199 samples (20.1%)

Train label distribution:
  Atelectasis: 11590/14045 (82.5%)
  Cardiomegaly: 10789/15197 (71.0%)
  Consolidation: 4578/6741 (67.9%)
  Edema: 7886/14792 (53.3%)
  Enlarged_Cardiomediastinum: 3847/7014 (54.8%)
  Fracture: 2869/3024 (94.9%)
  Lung_Lesion: 3628/3932 (92.3%)
  Lung_Opacity: 13383/14551 (92.0%)
  No_Finding: 3686/3686 (100.0%)
  Pleural_Effusion: 15029/20720 (72.5%)
  Pleural_Other: 1251/1396 (89.6%)
  Pneumonia: 5921/12631 (46.9%)
  Pneumothorax: 4799/14842 (32.3%)
  Support_Devices: 16847/17587 (95.8%)

Validation label distribution:
  Atelectasis: 3846/4677 (82.2%)
  Cardiomegaly: 3615/5117 (70.6%)
  Consolidation: 1499/2234 (67.1%)
  Edema: 2663/4928 (54.0%)
  Enlarged_Cardiomediastinum: 1313/2357 (55.7%)
  Fracture: 953