# Tutorial 1: Loading the CheXpert Dataset and Data Preprocessing

This tutorial demonstrates how to download, subset, preprocess, and load the CheXpert chest X-ray dataset for use in PyTorch models.  

Because the full CheXpert dataset is very large, we focus on:
- Using the **CheXpert-v1.0-small** subset
- Copying a **manageable subset of images** into Google Colab
- Filtering image metadata to ensure consistency between images and labels
- Preparing the dataset for training

By the end of this tutorial, we will have a clean PyTorch `Dataset` class that can be used directly for model training.

## Dataset Overview

CheXpert is a large-scale chest X-ray dataset containing over 200,000 images annotated for 14 clinical observations.  
Each image is associated with metadata stored in CSV files, including:
- Patient and study identifiers
- View position (frontal or lateral)
- Labels indicating the presence, absence, or uncertainty of various pathologies

The original dataset is available for access at https://stanfordaimi.azurewebsites.net/datasets/8cbd9ed4-2eb9-4565-affc-111cf4f7ebe2, but is extremely large, and for the purposes of this tutorial, we access the smaller, subset of the data available on Kaggle.

In this tutorial we focus on:
- **Frontal chest X-rays only**
- A **single binary label (Pneumonia)** for simplicity


## Downloading the Dataset

We download the CheXpert small dataset from Kaggle using the Kaggle API.  
Google Colab is used to avoid local installation issues and to enable GPU acceleration later.

This step requires:
- A Kaggle account
- A `kaggle.json` API token placed in Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# change in configs/paths.py to your desired dir in Google Drive
from configs.paths import GOOGLE_DRIVE_ROOT

# download dataset
if os.path.exists(os.path.join(GOOGLE_DRIVE_ROOT, "CheXpert-v1.0-small")):
    print("CheXpert already downloaded: skipping.")
    return
else:
    !kaggle datasets download -d ashery/chexpert
    !unzip chexpert.zip -d chexpert
    print("CheXpert not found: downloading")

ModuleNotFoundError: No module named 'google'

## Upload a Subset to Colab

We want to upload the dataset to Colab memory to prevent slowdowns from I/O to Google Drive during the training process. The CheXpert dataset is too large to fully load into Colab memory.  

To enable fast experimentation, we copy only a **subset of patient training data folders** into the Colab filesystem. Feel free to experiment with the subset size you upload into Colab.

Key steps:
- Copy data in batches to avoid I/O errors
- Skip files that already exist
- Retry failed transfers
- Keep the original directory structure intact

This allows PyTorch to load images efficiently during training.

In [None]:
# modify paths in configs/paths.py to yours if needed
from config.paths import (GOOGLE_DRIVE_CHEXPERT_VALID_DIR, 
                            GOOGLE_DRIVE_CHEXPERT_TRAIN_CSV, 
                            GOOGLE_DRIVE_CHEXPERT_VALID_CSV,
                            CHEXPERT_VALID_DIR,
                            CHEXPERT_TRAIN_CSV,
                            CHEXPERT_VALID_DIR)

!rsync -avh --ignore-errors {GOOGLE_DRIVE_CHEXPERT_VALID_DIR} {CHEXPERT_VALID_DIR}
!rsync -avh --ignore-errors {GOOGLE_DRIVE_CHEXPERT_TRAIN_CSV} {CHEXPERT_TRAIN_CSV}
!rsync -avh --ignore-errors {GOOGLE_DRIVE_CHEXPERT_VALID_CSV} {CHEXPERT_VALID_CSV}

In [None]:
import subprocess
import os
import time

# Modify if needed 
BATCH_SIZE = 50
TOTAL = 2500 

def batch_copy_patients(src, dst, batch_size, total):
    os.makedirs(dst, exist_ok=True)

    patients = sorted(os.listdir(src))[:total]
    print(f"Found {len(patients)} patient folders. Starting batch copy...")

    for i in range(0, len(patients), batch_size):
        batch = patients[i:i + batch_size]
        print(f"\n=== Copying batch {i//batch_size + 1} ({i} to {i+len(batch)-1}) ===")

        for p in batch:
            src_path = os.path.join(src, p)
            dst_path = os.path.join(dst, p)

            if os.path.exists(dst_path):
                print(f"{p} already exists â€” skipping")
                continue

            retries = 3
            for attempt in range(1, retries + 1):
                print(f"Copying {p} (attempt {attempt})...")

                result = subprocess.run(
                    ["rsync", "-a", src_path, dst],
                    stdout=subprocess.PIPE,
                    stderr=subprocess.PIPE,
                    text=True
                )

                if result.returncode == 0:
                    print(f"Finished {p}")
                    break
                else:
                    print(f"Error copying {p}: {result.stderr.strip()}")
                    time.sleep(5)

            if result.returncode != 0:
                print(f"Failed to copy {p} after {retries} attempts.")


batch_copy_patients(
        GOOGLE_DRIVE_CHEXPERT_TRAIN_DIR,
        CHEXPERT_TRAIN_DIR, 
        BATCH_SIZE,
        TOTAL)

## Synchronizing Image Files and Metadata (.csv files)

CheXpert stores labels in CSV files that reference image paths.  
After copying only a subset of images, many CSV entries may point to files that no longer exist. We need the CSV to be consistent with the training data directory structure so that we don't run into errors in our Dataset class.

Additionally, we will train on frontal views only, since this is a singular distribution.

To ensure consistency:
- We scan the copied image directory to collect valid image paths
- We filter the CSV to keep only rows corresponding to existing images
- We keep **frontal views only** 
- We generate a new `train_subset.csv` file

In [None]:
def collect_valid_paths(root):
    valid = set()
    for root_dir, _, files in os.walk(root):
        for f in files:
            if f.endswith(".jpg"):
                rel = os.path.relpath(os.path.join(root_dir, f), root)
                valid.add(rel)
    return valid
    
valid_paths = collect_valid_paths(CHEXPERT_TRAIN_DIR)

In [None]:
import pandas as pd

# modify in configs/paths.py if needed
from configs.paths import DEFAULT_DATA_ROOT

def prepare_train_csv(
    image_root = DEFAULT_DATA_ROOT,
    csv_in     = CHEXPERT_TRAIN_CSV,
    csv_out    = CHEXPERT_TRAIN_SUBSET_CSV
):
    print("Collecting valid image paths...")
    valid_paths = collect_valid_paths(image_root)

    print("Loading CSV...")
    df = pd.read_csv(csv_in).fillna(0)

    # Normalize paths
    df["Path"] = df["Path"].str.replace(
        "CheXpert-v1.0-small/train/", "", regex=False
    )

    # Keep frontal views only
    df = df[df["Path"].str.contains("frontal", na=False)]

    # Keep only images that exist on colab
    df = df[df["Path"].isin(valid_paths)]

    df.to_csv(csv_out, index=False)

    print(f"Filtered dataset size: {len(df)}")
    print("Example CSV path:", df["Path"].iloc[0])
    print("Example valid path:", next(iter(valid_paths)))

prepare_train_csv()

Filtered dataset size: 0


## PyTorch Dataset Class

We define a custom PyTorch `Dataset` class to:
- Load data
- Apply basic preprocessing transforms
- Return `(image, label)` pairs suitable for training

Design choices:
- Grayscale images (single channel)
- Simple resizing and tensor conversion
- Handle by missing files by warning 


In [None]:
from torch.utils.data import Dataset
from PIL import Image
import os
import torch
import pandas as pd
from torchvision import transforms

class CheXpertDataset(Dataset):
    def __init__(self, csv_path, root, label_name="Pneumonia"):
        self.df = pd.read_csv(csv_path)
        self.df = self.df.fillna(0)

        # Keep only rows with valid image paths
        self.df = self.df[self.df['Path'].notna()]

        self.root = root
        self.label_name = label_name

        self.transform = transforms.Compose([
            transforms.Resize((128, 128)),
            transforms.ToTensor()
        ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]

        rel_path = row["Path"].replace("CheXpert-v1.0-small/", "")
        img_path = os.path.join(self.root, rel_path)

        # Skip missing images
        if not os.path.exists(img_path):
            print(f"Missing image: {img_path}")
            return self.__getitem__((idx + 1) % len(self))

        img = Image.open(img_path).convert("L")
        img = self.transform(img)

        # Labels are -1,0,1 in CheXpert -> convert to {0,1}
        y = torch.tensor([1.0 if row[self.label_name] == 1 else 0.0], dtype=torch.float32)

        return img, y




## Summary

At this stage, we have:
- Downloaded and subsetted the CheXpert dataset
- Ensured consistency between image files and CSV metadata
- Filtered to frontal views only
- Created a PyTorch-compatible dataset class

This dataset will be used in subsequent tutorials for:
- Model definition
- Training conditional variational autoencoders
- Evaluating reconstruction quality
