# Project Summary

## Dataset & Goal
- Dataset Source: RSNA Pneumonia Detection Challenge (Kaggle). This is a high-authority source (Radiological Society of North America)
- Total Images: $\sim$26,000 unique Chest X-ray images.
- Model Goal: Multi-Class Classification (3 classes), with a planned fallback to Binary Classification if performance is poor.
- Image Path Root: All images were successfully unzipped and are located in the Colab runtime environment under the folder path /content/train_images/stage_2_train_images/.

## Class Definitions & Mapping

| Class Name | Label (Target) | Pathological Status | Role in Classification |
| :--- | :--- | :--- | :--- |
| Normal | 0 | Healthy | True Negative (Healthy) |
| Lung Opacity | 1 | Pneumonia Present | True Positive (Pneumonia) |
| No Lung Opacity / Not Normal | 2 | Other Diseases/Issues | Hard Negative (Sick, but NOT Pneumonia) |

# Data Preparation

## Load Datasets

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Installing pydicom (needed for medical images)
!pip install pydicom

# Unziping images into the local Colab environment (FAST)
!unzip -q "/content/drive/My Drive/STAT362 Final Project_RSNA/images.zip" -d "/content/train_images"

Mounted at /content/drive
Collecting pydicom
  Downloading pydicom-3.0.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pydicom-3.0.1-py3-none-any.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydicom
Successfully installed pydicom-3.0.1


In [None]:
import pandas as pd

# Loading the detailed class info
detailed_class = pd.read_csv('/content/drive/My Drive/STAT362 Final Project_RSNA/stage_2_detailed_class_info.csv')
labels = pd.read_csv('/content/drive/My Drive/STAT362 Final Project_RSNA/stage_2_train_labels.csv')

## EDA

In [3]:
detailed_class.head()

Unnamed: 0,patientId,class
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,No Lung Opacity / Not Normal
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,No Lung Opacity / Not Normal
2,00322d4d-1c29-4943-afc9-b6754be640eb,No Lung Opacity / Not Normal
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,Normal
4,00436515-870c-4b36-a041-de91049b9ab4,Lung Opacity


In [4]:
labels.head()

Unnamed: 0,patientId,x,y,width,height,Target
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,,,,,0
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,,,,,0
2,00322d4d-1c29-4943-afc9-b6754be640eb,,,,,0
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,,,,,0
4,00436515-870c-4b36-a041-de91049b9ab4,264.0,152.0,213.0,379.0,1


In [5]:
# Merge the two datasets to explore the relationship between class and target
merge_df = pd.merge(detailed_class, labels[['patientId', 'Target']], on='patientId')
merge_df.head()

Unnamed: 0,patientId,class,Target
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,No Lung Opacity / Not Normal,0
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,No Lung Opacity / Not Normal,0
2,00322d4d-1c29-4943-afc9-b6754be640eb,No Lung Opacity / Not Normal,0
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,Normal,0
4,00436515-870c-4b36-a041-de91049b9ab4,Lung Opacity,1


In [6]:
merge_df.groupby(by=['class', 'Target']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,patientId
class,Target,Unnamed: 2_level_1
Lung Opacity,1,16957
No Lung Opacity / Not Normal,0,11821
Normal,0,8851


A consistent class-to-target mapping confirms that 'Lung Opacity' represents pneumonia, while 'Normal' and 'Not Normal' represent non-pneumonia cases. Consequently, we will implement a three-class CNN to distinguish between these three unique states: healthy lungs, pneumonia, and other lung pathologies.

## Process Labels (The Multi-Class Logic)

In [None]:
# REMOVING DUPLICATES
# We only need one label per patientId
detailed_class = detailed_class.drop_duplicates(subset=['patientId'])

# DEFINING 3-CLASS MAPPING
# Normal = 0, Pneumonia = 1, Other Disease = 2
class_mapping = {
    'Normal': 0,
    'Lung Opacity': 1,
    'No Lung Opacity / Not Normal': 2
}

detailed_class['target'] = detailed_class['class'].map(class_mapping)

# Creating the file path column
# The patientId in CSV does not have .dcm extension, so we add it
detailed_class['path'] = detailed_class['patientId'].apply(lambda x: f"/content/train_images/stage_2_train_images/{x}.dcm")

print(f"Total unique images: {len(detailed_class)}")
print(detailed_class['target'].value_counts())

Total unique images: 26684
target
2    11821
0     8851
1     6012
Name: count, dtype: int64


## The Custom Dataset Class

This is the most important part. This Python class tells PyTorch how to open a DICOM file and turn it into a tensor your model can understand.

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader
import pydicom
from PIL import Image
import torchvision.transforms as transforms

class RSNADataset(Dataset):
    def __init__(self, dataframe, transform=None):
        self.dataframe = dataframe
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        # Get file path and label
        img_path = self.dataframe.iloc[idx]['path']
        label = self.dataframe.iloc[idx]['target']

        # Read DICOM file
        dcm_data = pydicom.dcmread(img_path)

        # Extract image data (pixels)
        image = dcm_data.pixel_array

        # Convert to PIL Image (Standard for PyTorch transforms)
        image = Image.fromarray(image).convert("RGB")

        # Apply transformations (Resize, Normalize, etc.)
        if self.transform:
            image = self.transform(image)

        return image, torch.tensor(label, dtype=torch.long)

In [None]:
# Defining Transforms
# We must resize images (e.g., to 224x224) because raw X-rays are too big
data_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    # Normalizing using mean/std of ImageNet (standard practice)
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

## Train-Test-Validation Split

In [None]:
from sklearn.model_selection import train_test_split

# 1. Splitting the DataFrame (70% Train, 15% Val, 15% Test)
train_df, temp_df = train_test_split(detailed_class, test_size=0.3, stratify=detailed_class['target'], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['target'], random_state=42)

print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

# 2. Creating Datasets
train_dataset = RSNADataset(train_df, transform=data_transforms)
val_dataset = RSNADataset(val_df, transform=data_transforms)
test_dataset = RSNADataset(test_df, transform=data_transforms)

# 3. Creating DataLoaders
# Batch size determines how many images you feed the GPU at once
BATCH_SIZE = 32

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Quick Test: Get one batch
images, labels = next(iter(train_loader))
print(f"Image batch shape: {images.shape}") # Should be [32, 3, 224, 224]
print(f"Labels batch shape: {labels.shape}")

Train: 18678, Val: 4003, Test: 4003
Image batch shape: torch.Size([32, 3, 224, 224])
Labels batch shape: torch.Size([32])


# CNN Model

## Instruction for building the model
Defining and implementing the custom CNN architecture. Using the provided train_loader to feed batches of data into the model, defining then the loss function as nn.CrossEntropyLoss() (since we're doing 3-class classification), and start building the training loop on the GPU (cuda). It's important to note that the final layer must output 3 neurons to match the target labels (0, 1, 2).

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

N_CLASSES = 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Defining CNN model
class RSNACNN(nn.Module):
    def __init__(self, num_classes=N_CLASSES):
        super().__init__()

        # Feature extractor: conv + batchnorm + relu + pooling
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),      # 224 -> 112

            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),      # 112 -> 56

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),      # 56 -> 28

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),

            # Independent of exact input size
            nn.AdaptiveAvgPool2d((1, 1))      # -> [B, 128, 1, 1]
        )

        # Classifier head
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # flatten to [B, 128]
        x = self.classifier(x)
        return x


model = RSNACNN().to(device)
print(model)


Using device: cuda
RSNACNN(
  (features): Sequential(
    (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (8): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_runni

In [12]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 10
best_val_acc = 0.0
best_state_dict = None

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0

    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * images.size(0)
        preds = outputs.argmax(dim=1)
        train_correct += (preds == labels).sum().item()
        train_total += labels.size(0)

    train_loss = train_loss / train_total
    train_acc = train_correct / train_total

    # Validating
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)

            outputs = model(images)
            loss = criterion(outputs, labels)

            val_loss += loss.item() * images.size(0)
            preds = outputs.argmax(dim=1)
            val_correct += (preds == labels).sum().item()
            val_total += labels.size(0)

    val_loss = val_loss / val_total
    val_acc = val_correct / val_total

    # Save best model so far (by val accuracy)
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_state_dict = model.state_dict()

    print(
        f"Epoch [{epoch+1}/{num_epochs}] "
        f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} "
        f"| Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}"
    )

# Loading best weights (based on validation accuracy)
if best_state_dict is not None:
    model.load_state_dict(best_state_dict)
    print(f"Loaded best model with Val Acc = {best_val_acc:.4f}")


Epoch [1/10] Train Loss: 0.9190 | Train Acc: 0.5380 | Val Loss: 0.8816 | Val Acc: 0.5616
Epoch [2/10] Train Loss: 0.8656 | Train Acc: 0.5783 | Val Loss: 0.8983 | Val Acc: 0.5661
Epoch [3/10] Train Loss: 0.8270 | Train Acc: 0.6064 | Val Loss: 0.8151 | Val Acc: 0.6198
Epoch [4/10] Train Loss: 0.7914 | Train Acc: 0.6278 | Val Loss: 0.7846 | Val Acc: 0.6238
Epoch [5/10] Train Loss: 0.7710 | Train Acc: 0.6341 | Val Loss: 0.7554 | Val Acc: 0.6480
Epoch [6/10] Train Loss: 0.7533 | Train Acc: 0.6479 | Val Loss: 0.9883 | Val Acc: 0.5558
Epoch [7/10] Train Loss: 0.7413 | Train Acc: 0.6550 | Val Loss: 0.7150 | Val Acc: 0.6722
Epoch [8/10] Train Loss: 0.7292 | Train Acc: 0.6603 | Val Loss: 0.8229 | Val Acc: 0.6233
Epoch [9/10] Train Loss: 0.7217 | Train Acc: 0.6668 | Val Loss: 0.9053 | Val Acc: 0.5736
Epoch [10/10] Train Loss: 0.7162 | Train Acc: 0.6685 | Val Loss: 1.0009 | Val Acc: 0.5296
Loaded best model with Val Acc = 0.6722


In [13]:
# Evaluating on test dataset

model.eval()
test_correct = 0
test_total = 0
test_loss = 0.0

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        test_loss += loss.item() * images.size(0)
        preds = outputs.argmax(dim=1)
        test_correct += (preds == labels).sum().item()
        test_total += labels.size(0)

test_loss = test_loss / test_total
test_acc = test_correct / test_total

print("======================================")
print("  Test Loss:     {:.4f}".format(test_loss))
print("  Test Accuracy: {:.4f}".format(test_acc))
print("======================================")


  Test Loss:     1.0118
  Test Accuracy: 0.5266
