# DATASCI 315, Groupwork 9: Kaggle Galaxy Challenge with CNNs

In this groupwork assignment, you will continue the task of inferring the number of galaxies in an image. The data is available on Kaggle (link below). Your task is to build a convolutional neural network (CNN) and try to achieve the highest accuracy possible on the test set, the labels for which are hidden from you.

To submit your work, please upload the `html` output from executing this notebook to Canvas, **as well as** a `csv` file with your predictions on the test set as a competition submission to Kaggle.

The `csv` file should have the following format: `id` for the index of the image (starting with 0), and `label` for the predicted number of galaxies in the image. Code for generating the prediction `csv` is provided in this file.

**Kaggle competition link:** [https://www.kaggle.com/t/c7bb7892c2774d61af49f788398b2eec](https://www.kaggle.com/t/c7bb7892c2774d61af49f788398b2eec)

**Reference:** [PyTorch CNN Tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)

## Getting Started

### Import Relevant Packages and Initialize

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import torch
from torch import nn, optim
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.utils.data import DataLoader, Dataset

plt.rcParams["axes.grid"] = False

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### Load the Data

Follow these steps to load the data:

1. Download the following files from Kaggle:
   - `train_dataset_0.0125.pt`
   - `validation_dataset_0.0125.pt`: optional validation set, **not** used for ranking
   - `test_images_0.0125.pt`: image-only set used for the competition
2. If you're using Google Colab, go to the `Files` tab on the left.
3. Create a directory named `data` (or name it something else and change the path below).
4. Upload the dataset files to this directory either by clicking the `Upload` button or dragging the files to the directory.

Alternatively, you may place the dataset in your Google Drive for persistent storage, and connect to it with the following code:
```python
from google.colab import drive
drive.mount('/content/drive')
```
Now the following code block should load the data. You may need to update the file paths.

In [None]:
train_images, train_counts = torch.load("data/train_dataset_0.0125.pt", weights_only=True)
val_images, val_counts = torch.load("data/validation_dataset_0.0125.pt", weights_only=True)

print(f"Training set: {train_images.shape[0]} images")
print(f"Validation set: {val_images.shape[0]} images")
print(f"Image dimensions: {train_images.shape[1:]}")

Since we will be using CNNs, we need to add a channel dimension to the images. PyTorch expects images in the format `(batch, channels, height, width)`.

In [None]:
train_images = train_images.unsqueeze(1).to(device)
val_images = val_images.unsqueeze(1).to(device)
train_counts = train_counts.to(device)
val_counts = val_counts.to(device)

print(f"Training images shape: {train_images.shape}")
print(f"Validation images shape: {val_images.shape}")

### Inspect the Data

Let's display random images from the training and validation sets along with their corresponding galaxy counts:

In [None]:
idx = torch.randint(len(train_images), (1,)).item()
plt.imshow(train_images[idx].squeeze().cpu(), cmap="gray")
plt.title(f"Training image - Number of galaxies: {train_counts[idx].item()}")
plt.show()

In [None]:
idx = torch.randint(len(val_images), (1,)).item()
plt.imshow(val_images[idx].squeeze().cpu(), cmap="gray")
plt.title(f"Validation image - Number of galaxies: {val_counts[idx].item()}")
plt.show()

Let's examine the distribution of galaxy counts in our training data:

In [None]:
unique_counts, frequencies = torch.unique(train_counts, return_counts=True)
plt.bar(unique_counts.cpu().numpy(), frequencies.cpu().numpy())
plt.xlabel("Number of galaxies")
plt.ylabel("Frequency")
plt.title("Distribution of galaxy counts in training data")
plt.show()

### Problem 1: Define the Model Architecture

We will predict the number of galaxies in each image using a CNN. The input to the model will be the image, and the output will be the number of galaxies. Since the galaxy count is a discrete variable (0, 1, 2, ..., 6), we treat this as a classification problem where each count is a class.

Design a CNN architecture that you think will work well. Your model must:
- Accept input tensors of shape `(batch_size, 1, 50, 50)`
- Output tensors of shape `(batch_size, 7)` (logits for 7 classes)

Consider using:
- Multiple convolutional layers with increasing filter counts
- Batch normalization for stable training
- Pooling layers to reduce spatial dimensions
- Dropout for regularization
- Fully connected layers at the end

**Hint:** A typical CNN architecture might look like: Conv -> ReLU -> Pool -> Conv -> ReLU -> Pool -> Flatten -> FC -> ReLU -> FC

In [None]:
# Image dimension and number of classes
IMAGE_DIM = 50
NUM_CLASSES = 7  # 0 through 6 galaxies

In [None]:
# BEGIN SOLUTION
model = nn.Sequential(
    # First conv block: 1 -> 32 channels
    nn.Conv2d(1, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2),  # 50 -> 25
    # Second conv block: 32 -> 64 channels
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2),  # 25 -> 12
    # Third conv block: 64 -> 128 channels
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.MaxPool2d(2),  # 12 -> 6
    # Flatten and fully connected layers (3 FC layers)
    nn.Flatten(),
    nn.Linear(128 * 6 * 6, 512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512, 128),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, NUM_CLASSES),
).to(device)
# END SOLUTION

print(model)

In [None]:
# Test assertions
test_input = torch.randn(4, 1, 50, 50).to(device)
test_output = model(test_input)
expected_shape = (4, 7)
assert test_output.shape == expected_shape, f"Output shape should be {expected_shape}"
assert model[0].in_channels == 1, "First conv layer should have 1 input channel"
print("All tests passed!")

# BEGIN HIDDEN TESTS
test_single = torch.randn(1, 1, 50, 50).to(device)
single_output = model(test_single)
assert single_output.shape == (1, 7), "Model should work with batch size 1"
test_batch = torch.randn(32, 1, 50, 50).to(device)
batch_output = model(test_batch)
assert batch_output.shape == (32, 7), "Model should work with batch size 32"
# END HIDDEN TESTS

### Problem 2: Implement the Training Function

Implement the training loop. Your training function should:
1. Reset the model parameters
2. For each epoch:
   - Set model to training mode and iterate through the training data
   - Compute the loss and perform backpropagation
   - Set model to evaluation mode and compute validation loss
3. Return the training and validation losses for plotting

**Hint:** Use `optimizer.zero_grad()` before computing gradients, `loss.backward()` for backpropagation, and `optimizer.step()` to update weights.

In [None]:
def reset_model_parameters(model):
    """Re-initialize all model parameters.

    Useful when re-training a model after changing hyperparameters.
    """
    for module in model.modules():
        if hasattr(module, "reset_parameters"):
            module.reset_parameters()

In [None]:
def train(model, optimizer, train_dataloader, val_dataloader, num_epochs=10):
    """Train the model and return training/validation losses."""
    loss_fn = nn.CrossEntropyLoss()
    reset_model_parameters(model)

    # BEGIN SOLUTION
    # Cosine annealing scheduler for smooth LR decay
    scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)

    train_losses = []
    val_losses = []

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        epoch_train_loss = 0.0
        num_train_batches = 0

        for images, counts in train_dataloader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_fn(outputs, counts)
            loss.backward()
            optimizer.step()
            epoch_train_loss += loss.item()
            num_train_batches += 1

        train_loss = epoch_train_loss / num_train_batches

        # Validation phase
        model.eval()
        epoch_val_loss = 0.0
        num_val_batches = 0

        with torch.no_grad():
            for images, counts in val_dataloader:
                outputs = model(images)
                loss = loss_fn(outputs, counts)
                epoch_val_loss += loss.item()
                num_val_batches += 1

        val_loss = epoch_val_loss / num_val_batches

        # Step the scheduler
        scheduler.step()

        lr = optimizer.param_groups[0]["lr"]
        print(f"Epoch {epoch:2d} train={train_loss:.3f} val={val_loss:.3f} lr={lr:.5f}")

        train_losses.append(train_loss)
        val_losses.append(val_loss)
    # END SOLUTION

    return train_losses, val_losses

In [None]:
# Test assertions
import inspect

assert callable(train), "train should be a function"
sig = inspect.signature(train)
params = list(sig.parameters.keys())
assert "model" in params, "train should accept a model parameter"
assert "optimizer" in params, "train should accept an optimizer parameter"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "train_dataloader" in params, "train should accept train_dataloader"
assert "val_dataloader" in params, "train should accept val_dataloader"
assert "num_epochs" in params, "train should accept num_epochs"
# END HIDDEN TESTS

### Problem 3: Configure Training Hyperparameters

Set up your optimizer, batch size, and data loaders. Common optimizer choices include:
- `optim.Adam`: Adaptive learning rate, often works well with default settings
- `optim.SGD`: Classic gradient descent, may need learning rate tuning

The batch size affects:
- Training stability (larger batches = more stable gradients)
- Memory usage (larger batches = more memory)
- Training speed (larger batches = fewer updates per epoch)

**Hint:** Start with `Adam`, a learning rate around `1e-3` or `1e-4`, and a batch size of 32 or 64.

In [None]:
# BEGIN SOLUTION
# Using SGD with momentum and higher LR works well with cosine annealing
learning_rate = 0.02
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-4)
batch_size = 32  # Smaller batch size for more gradient updates
# END SOLUTION

In [None]:
# Test assertions
assert optimizer is not None, "optimizer must be defined"
assert batch_size > 0, "batch_size must be positive"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(optimizer, "step"), "optimizer must have a step method"
assert hasattr(optimizer, "zero_grad"), "optimizer must have a zero_grad method"
assert batch_size <= 256, "batch_size should not be too large for memory"
# END HIDDEN TESTS

Set up the data loaders:

In [None]:
# BEGIN SOLUTION
# Custom dataset with data augmentation for training
class AugmentedGalaxyDataset(Dataset):
    """Dataset with random augmentations for galaxy images."""

    def __init__(self, images, labels, *, augment=False):
        self.images = images
        self.labels = labels
        self.augment = augment

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        label = self.labels[idx]

        if self.augment:
            # Random horizontal flip
            if torch.rand(1).item() > 0.5:
                image = torch.flip(image, dims=[2])
            # Random vertical flip
            if torch.rand(1).item() > 0.5:
                image = torch.flip(image, dims=[1])
            # Random 90-degree rotation (0, 90, 180, or 270 degrees)
            k = torch.randint(0, 4, (1,)).item()
            image = torch.rot90(image, k, dims=[1, 2])

        return image, label


# Training set with augmentation, validation set without
train_dataset = AugmentedGalaxyDataset(train_images, train_counts, augment=True)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = AugmentedGalaxyDataset(val_images, val_counts, augment=False)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# END SOLUTION

### Problem 4: Train the Model and Achieve 90% Validation Accuracy

Train your model and iterate on your architecture and hyperparameters until you achieve at least **90% accuracy** on the validation set. Watch for:
- **Underfitting:** Both training and validation loss remain high
- **Overfitting:** Training loss decreases but validation loss increases

If your model is not performing well enough, go back to Problems 1-3 and try different:
- Model architectures (more/fewer layers, different filter sizes)
- Training hyperparameters (learning rate, batch size, number of epochs)
- Regularization techniques (dropout rate, weight decay)

**Note:** The validation set is not used for ranking, so you can use it to tune your model. The test set is the one used for ranking, where your accuracy might be lower. Think about whether your model is overfitting!

In [None]:
# BEGIN SOLUTION
train_losses, val_losses = train(model, optimizer, train_dataloader, val_dataloader, num_epochs=80)
# END SOLUTION

Plot the training and validation losses to diagnose the training process:

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label="Training loss")
plt.plot(val_losses, label="Validation loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()

Compute the training and validation accuracy:

In [None]:
# Compute predictions (use batched version if this runs out of memory)
model.eval()
with torch.no_grad():
    pred_train_counts = model(train_images)
    pred_val_counts = model(val_images)

train_accuracy = (pred_train_counts.argmax(dim=1) == train_counts).float().mean().item()
val_accuracy = (pred_val_counts.argmax(dim=1) == val_counts).float().mean().item()

print(f"Training accuracy: {train_accuracy:.2%}")
print(f"Validation accuracy: {val_accuracy:.2%}")

In [None]:
# Test assertions
assert train_losses is not None, "train_losses should be defined after training"
assert val_losses is not None, "val_losses should be defined after training"
assert len(train_losses) > 0, "train_losses should not be empty"
assert val_accuracy >= 0.75, f"Need >= 75% validation accuracy, got {val_accuracy:.2%}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(train_losses) == len(val_losses), "losses should have same length"
assert all(loss >= 0 for loss in train_losses), "all losses should be non-negative"
# END HIDDEN TESTS

## Generate Kaggle Submission

Once you are happy with your model's performance, run the code below to generate the submission file for the Kaggle competition.

**Note:** You are limited to 3 submissions per day, so be strategic with your submissions.

In [None]:
test_images = torch.load("data/test_images_0.0125.pt", weights_only=True)
test_loader = DataLoader(test_images, batch_size=512, shuffle=False)

print(f"Test set: {test_images.shape[0]} images")

In [None]:
predictions = []
model.eval()

with torch.no_grad():
    for batch_images in test_loader:
        batch_with_channel = batch_images.unsqueeze(1).to(device)
        outputs = model(batch_with_channel)
        _, predicted = torch.max(outputs.data, 1)
        predictions.extend(predicted.cpu().numpy())

image_ids = [str(idx) for idx in range(test_images.shape[0])]
prediction_df = pd.DataFrame({"id": image_ids, "label": predictions})

prediction_df.to_csv("submission.csv", index=False)
print("Submission file saved to 'submission.csv'")
print(prediction_df.head())