⚠️ **Note**: **Submission is currently limited to only the speech detection tasks. We'll be releasing the obfuscated holdout data and an updated submission tutorial for the Phoneme Classification tasks in time for the second half of the competition.**

# 🍍 LibriBrain Competition: Submission
You've trained a model for one of our tracks and are now ready to submit your results? Congratulations! - let's walk through the process.

Broadly, you will need to do the following:
1. Run model predictions on our holdout data
2. Submit the .CSV file containing your results (find the detailed instructions [here](https://neural-processing-lab.github.io/2025-libribrain-competition/participate/#4-submit-on-evalai)).

This tutorial will walk you through step (1), generating the .CSV file for you to submit.

In case of any questions or problems, please get in touch through [our Discord server](https://neural-processing-lab.github.io/2025-libribrain-competition/links/discord).

⚠️ **Note**: We have only comprehensively validated the notebook to work on Colab and Unix. Your experience in other environments (e.g., Windows) may vary.

## Setting up dependencies
Run the code below *as is*. It will download all required dependencies, including our own [PNPL](https://pypi.org/project/pnpl/) package. On Windows, you might have to restart your Kernel after the installation has finished.

In [1]:
# Install additional dependencies
%pip install -q lightning torchmetrics scikit-learn plotly ipywidgets tqdm pnpl

# Set up base path for dataset and related files (base_path is assumed to be set in the cells below!)
base_path = "./libribrain"
try:
    import google.colab  # This module is only available in Colab.
    in_colab = True
    base_path = "/content"  # This is the folder displayed in the Colab sidebar
except ImportError:
    in_colab = False

## Generating submission CSV
For the speech detection task, you will be asked to evaluate **for each timepoint** of the "competition holdout" split of the data if it is speech or not - this means we expect a total of 560,638 predictions (that is how many samples there are in that split). These predictions should then be packaged into a .csv file you can upload on EvalAI. As we don't have labels to train against, the way you download the holdout data differs slightly from the regular `LibriBrainSpeech` dataset.

Here is how to generate the submission:

In [None]:
from torch.utils.data import DataLoader
from pnpl.datasets import LibriBrainCompetitionHoldout
from tqdm import tqdm
import torch

# First, instantiate the Competition Holdout dataset
speech_holdout_dataset = LibriBrainCompetitionHoldout(
    data_path=base_path,  # Same as in the other LibriBrain dataset - this is where we'll store the data
    tmax=0.8,             # Also identical to the other datasets - how many samples to return/group together
    task="speech"         # "speech" or "phoneme" ("phoneme" is not supported until Phoneme track launch)
)

# Next, create a DataLoader for the dataset
dataloader = DataLoader(
    speech_holdout_dataset,
    batch_size=1,
    shuffle=False,
    num_workers=4
)

# The final array of predictions must contain len(speech_holdout_dataset) values between 0..1
segments_to_predict = len(speech_holdout_dataset)
print(segments_to_predict)

# Finally, we loop over every sample to generate a prediction.
# For now, we will fill the submission with random values
all_random = torch.rand((segments_to_predict, 1))
random_predictions = [None] * segments_to_predict

for i, sample in enumerate(tqdm(dataloader)):
    # For your submission, this is where you would generate your model prediction:
    # segment = sample[0]                  # The actual segment data is at sample[0]
    # prediction = model.predict(segment)  # Assuming model has a predict method
    #
    # Here, we simply pull the precomputed random tensor instead
    random_predictions[i] = all_random[i]

speech_holdout_dataset.generate_submission_in_csv(
    random_predictions,
    "holdout_speech_predictions.csv"
)


If you don't wish to wait the ~20min it takes to generate the file above, you can generate a mock (valid, but filled with random values) submission file without iterating over all samples:

In [None]:
from pnpl.datasets import LibriBrainCompetitionHoldout
import torch


speech_holdout_dataset = LibriBrainCompetitionHoldout(
    data_path=base_path,
    tmax=0.8,
    task="speech"
)

segments_to_predict = len(speech_holdout_dataset)
all_random = torch.rand((segments_to_predict, 1))  # build all (1,) tensors at once
random_predictions = list(all_random)              # convert to list of shape-(1,) tensors
speech_holdout_dataset.generate_submission_in_csv(
    random_predictions,
    "holdout_speech_predictions.csv"
)
print("Submission file created!")

### Generating the correct number of predictions
The code above is all you need to generate your submission! Below, we will outline some more details that may be helpful to consider. 

We understand that while training your model, you may have played around with averaging samples, combining multiple timepoints into a singular output - in fact, the baseline model used in the [Speech Detection Notebook](https://neural-processing-lab.github.io/2025-libribrain-competition/colabs/LibriBrain_Competition_Speech_Detection.ipynb) did just that. But, for the submission to be valid, it will need to contain 560,638 predictions - one per timepoint. There are multiple ways to resolve this (predicting a baseline value if no prediction can be performed, interpolating between results,...).

Below, we will show one workaround. As our baseline model was trained to predict a single label for each sample of 200 timepoints, we need to somehow deal with the first and last 100 timepoints:

![Padding due to Sliding Window](https://neural-processing-lab.github.io/2025-libribrain-competition/images/baseline-model-sliding-window-padding.png)

Follow along as we:
1. Load a trained speech detection model
2. Handle the problematic samples in the holdout dataset that can't be loaded/used as they don't contain sufficient timepoints
3. Generate exactly 560,638 predictions (as required for a valid submission)

**Step 1: Define the model architecture**


In [1]:
import torch
import torch.nn as nn
import lightning as L
import os
from torch.utils.data import Dataset
import torchmetrics
from sklearn.metrics import roc_curve, auc, balanced_accuracy_score, jaccard_score
from torchmetrics import Precision, Recall, F1Score
from lightning.pytorch.callbacks import Callback
import numpy as np
from torchmetrics.functional import recall

# Model architecture (identical to Speech Detection Colab tutorial)
class SpeechModel(nn.Module):
    """
    Parameters:
        input_dim (int): Number of channels/features in the input tensor (usually SENSORS_SPEECH_MASK)
        model_dim (int): Dimensionality for the intermediate model representation.
        dropout_rate (float, optional): Dropout probability applied after convolutional and LSTM layers.
        lstm_layers (int, optional): Number of layers in the LSTM module.
        bi_directional (bool, optional): If True, uses a bidirectional LSTM; otherwise, a unidirectional LSTM.
        batch_norm (bool, optional): Indicates whether to use batch normalization.

    """
    def __init__(self, input_dim, model_dim, dropout_rate=0.3, lstm_layers = 1, bi_directional = False, batch_norm=False):
        super().__init__()
        self.conv = nn.Conv1d(
            in_channels=input_dim,
            out_channels=model_dim,
            kernel_size=3,
            padding=1,
        )
        self.lstm_layers = lstm_layers
        self.batch_norm = nn.BatchNorm1d(num_features=model_dim) if batch_norm else nn.Identity()
        self.conv_dropout = nn.Dropout(p=dropout_rate)
        self.lstm = nn.LSTM(
            input_size=model_dim,
            hidden_size=model_dim,
            num_layers=self.lstm_layers,
            dropout=dropout_rate,
            batch_first=True,
            bidirectional=bi_directional
        )
        self.lstm_dropout = nn.Dropout(p=dropout_rate)
        self.speech_classifier = nn.Linear(model_dim, 1)

    def forward(self, x):
        x = self.conv(x)
        x = self.batch_norm(x)
        x = self.conv_dropout(x)
        # LSTM expects (batch, seq_len, input_size)
        output, (h_n, c_n) = self.lstm(x.permute(0, 2, 1))
        last_layer_h_n = h_n
        if self.lstm_layers > 1:
            # handle more than one layer
            last_layer_h_n = h_n[-1, :, :]
            last_layer_h_n = last_layer_h_n.unsqueeze(0)
        output = self.lstm_dropout(last_layer_h_n)
        output = output.flatten(start_dim=0, end_dim=1)
        x = self.speech_classifier(output)
        return x


class BCEWithLogitsLossWithSmoothing(nn.Module):
    def __init__(self, smoothing=0.1, pos_weight = 1.0):
        """
        Binary Cross-Entropy Loss with Deterministic Label Smoothing.

        Parameters:
            smoothing (float): Smoothing factor. Must be between 0 and 1.
            pos_weight (float): Weight for the positive class.
        """
        super().__init__()
        self.smoothing = smoothing
        self.bce_loss = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([pos_weight]))

    def forward(self, logits, target):
        target = target.float()  # Ensure target is a float tensor
        target_smoothed = target * (1 - self.smoothing) + self.smoothing * 0.5
        return self.bce_loss(logits, target_smoothed)


class SpeechClassifier(L.LightningModule):
    """
    Parameters:
        input_dim (int): Number of input channels/features. This is passed to the underlying SpeechModel.
        model_dim (int): Dimensionality of the intermediate model representation.
        learning_rate (float, optional): Learning rate for the optimizer.
        weight_decay (float, optional): Weight decay for the optimizer.
        batch_size (int, optional): Batch size used during training and evaluation.
        dropout_rate (float, optional): Dropout probability applied after convolutional and LSTM layers.
        smoothing (float, optional): Label smoothing factor applied in the BCEWithLogits loss.
        pos_weight (float, optional): Weight for the positive class in the BCEWithLogits loss.
        batch_norm (bool, optional): Indicates whether to use batch normalization.
        lstm_layers (int, optional): Number of layers in the LSTM module within the SpeechModel.
        bi_directional (bool, optional): If True, uses a bidirectional LSTM in the SpeechModel; otherwise, uses a unidirectional LSTM.
    """

    def __init__(self, input_dim, model_dim, learning_rate=1e-3, weight_decay=0.01, batch_size=32, dropout_rate=0.3, smoothing=0.1, pos_weight = 1.0 , batch_norm = False, lstm_layers = 1, bi_directional = False):
        super().__init__()
        self.save_hyperparameters()

        self.learning_rate = learning_rate
        self.weight_decay = weight_decay
        self.batch_size = batch_size
        self.model = SpeechModel(input_dim, model_dim, dropout_rate=dropout_rate, lstm_layers=lstm_layers, bi_directional=bi_directional, batch_norm=batch_norm)

        self.loss_fn = BCEWithLogitsLossWithSmoothing(smoothing=smoothing, pos_weight = pos_weight)

        self.val_step_outputs = []
        self.test_step_outputs = {}


    def forward(self, x):
            return self.model(x)

    def _shared_eval_step(self, batch, stage):
        x = batch[0]
        y = batch[1] # (batch, seq_len)

        logits = self(x)
        loss = self.loss_fn(logits, y.unsqueeze(1).float())
        probs = torch.sigmoid(logits)
        y_probs = probs.detach().cpu()

        y_true = batch[1].detach().cpu()
        meg = x.detach().cpu()

        self.log(f'{stage}_loss', loss, on_step=False, on_epoch=True, batch_size=self.batch_size)
        return loss


    def training_step(self, batch, batch_idx):
        return self._shared_eval_step(batch, "train")


    def validation_step(self, batch, batch_idx):
        return self._shared_eval_step(batch, "val")


    def test_step(self, batch, batch_idx):
        x = batch[0]
        y = batch[1]  # (batch, seq_len)

        # ugly, taking care of only one label
        if len(y.shape) != 1:
            y = y.flatten(start_dim=0, end_dim=1).view(-1, 1)  # (batch, seq_len) -> (batch * seq_len, 1)
        else:
            y = y.unsqueeze(1)

        logits = self(x)
        loss = self.loss_fn(logits, y.float())
        probs = torch.sigmoid(logits)

        # Append data to the defaultdict
        # Ensure keys exist before appending
        if "y_probs" not in self.test_step_outputs:
            self.test_step_outputs["y_probs"] = []
        if "y_true" not in self.test_step_outputs:
            self.test_step_outputs["y_true"] = []
        if "meg" not in self.test_step_outputs:
            self.test_step_outputs["meg"] = []

        # Append data
        if y.shape[-1] != 1:
            self.test_step_outputs["y_probs"].extend(
                probs.detach().view(x.shape[0], x.shape[-1]).cpu())  # (batch, seq_len)
        else:
            self.test_step_outputs["y_probs"].extend(
                probs.detach().view(x.shape[0], 1).cpu())  # (batch, seq_len)

        self.test_step_outputs["y_true"].extend(batch[1].detach().cpu())  # (batch, seq_len)
        self.test_step_outputs["meg"].extend(x.detach().cpu())  # MEG data (batch, channels, seq_len)

        return self._shared_eval_step(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate, weight_decay=self.weight_decay)
        return optimizer


**Step 2: Load trained model**

In [None]:
import urllib.request

# Load trained model checkpoint
checkpoint_path = f"{base_path}/speech_model.ckpt"

# Download model if it doesn't exist
if not os.path.exists(checkpoint_path):
    print(f"Downloading model to {checkpoint_path}")
    url = "https://neural-processing-lab.github.io/2025-libribrain-competition/speech_model.ckpt"
    urllib.request.urlretrieve(url, checkpoint_path)
    print("Model downloaded successfully!")

# These are the sensor mask used in the model checkpoint
SENSORS_SPEECH_MASK = [18, 20, 22, 23, 45, 120, 138, 140, 142, 143, 145,
                       146, 147, 149, 175, 176, 177, 179, 180, 198, 271, 272, 275]


if os.path.exists(checkpoint_path):
    print(f"Loading trained model from {checkpoint_path}")
        
    # Load the trained model with parameters matching the checkpoint
    model = SpeechClassifier.load_from_checkpoint(
        checkpoint_path,
        input_dim=len(SENSORS_SPEECH_MASK),  # 23 sensors
        model_dim=100,  # The checkpoint uses 100 dimensions, not 64
        learning_rate=1e-3,
        weight_decay=0.01,
        batch_size=32,
        dropout_rate=0.3,
        smoothing=0.1,
        pos_weight=1.0,
        batch_norm=False,  # No batch norm in the checkpoint
        lstm_layers=2,  # The checkpoint has 2 LSTM layers
        bi_directional=False
    )
    model.eval()  # Set to evaluation mode
        
    print("Model loaded successfully!")
    print(f"Model uses {len(SENSORS_SPEECH_MASK)} sensors out of 306")
        
    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Model has {total_params:,} parameters")


**Step 3: Create a wrapper to handle problematic samples**


In [None]:
import torch
from torch.utils.data import Dataset
from pnpl.datasets import LibriBrainCompetitionHoldout

class SimpleHoldoutWrapper(Dataset):
    """
    Simple wrapper that categorizes samples as full vs incomplete.
    """
    def __init__(self, holdout_dataset):
        self.holdout_dataset = holdout_dataset
        
    def __len__(self):
        return len(self.holdout_dataset)
        
    def __getitem__(self, idx):
        try:
            sample = self.holdout_dataset[idx]
            meg_data = sample[0]  # Shape: [306, timepoints]
            
            if meg_data.shape[1] == 200:
                return meg_data, "full"  # Full sample - use model
            else:
                return meg_data, "incomplete"  # Incomplete - use default
                
        except Exception as e:
            return None, "error"

# Create the holdout dataset
holdout_dataset = LibriBrainCompetitionHoldout(
    data_path=base_path,
    tmax=0.8,
    task="speech"
)

print(f"Holdout dataset loaded with {len(holdout_dataset):,} samples")


**Step 4: Generate predictions efficiently**

Technically, we could now loop over every sample and check if it contains the correct number of timepoints. However, this is very slow and we already know that only the final 200 samples will contain less than 200 timepoints (and therefore be incomplete):

![Sample Structure](https://neural-processing-lab.github.io/2025-libribrain-competition/images/libribrain-speech-holdout-sample-structure.png)

In [7]:
def generate_predictions(model, holdout_dataset):
    """
    Sliding window prediction generation:
    - Creates 200-timepoint windows to predict the middle timepoint
    - First 99 timepoints (0-98): default to 1 (speech) - no previous context
    - Last 100 timepoints: default to 1 (speech) - no future context  
    - Middle timepoints 99-560537: use model predictions
    """
    
    # Get the total number of timepoints in the dataset
    total_timepoints = len(holdout_dataset)
    print(f"Total timepoints in dataset: {total_timepoints:,}")
    
    # Calculate prediction windows
    window_size = 200
    half_window = window_size // 2  # 100
    
    # Timepoints we can predict (have full 200-point context)
    first_predictable = half_window - 1  # 99 (first model prediction is for timepoint 99)
    last_predictable = total_timepoints - half_window - 1  # 560537
    predictable_count = last_predictable - first_predictable + 1  # 560439
    
    print(f"\nSliding window analysis:")
    print(f"  Window size: {window_size} timepoints")
    print(f"  First {first_predictable + 1} timepoints (0-{first_predictable}): default to speech (no past context)")
    print(f"  Timepoints {first_predictable}-{last_predictable}: {predictable_count:,} model predictions")
    print(f"  Last {half_window} timepoints: default to speech (no future context)")
    print(f"  Total predictions needed: {total_timepoints:,}")
    
    # Initialize predictions array
    all_predictions = [None] * total_timepoints
    
    # Step 1: Fill first 99 timepoints with default predictions (timepoints 0-98)
    print(f"\nSetting first {first_predictable + 1} timepoints (0-{first_predictable}) to speech=1...")
    for i in range(first_predictable + 1):
        all_predictions[i] = 1.0
    
    # Step 2: Fill last 100 timepoints with default predictions  
    print(f"Setting last {half_window} timepoints to speech=1...")
    for i in range(last_predictable + 1, total_timepoints):
        all_predictions[i] = 1.0
    
    # Step 3: Generate model predictions for the middle timepoints
    print(f"\nGenerating model predictions for {predictable_count:,} timepoints...")
    

    model.eval()
    batch_size = 1000  # Process predictions in batches
    
    with torch.no_grad():
        for start_idx in tqdm(range(0, predictable_count, batch_size), desc="Model predictions"):
            end_idx = min(start_idx + batch_size, predictable_count)
            
            # For each timepoint in this batch, find the corresponding dataset sample
            batch_data = []
            batch_timepoints = []
            
            for batch_pos in range(start_idx, end_idx):
                timepoint_idx = first_predictable + batch_pos  # Actual timepoint index (99, 100, ...)
                
                # The dataset sample that has timepoint_idx as its center (at position 99)
                # If timepoint_idx is the center, then the window starts at timepoint_idx - 99
                window_start = timepoint_idx - (half_window - 1)  # timepoint_idx - 99
                dataset_sample_idx = window_start  # This dataset sample starts at window_start
                
                try:
                    # Ensure we don't go out of bounds
                    if 0 <= dataset_sample_idx < len(holdout_dataset):
                        sample = holdout_dataset[dataset_sample_idx]
                        meg_data = sample[0]  # [306, 200]
                        
                        # Verify this sample has the right timepoints
                        if meg_data.shape[1] == window_size:
                            batch_data.append(meg_data)
                            batch_timepoints.append(timepoint_idx)
                        else:
                            # If not full window, default to speech
                            all_predictions[timepoint_idx] = 1.0
                    else:
                        # Out of bounds, default to speech
                        all_predictions[timepoint_idx] = 1.0
                        
                except Exception as e:
                    # If anything fails, default to speech
                    all_predictions[timepoint_idx] = 1.0
                    if start_idx == 0 and len([p for p in all_predictions[:timepoint_idx+1] if p == 1.0]) <= 5:
                        print(f"    Timepoint {timepoint_idx} failed: {str(e)[:100]}...")
            
            # Process batch with model if we have data
            if batch_data:
                try:
                    # Stack into batch and apply sensor mask
                    meg_batch = torch.stack(batch_data)  # [batch, 306, 200]
                    meg_masked = meg_batch[:, SENSORS_SPEECH_MASK, :]  # [batch, 23, 200]
                    
                    # Get model predictions
                    logits = model(meg_masked)
                    probs = torch.sigmoid(logits).squeeze()
                    
                    if probs.dim() == 0:
                        probs = probs.unsqueeze(0)
                    
                    # Store predictions
                    prob_list = probs.cpu().tolist()
                    for i, prob in enumerate(prob_list):
                        if i < len(batch_timepoints):
                            timepoint_idx = batch_timepoints[i]
                            all_predictions[timepoint_idx] = prob
                        
                except Exception as e:
                    print(f"Model failed on batch {start_idx // batch_size}: {e}")
                    # If model fails, use default for all timepoints in batch
                    for timepoint_idx in batch_timepoints:
                        all_predictions[timepoint_idx] = 1.0
            
            # Fill any unfilled timepoints in this batch with default
            for batch_pos in range(start_idx, end_idx):
                timepoint_idx = first_predictable + batch_pos
                if all_predictions[timepoint_idx] is None:
                    all_predictions[timepoint_idx] = 1.0
    
    # Verify all predictions are filled
    for i, pred in enumerate(all_predictions):
        if pred is None:
            all_predictions[i] = 1.0
    
    print(f"\nGenerated {len(all_predictions):,} predictions")
    
    # Summary statistics
    default_start_count = first_predictable
    default_end_count = half_window  
    model_preds = [all_predictions[i] for i in range(first_predictable, last_predictable + 1) 
                   if abs(all_predictions[i] - 1.0) > 1e-6]
    
    print(f"\nSummary:")
    print(f"  Default predictions (start): {default_start_count:,} (all 1.0)")
    print(f"  Model predictions: {len(model_preds):,} (avg: {sum(model_preds)/len(model_preds):.3f})" if model_preds else "  Model predictions: 0")
    print(f"  Default predictions (end): {default_end_count:,} (all 1.0)")
            
    return all_predictions


**Step 5: Generate the final submission**
The model is trained to predict the **middle timepoint** of a 200-timepoint window. With 560,638 total timepoints, you can create a sliding window where (zero-indexed):

- **Timepoints 0-98:** Cannot predict (need 100 previous timepoints) → Default to speech (1.0)
- **Timepoints 99-560,537:** Can predict using model (560,439 predictions) 
- **Timepoints 560,538-560,637:** Cannot predict (need 100 future timepoints) → Default to speech (1.0)

Now let's run the prediction generation and create the submission CSV file:



In [None]:
# Generate predictions
predictions = generate_predictions(model, holdout_dataset)

# Convert to tensor format expected by submission function
tensor_predictions = [torch.tensor(pred).unsqueeze(0) for pred in predictions]

# Generate submission CSV
submission_filename = "submission.csv"
holdout_dataset.generate_submission_in_csv(tensor_predictions, submission_filename)

print(f"✅ SUCCESS! Submission file created: {submission_filename}")
print(f"📄 Contains {len(predictions):,} predictions")
print("🎯 Ready for upload to EvalAI!")


Done! - we have now used to model from the Colab Tutorial to generate predictions for 560,438 out of 560638 timepoints, and padded the first and last 100 with a fixed majority class prediction to ensure we always end up with the expected amount of predictions for the submission!

## Ready to submit?
After generating the predictions file, the next step is to submit it for evaluation. Don't worry, you are allowed to submit multiple times. Please, take a look at the [Submit on EvalAI](https://neural-processing-lab.github.io/2025-libribrain-competition/participate/#4-submit-on-evalai) section on the website to learn more.

## That's it! 🥳
Thanks for taking the time to look at and/or participate in our competition. If this caught your interest, you might also want to take a look at the more advanced version of the task, focussed on Phoneme Classification - you can find the corresponding Colab [here](https://neural-processing-lab.github.io/2025-libribrain-competition/links/phoneme-colab). If you have any open questions, please get in touch through [our Discord server](https://neural-processing-lab.github.io/2025-libribrain-competition/links/discord).