## Approach 3: Bootstrapped Ensemble
The first ensemble approach tested whether simply training 5 independent models with randomly initialized weights would enable calculation of a variance for predictions on unlabeled samples maximized. This may have been a weak form of calculating variance. A better approach may be to make the models even more different from each other by bootstrapping - that is, using a random subset (e.g. 90%) of the data to train each model independently, then see what they disagree on. I was particularly encouraged to try this bootstrapped ensemble method when I noticed that the paper [Active learning-assisted directed evolution](https://www.nature.com/articles/s41467-025-55987-8) used this approach for their top performing DNN Ensemble method.

This will not improve my hypothesized main issue I discussed last time of ensuring diversity is high in the newly selected samples. I will work on that next time and apply that to this new bootstrapped ensemble along with the dropout method.

In [1]:
from scripts.data_utils import train_val_test_split
from scripts.config import (
    DATA_PATH, 
    SEQUENCE_COL, 
    SCORE_COL, 
    TOK_MODEL, 
    VAL_SPLIT,
    TEST_SPLIT,
    BATCH_SIZE,
    RANDOM_SEED,
)

training_pool, val_dataloader, test_dataloader = train_val_test_split(
    DATA_PATH,
    SEQUENCE_COL,
    SCORE_COL,
    TOK_MODEL,
    VAL_SPLIT,
    TEST_SPLIT,
    BATCH_SIZE,
    RANDOM_SEED
)

Compared to approach 1, the main thing that should change is that the train dataloader from acquire_new_batch should be subsetted from immediately before training a new model.

In [2]:
import torch
from torch.utils.data import Subset, DataLoader
import numpy as np

from scripts.training import initialize_and_train_new_model
from scripts.acquisition import get_pool_predictions

def get_bootstrap_sample(labeled_indices, pool_dataset, train_dataloader_batch_size):
    bootstrap_indices = np.random.choice(labeled_indices, size=int(0.9*len(labeled_indices)),replace=True)
    bootstrap_subset = Subset(pool_dataset, bootstrap_indices)
    bootstrap_dataloader = DataLoader(bootstrap_subset, batch_size=train_dataloader_batch_size, shuffle=True)
    return bootstrap_dataloader

def train_ensemble(
        n_models, 
        model_name, 
        approach,
        learning_rate,
        weight_decay,
        epochs,
        labeled_indices,
        train_dataloader_batch_size,
        pool_dataset, 
        pool_dataloader, 
        val_dataloader,
        patience
        ):
    
    # define list to store predictions as each model is trained then evaluated
    ensemble_predictions = []
    
    for i in range(n_models):
        print(f"\nTraining Model {i+1}...")
        # set a changing manual seed
        torch.manual_seed(i)
        torch.cuda.manual_seed(i)

        # get bootstrap sample from labeled dataset
        bootstrap_dataloader = get_bootstrap_sample(labeled_indices, pool_dataset, train_dataloader_batch_size)

        # initialize and train a new model
        model = initialize_and_train_new_model(approach, model_name, learning_rate, weight_decay, epochs, bootstrap_dataloader, val_dataloader, patience)
        
        # get model predictions on pool dataloader, append to ensemble predictions list
        pool_preds = get_pool_predictions(model, pool_dataloader, )
        ensemble_predictions.append(pool_preds)

    # stack ensemble predictions to create tensor of shape (n_models, n_unlabeled_samples)
    ensemble_predictions = torch.stack(ensemble_predictions, dim=0)
    print("Ensemble training complete, submitting predictions for next cycle.")
    # return list of ensemble predictions
    return ensemble_predictions

In [None]:
from pathlib import Path
import pandas as pd

from scripts.acquisition import acquire_new_batch, get_variances
from scripts.training import initialize_and_train_new_model, test_model
from scripts.campaigns import run_standard_finetuning


def get_learning_curves(
        n_samples,
        initial_n_samples,
        n_samples_per_batch,
        model_name, 
        approach,
        learning_rate, 
        weight_decay, 
        epochs, 
        training_pool, 
        train_dataloader_batch_size,
        pool_dataloader_batch_size,
        val_dataloader, 
        test_dataloader,
        patience=5,
        n_models=5,
        results_path="active_vs_standard_learning_curves.csv"
):
    results_path = Path(results_path)
    results_dir = results_path.parent
    results_dir.mkdir(parents=True, exist_ok=True)

    # Load existing results if the file exists, otherwise start with a fresh DataFrame.
    if results_path.exists():
        all_results_df = pd.read_csv(results_path)
    else:
        all_results_df = pd.DataFrame()
    
    total_pool_size = len(training_pool)
    unlabeled_indices = np.arange(total_pool_size)
    labeled_indices = np.array([], dtype=np.int64)

    ensemble_predictions = None
    current_cycle = 1
    total_cycles = int(np.ceil((n_samples-initial_n_samples)/n_samples_per_batch)) + 1
    
    while len(labeled_indices) < n_samples and len(unlabeled_indices) > 0:
        print(f"\nCycle {current_cycle}/{total_cycles}\n-------------------------------------------------")

        # on the first cycle, choose random samples of initial_n_samples size
        if ensemble_predictions is None:
            print(f"Choosing initial {initial_n_samples} samples randomly...")
            train_dataloader, pool_dataloader, labeled_indices, unlabeled_indices = acquire_new_batch(
                training_pool, train_dataloader_batch_size, pool_dataloader_batch_size, initial_n_samples, n_samples_per_batch, labeled_indices, unlabeled_indices, acquisition_scores=None
            )
        # each other time, use the n_samples_per_batch with acquisition scores to select
        else:
            scores = get_variances(ensemble_predictions, f"results/03_bootstrap_ensemble/variances{current_cycle}.csv")
            print(f"Selecting new data points...")
            train_dataloader, pool_dataloader, labeled_indices, unlabeled_indices = acquire_new_batch(
                training_pool, train_dataloader_batch_size, pool_dataloader_batch_size, initial_n_samples, n_samples_per_batch, labeled_indices, unlabeled_indices, acquisition_scores=scores
            )
        
        # give message when loop ends
        if len(unlabeled_indices) == 0:
            print("Unlabeled pool is empty. Proceeding to final model training.")
            break
        
        # evaluate active vs standard
        final_results = []

        # active
        print(f"\nTraining and evaluating model using {len(labeled_indices)} actively selected samples...")
        model_active = initialize_and_train_new_model(approach, model_name, learning_rate, weight_decay, epochs, train_dataloader, val_dataloader, patience, return_history=False)
        results_active = test_model(model_active, test_dataloader, return_results=True)
        results_active = {
            'changing_var': 'n_samples',
            'local_exp_idx': current_cycle-1,
            'value': len(labeled_indices),
            'training_method': 'active',
            **results_active
        }
        final_results.append(results_active)

        # standard
        print(f"\nTraining and evaluating model using {len(labeled_indices)} randomly selected samples...")
        model_standard, _ = run_standard_finetuning(len(labeled_indices), approach, model_name, train_dataloader_batch_size, learning_rate, weight_decay, epochs, training_pool, val_dataloader, patience)
        results_standard = test_model(model_standard, test_dataloader, return_results=True)
        results_standard = {
            'changing_var': 'n_samples',
            'local_exp_idx': current_cycle-1,
            'value': len(labeled_indices),
            'training_method': 'standard',
            **results_standard
        }
        final_results.append(results_standard)
        # save to disk each time to save progress
        results_df = pd.DataFrame(final_results)
        all_results_df = pd.concat([all_results_df, results_df], ignore_index=True)
        all_results_df.to_csv(results_path, index=False)
        print(f"Progress for experiment {current_cycle-1} appended to {results_path}")

        # if it's the last cycle, skip ensemble predictions
        if (current_cycle == total_cycles):
            print("Experiments complete.")
            break

        print("Starting ensemble training and pool evaluation...")
        ensemble_predictions = train_ensemble(n_models, model_name, approach, learning_rate, weight_decay, epochs, labeled_indices, train_dataloader_batch_size, training_pool, pool_dataloader, val_dataloader, patience)
    
        current_cycle += 1
    return all_results_df

In [4]:
from scripts.config import (
    MODEL_NAME,
    APPROACH,
    LEARNING_RATE,
    WEIGHT_DECAY,
    EPOCHS,
    POOL_BATCH_SIZE,
    PATIENCE,
    N_MODELS,
)

get_learning_curves(
    n_samples=256,
    initial_n_samples=16,
    n_samples_per_batch=16,
    model_name=MODEL_NAME,
    approach=APPROACH,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    epochs=EPOCHS,
    training_pool=training_pool,
    train_dataloader_batch_size=BATCH_SIZE,
    pool_dataloader_batch_size=POOL_BATCH_SIZE,
    val_dataloader=val_dataloader,
    test_dataloader=test_dataloader,
    patience=PATIENCE,
    n_models=N_MODELS,
    results_path='results/03_bootstrap_ensemble/active_vs_standard_learning_curve.csv'
)


Cycle 1/16
-------------------------------------------------
Choosing initial 16 samples randomly...

Training and evaluating model using 16 actively selected samples...


[Training]:  76%|███████▌  | 38/50 [00:16<00:05,  2.35it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0436 | Val Loss: 0.2137 | SpearmanR: 0.3303


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 72.70it/s]



Training and evaluating model using 16 randomly selected samples...


[Training]:  20%|██        | 10/50 [00:04<00:17,  2.35it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.1909 | Val Loss: 0.2235 | SpearmanR: 0.1096


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 72.68it/s]


Progress for experiment 0 appended to results/03_bootstrap_ensemble/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  52%|█████▏    | 26/50 [00:11<00:10,  2.28it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0488 | Val Loss: 0.3131 | SpearmanR: 0.2721


[Surveying]: 100%|██████████| 25/25 [00:02<00:00, 12.04it/s]



Training Model 2...


[Training]:  74%|███████▍  | 37/50 [00:16<00:05,  2.25it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0544 | Val Loss: 0.2670 | SpearmanR: 0.2994


[Surveying]: 100%|██████████| 25/25 [00:02<00:00, 12.00it/s]



Training Model 3...


[Training]:  38%|███▊      | 19/50 [00:08<00:13,  2.27it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0753 | Val Loss: 0.1990 | SpearmanR: 0.0530


[Surveying]: 100%|██████████| 25/25 [00:02<00:00, 11.94it/s]



Training Model 4...


[Training]:  58%|█████▊    | 29/50 [00:12<00:08,  2.36it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0348 | Val Loss: 0.2820 | SpearmanR: 0.2569


[Surveying]: 100%|██████████| 25/25 [00:02<00:00, 12.00it/s]



Training Model 5...


[Training]:  66%|██████▌   | 33/50 [00:14<00:07,  2.27it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0447 | Val Loss: 0.2516 | SpearmanR: 0.3652


[Surveying]: 100%|██████████| 25/25 [00:02<00:00, 11.92it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 2/16
-------------------------------------------------
Saving variance distribution to results/02_mc_dropout/variances2.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 32 actively selected samples...


[Training]:  64%|██████▍   | 32/50 [00:14<00:08,  2.14it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0495 | Val Loss: 0.2012 | SpearmanR: 0.3027


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 72.10it/s]



Training and evaluating model using 32 randomly selected samples...


[Training]:   4%|▍         | 2/50 [00:01<00:30,  1.58it/s]


KeyboardInterrupt: 