# Approach 2.5 / 3.5

Each of the approaches tested so far, despite being used previously in the literature (at least to the extent that I understand them) have failed to out perform just randomly selecting samples. I suspect this could be due to one and/or another issue with the current acquisition strategy. Each of these approaches use some method to find the variance of model predictions to find the most uncertain sequences for the next batch. However, those uncertain sequences might somehow correspond to noisy measurements (measurements with lower read counts, especially if they tend to be the sequences with more mutations), and they might not be sufficiently diverse.

Given these hypotheses for what's going wrong, I may be able to solve them by ensuring the next sample is uncertain as well as diverse. I'd be interested in testing when its just diverse and when its uncertain and diverse compared to the random baseline.

But how do we ensure diversity? Well, since we are conveniently using a language model, we should be able to just use the language model embeddings. So, all I need to do is figure out how to get the embeddings, measure their distance, and integrate some amount of distance optimization into my acquistion function.

Note: Below, I'm going to extract the embeddings from the last hidden state of the CLS token, since that's the standard model I've been working with setting up the structure of active learning. However, other fine-tuning approaches should make use of different embeddings. For example, if you're using only the mutant residues as the final state input to your output layer, you should probably use those embeddings for diversity. Another decision point is whether to use the embeddings of your fine-tuned model or a fresh ESM2 model. I'm going to work with just the fine-tuned model here.

Here's the algorithm I'll employ:
1. Train ensemble on random initial set.
2. Get variances from model predictions.
3. Filter samples with top k% of variances.
4. Retrieve embeddings for those samples.
5. For each sample in the unlabeled pool, calculate its distance from each of the samples in the current batch.
6. Find the minimum distance from those for each candidate.
7. Add candidate c whose minimum distance from the current batch is the greatest and remove it from the unlabeled pool.
8. Calculate the distance of just-added candidate c from each of the remaining unlabeled pool. Add these distances to the distances list for each sample.
9. Repeat 6-8 until n samples are added to the labeled samples.

Actually, now that I spell out the algorithm, I'm realizing that I should also test just a random choice from the top k% of variances to make sure that the embeddings even add anything.

In [1]:
from scripts.data_utils import train_val_test_split
from scripts.config import (
    DATA_PATH, 
    SEQUENCE_COL, 
    SCORE_COL, 
    TOK_MODEL, 
    VAL_SPLIT,
    TEST_SPLIT,
    BATCH_SIZE,
    RANDOM_SEED,
)

training_pool, val_dataloader, test_dataloader = train_val_test_split(
    DATA_PATH,
    SEQUENCE_COL,
    SCORE_COL,
    TOK_MODEL,
    VAL_SPLIT,
    TEST_SPLIT,
    BATCH_SIZE,
    RANDOM_SEED
)

In [2]:
import torch
from torch.utils.data import Subset, DataLoader
import numpy as np

from scripts.training import initialize_and_train_new_model
from scripts.acquisition import get_pool_predictions

def get_bootstrap_sample(labeled_indices, pool_dataset, train_dataloader_batch_size):
    bootstrap_indices = np.random.choice(labeled_indices, size=int(0.9*len(labeled_indices)),replace=True)
    bootstrap_subset = Subset(pool_dataset, bootstrap_indices)
    bootstrap_dataloader = DataLoader(bootstrap_subset, batch_size=train_dataloader_batch_size, shuffle=True)
    return bootstrap_dataloader

def train_ensemble(
        n_models, 
        model_name, 
        approach,
        learning_rate,
        weight_decay,
        epochs,
        labeled_indices,
        train_dataloader_batch_size,
        pool_dataset, 
        pool_dataloader, 
        val_dataloader,
        patience
        ):
    
    # define list to store predictions as each model is trained then evaluated
    ensemble_predictions = []
    
    for i in range(n_models):
        print(f"\nTraining Model {i+1}...")
        # set a changing manual seed
        torch.manual_seed(i)
        torch.cuda.manual_seed(i)

        # get bootstrap sample from labeled dataset
        bootstrap_dataloader = get_bootstrap_sample(labeled_indices, pool_dataset, train_dataloader_batch_size)

        # initialize and train a new model
        model = initialize_and_train_new_model(approach, model_name, learning_rate, weight_decay, epochs, bootstrap_dataloader, val_dataloader, patience)
        
        # get model predictions on pool dataloader, append to ensemble predictions list
        pool_preds = get_pool_predictions(model, pool_dataloader)
        ensemble_predictions.append(pool_preds)

    # stack ensemble predictions to create tensor of shape (n_models, n_unlabeled_samples)
    ensemble_predictions = torch.stack(ensemble_predictions, dim=0)
    print("Ensemble training complete, submitting predictions for next cycle.")
    # return list of ensemble predictions
    return ensemble_predictions

### Random choice from top % variances  

In [9]:
import numpy as np
from torch.utils.data import Subset, DataLoader

# acquire new batch, randomly if no scores given, top "batch_size_to_acquire" if given
def acquire_new_batch(
    dataset, 
    train_dataloader_batch_size, 
    pool_dataloader_batch_size, 
    initial_batch_size,
    top_score_fraction, 
    batch_size_to_acquire, 
    labeled_indices, 
    unlabeled_indices, 
    acquisition_scores=None
    ):

    # if initial batch, when there are no acquisition scores, select randomly
    if acquisition_scores is None:
        initial_batch_size = min(initial_batch_size, len(unlabeled_indices))
        indices_to_acquire = np.random.choice(unlabeled_indices, size=initial_batch_size, replace=False)
    
    # else select based on top acquisition scores
    else:
        # make sure we don't overshoot samples to acquire if on the final batch
        batch_size_to_acquire = min(batch_size_to_acquire, len(acquisition_scores))
        # detmine the number of top scorers to select from
        num_top_scorers = int(top_score_fraction * len(unlabeled_indices))
        # get the indicies of the top acquisition scores (num of samples)
        top_indices = acquisition_scores.topk(num_top_scorers).indices
        # choose a random set of indices from these top scorers
        top_k_indices = np.random.choice(top_indices.cpu().numpy(), size=batch_size_to_acquire, replace=False)
        # use these to find the indices that map back to the original dataset
        indices_to_acquire = unlabeled_indices[top_k_indices]
    
    # update the indices lists
    labeled_indices = np.concatenate([labeled_indices, indices_to_acquire])
    unlabeled_indices = np.setdiff1d(unlabeled_indices, indices_to_acquire, assume_unique=True)
    
    # create new subsets and dataloaders
    train_subset = Subset(dataset, labeled_indices.tolist())
    pool_subset = Subset(dataset, unlabeled_indices.tolist())
    train_dataloader = DataLoader(train_subset, batch_size=train_dataloader_batch_size, shuffle=True)
    pool_dataloader = DataLoader(pool_subset, batch_size=pool_dataloader_batch_size, shuffle=False)
    
    return train_dataloader, pool_dataloader, labeled_indices, unlabeled_indices

In [10]:
from pathlib import Path
import pandas as pd

from scripts.acquisition import get_variances
from scripts.training import initialize_and_train_new_model, test_model
from scripts.campaigns import run_standard_finetuning


def get_learning_curves(
        n_samples,
        initial_n_samples,
        top_score_fraction,
        n_samples_per_batch,
        model_name, 
        approach,
        learning_rate, 
        weight_decay, 
        epochs, 
        training_pool, 
        train_dataloader_batch_size,
        pool_dataloader_batch_size,
        val_dataloader, 
        test_dataloader,
        patience=5,
        n_models=5,
        results_path="active_vs_standard_learning_curves.csv"
):
    results_path = Path(results_path)
    results_dir = results_path.parent
    results_dir.mkdir(parents=True, exist_ok=True)

    # Load existing results if the file exists, otherwise start with a fresh DataFrame.
    if results_path.exists():
        all_results_df = pd.read_csv(results_path)
    else:
        all_results_df = pd.DataFrame()
    
    total_pool_size = len(training_pool)
    unlabeled_indices = np.arange(total_pool_size)
    labeled_indices = np.array([], dtype=np.int64)

    ensemble_predictions = None
    current_cycle = 1
    total_cycles = int(np.ceil((n_samples-initial_n_samples)/n_samples_per_batch)) + 1
    
    while len(labeled_indices) < n_samples and len(unlabeled_indices) > 0:
        print(f"\nCycle {current_cycle}/{total_cycles}\n-------------------------------------------------")

        # on the first cycle, choose random samples of initial_n_samples size
        if ensemble_predictions is None:
            print(f"Choosing initial {initial_n_samples} samples randomly...")
            train_dataloader, pool_dataloader, labeled_indices, unlabeled_indices = acquire_new_batch(
                training_pool, train_dataloader_batch_size, pool_dataloader_batch_size, initial_n_samples, top_score_fraction, n_samples_per_batch, labeled_indices, unlabeled_indices, acquisition_scores=None
            )
        # each other time, use the n_samples_per_batch with acquisition scores to select
        else:
            scores = get_variances(ensemble_predictions, f"{results_dir}/variances{current_cycle}.csv")
            print(f"Selecting new data points...")
            train_dataloader, pool_dataloader, labeled_indices, unlabeled_indices = acquire_new_batch(
                training_pool, train_dataloader_batch_size, pool_dataloader_batch_size, initial_n_samples, top_score_fraction, n_samples_per_batch, labeled_indices, unlabeled_indices, acquisition_scores=scores
            )
        
        # give message when loop ends
        if len(unlabeled_indices) == 0:
            print("Unlabeled pool is empty. Proceeding to final model training.")
            break
        
        # evaluate active vs standard
        final_results = []

        # active
        print(f"\nTraining and evaluating model using {len(labeled_indices)} actively selected samples...")
        model_active = initialize_and_train_new_model(approach, model_name, learning_rate, weight_decay, epochs, train_dataloader, val_dataloader, patience, return_history=False)
        results_active = test_model(model_active, test_dataloader, return_results=True)
        results_active = {
            'changing_var': 'n_samples',
            'local_exp_idx': current_cycle-1,
            'value': len(labeled_indices),
            'training_method': 'active',
            **results_active
        }
        final_results.append(results_active)

        # standard
        print(f"\nTraining and evaluating model using {len(labeled_indices)} randomly selected samples...")
        model_standard, _ = run_standard_finetuning(len(labeled_indices), approach, model_name, train_dataloader_batch_size, learning_rate, weight_decay, epochs, training_pool, val_dataloader, patience)
        results_standard = test_model(model_standard, test_dataloader, return_results=True)
        results_standard = {
            'changing_var': 'n_samples',
            'local_exp_idx': current_cycle-1,
            'value': len(labeled_indices),
            'training_method': 'standard',
            **results_standard
        }
        final_results.append(results_standard)
        # save to disk each time to save progress
        results_df = pd.DataFrame(final_results)
        all_results_df = pd.concat([all_results_df, results_df], ignore_index=True)
        all_results_df.to_csv(results_path, index=False)
        print(f"Progress for experiment {current_cycle-1} appended to {results_path}")

        # if it's the last cycle, skip ensemble predictions
        if (current_cycle == total_cycles):
            print("Experiments complete.")
            break

        print("Starting ensemble training and pool evaluation...")
        ensemble_predictions = train_ensemble(n_models, model_name, approach, learning_rate, weight_decay, epochs, labeled_indices, train_dataloader_batch_size, training_pool, pool_dataloader, val_dataloader, patience)
    
        current_cycle += 1
    return all_results_df

In [None]:
from scripts.config import (
    MODEL_NAME,
    APPROACH,
    LEARNING_RATE,
    WEIGHT_DECAY,
    EPOCHS,
    POOL_BATCH_SIZE,
    PATIENCE,
    N_MODELS,
    TOP_SCORE_FRACTION,
)

get_learning_curves(
    n_samples=256,
    initial_n_samples=16,
    top_score_fraction=TOP_SCORE_FRACTION,
    n_samples_per_batch=16,
    model_name=MODEL_NAME,
    approach=APPROACH,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    epochs=EPOCHS,
    training_pool=training_pool,
    train_dataloader_batch_size=BATCH_SIZE,
    pool_dataloader_batch_size=POOL_BATCH_SIZE,
    val_dataloader=val_dataloader,
    test_dataloader=test_dataloader,
    patience=PATIENCE,
    n_models=N_MODELS,
    results_path='results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv'
)


Cycle 1/16
-------------------------------------------------
Choosing initial 16 samples randomly...

Training and evaluating model using 16 actively selected samples...


[Training]: 100%|██████████| 50/50 [00:35<00:00,  1.42it/s]


Train Loss: 0.0336 | Val Loss: 0.1788 | SpearmanR: 0.4060


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 127.40it/s]



Training and evaluating model using 16 randomly selected samples...


[Training]:  20%|██        | 10/50 [00:03<00:13,  2.88it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.2326 | Val Loss: 0.2467 | SpearmanR: -0.0998


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.63it/s]


Progress for experiment 0 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  96%|█████████▌| 48/50 [00:27<00:01,  1.77it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0013 | Val Loss: 0.2925 | SpearmanR: 0.2855


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.78it/s]



Training Model 2...


[Training]:  84%|████████▍ | 42/50 [00:24<00:04,  1.73it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0291 | Val Loss: 0.3151 | SpearmanR: 0.1252


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.76it/s]



Training Model 3...


[Training]: 100%|██████████| 50/50 [00:32<00:00,  1.55it/s]


Train Loss: 0.0047 | Val Loss: 0.1990 | SpearmanR: 0.1898


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.92it/s]



Training Model 4...


[Training]:  20%|██        | 10/50 [00:02<00:11,  3.51it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.2623 | Val Loss: 0.1999 | SpearmanR: -0.0290


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.81it/s]



Training Model 5...


[Training]: 100%|██████████| 50/50 [00:32<00:00,  1.54it/s]


Train Loss: 0.0059 | Val Loss: 0.1985 | SpearmanR: 0.3307


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.00it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 2/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances2.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 32 actively selected samples...


[Training]:  90%|█████████ | 45/50 [00:18<00:02,  2.38it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0235 | Val Loss: 0.1893 | SpearmanR: 0.3408


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 131.91it/s]



Training and evaluating model using 32 randomly selected samples...


[Training]:  26%|██▌       | 13/50 [00:07<00:20,  1.81it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.1520 | Val Loss: 0.2177 | SpearmanR: 0.1649


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 131.04it/s]


Progress for experiment 1 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:17<00:00,  2.83it/s]


Train Loss: 0.0019 | Val Loss: 0.2308 | SpearmanR: 0.3456


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.81it/s]



Training Model 2...


[Training]:  88%|████████▊ | 44/50 [00:13<00:01,  3.18it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0084 | Val Loss: 0.1676 | SpearmanR: 0.4401


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.75it/s]



Training Model 3...


[Training]:  70%|███████   | 35/50 [00:11<00:04,  3.05it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0448 | Val Loss: 0.2103 | SpearmanR: 0.1413


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.74it/s]



Training Model 4...


[Training]:  80%|████████  | 40/50 [00:12<00:03,  3.20it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0181 | Val Loss: 0.2064 | SpearmanR: 0.3939


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.74it/s]



Training Model 5...


[Training]:  90%|█████████ | 45/50 [00:14<00:01,  3.05it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0050 | Val Loss: 0.2154 | SpearmanR: 0.3789


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.76it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 3/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances3.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 48 actively selected samples...


[Training]:  82%|████████▏ | 41/50 [00:14<00:03,  2.85it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0309 | Val Loss: 0.1424 | SpearmanR: 0.5392


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.74it/s]



Training and evaluating model using 48 randomly selected samples...


[Training]:  42%|████▏     | 21/50 [00:07<00:09,  2.99it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0978 | Val Loss: 0.1651 | SpearmanR: 0.3981


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.23it/s]


Progress for experiment 2 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  62%|██████▏   | 31/50 [00:11<00:06,  2.80it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0081 | Val Loss: 0.1883 | SpearmanR: 0.3627


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.74it/s]



Training Model 2...


[Training]: 100%|██████████| 50/50 [00:18<00:00,  2.76it/s]


Train Loss: 0.0061 | Val Loss: 0.1606 | SpearmanR: 0.4655


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.76it/s]



Training Model 3...


[Training]:  76%|███████▌  | 38/50 [00:13<00:04,  2.77it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0019 | Val Loss: 0.1933 | SpearmanR: 0.4052


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.86it/s]



Training Model 4...


[Training]: 100%|██████████| 50/50 [00:18<00:00,  2.64it/s]


Train Loss: 0.0057 | Val Loss: 0.1433 | SpearmanR: 0.5033


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.87it/s]



Training Model 5...


[Training]: 100%|██████████| 50/50 [00:18<00:00,  2.76it/s]


Train Loss: 0.0155 | Val Loss: 0.1741 | SpearmanR: 0.4640


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.64it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 4/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances4.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 64 actively selected samples...


[Training]:  56%|█████▌    | 28/50 [00:10<00:08,  2.56it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0719 | Val Loss: 0.1567 | SpearmanR: 0.4937


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 125.16it/s]



Training and evaluating model using 64 randomly selected samples...


[Training]:  92%|█████████▏| 46/50 [00:15<00:01,  2.89it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0271 | Val Loss: 0.1510 | SpearmanR: 0.4731


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 130.01it/s]


Progress for experiment 3 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  86%|████████▌ | 43/50 [00:14<00:02,  2.89it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0081 | Val Loss: 0.1741 | SpearmanR: 0.3866


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.67it/s]



Training Model 2...


[Training]:  86%|████████▌ | 43/50 [00:14<00:02,  3.05it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0154 | Val Loss: 0.1542 | SpearmanR: 0.5206


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.75it/s]



Training Model 3...


[Training]: 100%|██████████| 50/50 [00:17<00:00,  2.79it/s]


Train Loss: 0.0139 | Val Loss: 0.1796 | SpearmanR: 0.4037


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.51it/s]



Training Model 4...


[Training]:  72%|███████▏  | 36/50 [00:12<00:05,  2.77it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0047 | Val Loss: 0.1641 | SpearmanR: 0.4576


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.71it/s]



Training Model 5...


[Training]: 100%|██████████| 50/50 [00:18<00:00,  2.72it/s]


Train Loss: 0.0052 | Val Loss: 0.1634 | SpearmanR: 0.5089


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 16.60it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 5/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances5.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 80 actively selected samples...


[Training]:  86%|████████▌ | 43/50 [00:15<00:02,  2.85it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0236 | Val Loss: 0.1710 | SpearmanR: 0.4967


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.71it/s]



Training and evaluating model using 80 randomly selected samples...


[Training]: 100%|██████████| 50/50 [00:17<00:00,  2.84it/s]


Train Loss: 0.0198 | Val Loss: 0.1620 | SpearmanR: 0.5035


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 130.55it/s]


Progress for experiment 4 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  72%|███████▏  | 36/50 [00:22<00:08,  1.64it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0088 | Val Loss: 0.1681 | SpearmanR: 0.4383


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.13it/s]



Training Model 2...


[Training]:  54%|█████▍    | 27/50 [00:16<00:14,  1.59it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0117 | Val Loss: 0.1650 | SpearmanR: 0.4218


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.00it/s]



Training Model 3...


[Training]:  54%|█████▍    | 27/50 [00:18<00:15,  1.48it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0130 | Val Loss: 0.1816 | SpearmanR: 0.3911


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.26it/s]



Training Model 4...


[Training]: 100%|██████████| 50/50 [00:28<00:00,  1.76it/s]


Train Loss: 0.0008 | Val Loss: 0.1386 | SpearmanR: 0.5346


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.12it/s]



Training Model 5...


[Training]: 100%|██████████| 50/50 [00:34<00:00,  1.44it/s]


Train Loss: 0.0020 | Val Loss: 0.1901 | SpearmanR: 0.4033


[Surveying]: 100%|██████████| 25/25 [00:01<00:00, 17.27it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 6/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances6.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 96 actively selected samples...


[Training]: 100%|██████████| 50/50 [00:31<00:00,  1.57it/s]


Train Loss: 0.0138 | Val Loss: 0.1468 | SpearmanR: 0.5769


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.91it/s]



Training and evaluating model using 96 randomly selected samples...


[Training]:  68%|██████▊   | 34/50 [00:21<00:10,  1.56it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0285 | Val Loss: 0.1320 | SpearmanR: 0.5998


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.37it/s]


Progress for experiment 5 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  90%|█████████ | 45/50 [00:31<00:03,  1.41it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0030 | Val Loss: 0.1603 | SpearmanR: 0.5065


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.61it/s]



Training Model 2...


[Training]:  58%|█████▊    | 29/50 [00:20<00:14,  1.40it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0181 | Val Loss: 0.1552 | SpearmanR: 0.5059


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.59it/s]



Training Model 3...


[Training]:  72%|███████▏  | 36/50 [00:19<00:07,  1.83it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0068 | Val Loss: 0.1569 | SpearmanR: 0.4809


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.48it/s]



Training Model 4...


[Training]:  80%|████████  | 40/50 [00:27<00:06,  1.43it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0074 | Val Loss: 0.1516 | SpearmanR: 0.5078


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.42it/s]



Training Model 5...


[Training]:  88%|████████▊ | 44/50 [00:29<00:04,  1.49it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0101 | Val Loss: 0.1569 | SpearmanR: 0.5020


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.52it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 7/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances7.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 112 actively selected samples...


[Training]: 100%|██████████| 50/50 [00:34<00:00,  1.45it/s]


Train Loss: 0.0096 | Val Loss: 0.1576 | SpearmanR: 0.5423


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.10it/s]



Training and evaluating model using 112 randomly selected samples...


[Training]: 100%|██████████| 50/50 [00:33<00:00,  1.48it/s]


Train Loss: 0.0174 | Val Loss: 0.1533 | SpearmanR: 0.5817


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.01it/s]


Progress for experiment 6 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:32<00:00,  1.52it/s]


Train Loss: 0.0036 | Val Loss: 0.1561 | SpearmanR: 0.5177


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.62it/s]



Training Model 2...


[Training]:  34%|███▍      | 17/50 [00:11<00:22,  1.45it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0222 | Val Loss: 0.1765 | SpearmanR: 0.4338


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.67it/s]



Training Model 3...


[Training]:  66%|██████▌   | 33/50 [00:20<00:10,  1.59it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0446 | Val Loss: 0.1629 | SpearmanR: 0.4387


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.53it/s]



Training Model 4...


[Training]:  86%|████████▌ | 43/50 [00:30<00:04,  1.41it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0159 | Val Loss: 0.1547 | SpearmanR: 0.5035


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.51it/s]



Training Model 5...


[Training]:  54%|█████▍    | 27/50 [00:18<00:15,  1.46it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0162 | Val Loss: 0.1602 | SpearmanR: 0.4933


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.51it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 8/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances8.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 128 actively selected samples...


[Training]:  80%|████████  | 40/50 [00:25<00:06,  1.55it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0151 | Val Loss: 0.1450 | SpearmanR: 0.5699


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.95it/s]



Training and evaluating model using 128 randomly selected samples...


[Training]:  94%|█████████▍| 47/50 [00:33<00:02,  1.42it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0358 | Val Loss: 0.1352 | SpearmanR: 0.5908


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.69it/s]


Progress for experiment 7 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  38%|███▊      | 19/50 [00:13<00:21,  1.45it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.1477 | Val Loss: 0.2029 | SpearmanR: 0.5158


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.62it/s]



Training Model 2...


[Training]: 100%|██████████| 50/50 [00:37<00:00,  1.32it/s]


Train Loss: 0.0045 | Val Loss: 0.1711 | SpearmanR: 0.4564


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.70it/s]



Training Model 3...


[Training]: 100%|██████████| 50/50 [00:24<00:00,  2.05it/s]


Train Loss: 0.0178 | Val Loss: 0.1521 | SpearmanR: 0.5003


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.73it/s]



Training Model 4...


[Training]:  80%|████████  | 40/50 [00:29<00:07,  1.37it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0133 | Val Loss: 0.1453 | SpearmanR: 0.5624


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.67it/s]



Training Model 5...


[Training]: 100%|██████████| 50/50 [00:40<00:00,  1.24it/s]


Train Loss: 0.0394 | Val Loss: 0.1550 | SpearmanR: 0.5098


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.62it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 9/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances9.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 144 actively selected samples...


[Training]:  68%|██████▊   | 34/50 [00:30<00:14,  1.11it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0296 | Val Loss: 0.1379 | SpearmanR: 0.5825


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.54it/s]



Training and evaluating model using 144 randomly selected samples...


[Training]: 100%|██████████| 50/50 [00:33<00:00,  1.49it/s]


Train Loss: 0.0233 | Val Loss: 0.1500 | SpearmanR: 0.5689


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 126.49it/s]


Progress for experiment 8 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:23<00:00,  2.15it/s]


Train Loss: 0.0168 | Val Loss: 0.1379 | SpearmanR: 0.5476


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.61it/s]



Training Model 2...


[Training]:  92%|█████████▏| 46/50 [00:19<00:01,  2.39it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0120 | Val Loss: 0.1674 | SpearmanR: 0.5490


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.62it/s]



Training Model 3...


[Training]:  62%|██████▏   | 31/50 [00:14<00:08,  2.20it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0259 | Val Loss: 0.1938 | SpearmanR: 0.4978


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.72it/s]



Training Model 4...


[Training]: 100%|██████████| 50/50 [00:19<00:00,  2.50it/s]


Train Loss: 0.0217 | Val Loss: 0.1418 | SpearmanR: 0.5246


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.68it/s]



Training Model 5...


[Training]:  58%|█████▊    | 29/50 [00:14<00:10,  1.95it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0204 | Val Loss: 0.1893 | SpearmanR: 0.4517


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.62it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 10/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances10.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 160 actively selected samples...


[Training]:  94%|█████████▍| 47/50 [00:21<00:01,  2.14it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0205 | Val Loss: 0.1256 | SpearmanR: 0.5914


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 125.13it/s]



Training and evaluating model using 160 randomly selected samples...


[Training]:  98%|█████████▊| 49/50 [00:25<00:00,  1.94it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0090 | Val Loss: 0.1355 | SpearmanR: 0.6327


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 125.51it/s]


Progress for experiment 9 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]:  92%|█████████▏| 46/50 [00:22<00:01,  2.03it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0044 | Val Loss: 0.1577 | SpearmanR: 0.5298


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.71it/s]



Training Model 2...


[Training]:  82%|████████▏ | 41/50 [00:19<00:04,  2.15it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0117 | Val Loss: 0.1464 | SpearmanR: 0.5633


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.75it/s]



Training Model 3...


[Training]:  96%|█████████▌| 48/50 [00:23<00:00,  2.05it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0016 | Val Loss: 0.1418 | SpearmanR: 0.5635


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.84it/s]



Training Model 4...


[Training]:  98%|█████████▊| 49/50 [00:22<00:00,  2.17it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0056 | Val Loss: 0.1467 | SpearmanR: 0.5560


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.81it/s]



Training Model 5...


[Training]:  80%|████████  | 40/50 [00:17<00:04,  2.29it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0061 | Val Loss: 0.1338 | SpearmanR: 0.5726


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.60it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 11/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances11.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 176 actively selected samples...


[Training]:  82%|████████▏ | 41/50 [00:21<00:04,  1.95it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0332 | Val Loss: 0.1579 | SpearmanR: 0.6268


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.50it/s]



Training and evaluating model using 176 randomly selected samples...


[Training]:  80%|████████  | 40/50 [00:20<00:05,  1.97it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0515 | Val Loss: 0.1163 | SpearmanR: 0.6516


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 128.68it/s]


Progress for experiment 10 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:33<00:00,  1.50it/s]


Train Loss: 0.0158 | Val Loss: 0.1611 | SpearmanR: 0.5337


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.05it/s]



Training Model 2...


[Training]:  84%|████████▍ | 42/50 [00:30<00:05,  1.38it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0073 | Val Loss: 0.1462 | SpearmanR: 0.5419


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.92it/s]



Training Model 3...


[Training]:  80%|████████  | 40/50 [00:30<00:07,  1.32it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0485 | Val Loss: 0.1494 | SpearmanR: 0.5668


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.08it/s]



Training Model 4...


[Training]:  72%|███████▏  | 36/50 [00:18<00:07,  1.93it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0086 | Val Loss: 0.1380 | SpearmanR: 0.5817


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.77it/s]



Training Model 5...


[Training]:  78%|███████▊  | 39/50 [00:21<00:05,  1.84it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0098 | Val Loss: 0.1325 | SpearmanR: 0.5677


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.69it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 12/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances12.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 192 actively selected samples...


[Training]: 100%|██████████| 50/50 [00:38<00:00,  1.31it/s]


Train Loss: 0.0164 | Val Loss: 0.1415 | SpearmanR: 0.6202


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 127.06it/s]



Training and evaluating model using 192 randomly selected samples...


[Training]:  86%|████████▌ | 43/50 [00:33<00:05,  1.27it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0154 | Val Loss: 0.1386 | SpearmanR: 0.6167


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 130.31it/s]


Progress for experiment 11 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:33<00:00,  1.51it/s]


Train Loss: 0.0027 | Val Loss: 0.1411 | SpearmanR: 0.5689


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.91it/s]



Training Model 2...


[Training]:  56%|█████▌    | 28/50 [00:23<00:18,  1.17it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0091 | Val Loss: 0.1352 | SpearmanR: 0.5561


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.98it/s]



Training Model 3...


[Training]:  98%|█████████▊| 49/50 [00:37<00:00,  1.32it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0109 | Val Loss: 0.1192 | SpearmanR: 0.6487


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.83it/s]



Training Model 4...


[Training]:  70%|███████   | 35/50 [00:25<00:11,  1.35it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0143 | Val Loss: 0.1797 | SpearmanR: 0.5343


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.86it/s]



Training Model 5...


[Training]:  92%|█████████▏| 46/50 [00:36<00:03,  1.28it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0026 | Val Loss: 0.1287 | SpearmanR: 0.6145


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 16.86it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 13/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances13.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 208 actively selected samples...


[Training]:  70%|███████   | 35/50 [00:29<00:12,  1.20it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0254 | Val Loss: 0.1389 | SpearmanR: 0.6247


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 129.79it/s]



Training and evaluating model using 208 randomly selected samples...


[Training]:  82%|████████▏ | 41/50 [00:31<00:06,  1.30it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0177 | Val Loss: 0.1152 | SpearmanR: 0.6718


[Testing]: 100%|██████████| 25/25 [00:00<00:00, 131.18it/s]


Progress for experiment 12 appended to results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv
Starting ensemble training and pool evaluation...

Training Model 1...


[Training]: 100%|██████████| 50/50 [00:41<00:00,  1.19it/s]


Train Loss: 0.0057 | Val Loss: 0.1622 | SpearmanR: 0.5660


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.28it/s]



Training Model 2...


[Training]:  94%|█████████▍| 47/50 [00:40<00:02,  1.17it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0056 | Val Loss: 0.1614 | SpearmanR: 0.5567


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.32it/s]



Training Model 3...


[Training]:  74%|███████▍  | 37/50 [00:24<00:08,  1.51it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0121 | Val Loss: 0.1210 | SpearmanR: 0.6124


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.12it/s]



Training Model 4...


[Training]:  64%|██████▍   | 32/50 [00:28<00:15,  1.14it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0200 | Val Loss: 0.1585 | SpearmanR: 0.5745


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.24it/s]



Training Model 5...


[Training]:  82%|████████▏ | 41/50 [00:23<00:05,  1.74it/s]


Early stopping triggered after 10 epochs with no improvement.
Train Loss: 0.0144 | Val Loss: 0.1526 | SpearmanR: 0.5292


[Surveying]: 100%|██████████| 24/24 [00:01<00:00, 17.32it/s]


Ensemble training complete, submitting predictions for next cycle.

Cycle 14/16
-------------------------------------------------
Saving variance distribution to results/04_adding_diversity/random_from_top_fraction/variances14.csv...
Save complete.
Selecting new data points...

Training and evaluating model using 224 actively selected samples...


[Training]:  88%|████████▊ | 44/50 [00:39<00:03,  1.96it/s]

In [None]:
lc_df = pd.read_csv("results/04_adding_diversity/random_from_top_fraction/active_vs_standard_learning_curve.csv")
lc_df.head()

Unnamed: 0,changing_var,local_exp_idx,value,training_method,avg_test_loss,spearmanr,pearsonr,final_mse
0,n_samples,0,16,active,0.197894,0.382034,0.373106,0.198538
1,n_samples,0,16,standard,0.320916,0.270746,0.294286,0.320806
2,n_samples,1,32,active,0.184044,0.545784,0.502709,0.184418
3,n_samples,1,32,standard,0.17345,0.41655,0.422225,0.173986
4,n_samples,2,48,active,0.236712,0.378425,0.410424,0.237579


In [None]:
import plotly.express as px

fig = px.line(lc_df, 'value', 'spearmanr',color="training_method")

fig.update_layout(
    xaxis_title="Number of Training Samples",
    yaxis_title="Spearman Correlation Coefficient"
)

fig.show()

In [None]:
import pandas as pd
import glob
from pathlib import Path

file_pattern = "results/04_adding_diversity/random_from_top_fraction/variances*.csv"
variance_files = glob.glob(file_pattern)
variance_files.sort()

all_variances_dfs = []

for filepath in variance_files:
    temp_df = pd.read_csv(filepath)
    column_name = Path(filepath).stem
    temp_df = temp_df.rename(columns={'variance': column_name})
    all_variances_dfs.append(temp_df)

final_df = pd.concat(all_variances_dfs, axis=1)

final_df

Unnamed: 0,variances10,variances11,variances12,variances13,variances14,variances15,variances16,variances2,variances3,variances4,variances5,variances6,variances7,variances8,variances9
0,0.149963,0.011875,0.025784,0.021565,0.042797,0.023504,0.020772,0.013999,0.028303,0.051192,0.048048,0.027420,0.055021,0.014279,0.049370
1,0.050087,0.005050,0.020347,0.031385,0.019446,0.037039,0.030839,0.119323,0.064065,0.027289,0.024033,0.024117,0.022003,0.027090,0.043563
2,0.069689,0.014829,0.010998,0.004873,0.013083,0.016777,0.035694,0.028558,0.052594,0.060716,0.149823,0.034743,0.123635,0.090861,0.006818
3,0.030903,0.022928,0.014478,0.082185,0.103137,0.105877,0.029240,0.044814,0.054224,0.040302,0.059472,0.037716,0.021208,0.004541,0.026667
4,0.036251,0.021226,0.081336,0.022297,0.035207,0.022637,0.013253,0.083888,0.057288,0.017450,0.041651,0.015074,0.017717,0.039450,0.071926
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3147,,,,,,,,0.032242,,,,,,,
3148,,,,,,,,0.089120,,,,,,,
3149,,,,,,,,0.029723,,,,,,,
3150,,,,,,,,0.050000,,,,,,,


In [None]:
columns_to_plot = ['variances2', 'variances9', 'variances15']

# "Melt" the selected columns into a long-format DataFrame
melted_df = final_df[columns_to_plot].melt(
    var_name='Cycle',      # New column for the original column names
    value_name='Variance'  # New column for the variance values
)

fig = px.histogram(
    data_frame=melted_df,
    x='Variance',                                  
    color='Cycle',                                 
    barmode='overlay',                             
    opacity=0.65,                                  
    histnorm='probability density',                
    title='Distribution of Variances Across Active Learning Cycles'
)

fig.show()

### With embeddings

In [None]:
def get_sequence_embedding(model, sequence):
    return embedding

def get_all_embeddings(unlabeled_indices):
    return embeddings

def get_distance(embed_1, embed_2):
    return distance

def get_all_distances(unlabeled_embeddings):
    return all_distances

def get_next_sample():
    return

def acquire_new_batch():
    return