# Training a BioEncoder-model on butterflies

This notebook demonstrates the complete two-stage workflow for training a BioEncoder model on Junonia butterfly images. Stage 1 learns discriminative features using deep metric learning, while Stage 2 fine-tunes a classification head for species prediction.

In [None]:
import os
import bioencoder

In [None]:
os.chdir(r"D:\git-repos\mluerig\workshop-nau-bioencoder")
# os.chdir(r"/home/mlurig/git-repos/workshop-nau-bioencoder")
# os.chdir(r"/scratch/mdl458/workshop-nau-bioencoder")

### Initialize BioEncoder Workspace

Create the project directory structure for this training run. The `root_dir` variable points to where all training outputs (models, logs, plots) will be saved, the `run_name` parameter allows you to organize multiple experiments<>.

In [None]:
bioencoder.configure(root_dir="bioencoder_wd", run_name="v1", create=True)

### Split Dataset into Train/Val sets

Automatically partition the Junonia dorsal images into training, validation, and test sets. The `max_ratio=10` parameter ensures no class has more than 10x the samples of the smallest class, helping to balance training. The `random_seed` ensures reproducibility. Use `help(bioencoder.split_dataset)` to see additional options like `val_percent` for custom split proportions.

In [None]:
help(bioencoder.split_dataset)

### Stage 1 Training

Train the feature extraction backbone using metric learning (e.g., triplet or contrastive loss). This stage learns to embed visually similar specimens close together in feature space. 

In [None]:
bioencoder.train(root_dir=r"bioencoder_wd", run_name="v1", config_path=r"configs/train_stage1.yml") # , overwrite=True

### Stochastic Weight Averaging (SWA)

Average the model weights from the last several training epochs to create a more robust final model. SWA typically improves generalization by finding flatter minima in the loss landscape, leading to better performance on unseen data.

In [None]:
bioencoder.swa(config_path=r"configs/swa_stage1.yml")

### Visualize Stage 1 Results

Generate interactive plots including PCA/t-SNE embeddings to visualize how the model organizes specimens in feature space. These visualizations help assess whether the learned embeddings capture meaningful phenotypic relationships.

In [None]:
df_emb, df_plots = bioencoder.interactive_plots(config_path=r"configs/plot_stage1.yml", overwrite=True)
os.makedirs(r"data", exist_ok=True)
df_embeddings.to_csv(r"data/embeddings_v1.csv", index=False)


### Stage 2 Training

Fine-tune the pre-trained encoder by adding and training a classification head. This stage uses the rich feature representations learned in Stage 1 but optimizes for direct class prediction using cross-entropy loss. The frozen or partially frozen backbone helps prevent overfitting.

In [None]:
bioencoder.train(root_dir=r"bioencoder_wd", config_path=r"configs/train_stage2.yml") # , overwrite=True

### SWA for Stage 2

Average weights from Stage 2 training checkpoints to stabilize the final classification model. This ensures the classifier benefits from the same generalization improvements as the feature extractor.

In [None]:
bioencoder.swa(config_path=r"configs/swa_stage2.yml")

### Explore the Final Model

Launch the interactive model explorer to analyze predictions, visualize attention maps, and identify which image regions the model uses for classification. This tool helps discover morphological traits of importance and validates that the model learns biologically meaningful features.

In [None]:
bioencoder.model_explorer(config_path=r"configs/explore_stage2.yml")