# Butterfly Classification Pipeline Notebook

This notebook runs the full pipeline by importing functions from the repository. It executes:
1. Wing segmentation using the pretrained U-Net model
2. Data augmentation to balance the dataset
3. Fine-tuning the BiO‑CLIP classifier

Before running, please ensure you have downloaded the data and U-Net model weights as described in the project summary.

Adjust the placeholder paths and parameters as needed.

In [None]:
# Add repository folders to the Python path
import sys
import os

# Adjust these relative paths if your repository structure is different
current_dir = os.getcwd()
sys.path.append(os.path.join(current_dir, '..', 'remove_bg'))
sys.path.append(os.path.join(current_dir, '..', 'augmentation'))
sys.path.append(os.path.join(current_dir, '..', 'finetune_bioclip'))

# Define common paths (update these paths as needed)
PRETRAINED_UNET_MODEL = os.path.join('..', 'models', 'best_unet_model.pth')
METADATA_CSV = os.path.join('..', 'data', 'metadata.csv')
OUTPUT_WING_IMAGES = os.path.join('..', 'data', 'wing_images')

ORIG_IMG_FOLDER = OUTPUT_WING_IMAGES  # Use the segmented wing images
OUTPUT_AUG_IMAGES = os.path.join('..', 'data', 'augmented_images')
INPUT_CSV_FOR_AUG = METADATA_CSV  
OUTPUT_CSV_FOR_AUG = os.path.join('..', 'data', 'augmented_metadata.csv')

AUG_MIN_IMAGES_PER_CLASS = 1000
AUG_PER_IMAGE_HIGH_COUNT = 1

DATA_FOR_FINETUNING = OUTPUT_CSV_FOR_AUG
IMG_DIR_FOR_FINETUNING = OUTPUT_AUG_IMAGES
CLASSIFIER_SAVE_DIR = os.path.join('..', 'models', 'bioclip_classifier')

print('Paths set up successfully.')

In [None]:
# Import functions from the custom modules
from select_wings_unet import select_wings
from albumentation_augm import augment_dataset
from finetune_aug_bg import finetune_model

print('Custom modules imported successfully.')

## 1. Wing Segmentation

This step uses the pretrained U-Net model to segment the butterfly wings. The function `select_wings` is assumed to perform the following:
- Resize input images
- Apply the U-Net to generate a segmentation mask
- Threshold the mask and remove the background

Update the paths if needed before running.

In [None]:
# Run wing segmentation
select_wings(
    model_path=PRETRAINED_UNET_MODEL,
    csv_path=METADATA_CSV,
    output_folder=OUTPUT_WING_IMAGES
)

print('Wing segmentation completed.')

## 2. Data Augmentation

This step augments the segmented wing images to ensure a balanced dataset (at least 1,000 images per class). The function `augment_dataset` should perform the necessary image augmentations and update the metadata CSV accordingly.

In [None]:
# Run data augmentation
augment_dataset(
    orig_img_folder=OUTPUT_WING_IMAGES,
    output_img_folder=OUTPUT_AUG_IMAGES,
    csv_path=INPUT_CSV_FOR_AUG,
    output_csv_path=OUTPUT_CSV_FOR_AUG,
    min_images_per_class=AUG_MIN_IMAGES_PER_CLASS,
    aug_per_image_high_count=AUG_PER_IMAGE_HIGH_COUNT
)

print('Data augmentation completed.')

## 3. Fine-Tuning BiO-CLIP

This step fine-tunes the pre-trained BiO-CLIP model for butterfly subspecies classification. The function `finetune_model` is expected to:
- Load the augmented dataset and images
- Unfreeze the last two attention blocks of the model with a small learning rate
- Train an additional classifier head with a higher learning rate
- Save the best model

Adjust any additional hyperparameters via keyword arguments as needed.

In [None]:
# Run fine-tuning
finetune_model(
    data_file=DATA_FOR_FINETUNING,
    img_dir=IMG_DIR_FOR_FINETUNING,
    clf_save_dir=CLASSIFIER_SAVE_DIR,
    # Optionally, you can pass additional hyperparameters here
    num_epochs=30,
    batch_size=32,
    lr_backbone=1e-5,
    lr_head=1e-3
)

print('Fine-tuning completed.')

## Pipeline Completed

The notebook has run all the steps of the butterfly classification pipeline:
1. Wing segmentation
2. Data augmentation
3. Fine-tuning of the BiO‑CLIP classifier

Check the output folders and saved models to verify that each step was executed correctly.