# Butterfly Classification Pipeline Notebook with Data & Model Downloads

This notebook downloads the dataset and model weights, validates them, and then runs the full pipeline:

1. Wing segmentation using the pretrained U-Net model
2. Data augmentation for dataset balancing
3. Fine-tuning the pre-trained BiO‑CLIP classifier

Scripts that are designed as command-line tools (e.g. for segmentation, augmentation, and fine-tuning) are invoked using the `!` operator.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
print('Seaborn style set.')

## 1. Load Dataset CSV

Load the butterfly anomaly training CSV from GitHub.

In [None]:
csv_url = "https://raw.githubusercontent.com/Imageomics/HDR-anomaly-challenge/refs/heads/main/files/butterfly_anomaly_train.csv"
df = pd.read_csv(csv_url)
print(df.head())

## 2. Create Classification Column

For rows missing a direct subspecies label, combine the parent subspecies to form a classification.

In [None]:
for camid in list(df.loc[df["subspecies"].isna(), "CAMID"]):
    temp = df.loc[df["CAMID"] == camid]
    subspecies = temp["parent_subspecies_1"].astype(str) + " and " + temp["parent_subspecies_2"].astype(str)
    df.loc[df["CAMID"] == camid, "classification"] = subspecies

for camid in list(df.loc[df["subspecies"].notna(), "CAMID"]):
    temp = df.loc[df["CAMID"] == camid]
    subspecies = temp["subspecies"].astype(str)
    df.loc[df["CAMID"] == camid, "classification"] = subspecies

print('Classification column added.')

## 3. Visualize Distribution

Plot the distribution of images by classification (colored by hybrid status).

In [None]:
sns.histplot(df, y="classification", hue="hybrid_stat")
plt.show()

## 4. Create Sample Subset for Demo

Select a stratified 15% sample of the dataset for a quicker demo download.

In [None]:
from sklearn.model_selection import train_test_split

df_set, df_sample = train_test_split(df, test_size=0.15, stratify=df["classification"], random_state=614)
print(df_sample.info())

## 5. Download Sample Images and Validate

Use the functions from the `cautiousrobot` and `sumbuddy` modules to download and validate the sample images.

In [None]:
from cautiousrobot.__main__ import download_images
from cautiousrobot.buddy_check import BuddyCheck
from sumbuddy import get_checksums

IMG_DIR = "sample_images"
CHECKSUM_PATH = "sample_images_checksums.csv"

print("Downloading sample images...")
download_images(
    df_sample,
    img_dir=IMG_DIR,
    log_filepath="sample_img_logs.txt",
    error_log_filepath="sample_img_error_logs.txt",
    downsample_path="sample_images_downsized",
    downsample=256
)

print("Downloading complete. Calculating checksums...")
get_checksums(input_path=IMG_DIR, output_filepath=CHECKSUM_PATH)

checksum_df = pd.read_csv(CHECKSUM_PATH, low_memory=False)
expected_num_imgs = df_sample.shape[0]
print(f"{checksum_df.shape[0]} images were downloaded to {IMG_DIR} of the {expected_num_imgs} expected.")

buddy_check = BuddyCheck(buddy_id="filename", buddy_col="md5")
missing_imgs = buddy_check.validate_download(source_df=df_sample, checksum_df=checksum_df, source_validation_col="md5")
if missing_imgs is not None:
    missing_imgs.to_csv("samples_missing.csv", index=False)
    print("See samples_missing.csv for missing image info and check logs.")
else:
    print(f"Buddy check successful. All {expected_num_imgs} expected images accounted for.")

df_sample["folder"] = "sample_images_downsized"
df_sample.to_csv('./sample_annotation.csv', index=False)
print("Sample annotation saved.")

## 6. Download Model Weights

Download the required model weights from Huggingface using wget.

In [None]:
import wget

file_urls = [
    "https://huggingface.co/pn74870/2025-NSF-HDR-Hackaton-Butterfly-Hybrid-Detection/resolve/main/best_model.pth",
    "https://huggingface.co/pn74870/2025-NSF-HDR-Hackaton-Butterfly-Hybrid-Detection/resolve/main/cl_head_select_wings.pth",
    "https://huggingface.co/pn74870/2025-NSF-HDR-Hackaton-Butterfly-Hybrid-Detection/resolve/main/fine_tuned_bioclip_select_wings.pth"
]
file_names = [
    "best_unet_model.pth",
    "cl_head_select_wings.pth",
    "fine_tuned_bioclip_select_wings.pth"
]

for file_url, filename in zip(file_urls, file_names):
    print(f"Downloading {filename}...")
    wget.download(file_url, filename)
    print("\nDownload complete.")

## 7. Run Pipeline Steps

Call the segmentation, augmentation, and fine-tuning scripts using shell commands. Adjust the paths as needed.

In [None]:
# 7.1 Wing Segmentation
!python ../remove_bg/select_wings_unet.py --model_path best_unet_model.pth --csv_path sample_annotation.csv --output_folder ../data/wing_images

In [None]:
# 7.2 Data Augmentation
!python ../augmentation/albumentation_augm.py --orig_img_folder ../data/wing_images --output_img_folder ../data/augmented_images --csv_path sample_annotation.csv --output_csv_path ../data/augmented_metadata.csv --min_images_per_class 1000 --aug_per_image_high_count 1

In [None]:
# 7.3 Fine-Tuning BiO-CLIP
!python ../training/finetune_aug_bg.py --data_file ../data/augmented_metadata.csv --img_dir ../data/augmented_images --clf_save_dir ../models/bioclip_classifier --num_epochs 5 --batch_size 4 --lr_backbone 1e-5 --lr_classifier 1e-3

## Pipeline Completed

Check the output folders and saved models to verify that each step was executed correctly.