## BIOSCAN-5M

The BIOSCAN-5M dataset is a multimodal collection of over 5.15 million arthropod specimens (98% insects), curated to advance biodiversity monitoring through machine learning. It expands the earlier BIOSCAN-1M dataset by including high-resolution images, DNA barcode sequences, taxonomic labels (phylum to species), geographical locations, specimen size data, and Barcode Index Numbers (BINs). Designed for both closed-world (known species) and open-world (novel species) scenarios, it supports tasks like taxonomic classification, clustering, and multimodal learning.

##### This dataset is a randomly chosen subset of 30,000 samples across all splits from the Cropped 256 dataset

Key Features:

* Images: 5.15M high-resolution microscope images (1024×768px) with cropped/resized versions.

* Genetic Data: Raw DNA barcode sequences (COI gene) and BIN clusters.

* Taxonomy: Labels for 7 taxonomic ranks (phylum, class, order, family, subfamily, genus, species).

* Geographical Metadata: Collection country, province/state, latitude/longitude.

* Size Metadata: Pixel count, area fraction, and scale factor for specimens

Let's begin by installing some dependencies and [downloading the dataset from the Voxel51 org on Hugging Face](https://huggingface.co/datasets/Voxel51/BIOSCAN-30k).

In [None]:
!pip install fiftyone open-clip-torch umap-learn transformers

In [None]:
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/BIOSCAN-30k",
    name="bioscan30k",
    overwrite=True,
    persistent=True
    )

This dataset has geolocation, to visualize that in the FiftyOne app you'll need to a Mapbox key. You can sign up for a key [here](https://account.mapbox.com/auth/signup/), it's free and you get 50,000 free map loads. Once you have a Mapbox account and API key, you will need to set the following environment variable `export MAPBOX_TOKEN=xxxxxxx`. Alternatively, if you're running this in a Jupyter notebook you can do the following:

In [None]:
import os

from getpass import getpass

os.environ["MAPBOX_TOKEN"] = getpass("Input your Mapbox token:")

Now let’s install a plugin that allows us to create custom dashboards and glean more insight into our dataset:

In [None]:
!fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard

After the dataset has been downloaded you can begin exploring it in the FiftyOne app. Once the dataset has been downloaded, you can do some initial exploration by launching the app.

There are two ways to use the app:

* As a cell in your notebook, which you can do by running `fo.launch_app(dataset)`

* In a separate browser window, run `fiftyone app launch` in your terminal

Once the app is launched, you can explore the dataset by:

* Scrolling through the images for a visual vibecheck of its contents

* Filter based on the labels (the various taxonomic classifications, geographic information, or size measurements)

* Opening the map panel and exploring based on geographic location

* Create a dashboard of plots for the various information fields of the dataset.


In [None]:
import fiftyone as fo

fo.launch_app(dataset)

#### 🐛 Warning: You're about to see some creepy crawly insects.

Below is an example of using the map panel:

![Explore bioscal](assets/bioscan-explore.gif)


#### You can also create a custom dashboard like so:

![Explore bioscal](assets/bioscan-5m-dashboard.gif)

You can call the dataset as shown below to see all the fields available:

In [None]:
dataset

### Deeper analysis with FiftyOne

You can take your analysis to a deeper level by using embeddings based workflows. 

The authors of the paper mentioned they trained a CLIP like model. This model, built using the CLIBD (Contrastive Learning for Image-Barcode-Description) framework, learns a shared embedding space across the three modalities, enabling cross-modal queries and improving performance in taxonomic classification tasks. However, I was unable to find the model weights on Hugging Face or through the projects GitHub repo. 

Instead, I will make use of some other models which were mentioned in the paper.

Note: I'm not an expert in biology, genomics, or insects. I'm just a hacker. I apologize in advance to the community of pracitioners working in this space if I'm not using the models as intended. My goal is to to show you what's possible when you use the open source FiftyOne library. 

Let's start computing embeddings for the images using [BioCLIP](https://github.com/Imageomics/bioclip/tree/main).

BioCLIP extends the CLIP framework to create a vision foundation model specialized for biological imagery, focusing on taxonomic relationships across the tree of life. Trained on TreeOfLife-10M—a novel dataset of 10M biological images spanning 454K taxa — BioCLIP learns hierarchical representations aligned with taxonomic ranks (kingdom to species). Unlike standard CLIP, it treats species as interconnected nodes in a biological hierarchy rather than isolated classes.

BioCLIP is part of the [Open CLIP](https://github.com/mlfoundations/open_clip) ecosystem, so you can use FiftyOne's integration with as follows:

In [2]:
import fiftyone.zoo as foz

bio_clip_model = foz.load_zoo_model(
    "open-clip-torch",
    pretrained="",
    clip_model="hf-hub:imageomics/bioclip"
)

Once the model is downloaded, you can compute embeddings as follows:


In [None]:
import torch 

device="cuda" if torch.cuda.is_available() else "cpu" #use GPU if available

dataset.compute_embeddings(
    model=bio_clip_model,
    embeddings_field="bio_clip_embeddings",
    batch_size=128, #use whatever batch size your GPU can handle
    device=device
)

We'll visualize these embeddings shortly, but first let's compute embeddings for the DNA Sequences using [BarcodeBERT](https://github.com/bioscan-ml/BarcodeBERT).



In [None]:
import torch

from transformers import AutoTokenizer, AutoModel, BertConfig

# First load the configuration
barcode_bert_config = BertConfig.from_pretrained(
    "bioscan-ml/BarcodeBERT", 
    trust_remote_code=True
    )

# Load the tokenizer
barcode_bert_tokenizer = AutoTokenizer.from_pretrained(
    "bioscan-ml/BarcodeBERT", 
    trust_remote_code=True
    )

# Load the model
barcode_bert_model = AutoModel.from_pretrained(
    "bioscan-ml/BarcodeBERT", 
    device_map=device,
    trust_remote_code=True,
    config=barcode_bert_config
    )

In [13]:
with torch.no_grad():
    for sample in dataset:
        dna_sequence = sample["dna_barcode"]['value']
        inputs = barcode_bert_tokenizer(dna_sequence, return_tensors="pt")["input_ids"]
        inputs = inputs.to(device)
        outputs = barcode_bert_model(inputs.unsqueeze(0))["hidden_states"][-1]
        embs = outputs.mean(1).squeeze().cpu().numpy()
        sample["barcode_bert_embeddings"] = embs
        sample.save()

Now we can visualize the embeddings!

In [None]:
import fiftyone.brain as fob

embedding_fields = [ "bio_clip_embeddings", "barcode_bert_embeddings"]

for fields in embedding_fields:
    _fname = fields.split("_embeddings")[0]
    results = fob.compute_visualization(
        dataset,
        embeddings=fields,
        method="umap",
        brain_key=f"{_fname}_viz",
        num_dims=2,
        )

In [None]:
import fiftyone as fo
fo.launch_app(dataset)