This is a quick afternoon data exploration project by me, Johno. All code by GPT 5.4. Please don't hammer iNaturalist's API too hard if you choose to replicate this. Below is an (AI-generated) overview of the project and how to get this running - feel free to target a different taxa rather than repeating for tulips - ask an agent of your choice to make the change.
This repo:
- fetches tulip observation metadata and image URLs from iNaturalist
- downloads image files locally if you want them
- embeds image URLs with a CLIP model on Replicate
- lets you search the embedding space in a small review app
- builds a filtered subset from a CLIP query and threshold
- projects that subset with UMAP and shows it in a local viewer
The current working filtered set is based on:
- query:
closeup photo of a tulip flower, filling the frame - threshold:
0.22
- Python 3.12-ish
- a Replicate API key for the CLIP embedding steps
ghonly if you want to publish the repo somewhere
Put your Replicate key in a .env file in the repo root like this:
REPLICATE_API_TOKEN=...Install dependencies:
uv pip install -r requirements.txtpython3 scripts/fetch_inat_metadata.py \
--limit 10000 \
--output data/metadata/tulips_medium_10k.csvpython3 scripts/download_inat_images.py \
--input data/metadata/tulips_medium_10k.csv \
--output-dir data/images/medium_10k \
--limit 10000This step uses Replicate and requires REPLICATE_API_TOKEN.
python3 scripts/embed_inat_clip.py --dataset mainpython3 review_app/app.pyOpen http://127.0.0.1:5000.
python3 scripts/build_filtered_dataset.pypython3 scripts/build_projection_bundle.pypython3 projection_app/app.pyOpen http://127.0.0.1:5001.
scripts/fetch_inat_metadata.py: query iNaturalist and write a CSV manifestscripts/download_inat_images.py: download images locallyscripts/embed_inat_clip.py: build CLIP embeddings through Replicatescripts/build_filtered_dataset.py: create a saved filtered subset from a CLIP queryscripts/build_projection_bundle.py: compute KMeans clusters, UMAP coordinates, and center-crop average colorsreview_app/: text-query CLIP review/search UIprojection_app/: UMAP projection viewer
- Generated data, downloaded images, embeddings, filtered subsets, projection bundles, and
.envare all ignored by git. - If you want to adapt this to a different taxon, the main place to start is the iNaturalist fetch step and the CLIP filter query.