FlyBase local sync/query

Use FlyBase bulk files for agent workloads. Live API: helper only.

Why

https://api.flybase.org/api/v1.0/ exists.
some endpoints return useful JSON now, eg domain/FBgn0001250, sequence/id/FBgn0001250.
some plausible endpoints return empty body today.
bulk bucket + release files: better for repeatable agent queries.

Current surfaces checked

release bucket: https://s3ftp.flybase.org/releases/current/
precomputed files: https://s3ftp.flybase.org/releases/current/precomputed_files/
Postgres dump: https://s3ftp.flybase.org/releases/current/psql/FB2026_01.sql.gz
API root: https://api.flybase.org/api/v1.0/
batch download: https://flybase.org/batchdownload

Layout

src/flybase_cli/: package code
tests/: stdlib unittest
flybase_cli.py: thin repo-root shim
pyproject.toml: package metadata / console entrypoint

CLI

python3 flybase_cli.py presets

python3 flybase_cli.py sync gene-core

python3 flybase_cli.py sync gene-core --release FB2026_01

python3 flybase_cli.py sync gene-knowledge --release FB2026_01

python3 flybase_cli.py full-sync --release FB2026_01

python3 flybase_cli.py full-sync \
  --release FB2026_01 \
  --include 'best_gene_summary|entity_publication'

python3 flybase_cli.py sync-incremental \
  gene-knowledge \
  --from-release FB2025_06 \
  --release FB2026_01

python3 flybase_cli.py release-diff \
  --preset gene-knowledge \
  --from-release FB2025_06 \
  --to-release FB2026_01

python3 flybase_cli.py genomes --release FB2026_01

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --section fasta \
  --asset mirna

python3 flybase_cli.py genome-presets

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --preset mirna-fasta

PYTHONPATH=src python3 -m flybase_cli sync gene-expression

python3 flybase_cli.py manifest \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py sync-url \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py ingest \
  data/flybase/precomputed_files/genes/best_gene_summary_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_annotation_ID_fb_2026_01.tsv.gz

python3 flybase_cli.py tables --columns

python3 flybase_cli.py describe --sample-values 2
python3 flybase_cli.py schema-export --sample-values 1
python3 flybase_cli.py query-plan --sample-values 1 --limit 5
python3 flybase_cli.py query-run --template-name gene-summary-by-fbgn --param fbgn_id=FBgn0002121

python3 flybase_cli.py fts-build

python3 flybase_cli.py search 'memory formation'

python3 flybase_cli.py pg-load --release FB2026_01

python3 flybase_cli.py sql \
  "select * from fb_best_gene_summary_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select s.fbgn_id, s.gene_symbol, a.annotation_id, p.flybase_fbtr, p.flybase_fbpp \
   from fb_best_gene_summary_fb_2026_01 s \
   join fb_fbgn_annotation_id_fb_2026_01 a on a.primary_fbgn = s.fbgn_id \
   left join fb_fbgn_fbtr_fbpp_fb_2026_01 p on p.flybase_fbgn = s.fbgn_id \
   limit 5"

python3 flybase_cli.py api domain/FBgn0001250

Sync presets

gene-core: summaries + FBgn/FBtr/FBpp + annotation IDs + SO annotations
gene-expression: curated/high-throughput/scRNA expression slices
references: publication/link tables
gene-knowledge: core gene facts + representative publications + orthology tables
orthology: ortholog, paralog, and disease-association tables
interactions: gene- and allele-level interaction tables

Full sync

full-sync crawls an entire release prefix, default precomputed_files/
default behavior: download only files the current loaders can ingest into SQLite
use --all-files if you want non-ingestable release artifacts too
use --include / --exclude to stage a narrower smoke or partial warehouse
default manifest path: data/flybase/manifests/<release>/full-sync.json

Discovery

genomes --release FB2026_01 lists genome builds linked from that FlyBase release
sync-url turns a crawlable FlyBase directory URL into a one-step local sync
sync-genome resolves a release/build pair into the right genome-section URL automatically
genome-presets lists reusable genome asset sync recipes

Genome sync

sections: fasta, gff, gtf, dna, chado-xml
asset shortcuts include mirna, transcript, translation, gene, chromosome, cds, ncrna, gff, gtf
presets include mirna-fasta, transcript-fasta, translation-fasta, gene-fasta, chromosome-fasta, ncrna-fasta, gff-all, gtf-all
use --include/--exclude for narrower file selection on top of the asset preset

Ingest formats

delimited: tsv, csv, gzipped variants
sequence: fasta, fa, fna, faa, gzipped variants
annotation: gff, gff3, gtf, gzipped variants
JSON: json, json.gz

JSON ingest

top-level scalar JSON fields become queryable SQLite columns
one nested dict level is flattened, eg gene.symbol -> gene_symbol
repeated top-level lists become child tables, eg symbolSynonyms -> <table>_symbolsynonyms
repeated lists nested inside child dict rows become descendant tables, eg genomeLocations[].exons[] -> <table>_genomelocations_exons
full source record remains in payload_json

Example:

python3 flybase_cli.py sql \
  "select record_id, symbol, gene_geneId from fb_ncrna_genes_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, ordinal, value \
   from fb_ncrna_genes_fb_2026_01_symbolsynonyms \
   limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, parent_ordinal, ordinal, startPosition, endPosition \
   from fb_ncrna_genes_fb_2026_01_genomelocations_exons \
   limit 5"

Search

fts-build creates a local SQLite FTS5 index from ingested tables
search queries that index without calling the live FlyBase API
record ids prefer stable FlyBase-like columns such as fbgn_id, primary_fbgn, flybase_fbtr

Metadata

describe summarizes ingested tables with row counts, source paths, semantic tags, columns, and representative non-empty values
schema-export writes the same metadata to a deterministic JSON artifact beside the SQLite DB, eg FB2026_01.schema.json
schema-export also includes inferred relationships for nested child tables and common FlyBase ID joins
schema-export also emits semantic_summary for table/entity tag coverage
schema-export also emits ready-to-run query_templates
query-plan prints starter SQL without the larger schema payload
query-plan now includes named biological templates such as gene-summary-by-fbgn, transcript-protein-links, publications-for-gene, and coordinate lookups when matching tables exist
query-run selects one template and executes it with parameter values
useful first step before writing ad hoc SQL or building agent query plans

Example:

python3 flybase_cli.py schema-export \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1

python3 flybase_cli.py query-plan \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1 \
  --limit 5

python3 flybase_cli.py query-run \
  --db data/flybase/FB2026_01.sqlite \
  --template-name gene-summary-by-fbgn \
  --param fbgn_id=FBgn0002121

Notes

nested JSON child tables keep lineage columns like parent_record_id, parent_ordinal, ordinal.
many FlyBase files start with ## metadata lines; loader skips those.
sync writes a preset manifest under data/flybase/manifests/<release>/.
full-sync is the broadest offline path for release bulk data without going through the full Postgres dump.
sync --release FB2026_01 defaults to data/flybase/FB2026_01.sqlite to avoid cross-release mixing.
sync-incremental uses stable manifest keys so release-renamed files still land in updated instead of noisy add/remove pairs.
release-diff compares releases either by raw prefix or by curated multi-prefix preset.
manifest --url lets you crawl non-releases/ FlyBase directories such as genome FASTA/GFF trees.
sync-url is the shortest path for genome assets once you know the directory URL.
sync-genome is the shortest path when you know the FlyBase release + genome build label.
sync-genome --preset ... is the preferred path for common genome asset pulls.
some FlyBase .gff.gz assets are tar-wrapped gzip archives; loader handles that transparently.
sql and query-run shape results as record-oriented JSON with summary metadata for agent chaining.
pg-load stages the full Postgres import script for releases/<release>/psql/<release>.sql.gz.
pg-load --execute runs the staged script when createdb and psql are installed locally.
SQLite keeps setup minimal; switch to DuckDB/Postgres if you want bigger joins/faster scans.
if you only need a few IDs, FlyBase Batch Download may be simpler than syncing files.
use --no-header for files whose first non-comment row is data, not column names.

Tests

python3 -m unittest discover -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
src/flybase_cli		src/flybase_cli
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
flybase_cli.py		flybase_cli.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlyBase local sync/query

Why

Current surfaces checked

Layout

CLI

Sync presets

Full sync

Discovery

Genome sync

Ingest formats

JSON ingest

Search

Metadata

Notes

Tests

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlyBase local sync/query

Why

Current surfaces checked

Layout

CLI

Sync presets

Full sync

Discovery

Genome sync

Ingest formats

JSON ingest

Search

Metadata

Notes

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages