Observa

A Dockerized platform for hosting and exploring iNaturalist Open Data from the AWS Open Registry. Includes a PostGIS-enabled PostgreSQL database, automated daily imports, Grafana dashboards, and a REST API.

Prerequisites

Docker and Docker Compose v2+
~50 GB of free disk space (for downloaded CSVs and database storage)
16 GB+ RAM recommended (PostgreSQL is tuned for analytical workloads)

Quick Start

Clone and configure

git clone https://github.com/jmcmeen/observa.git && cd observa
cp .env.example .env

Edit .env to set your passwords:

POSTGRES_PASSWORD=your_secure_password
API_USER_PASSWORD=your_api_password
GF_SECURITY_ADMIN_PASSWORD=your_grafana_password

Security: The importer will refuse to start if any password is still set to changeme. Choose strong, unique passwords for each service before proceeding.

Start the services
```
docker compose up -d
```
This starts PostgreSQL (with PostGIS), Grafana, the importer, and PostgREST.
Run the initial import

The first import must be triggered manually. This downloads ~10 GB of compressed data from S3 and loads it into the database. It will take a while depending on your connection and hardware.
```
docker compose run --rm --entrypoint /import.sh importer
```
Open the dashboard

Navigate to http://localhost:3000 and log in with the credentials from your .env file. Eight dashboards are pre-provisioned:
- iNaturalist Overview — total observations, taxa, maps, quality grade distribution, monthly trends, and top species
- Regional Explorer — parameterized dashboard for any region and taxon (set bounding box, taxon ID, and quality grade via dropdowns)
- Import Health — import duration trends, row counts, and status history
- iNaturalist HotSpots — global observation density heatmap with multi-resolution grid variable
- Anomaly Detection — anomaly score distribution, trends, geographic map, and top anomalous species
- Observer Activity — observer engagement, cumulative growth, activity distribution, and monthly active counts (filterable by observer ID)
- BioBlitz Event — time-bounded citizen science events with species accumulation and participant leaderboards
- Database Health — cache hit ratio, index hit ratio, database size, table sizes, index usage, bloat, slow queries, and active connections

Configuration

All settings are in .env (see .env.example for defaults):

Variable	Description	Default
`POSTGRES_USER`	Database user	`observa`
`POSTGRES_PASSWORD`	Database password	(required)
`POSTGRES_DB`	Database name	`inaturalist`
`API_USER_PASSWORD`	PostgREST API database user password	(required)
`GF_SECURITY_ADMIN_USER`	Grafana admin username	`admin`
`GF_SECURITY_ADMIN_PASSWORD`	Grafana admin password	(required)
`IMPORT_CRON`	Import schedule (cron expression)	`0 3 * * *` (daily 3 AM)
`ROW_DROP_THRESHOLD`	Max allowed row count drop % before aborting import	`50`
`BACKUP_CRON`	Backup schedule (cron expression)	`0 2 * * 0` (weekly Sunday 2 AM)
`BACKUP_RETENTION`	Number of database backups to keep	`4`

PostgreSQL tuning

db/postgresql.conf is tuned for a host with ~16 GB RAM. The most impactful settings are shared_buffers (4 GB) and work_mem (256 MB). On hosts with less RAM, create a docker-compose.override.yml to lower these values:

services:
  postgres:
    command: >
      postgres
        -c config_file=/etc/postgresql/postgresql.conf
        -c shared_buffers=1GB
        -c work_mem=64MB

See comments in db/postgresql.conf for details on worst-case memory usage under parallel queries.

Services

Service	Port	Description
`postgres`	5432	PostGIS 16 database with iNaturalist data
`grafana`	3000	Dashboard UI with alerting
`importer`	—	Cron-based ETL container (no exposed ports)
`nginx`	3001	Reverse proxy with rate limiting and access logging
`postgrest`	—	Read-only REST API over the database (internal, behind nginx)
`backup`	—	Scheduled database backup (weekly by default)

Daily Imports

The importer runs on the schedule defined by IMPORT_CRON. Each run uses a zero-downtime swap-table strategy:

Downloads the latest CSVs from s3://inaturalist-open-data
Validates file integrity and CSV headers
Loads data into staging tables with indexes
Atomically swaps staging tables with live tables
Refreshes materialized views for fast dashboard queries
Logs results to the import_log table

Safety features:

File lock with PID-based stale lock recovery prevents concurrent imports
ETag caching skips the full import cycle when S3 data hasn't changed
Row count validation aborts if data drops >50% from previous import
Failed imports are automatically logged with error details
Staging tables are cleaned up on failure (live data preserved)

To trigger a manual import:

docker compose run --rm --entrypoint /import.sh importer

To check import logs:

docker compose exec postgres psql -U observa -d inaturalist \
  -c "SELECT * FROM import_log ORDER BY id DESC LIMIT 5;"

Health Endpoint

A v_health view is exposed via PostgREST for external uptime monitors:

curl "http://localhost:3001/v_health"

Returns the status of the last finished import (completed or failed), hours since last import, row counts, and any error message. In-progress imports are excluded. This can be polled by monitoring tools without requiring Grafana access.

REST API

PostgREST provides a read-only REST API at http://localhost:3001. Examples:

# Get 10 observations
curl "http://localhost:3001/observations?limit=10"

# Search taxa by name
curl "http://localhost:3001/taxa?name=ilike.*robin*"

# Filter observations by quality grade
curl "http://localhost:3001/observations?quality_grade=eq.research&limit=50"

# Get observation counts from materialized views
curl "http://localhost:3001/mv_quality_grade_counts"

Spatial queries

Find observations within a radius of a geographic point using the observations_near RPC function:

# Observations within 50 km of Great Smoky Mountains (lat 35.5, lon -83.2)
curl "http://localhost:3001/rpc/observations_near?lat=35.5&lon=-83.2&radius_km=50"

# Nearest 20 observations to a point (default radius: 10 km)
curl "http://localhost:3001/rpc/observations_near?lat=40.7&lon=-74.0&lim=20"

Returns observation_uuid, taxon_id, quality_grade, observed_on, and distance_m (distance in meters from the query point), ordered by distance.

Taxa search

Search taxa by name with fuzzy matching (typo-tolerant):

# Search for "robin" (substring + similarity match)
curl "http://localhost:3001/rpc/taxa_search?query=robin"

# Fuzzy search for a misspelled name
curl "http://localhost:3001/rpc/taxa_search?query=turdus%20migratorus&lim=5"

Returns taxon_id, name, rank, rank_level, active, and similarity score, ordered by best match.

Taxonomy tree

Navigate the taxonomic hierarchy using recursive ancestry queries:

# Get the full lineage of a taxon (e.g., American Robin, taxon_id=12727)
curl "http://localhost:3001/rpc/taxon_lineage?target_taxon_id=12727"

# List direct children of a taxon with observation counts
curl "http://localhost:3001/rpc/taxon_children?parent_id=3&lim=20"

taxon_lineage walks from any taxon up to its kingdom, returning each ancestor's name, rank, and depth. taxon_children lists direct children sorted by observation count.

See the PostgREST documentation for full query syntax.

CSV export

Add Accept: text/csv to any API request to get CSV output instead of JSON:

curl -H "Accept: text/csv" \
  "http://localhost:3001/observations?quality_grade=eq.research&limit=1000" > research.csv

See docs/csv-export-guide.md for filtering, column selection, pagination, joins, and more examples.

API security notice

The API is rate-limited (10 requests/second per IP, burst of 20) via an nginx reverse proxy. The /v_health endpoint is exempt from rate limiting for uptime monitors. PostgREST is not directly exposed — all traffic routes through nginx.

The API is unauthenticated by default — anyone with network access to port 3001 can query the full dataset. This is acceptable on trusted networks or when bound to localhost (the default), but you should add authentication before exposing it to the internet.

To enable JWT authentication:

Generate a secret (at least 32 characters):
```
openssl rand -base64 32
```
Add it to your .env:
```
PGRST_JWT_SECRET=your_generated_secret
```
Add the variable to the postgrest service in docker-compose.yml:
```
environment:
  PGRST_JWT_SECRET: ${PGRST_JWT_SECRET}
```
Requests must now include a valid JWT in the Authorization: Bearer <token> header. See the PostgREST authentication docs for details on generating tokens and configuring roles.

Database Backup

Backups run automatically on the schedule defined by BACKUP_CRON (default: weekly Sunday 2 AM). To trigger an immediate backup:

docker compose exec backup /backup.sh

Backups are stored in the backup_data volume in PostgreSQL custom format. To restore, copy the dump from the volume and pipe it into pg_restore:

# List available backups
docker compose exec backup ls /backups

# Restore a specific backup
docker compose exec backup \
  pg_restore -h postgres -U observa -d inaturalist --clean /backups/observa_20260315_030000.dump

Data Export

Observa supports exporting filtered CSV files for use in R, Python, Excel, QGIS, and other tools. The quickest method is the REST API:

# Research-grade observations for a specific taxon
curl -H "Accept: text/csv" \
  "http://localhost:3001/observations?taxon_id=eq.3726&quality_grade=eq.research" > research.csv

# Top 100 most-observed species
curl -H "Accept: text/csv" \
  "http://localhost:3001/mv_top_taxa?rank=eq.species&order=observation_count.desc&limit=100" > top_species.csv

For joins, complex queries, and full-table dumps, use psql \COPY or the built-in export.sh script.

See docs/csv-export-guide.md for the complete guide — filtering, column selection, pagination, joins, and examples for common workflows.

Connecting Directly to the Database

docker compose exec postgres psql -U observa -d inaturalist

Or from your host (with a PostgreSQL client installed):

psql -h localhost -U observa -d inaturalist

Grafana Alerts

The following alerts are pre-configured:

Import is stale — warns if no successful import in 36+ hours
Last import failed — critical alert on import failure
Observation count drop — warns if count drops >10% between imports
Import hung — critical alert if an import has been running for 4+ hours

Configure notification channels in Grafana under Alerting > Contact Points.

Development

Running the importer against an existing database

If the database is already populated, you can re-run the importer without downloading from S3. The importer uses ETag caching, so if cached files exist and haven't changed upstream, the download phase is skipped automatically:

docker compose exec importer /bin/sh /import.sh

To force a fresh download (e.g., after clearing the cache volume):

docker compose down importer
docker volume rm observa_import_cache
docker compose up -d importer
docker compose exec importer /bin/sh /import.sh

Local testing without S3

A one-command test harness seeds 100K synthetic observations, refreshes materialized views, and runs an API smoke test against all endpoints:

docker compose up -d
./scripts/test-local.sh

This generates a realistic dataset with taxonomy hierarchy (10 orders, 9 families, 80 species), 200 observers, geographically clustered observations across 5 US regions, 60K photos, and seasonal date spread — enough to exercise all dashboards, spatial queries, taxa search, and data quality alerts.

The individual scripts can also be run separately:

# Seed data only (idempotent — safe to re-run)
docker compose exec -T postgres psql -U observa -d inaturalist -f - < scripts/seed-test-data.sql
docker compose exec importer psql -f /scripts/create-materialized-views.sql

# API smoke tests only
./scripts/test-api.sh

Testing alert SQL

Grafana alert rules query PostgreSQL directly. To test alert queries from the command line:

# Import stale alert — checks hours since last successful import
docker compose exec postgres psql -U observa -d inaturalist -c "
  SELECT EXTRACT(EPOCH FROM now() - max(finished_at)) / 3600 AS hours_since_import
  FROM import_log WHERE status = 'completed';
"

# Observation count drop alert
docker compose exec postgres psql -U observa -d inaturalist -c "
  SELECT a.observations_count AS current, b.observations_count AS previous,
         round(100.0 * (b.observations_count - a.observations_count) / b.observations_count, 1) AS drop_pct
  FROM import_log a, import_log b
  WHERE a.id = (SELECT max(id) FROM import_log WHERE status = 'completed')
    AND b.id = (SELECT max(id) FROM import_log WHERE status = 'completed' AND id < a.id);
"

Rebuilding a single service

docker compose up -d --build importer   # rebuild and restart importer only
docker compose up -d --build postgres   # rebuild postgres (data persists in volume)

Viewing logs

docker compose logs -f importer         # follow importer logs
docker compose logs --tail 50 postgres  # last 50 lines from postgres
docker compose logs                     # all services

Documentation

Guide	Description
CSV Export Guide	Exporting filtered data as CSV for R, Python, Excel, QGIS
Data Model Reference	Tables, columns, relationships, indexes, and materialized views
Custom Dashboards	Building your own Grafana dashboards with query examples
Troubleshooting	Common problems, error messages, and fixes

Architecture

Key design decisions and the reasoning behind them:

Atomic swap-table imports. Each import loads data into staging tables, then renames them into place in a single transaction. This avoids the read-blocking and WAL amplification of TRUNCATE/INSERT or row-by-row UPSERT on a 200M+ row table. Dashboards and API queries experience zero downtime during imports.

Materialized views rebuilt per import. The iNaturalist open dataset is published as a full snapshot, not a delta. Since every row is replaced on each import, incremental view maintenance would add complexity without benefit. Views are built under temporary names and swapped in atomically so dashboards continue serving stale-but-valid data during the rebuild.

File-lock concurrency control. The importer uses an atomic mkdir-based lock with PID-based stale detection. If a previous import was killed (OOM, node restart), the next run detects the dead PID and recovers the lock automatically. This prevents concurrent imports from corrupting the table swap without requiring external coordination.

Read-only API role separation. PostgREST connects through an api_readonly role that has only SELECT and EXECUTE grants. Even if PostgREST is exposed beyond localhost, the database enforces that no data can be modified through the API layer.

Contributing

Contributions are welcome — bug fixes, dashboards, documentation improvements, and well-scoped features. See CONTRIBUTING.md for the full guide, including:

Local development setup and the synthetic test data harness
Running the same lint checks CI runs
How to add a new dashboard, docs page, alert, or schema change
PR submission and review checklist

Bug reports and feature requests can be filed via GitHub Issues.

Citation

If you use Observa in academic work or publications, please cite it as:

@software{observa,
  author       = {McMeen, John},
  title        = {Observa: A Dockerized Platform for iNaturalist Open Data},
  url          = {https://github.com/jmcmeen/observa},
  license      = {Apache-2.0}
}

Data Source

This project uses the iNaturalist Open Dataset, which is hosted on the AWS Open Data Registry. The dataset includes observations, photos, taxa, and observer metadata published under Creative Commons licenses. See the upstream repository for licensing and attribution details.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
backup		backup
db		db
docs		docs
grafana/provisioning		grafana/provisioning
importer		importer
nginx		nginx
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Observa

Prerequisites

Quick Start

Configuration

PostgreSQL tuning

Services

Daily Imports

Health Endpoint

REST API

Spatial queries

Taxa search

Taxonomy tree

CSV export

API security notice

Database Backup

Data Export

Connecting Directly to the Database

Grafana Alerts

Development

Running the importer against an existing database

Local testing without S3

Testing alert SQL

Rebuilding a single service

Viewing logs

Documentation

Architecture

Contributing

Citation

Data Source

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages