feat: data/dataset refresh + indexes auto-embedding + embedding providers by anoop-narang · Pull Request #67 · hotdata-dev/hotdata-cli

anoop-narang · 2026-04-29T15:17:20Z

Three cohesive additions to the CLI, all wiring up server endpoints that already existed but weren't reachable from hotdata:

hotdata connections refresh data mode + async + scope flags
hotdata datasets refresh (new subcommand) + async
hotdata indexes dataset scope + auto-embedding + delete + bug fix; new hotdata embedding-providers CRUD

What changed

`connections refresh`

Was schema-only and synchronous. Now:

--data toggles to data refresh.
--schema <s> --table <t> narrows a data refresh to a single table (must be paired).
--async submits as a background job and returns a job ID; only valid with --data.
--include-uncached includes tables that haven't been cached yet (connection-wide data refresh only).
CLI-side validation matches server validation rules (no schema-only data refresh, async only with data, etc.).
Output now reports tables_discovered/added/modified for schema refresh and rows_synced/duration for data refresh.

`datasets refresh`

New subcommand. Re-runs the dataset's source (URL fetch or saved query) and creates a new version.

hotdata datasets refresh <dataset_id> [--async]

Upload-source datasets reject sync upfront (server returns a clear 400). The async path on upload-source datasets is still accepted by the server but the worker can't complete it — flagged for runtimedb.

`indexes`

New --dataset-id scope, alternative to --connection-id + --schema + --table on list, create, and the new delete. Mutually exclusive (clap-enforced).
New delete subcommand on both scopes.
Auto-embedding flags on create: --embedding-provider-id, --dimensions, --output-column, --description. With --type vector over a text column, the server generates embeddings automatically using the named provider (or first system provider as fallback).
--type is now required (no default sorted). Forces deliberate choice. Breaking for any scripts that omitted it — pass --type sorted to keep the previous default.
Bug fix: indexes create --async previously read parsed[\"job_id\"] from the response, but the server returns id per the SubmitJobResponse schema. Result: it always printed job_id: unknown. Now reads id correctly. Confirmed end-to-end against prod.
CLI-side pre-validation: auto-embed flags only valid with --type vector; --type vector requires exactly one column.

`embedding-providers` (new command)

Full CRUD for embedding providers used by vector indexes.

hotdata embedding-providers list
hotdata embedding-providers get <id>
hotdata embedding-providers create --name X --provider-type service [--config '<json>'] [--provider-api-key <key> | --secret-name <name>]
hotdata embedding-providers update <id> [...]
hotdata embedding-providers delete <id>

Note the flag name --provider-api-key (not --api-key) — pairs with --provider-type and avoids colliding with the global --api-key (Hotdata auth) flag. The collision was at clap's struct-field-id level; the JSON request body field is still api_key per the OpenAPI schema.

Misc

Added dataset_refresh and create_dataset_index to hotdata jobs list --job-type accepted values (the server emits these types but the CLI was rejecting them as filter values).
Added ApiClient::delete_raw helper. Mirrors post_raw/get_raw.
README and SKILL.md docs updated for all of the above.

Test plan

Verified end-to-end against api.hotdata.dev (production):

All 4 connections-refresh combos (sync/async × single-table/connection-wide), plus --include-uncached
CLI validation rejects all illegal scope/flag combinations upfront
datasets refresh sync (creates new version) and async (job submitted, polls successfully)
indexes create sorted/vector on connection scope; vector with auto-embedding generates title_embedding column with cosine metric via sys_emb_openai
indexes create sorted on dataset scope; async path returns real job ID (bug fix confirmed)
indexes delete on both scopes
embedding-providers full CRUD (list, get, create with --provider-api-key, update, delete)
Global --api-key flow remains unchanged — verified after rename
cargo test — 110/110 passing (no test files were modified or removed)

Notes for reviewers

Breaking: hotdata indexes create --type is now required (was default_value = \"sorted\"). Scripts that omitted it will get a clear clap error; recovery is --type sorted.
Server-side findings flagged elsewhere (not fixed here):
- Async dataset refresh on upload-source datasets is accepted but unfinishable (sync rejects upfront).
- runtimedb/openapi.yaml:3175 describes RefreshDatasetResponse as the body for POST /v1/datasets/{id}/refresh — that endpoint doesn't exist; actual endpoint is POST /v1/refresh with dataset_id.
- OpenAPI for POST .../indexes documents only 201 IndexInfoResponse, missing the 202 SubmitJobResponse async variant.
- Vector indexes on legacy datasets (parquet predating _key column) return a clear 400 directing users to refresh the dataset first.
The branch was developed as a stack of three (connections-refresh-data-async → datasets-refresh → indexes-auto-embedding); commits are kept separate inside the PR for readability but are intended to merge as a unit.

…cached `hotdata connections refresh` previously only triggered a synchronous schema refresh. The /v1/refresh endpoint actually supports data refresh (per-table or whole-connection), an async/background-job mode, and an include_uncached toggle for picking up newly-discovered tables — none of which were exposed. Adds: - --data: refresh data instead of schema metadata - --schema/--table: narrow scope (server requires both together for data refresh) - --async: submit as a background job, returns a job id to poll via `hotdata jobs <id>` - --include-uncached: connection-wide data refresh only, includes uncached tables - CLI-side validation mirroring server rules so we fail fast with clear errors - Richer output: schema refresh now reports tables_discovered/added/modified; data refresh reports rows_synced and duration Also adds `dataset_refresh` to the allowed values for `jobs list --job-type`, which the server emits but the CLI didn't accept as a filter.

Updates the README and the bundled hotdata skill to match the expanded `hotdata connections refresh` surface (--data, --schema/--table, --async, --include-uncached) and to add `dataset_refresh` to the documented values for `hotdata jobs list --job-type`.

Adds `hotdata datasets refresh <dataset_id> [--async]` for re-running a dataset's source (URL fetch or saved query) to create a new version. Calls the same `/v1/refresh` endpoint as `connections refresh`, but with `dataset_id` set instead of `connection_id`. The sync path prints the new version and status; the async path prints the job ID and points the user at `hotdata jobs <id>` to poll. Upload- source datasets have no remote to re-pull from, so the server's 400 ("Refresh not supported for source type 'upload'") is surfaced directly. Updates README.md and SKILL.md to document the new subcommand.

…-providers CRUD INDEXES - New `--dataset-id` scope alternative to `--connection-id`/`--schema`/`--table` on `indexes list`, `indexes create`, and the new `indexes delete` subcommand. Scopes are mutually exclusive (clap-enforced). - New auto-embedding flags on `indexes create`: --embedding-provider-id --dimensions --output-column --description When `--type vector` runs against a text column, the server generates embeddings automatically using the named provider (or the first system provider). Generated column defaults to `{column}_embedding`. - `--type` is now required on `indexes create` (previously defaulted to `sorted`). Forces deliberate choice. BREAKING for scripts that omitted it. - New `indexes delete` subcommand for both connection and dataset scopes. - CLI-side pre-validation: * scope flags can't be mixed (clap mutex) * `--schema`/`--table` require `--connection-id` (clap) * `--connection-id` requires both `--schema` and `--table` (clap) * auto-embed flags only valid with `--type vector` (custom) * `--type vector` requires exactly one column in `--columns` (custom) BUG FIX - `indexes create --async` previously read `parsed["job_id"]` from the response, but the server returns `id` (per `SubmitJobResponse`). Result: it always printed `job_id: unknown`. Now reads `id` correctly. Confirmed end-to-end against prod with `hotdata jobs <id>` lookups working. EMBEDDING PROVIDERS - New `hotdata embedding-providers` command surface: list, get, create, update, delete - The "inline API key" flag is named `--inline-api-key` (struct field `inline_api_key`) to avoid colliding with the global `--api-key` auth flag — clap merges fields by their internal id, so reusing the name `api_key` would silently route the value to the auth layer. JOBS - Added `create_dataset_index` to the `--job-type` value list (server emits this type for async dataset index creation; the CLI was rejecting it as an invalid filter value). API LAYER - Added `ApiClient::delete_raw` — needed for `indexes delete` and `embedding-providers delete`. Mirrors `post_raw`/`get_raw` shape.

…viders CRUD Updates README.md and skills/hotdata/SKILL.md for the new surface: - `hotdata indexes` now supports both connection and dataset scope; show both invocation forms side-by-side, note `--type` is required, and document the auto-embedding flags (--embedding-provider-id, --dimensions, --output-column, --description). - `hotdata indexes delete` is new; documented for both scopes. - `hotdata embedding-providers` is new; full list/get/create/update/delete surface documented, with a callout that `--inline-api-key` (not `--api-key`) is the inline-secret flag — to avoid colliding with the global auth `--api-key`. - `--job-type` filter list updated with `create_dataset_index`.

…api-key Renames the flag (and Rust struct field) on `embedding-providers create` and `embedding-providers update` from `--inline-api-key` / `inline_api_key` to `--provider-api-key` / `provider_api_key`. Why: - Pairs naturally with the existing `--provider-type` flag on the same subcommand (consistent prefix family). - Self-documenting: this is the embedding service's own API key (e.g. an OpenAI sk-... key), not the user's Hotdata auth credential. - Avoids the clap field-id collision with the global `Cli::api_key` that motivated the original rename, but does so via a name that reads more naturally than `--inline-api-key`. The JSON request body field stays `api_key` per the OpenAPI schema — only the user-facing CLI flag and Rust field are renamed.

sentry · 2026-04-29T15:19:21Z

Codecov Report

❌ Patch coverage is 20.50817% with 438 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/embedding_providers.rs	23.80%	144 Missing ⚠️
src/main.rs	0.00%	118 Missing ⚠️
src/connections.rs	0.00%	86 Missing ⚠️
src/indexes.rs	36.73%	62 Missing ⚠️
src/datasets.rs	0.00%	28 Missing ⚠️

📢 Thoughts on this report? Let us know!

claude · 2026-04-29T15:19:34Z

+    if name.is_none() && config.is_none() && api_key.is_none() && secret_name.is_none() {
+        eprintln!(
+            "{}",
+            "error: provide at least one of --name, --config, --api-key, --secret-name.".red()


Blocking: The error message names --api-key, but the actual flag is --provider-api-key. A user who sees this error and follows its guidance will try --api-key, which is the global Hotdata auth flag — wrong key, confusing failure.

Suggested change

"error: provide at least one of --name, --config, --api-key, --secret-name.".red()

"error: provide at least one of --name, --config, --provider-api-key, --secret-name.".red()

claude

Review

Blocking Issues

src/embedding_providers.rs:158 — update's no-op guard prints --api-key in the error message, but the actual flag is --provider-api-key. A user following the error guidance will reach for --api-key, which is the global Hotdata auth flag, not the provider key. One-character fix in the string literal.

Action Required

Fix the flag name in the error string on line 158 before merge.

…_distance `hotdata search` now requires `--type vector|bm25` (no default; same rule as `indexes create --type`) and a positional query text argument. Both modes run entirely server-side with no client-side embedding. Routing: - `--type vector "<query>"` → SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist Server resolves the embedding column, model, dimensions, and metric from the index metadata. The user names the source text column. - `--type bm25 "<query>"` → existing bm25_search() server-side path. Removed: - `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL). - Stdin-piped-vector path (was: read JSON vector from stdin, generate `l2_distance` SQL). - `src/embedding.rs` module (its only callers were the two paths above). Both removed paths hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. They also required the user to point `--column` at the auto-generated `_embedding` column rather than the source text column. Power users who need client-side embedding or want to query with a precomputed vector can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`). Verified against prod on `my_ducklake.main.internet_pages_small`: - BM25 "basketball" → finds the basketball ProCamp title (score 2.92) - BM25 "HIV" → finds the HIV Story titles (score 4.81) - Vector "sports games athletes" → ranks the basketball ProCamp first (cosine distance 0.69), heart-attack-fitness second (0.80) - Vector "travel vacation cruise" → ranks the cruise excursion first (0.63), 48-hours-in-Cesky-Krumlov second (0.74) The semantically meaningful vector results confirm auto-embedding produced useful vectors AND the server-side rewrite correctly resolves provider+metric+output_column from index metadata. Cleaned up indexes after the test run.

…-op error `update`'s "provide at least one field" guard message still listed `--api-key`, which was the original local flag name before it was renamed to `--provider-api-key` to avoid colliding with the global Hotdata auth flag. A user following the error guidance would reach for `--api-key` (the global auth flag), not the provider key. One-character class of fix; caught by the PR review bot on #67.

Adds 7 unit tests covering the lightest-weight gaps in the new code: - `src/api.rs` — 2 mockito tests for `delete_raw` (204 success, 404 with error body), matching the existing `get_none_if_not_found_*` pattern. - `src/indexes.rs` — 2 path-construction tests for the new `IndexScope` enum (`Connection` and `Dataset` variants), matching the existing pure-function tests in this module. - `src/embedding_providers.rs` — 3 deserialization fixture tests for the `Provider` and `ListResponse` shapes plus a `parse_config` smoke test, matching the runtimedb-payload-deserialization pattern from `datasets.rs::update_response_deserializes_runtimedb_payload`. Skipped `datasets.rs::refresh` — adding a typed response struct purely for test fixture purposes would be over-engineering since the refresh function reads the response as `serde_json::Value` directly. 110 tests → 117 (105 → 112 unit + 5 integration unchanged). Patch coverage on the PR is still informational by repo policy (codecov.yml).

Reading just the Search section, a user might miss that vector search is end-to-end auto-embedded — both the column's embeddings (built when the index was created) and the query embedding (computed at search time) come from the same server-configured provider, with matching metric, model, and dimensions. Spells that out at the top of the `--type vector` bullet, and adds an explicit pointer to raw SQL via `hotdata query` for cases where the user needs a different model than the index, or has no index at all (the SQL reference covers the underlying distance functions and table UDFs).

claude

The cycle-1 blocking issue is fixed: src/embedding_providers.rs:158 now correctly names --provider-api-key. Everything else looks clean.

anoop-narang added 6 commits April 29, 2026 14:56

claude Bot reviewed Apr 29, 2026

View reviewed changes

claude Bot requested changes Apr 29, 2026

View reviewed changes

anoop-narang added 2 commits April 29, 2026 21:31

claude Bot previously approved these changes Apr 29, 2026

View reviewed changes

anoop-narang added 2 commits April 29, 2026 22:23

anoop-narang dismissed claude[bot]’s stale review via 5007ca8 April 29, 2026 16:57

claude Bot approved these changes Apr 29, 2026

View reviewed changes

anoop-narang merged commit 27dcdb0 into main Apr 29, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: data/dataset refresh + indexes auto-embedding + embedding providers#67

feat: data/dataset refresh + indexes auto-embedding + embedding providers#67
anoop-narang merged 10 commits intomainfrom
feat/indexes-auto-embedding

anoop-narang commented Apr 29, 2026

Uh oh!

sentry Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 29, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"error: provide at least one of --name, --config, --api-key, --secret-name.".red()
	"error: provide at least one of --name, --config, --provider-api-key, --secret-name.".red()

Conversation

anoop-narang commented Apr 29, 2026

What changed

connections refresh

datasets refresh

indexes

embedding-providers (new command)

Misc

Test plan

Notes for reviewers

Uh oh!

sentry Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Blocking Issues

Action Required

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`connections refresh`

`datasets refresh`

`indexes`

`embedding-providers` (new command)

sentry Bot commented Apr 29, 2026 •

edited

Loading