Skip to content

feat: data/dataset refresh + indexes auto-embedding + embedding providers#67

Merged
anoop-narang merged 10 commits intomainfrom
feat/indexes-auto-embedding
Apr 29, 2026
Merged

feat: data/dataset refresh + indexes auto-embedding + embedding providers#67
anoop-narang merged 10 commits intomainfrom
feat/indexes-auto-embedding

Conversation

@anoop-narang
Copy link
Copy Markdown
Contributor

Three cohesive additions to the CLI, all wiring up server endpoints that already existed but weren't reachable from hotdata:

  1. hotdata connections refresh data mode + async + scope flags
  2. hotdata datasets refresh (new subcommand) + async
  3. hotdata indexes dataset scope + auto-embedding + delete + bug fix; new hotdata embedding-providers CRUD

What changed

connections refresh

Was schema-only and synchronous. Now:

  • --data toggles to data refresh.
  • --schema <s> --table <t> narrows a data refresh to a single table (must be paired).
  • --async submits as a background job and returns a job ID; only valid with --data.
  • --include-uncached includes tables that haven't been cached yet (connection-wide data refresh only).
  • CLI-side validation matches server validation rules (no schema-only data refresh, async only with data, etc.).
  • Output now reports tables_discovered/added/modified for schema refresh and rows_synced/duration for data refresh.

datasets refresh

New subcommand. Re-runs the dataset's source (URL fetch or saved query) and creates a new version.

hotdata datasets refresh <dataset_id> [--async]

Upload-source datasets reject sync upfront (server returns a clear 400). The async path on upload-source datasets is still accepted by the server but the worker can't complete it — flagged for runtimedb.

indexes

  • New --dataset-id scope, alternative to --connection-id + --schema + --table on list, create, and the new delete. Mutually exclusive (clap-enforced).
  • New delete subcommand on both scopes.
  • Auto-embedding flags on create: --embedding-provider-id, --dimensions, --output-column, --description. With --type vector over a text column, the server generates embeddings automatically using the named provider (or first system provider as fallback).
  • --type is now required (no default sorted). Forces deliberate choice. Breaking for any scripts that omitted it — pass --type sorted to keep the previous default.
  • Bug fix: indexes create --async previously read parsed[\"job_id\"] from the response, but the server returns id per the SubmitJobResponse schema. Result: it always printed job_id: unknown. Now reads id correctly. Confirmed end-to-end against prod.
  • CLI-side pre-validation: auto-embed flags only valid with --type vector; --type vector requires exactly one column.

embedding-providers (new command)

Full CRUD for embedding providers used by vector indexes.

hotdata embedding-providers list
hotdata embedding-providers get <id>
hotdata embedding-providers create --name X --provider-type service [--config '<json>'] [--provider-api-key <key> | --secret-name <name>]
hotdata embedding-providers update <id> [...]
hotdata embedding-providers delete <id>

Note the flag name --provider-api-key (not --api-key) — pairs with --provider-type and avoids colliding with the global --api-key (Hotdata auth) flag. The collision was at clap's struct-field-id level; the JSON request body field is still api_key per the OpenAPI schema.

Misc

  • Added dataset_refresh and create_dataset_index to hotdata jobs list --job-type accepted values (the server emits these types but the CLI was rejecting them as filter values).
  • Added ApiClient::delete_raw helper. Mirrors post_raw/get_raw.
  • README and SKILL.md docs updated for all of the above.

Test plan

Verified end-to-end against api.hotdata.dev (production):

  • All 4 connections-refresh combos (sync/async × single-table/connection-wide), plus --include-uncached
  • CLI validation rejects all illegal scope/flag combinations upfront
  • datasets refresh sync (creates new version) and async (job submitted, polls successfully)
  • indexes create sorted/vector on connection scope; vector with auto-embedding generates title_embedding column with cosine metric via sys_emb_openai
  • indexes create sorted on dataset scope; async path returns real job ID (bug fix confirmed)
  • indexes delete on both scopes
  • embedding-providers full CRUD (list, get, create with --provider-api-key, update, delete)
  • Global --api-key flow remains unchanged — verified after rename
  • cargo test — 110/110 passing (no test files were modified or removed)

Notes for reviewers

  • Breaking: hotdata indexes create --type is now required (was default_value = \"sorted\"). Scripts that omitted it will get a clear clap error; recovery is --type sorted.
  • Server-side findings flagged elsewhere (not fixed here):
    • Async dataset refresh on upload-source datasets is accepted but unfinishable (sync rejects upfront).
    • runtimedb/openapi.yaml:3175 describes RefreshDatasetResponse as the body for POST /v1/datasets/{id}/refresh — that endpoint doesn't exist; actual endpoint is POST /v1/refresh with dataset_id.
    • OpenAPI for POST .../indexes documents only 201 IndexInfoResponse, missing the 202 SubmitJobResponse async variant.
    • Vector indexes on legacy datasets (parquet predating _key column) return a clear 400 directing users to refresh the dataset first.
  • The branch was developed as a stack of three (connections-refresh-data-asyncdatasets-refreshindexes-auto-embedding); commits are kept separate inside the PR for readability but are intended to merge as a unit.

…cached

`hotdata connections refresh` previously only triggered a synchronous
schema refresh. The /v1/refresh endpoint actually supports data refresh
(per-table or whole-connection), an async/background-job mode, and an
include_uncached toggle for picking up newly-discovered tables — none
of which were exposed.

Adds:
- --data: refresh data instead of schema metadata
- --schema/--table: narrow scope (server requires both together for data refresh)
- --async: submit as a background job, returns a job id to poll via `hotdata jobs <id>`
- --include-uncached: connection-wide data refresh only, includes uncached tables
- CLI-side validation mirroring server rules so we fail fast with clear errors
- Richer output: schema refresh now reports tables_discovered/added/modified;
  data refresh reports rows_synced and duration

Also adds `dataset_refresh` to the allowed values for `jobs list --job-type`,
which the server emits but the CLI didn't accept as a filter.
Updates the README and the bundled hotdata skill to match the expanded
`hotdata connections refresh` surface (--data, --schema/--table, --async,
--include-uncached) and to add `dataset_refresh` to the documented values
for `hotdata jobs list --job-type`.
Adds `hotdata datasets refresh <dataset_id> [--async]` for re-running a
dataset's source (URL fetch or saved query) to create a new version.
Calls the same `/v1/refresh` endpoint as `connections refresh`, but with
`dataset_id` set instead of `connection_id`.

The sync path prints the new version and status; the async path prints
the job ID and points the user at `hotdata jobs <id>` to poll. Upload-
source datasets have no remote to re-pull from, so the server's 400
("Refresh not supported for source type 'upload'") is surfaced directly.

Updates README.md and SKILL.md to document the new subcommand.
…-providers CRUD

INDEXES
- New `--dataset-id` scope alternative to `--connection-id`/`--schema`/`--table`
  on `indexes list`, `indexes create`, and the new `indexes delete` subcommand.
  Scopes are mutually exclusive (clap-enforced).
- New auto-embedding flags on `indexes create`:
    --embedding-provider-id  --dimensions  --output-column  --description
  When `--type vector` runs against a text column, the server generates
  embeddings automatically using the named provider (or the first system
  provider). Generated column defaults to `{column}_embedding`.
- `--type` is now required on `indexes create` (previously defaulted to
  `sorted`). Forces deliberate choice. BREAKING for scripts that omitted it.
- New `indexes delete` subcommand for both connection and dataset scopes.
- CLI-side pre-validation:
    * scope flags can't be mixed (clap mutex)
    * `--schema`/`--table` require `--connection-id` (clap)
    * `--connection-id` requires both `--schema` and `--table` (clap)
    * auto-embed flags only valid with `--type vector` (custom)
    * `--type vector` requires exactly one column in `--columns` (custom)

BUG FIX
- `indexes create --async` previously read `parsed["job_id"]` from the
  response, but the server returns `id` (per `SubmitJobResponse`). Result:
  it always printed `job_id: unknown`. Now reads `id` correctly. Confirmed
  end-to-end against prod with `hotdata jobs <id>` lookups working.

EMBEDDING PROVIDERS
- New `hotdata embedding-providers` command surface:
    list, get, create, update, delete
- The "inline API key" flag is named `--inline-api-key` (struct field
  `inline_api_key`) to avoid colliding with the global `--api-key` auth
  flag — clap merges fields by their internal id, so reusing the name
  `api_key` would silently route the value to the auth layer.

JOBS
- Added `create_dataset_index` to the `--job-type` value list (server
  emits this type for async dataset index creation; the CLI was rejecting
  it as an invalid filter value).

API LAYER
- Added `ApiClient::delete_raw` — needed for `indexes delete` and
  `embedding-providers delete`. Mirrors `post_raw`/`get_raw` shape.
…viders CRUD

Updates README.md and skills/hotdata/SKILL.md for the new surface:
- `hotdata indexes` now supports both connection and dataset scope; show
  both invocation forms side-by-side, note `--type` is required, and
  document the auto-embedding flags (--embedding-provider-id, --dimensions,
  --output-column, --description).
- `hotdata indexes delete` is new; documented for both scopes.
- `hotdata embedding-providers` is new; full list/get/create/update/delete
  surface documented, with a callout that `--inline-api-key` (not
  `--api-key`) is the inline-secret flag — to avoid colliding with the
  global auth `--api-key`.
- `--job-type` filter list updated with `create_dataset_index`.
…api-key

Renames the flag (and Rust struct field) on `embedding-providers create`
and `embedding-providers update` from `--inline-api-key` /
`inline_api_key` to `--provider-api-key` / `provider_api_key`.

Why:
- Pairs naturally with the existing `--provider-type` flag on the same
  subcommand (consistent prefix family).
- Self-documenting: this is the embedding service's own API key (e.g.
  an OpenAI sk-... key), not the user's Hotdata auth credential.
- Avoids the clap field-id collision with the global `Cli::api_key`
  that motivated the original rename, but does so via a name that
  reads more naturally than `--inline-api-key`.

The JSON request body field stays `api_key` per the OpenAPI schema —
only the user-facing CLI flag and Rust field are renamed.
@sentry
Copy link
Copy Markdown

sentry Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 20.50817% with 438 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/embedding_providers.rs 23.80% 144 Missing ⚠️
src/main.rs 0.00% 118 Missing ⚠️
src/connections.rs 0.00% 86 Missing ⚠️
src/indexes.rs 36.73% 62 Missing ⚠️
src/datasets.rs 0.00% 28 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/embedding_providers.rs Outdated
if name.is_none() && config.is_none() && api_key.is_none() && secret_name.is_none() {
eprintln!(
"{}",
"error: provide at least one of --name, --config, --api-key, --secret-name.".red()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: The error message names --api-key, but the actual flag is --provider-api-key. A user who sees this error and follows its guidance will try --api-key, which is the global Hotdata auth flag — wrong key, confusing failure.

Suggested change
"error: provide at least one of --name, --config, --api-key, --secret-name.".red()
"error: provide at least one of --name, --config, --provider-api-key, --secret-name.".red()

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • src/embedding_providers.rs:158update's no-op guard prints --api-key in the error message, but the actual flag is --provider-api-key. A user following the error guidance will reach for --api-key, which is the global Hotdata auth flag, not the provider key. One-character fix in the string literal.

Action Required

Fix the flag name in the error string on line 158 before merge.

…_distance

`hotdata search` now requires `--type vector|bm25` (no default; same rule
as `indexes create --type`) and a positional query text argument. Both
modes run entirely server-side with no client-side embedding.

Routing:
- `--type vector "<query>"` →
    SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist
  Server resolves the embedding column, model, dimensions, and metric from
  the index metadata. The user names the source text column.
- `--type bm25 "<query>"` → existing bm25_search() server-side path.

Removed:
- `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL).
- Stdin-piped-vector path (was: read JSON vector from stdin, generate
  `l2_distance` SQL).
- `src/embedding.rs` module (its only callers were the two paths above).

Both removed paths hardcoded `l2_distance` regardless of the index's
actual metric, which silently produced wrong rankings on cosine indexes.
They also required the user to point `--column` at the auto-generated
`_embedding` column rather than the source text column. Power users who
need client-side embedding or want to query with a precomputed vector
can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`).

Verified against prod on `my_ducklake.main.internet_pages_small`:
- BM25 "basketball" → finds the basketball ProCamp title (score 2.92)
- BM25 "HIV" → finds the HIV Story titles (score 4.81)
- Vector "sports games athletes" → ranks the basketball ProCamp first
  (cosine distance 0.69), heart-attack-fitness second (0.80)
- Vector "travel vacation cruise" → ranks the cruise excursion first
  (0.63), 48-hours-in-Cesky-Krumlov second (0.74)

The semantically meaningful vector results confirm auto-embedding produced
useful vectors AND the server-side rewrite correctly resolves
provider+metric+output_column from index metadata. Cleaned up indexes
after the test run.
…-op error

`update`'s "provide at least one field" guard message still listed
`--api-key`, which was the original local flag name before it was renamed
to `--provider-api-key` to avoid colliding with the global Hotdata auth
flag. A user following the error guidance would reach for `--api-key`
(the global auth flag), not the provider key.

One-character class of fix; caught by the PR review bot on #67.
claude[bot]
claude Bot previously approved these changes Apr 29, 2026
Adds 7 unit tests covering the lightest-weight gaps in the new code:

- `src/api.rs` — 2 mockito tests for `delete_raw` (204 success, 404 with
  error body), matching the existing `get_none_if_not_found_*` pattern.
- `src/indexes.rs` — 2 path-construction tests for the new `IndexScope`
  enum (`Connection` and `Dataset` variants), matching the existing
  pure-function tests in this module.
- `src/embedding_providers.rs` — 3 deserialization fixture tests for the
  `Provider` and `ListResponse` shapes plus a `parse_config` smoke test,
  matching the runtimedb-payload-deserialization pattern from
  `datasets.rs::update_response_deserializes_runtimedb_payload`.

Skipped `datasets.rs::refresh` — adding a typed response struct purely
for test fixture purposes would be over-engineering since the refresh
function reads the response as `serde_json::Value` directly.

110 tests → 117 (105 → 112 unit + 5 integration unchanged). Patch
coverage on the PR is still informational by repo policy (codecov.yml).
Reading just the Search section, a user might miss that vector search is
end-to-end auto-embedded — both the column's embeddings (built when the
index was created) and the query embedding (computed at search time)
come from the same server-configured provider, with matching metric,
model, and dimensions.

Spells that out at the top of the `--type vector` bullet, and adds an
explicit pointer to raw SQL via `hotdata query` for cases where the user
needs a different model than the index, or has no index at all (the SQL
reference covers the underlying distance functions and table UDFs).
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cycle-1 blocking issue is fixed: src/embedding_providers.rs:158 now correctly names --provider-api-key. Everything else looks clean.

@anoop-narang anoop-narang merged commit 27dcdb0 into main Apr 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant