feat: data/dataset refresh + indexes auto-embedding + embedding providers#67
Merged
anoop-narang merged 10 commits intomainfrom Apr 29, 2026
Merged
feat: data/dataset refresh + indexes auto-embedding + embedding providers#67anoop-narang merged 10 commits intomainfrom
anoop-narang merged 10 commits intomainfrom
Conversation
…cached `hotdata connections refresh` previously only triggered a synchronous schema refresh. The /v1/refresh endpoint actually supports data refresh (per-table or whole-connection), an async/background-job mode, and an include_uncached toggle for picking up newly-discovered tables — none of which were exposed. Adds: - --data: refresh data instead of schema metadata - --schema/--table: narrow scope (server requires both together for data refresh) - --async: submit as a background job, returns a job id to poll via `hotdata jobs <id>` - --include-uncached: connection-wide data refresh only, includes uncached tables - CLI-side validation mirroring server rules so we fail fast with clear errors - Richer output: schema refresh now reports tables_discovered/added/modified; data refresh reports rows_synced and duration Also adds `dataset_refresh` to the allowed values for `jobs list --job-type`, which the server emits but the CLI didn't accept as a filter.
Updates the README and the bundled hotdata skill to match the expanded `hotdata connections refresh` surface (--data, --schema/--table, --async, --include-uncached) and to add `dataset_refresh` to the documented values for `hotdata jobs list --job-type`.
Adds `hotdata datasets refresh <dataset_id> [--async]` for re-running a
dataset's source (URL fetch or saved query) to create a new version.
Calls the same `/v1/refresh` endpoint as `connections refresh`, but with
`dataset_id` set instead of `connection_id`.
The sync path prints the new version and status; the async path prints
the job ID and points the user at `hotdata jobs <id>` to poll. Upload-
source datasets have no remote to re-pull from, so the server's 400
("Refresh not supported for source type 'upload'") is surfaced directly.
Updates README.md and SKILL.md to document the new subcommand.
…-providers CRUD
INDEXES
- New `--dataset-id` scope alternative to `--connection-id`/`--schema`/`--table`
on `indexes list`, `indexes create`, and the new `indexes delete` subcommand.
Scopes are mutually exclusive (clap-enforced).
- New auto-embedding flags on `indexes create`:
--embedding-provider-id --dimensions --output-column --description
When `--type vector` runs against a text column, the server generates
embeddings automatically using the named provider (or the first system
provider). Generated column defaults to `{column}_embedding`.
- `--type` is now required on `indexes create` (previously defaulted to
`sorted`). Forces deliberate choice. BREAKING for scripts that omitted it.
- New `indexes delete` subcommand for both connection and dataset scopes.
- CLI-side pre-validation:
* scope flags can't be mixed (clap mutex)
* `--schema`/`--table` require `--connection-id` (clap)
* `--connection-id` requires both `--schema` and `--table` (clap)
* auto-embed flags only valid with `--type vector` (custom)
* `--type vector` requires exactly one column in `--columns` (custom)
BUG FIX
- `indexes create --async` previously read `parsed["job_id"]` from the
response, but the server returns `id` (per `SubmitJobResponse`). Result:
it always printed `job_id: unknown`. Now reads `id` correctly. Confirmed
end-to-end against prod with `hotdata jobs <id>` lookups working.
EMBEDDING PROVIDERS
- New `hotdata embedding-providers` command surface:
list, get, create, update, delete
- The "inline API key" flag is named `--inline-api-key` (struct field
`inline_api_key`) to avoid colliding with the global `--api-key` auth
flag — clap merges fields by their internal id, so reusing the name
`api_key` would silently route the value to the auth layer.
JOBS
- Added `create_dataset_index` to the `--job-type` value list (server
emits this type for async dataset index creation; the CLI was rejecting
it as an invalid filter value).
API LAYER
- Added `ApiClient::delete_raw` — needed for `indexes delete` and
`embedding-providers delete`. Mirrors `post_raw`/`get_raw` shape.
…viders CRUD Updates README.md and skills/hotdata/SKILL.md for the new surface: - `hotdata indexes` now supports both connection and dataset scope; show both invocation forms side-by-side, note `--type` is required, and document the auto-embedding flags (--embedding-provider-id, --dimensions, --output-column, --description). - `hotdata indexes delete` is new; documented for both scopes. - `hotdata embedding-providers` is new; full list/get/create/update/delete surface documented, with a callout that `--inline-api-key` (not `--api-key`) is the inline-secret flag — to avoid colliding with the global auth `--api-key`. - `--job-type` filter list updated with `create_dataset_index`.
…api-key Renames the flag (and Rust struct field) on `embedding-providers create` and `embedding-providers update` from `--inline-api-key` / `inline_api_key` to `--provider-api-key` / `provider_api_key`. Why: - Pairs naturally with the existing `--provider-type` flag on the same subcommand (consistent prefix family). - Self-documenting: this is the embedding service's own API key (e.g. an OpenAI sk-... key), not the user's Hotdata auth credential. - Avoids the clap field-id collision with the global `Cli::api_key` that motivated the original rename, but does so via a name that reads more naturally than `--inline-api-key`. The JSON request body field stays `api_key` per the OpenAPI schema — only the user-facing CLI flag and Rust field are renamed.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| if name.is_none() && config.is_none() && api_key.is_none() && secret_name.is_none() { | ||
| eprintln!( | ||
| "{}", | ||
| "error: provide at least one of --name, --config, --api-key, --secret-name.".red() |
Contributor
There was a problem hiding this comment.
Blocking: The error message names --api-key, but the actual flag is --provider-api-key. A user who sees this error and follows its guidance will try --api-key, which is the global Hotdata auth flag — wrong key, confusing failure.
Suggested change
| "error: provide at least one of --name, --config, --api-key, --secret-name.".red() | |
| "error: provide at least one of --name, --config, --provider-api-key, --secret-name.".red() |
Contributor
There was a problem hiding this comment.
Review
Blocking Issues
src/embedding_providers.rs:158—update's no-op guard prints--api-keyin the error message, but the actual flag is--provider-api-key. A user following the error guidance will reach for--api-key, which is the global Hotdata auth flag, not the provider key. One-character fix in the string literal.
Action Required
Fix the flag name in the error string on line 158 before merge.
…_distance
`hotdata search` now requires `--type vector|bm25` (no default; same rule
as `indexes create --type`) and a positional query text argument. Both
modes run entirely server-side with no client-side embedding.
Routing:
- `--type vector "<query>"` →
SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist
Server resolves the embedding column, model, dimensions, and metric from
the index metadata. The user names the source text column.
- `--type bm25 "<query>"` → existing bm25_search() server-side path.
Removed:
- `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL).
- Stdin-piped-vector path (was: read JSON vector from stdin, generate
`l2_distance` SQL).
- `src/embedding.rs` module (its only callers were the two paths above).
Both removed paths hardcoded `l2_distance` regardless of the index's
actual metric, which silently produced wrong rankings on cosine indexes.
They also required the user to point `--column` at the auto-generated
`_embedding` column rather than the source text column. Power users who
need client-side embedding or want to query with a precomputed vector
can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`).
Verified against prod on `my_ducklake.main.internet_pages_small`:
- BM25 "basketball" → finds the basketball ProCamp title (score 2.92)
- BM25 "HIV" → finds the HIV Story titles (score 4.81)
- Vector "sports games athletes" → ranks the basketball ProCamp first
(cosine distance 0.69), heart-attack-fitness second (0.80)
- Vector "travel vacation cruise" → ranks the cruise excursion first
(0.63), 48-hours-in-Cesky-Krumlov second (0.74)
The semantically meaningful vector results confirm auto-embedding produced
useful vectors AND the server-side rewrite correctly resolves
provider+metric+output_column from index metadata. Cleaned up indexes
after the test run.
…-op error `update`'s "provide at least one field" guard message still listed `--api-key`, which was the original local flag name before it was renamed to `--provider-api-key` to avoid colliding with the global Hotdata auth flag. A user following the error guidance would reach for `--api-key` (the global auth flag), not the provider key. One-character class of fix; caught by the PR review bot on #67.
Adds 7 unit tests covering the lightest-weight gaps in the new code: - `src/api.rs` — 2 mockito tests for `delete_raw` (204 success, 404 with error body), matching the existing `get_none_if_not_found_*` pattern. - `src/indexes.rs` — 2 path-construction tests for the new `IndexScope` enum (`Connection` and `Dataset` variants), matching the existing pure-function tests in this module. - `src/embedding_providers.rs` — 3 deserialization fixture tests for the `Provider` and `ListResponse` shapes plus a `parse_config` smoke test, matching the runtimedb-payload-deserialization pattern from `datasets.rs::update_response_deserializes_runtimedb_payload`. Skipped `datasets.rs::refresh` — adding a typed response struct purely for test fixture purposes would be over-engineering since the refresh function reads the response as `serde_json::Value` directly. 110 tests → 117 (105 → 112 unit + 5 integration unchanged). Patch coverage on the PR is still informational by repo policy (codecov.yml).
Reading just the Search section, a user might miss that vector search is end-to-end auto-embedded — both the column's embeddings (built when the index was created) and the query embedding (computed at search time) come from the same server-configured provider, with matching metric, model, and dimensions. Spells that out at the top of the `--type vector` bullet, and adds an explicit pointer to raw SQL via `hotdata query` for cases where the user needs a different model than the index, or has no index at all (the SQL reference covers the underlying distance functions and table UDFs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three cohesive additions to the CLI, all wiring up server endpoints that already existed but weren't reachable from
hotdata:hotdata connections refreshdata mode + async + scope flagshotdata datasets refresh(new subcommand) + asynchotdata indexesdataset scope + auto-embedding + delete + bug fix; newhotdata embedding-providersCRUDWhat changed
connections refreshWas schema-only and synchronous. Now:
--datatoggles to data refresh.--schema <s> --table <t>narrows a data refresh to a single table (must be paired).--asyncsubmits as a background job and returns a job ID; only valid with--data.--include-uncachedincludes tables that haven't been cached yet (connection-wide data refresh only).tables_discovered/added/modifiedfor schema refresh androws_synced/durationfor data refresh.datasets refreshNew subcommand. Re-runs the dataset's source (URL fetch or saved query) and creates a new version.
Upload-source datasets reject sync upfront (server returns a clear 400). The async path on upload-source datasets is still accepted by the server but the worker can't complete it — flagged for runtimedb.
indexes--dataset-idscope, alternative to--connection-id+--schema+--tableonlist,create, and the newdelete. Mutually exclusive (clap-enforced).deletesubcommand on both scopes.create:--embedding-provider-id,--dimensions,--output-column,--description. With--type vectorover a text column, the server generates embeddings automatically using the named provider (or first system provider as fallback).--typeis now required (no defaultsorted). Forces deliberate choice. Breaking for any scripts that omitted it — pass--type sortedto keep the previous default.indexes create --asyncpreviously readparsed[\"job_id\"]from the response, but the server returnsidper theSubmitJobResponseschema. Result: it always printedjob_id: unknown. Now readsidcorrectly. Confirmed end-to-end against prod.--type vector;--type vectorrequires exactly one column.embedding-providers(new command)Full CRUD for embedding providers used by vector indexes.
Note the flag name
--provider-api-key(not--api-key) — pairs with--provider-typeand avoids colliding with the global--api-key(Hotdata auth) flag. The collision was at clap's struct-field-id level; the JSON request body field is stillapi_keyper the OpenAPI schema.Misc
dataset_refreshandcreate_dataset_indextohotdata jobs list --job-typeaccepted values (the server emits these types but the CLI was rejecting them as filter values).ApiClient::delete_rawhelper. Mirrorspost_raw/get_raw.Test plan
Verified end-to-end against
api.hotdata.dev(production):--include-uncacheddatasets refreshsync (creates new version) and async (job submitted, polls successfully)indexes createsorted/vector on connection scope; vector with auto-embedding generatestitle_embeddingcolumn withcosinemetric viasys_emb_openaiindexes createsorted on dataset scope; async path returns real job ID (bug fix confirmed)indexes deleteon both scopesembedding-providersfull CRUD (list, get, create with--provider-api-key, update, delete)--api-keyflow remains unchanged — verified after renamecargo test— 110/110 passing (no test files were modified or removed)Notes for reviewers
hotdata indexes create --typeis now required (wasdefault_value = \"sorted\"). Scripts that omitted it will get a clear clap error; recovery is--type sorted.runtimedb/openapi.yaml:3175describesRefreshDatasetResponseas the body forPOST /v1/datasets/{id}/refresh— that endpoint doesn't exist; actual endpoint isPOST /v1/refreshwithdataset_id.POST .../indexesdocuments only201 IndexInfoResponse, missing the202 SubmitJobResponseasync variant._keycolumn) return a clear 400 directing users to refresh the dataset first.connections-refresh-data-async→datasets-refresh→indexes-auto-embedding); commits are kept separate inside the PR for readability but are intended to merge as a unit.