diff --git a/README.md b/README.md index ae5ac99..82ee6d8 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,8 @@ API key priority (lowest to highest): config file → `HOTDATA_API_KEY` env var | `query` | | Execute a SQL query | | `queries` | `list` | Inspect query run history | | `search` | | Full-text search across a table column | -| `indexes` | `list`, `create` | Manage indexes on a table | +| `indexes` | `list`, `create`, `delete` | Manage indexes on a table or dataset | +| `embedding-providers` | `list`, `get`, `create`, `update`, `delete` | Manage embedding providers used by vector indexes | | `results` | `list` | Retrieve stored query results | | `jobs` | `list` | Manage background jobs | | `sandbox` | `list`, `new`, `set`, `read`, `update`, `run` | Manage sandboxes | @@ -101,13 +102,16 @@ hotdata workspaces set [] ```sh hotdata connections list [-w ] [-o table|json|yaml] hotdata connections [-w ] [-o table|json|yaml] -hotdata connections refresh [-w ] +hotdata connections refresh [-w ] [--data] [--schema --table ] [--async] [--include-uncached] hotdata connections new [-w ] ``` - `list` returns `id`, `name`, `source_type` for each connection. - Pass a connection ID to view details (id, name, source type, table counts). -- `refresh` triggers a schema refresh for a connection. +- `refresh` triggers a schema refresh by default. Pass `--data` to refresh cached row data instead. +- `--schema` and `--table` narrow a data refresh to a single table (must be supplied together). +- `--async` submits a data refresh as a background job and returns a job ID; poll with `hotdata jobs `. Only valid with `--data` — schema refresh is always synchronous. +- `--include-uncached` includes tables that haven't been cached yet in a connection-wide data refresh. Only valid with `--data` and no `--table`. - `new` launches an interactive connection creation wizard. ### Create a connection @@ -143,6 +147,7 @@ hotdata datasets create --file data.csv [--label "My Dataset"] [--table-name my_ hotdata datasets create --sql "SELECT ..." --label "My Dataset" hotdata datasets create --url "https://example.com/data.parquet" --label "My Dataset" hotdata datasets update [--label "New Label"] [--table-name new_table] +hotdata datasets refresh [--workspace-id ] [--async] ``` - Datasets are queryable as `datasets.main.`. @@ -150,6 +155,8 @@ hotdata datasets update [--label "New Label"] [--table-name new_tab - `--url` imports data directly from a URL (supports csv, json, parquet). - Format is auto-detected from file extension or content. - Piped stdin is supported: `cat data.csv | hotdata datasets create --label "My Dataset"` +- `refresh` re-runs the dataset's source (URL fetch or saved query) and creates a new version. Not supported for upload-source datasets. +- `--async` submits the refresh as a background job and returns a job ID; poll with `hotdata jobs `. ## Workspace context @@ -194,33 +201,62 @@ hotdata queries [-o table|json|yaml] ## Search -```sh -# BM25 full-text search -hotdata search "query text" --table --column [--select ] [--limit ] [-o table|json|csv] +`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side. -# Vector search with --model (calls OpenAI to embed the query) -hotdata search "query text" --table --column --model text-embedding-3-small [--limit ] +```sh +# BM25 full-text search (requires a BM25 index on the column) +hotdata search "" --type bm25 --table --column [--select ] [--limit ] [-o table|json|csv] -# Vector search with piped embedding -echo '[0.1, -0.2, ...]' | hotdata search --table
--column [--limit ] +# Vector search (requires a vector index with auto-embedding on the column) +hotdata search "" --type vector --table
--column [--limit ] ``` -- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column. -- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var. -- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search. -- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending). +- **`--type vector`** — pass your query as **plain text**, name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — so distance metric, model, and dimensions all match automatically. No `OPENAI_API_KEY`, no client-side embedding, no need to know about the auto-generated `_embedding` column. Generated SQL: `vector_distance(col, 'query')` server-side. +- **`--type bm25`** runs `bm25_search(table, col, 'query')` — requires a BM25 index on the column. +- **No vector index, or want to use a different model than the index?** Skip `hotdata search` and use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, []) FROM ...`). The SQL reference covers the available distance functions and table UDFs. +- BM25 results sort by score (descending). Vector results sort by distance (ascending). - `--select` specifies which columns to return (comma-separated, defaults to all). +- The previous `--model` flag and stdin-piped-vector path are **removed** — both hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, []) ...`). ## Indexes +Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive. + +```sh +# Connection-table scope +hotdata indexes list --connection-id --schema --table
[-o table|json|yaml] +hotdata indexes create --connection-id --schema --table
\ + --name --columns --type sorted|bm25|vector \ + [--metric l2|cosine|dot] [--async] \ + [--embedding-provider-id ] [--dimensions ] [--output-column ] [--description ] +hotdata indexes delete --connection-id --schema --table
--name + +# Dataset scope +hotdata indexes list --dataset-id [-o table|json|yaml] +hotdata indexes create --dataset-id --name --columns --type sorted|bm25|vector ... +hotdata indexes delete --dataset-id --name +``` + +- `--type` is **required** — choose `sorted` (B-tree-like), `bm25` (full-text), or `vector` (similarity). +- `--type vector` requires exactly one column. +- `--async` submits index creation as a background job and returns a job ID; poll with `hotdata jobs `. +- **Auto-embedding (text → vector):** when `--type vector` is used on a text column, embeddings are generated automatically. The embedding provider can be specified with `--embedding-provider-id`; if omitted, the first system provider is used. The generated column defaults to `{column}_embedding` and can be overridden with `--output-column`. + +## Embedding providers + ```sh -hotdata indexes list --connection-id --schema --table
[--workspace-id ] [--format table|json|yaml] -hotdata indexes create --connection-id --schema --table
--name --columns [--type sorted|bm25|vector] [--metric l2|cosine|dot] [--async] +hotdata embedding-providers list [-o table|json|yaml] +hotdata embedding-providers get [-o table|json|yaml] +hotdata embedding-providers create --name --provider-type service|local \ + [--config ''] [--provider-api-key | --secret-name ] +hotdata embedding-providers update [--name ] [--config ''] \ + [--provider-api-key | --secret-name ] +hotdata embedding-providers delete ``` -- `list` shows indexes on a table with name, type, columns, status, and creation date. -- `create` creates an index. Use `--type bm25` for full-text search, `--type vector` for vector search (requires `--metric`). -- `--async` submits index creation as a background job. +- `list`/`get` show registered providers (system providers like `sys_emb_openai` come pre-configured). +- `--provider-api-key` auto-creates a managed secret for the provider; `--secret-name` references an existing secret. They are mutually exclusive. +- `--provider-api-key` pairs with `--provider-type` and avoids colliding with the global `--api-key` (Hotdata auth). ## Results @@ -239,7 +275,7 @@ hotdata jobs [--workspace-id ] [--format table|json|yaml] ``` - `list` shows only active jobs (`pending` and `running`) by default. Use `--all` to see all jobs. -- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `create_index`. +- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `dataset_refresh`, `create_index`, `create_dataset_index`. - `--status` accepts: `pending`, `running`, `succeeded`, `partially_succeeded`, `failed`. ## Sandboxes diff --git a/skills/hotdata/SKILL.md b/skills/hotdata/SKILL.md index 576a748..626868a 100644 --- a/skills/hotdata/SKILL.md +++ b/skills/hotdata/SKILL.md @@ -94,12 +94,15 @@ hotdata connections [--workspace-id ] [--output ta - `list` returns `id`, `name`, `source_type` for each connection. - Pass a connection ID to view details (id, name, source type, table counts). -### Refresh connection schema +### Refresh connection schema or data ``` -hotdata connections refresh [--workspace-id ] +hotdata connections refresh [--workspace-id ] [--data] [--schema --table ] [--async] [--include-uncached] ``` -- Refreshes the connection’s catalog so new or changed tables and columns appear in `hotdata tables list` and queries. -- Use after DDL or other changes in the source database when the workspace view is stale. +- Default (no flags) refreshes the connection’s catalog so new or changed tables and columns appear in `hotdata tables list` and queries. Use after DDL or other changes in the source database when the workspace view is stale. +- `--data` re-syncs cached row data from the source instead of refreshing the catalog. +- `--schema` and `--table` narrow a data refresh to a single table (must be supplied together). +- `--async` submits a data refresh as a background job and returns a job ID; poll with `hotdata jobs `. Only valid with `--data` — schema refresh is always synchronous. +- `--include-uncached` includes tables that haven't been cached yet in a connection-wide data refresh. Only valid with `--data` and no `--table`. ### Create a Connection @@ -212,6 +215,14 @@ hotdata datasets create --label "My Dataset" --upload-id [--format c - `--table-name` is optional — derived from the label if omitted. - After **`datasets create`**, the CLI prints a **`full_name`** line (for example `datasets.main.my_table` or `datasets.s_ufmblmvq.tac_csat`). **Always use that `full_name` in SQL**—do not assume `datasets.main`. +#### Refresh a dataset +``` +hotdata datasets refresh [--workspace-id ] [--async] +``` +- Re-runs the dataset's source (URL fetch or saved query) and creates a **new version**. Use after the upstream source has changed. +- **Not supported for upload-source datasets** — those have no remote source to re-pull from. The CLI surfaces the server's `400` directly. +- `--async` submits the refresh as a background job and returns a `job_id`; poll with **`hotdata jobs `**. + #### Querying datasets Qualified dataset tables are **`datasets..`**: **`main`** for workspace-scoped datasets (created outside a sandbox), or the **sandbox id** for sandbox-created data (e.g. `datasets.s_ufmblmvq.tac_csat`). The create output’s **`full_name`** is authoritative—copy it into `FROM` / `JOIN` clauses instead of guessing `datasets.main.…`. @@ -286,32 +297,59 @@ These commands use the **active workspace only** (the `queries` command has no ` To create a dataset from a **saved query** still registered for the workspace, use **`hotdata datasets create --query-id `** (this CLI does not expose separate saved-query create/run subcommands). ### Search -``` -# BM25 full-text search -hotdata search "query text" --table --column [--select ] [--limit ] [--output table|json|csv] -# Vector search with --model (calls OpenAI to embed the query) -hotdata search "query text" --table
--column --model text-embedding-3-small [--limit ] +`--type` is **required**. Pass `vector` or `bm25`. Both run entirely server-side. + +``` +# BM25 full-text search (requires BM25 index on the column) +hotdata search "" --type bm25 --table --column [--select ] [--limit ] [--output table|json|csv] -# Vector search with piped embedding -echo '[0.1, -0.2, ...]' | hotdata search --table
--column [--limit ] +# Vector similarity search via server-side auto-embed (requires a vector index on the column) +hotdata search "" --type vector --table
--column [--limit ] ``` -- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column. -- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var. Supported models: `text-embedding-3-small`, `text-embedding-3-large`. -- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search. -- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending). +- **`--type vector`** — pass the query as **plain text** and name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — distance metric, model, and dimensions match automatically. No client-side embedding, no `OPENAI_API_KEY` required. Generated SQL: `vector_distance(col, 'text')`. +- **`--type bm25`** generates `bm25_search(table, col, 'text')` server-side; requires a BM25 index on the column. +- **No vector index on the column, or want a different embedding model?** `hotdata search` won't help — drop down to raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, []) FROM ...`). See the SQL reference for available distance functions and table UDFs. +- BM25 results sort by score (descending). Vector results sort by distance (ascending). - `--select` specifies which columns to return (comma-separated, defaults to all). - Default limit is 10. -- **For BM25 search, create a BM25 index on the target column first. For vector search, create a vector index.** +- **For BM25 search, create a BM25 index on the target column first (`hotdata indexes create ... --type bm25`). For vector search, create a vector index, optionally with auto-embedding on a text column.** +- The earlier `--model` flag and stdin-piped-vector path have both been removed. They hardcoded `l2_distance` regardless of the index's metric (silently wrong on cosine indexes). For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query`. ### Indexes + +Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`) — the two scopes are mutually exclusive. `--type` is required (no default). + +``` +# Connection-table scope +hotdata indexes list --connection-id --schema --table
[--workspace-id ] [--output table|json|yaml] +hotdata indexes create --connection-id --schema --table
\ + --name --columns --type sorted|bm25|vector \ + [--metric l2|cosine|dot] [--async] \ + [--embedding-provider-id ] [--dimensions ] [--output-column ] [--description ] +hotdata indexes delete --connection-id --schema --table
--name + +# Dataset scope (positional dataset_id replaced by --dataset-id flag) +hotdata indexes list --dataset-id [--workspace-id ] [--output table|json|yaml] +hotdata indexes create --dataset-id --name --columns --type sorted|bm25|vector ... +hotdata indexes delete --dataset-id --name +``` +- `--type` accepts `sorted` (B-tree-like; range/exact lookups), `bm25` (full-text), or `vector` (similarity). It is **required**. +- `--type vector` requires exactly one column. +- `--async` submits index creation as a background job; poll with `hotdata jobs `. +- **Auto-embedding:** with `--type vector` on a **text** column, the server generates embeddings automatically. Pass `--embedding-provider-id` to pick a specific provider; if omitted, the first system provider is used. The generated column defaults to `{column}_embedding` (override with `--output-column`). + +### Embedding providers ``` -hotdata indexes list --connection-id --schema --table
[--workspace-id ] [--output table|json|yaml] -hotdata indexes create --connection-id --schema --table
--name --columns [--workspace-id ] [--type sorted|bm25|vector] [--metric l2|cosine|dot] [--async] +hotdata embedding-providers list [--workspace-id ] [--output table|json|yaml] +hotdata embedding-providers get [--workspace-id ] [--output table|json|yaml] +hotdata embedding-providers create --name --provider-type service|local \ + [--config ''] [--provider-api-key | --secret-name ] [--workspace-id ] +hotdata embedding-providers update [--name ] [--config ''] [--provider-api-key | --secret-name ] +hotdata embedding-providers delete [--workspace-id ] ``` -- `list` shows indexes on a table with name, type, columns, status, and creation date. -- `create` creates an index. Use `--type bm25` for full-text search, `--type vector` for vector search (requires `--metric`). -- `--async` submits index creation as a background job. Use `hotdata jobs ` to check status. +- System providers (e.g. `sys_emb_openai`) come pre-configured. `list` shows IDs to pass to `--embedding-provider-id`. +- `--provider-api-key` (the embedding service's own key, e.g. an OpenAI `sk-...`) auto-creates a managed secret. Pairs with `--provider-type`; named to avoid colliding with the global `--api-key` (Hotdata auth). `--secret-name` references an existing secret. Mutually exclusive. ### Jobs ``` @@ -319,7 +357,7 @@ hotdata jobs list [--workspace-id ] [--job-type ] [--status hotdata jobs [--workspace-id ] [--output table|json|yaml] ``` - `list` shows only active jobs (`pending`, `running`) by default. Use `--all` to see all jobs. -- `--job-type`: `data_refresh_table`, `data_refresh_connection`, `create_index`. +- `--job-type`: `data_refresh_table`, `data_refresh_connection`, `dataset_refresh`, `create_index`, `create_dataset_index`. - `--status`: `pending`, `running`, `succeeded`, `partially_succeeded`, `failed`. - Use `hotdata jobs ` to inspect a specific job's status, error, and result. diff --git a/src/api.rs b/src/api.rs index 9e0e81c..4824af9 100644 --- a/src/api.rs +++ b/src/api.rs @@ -214,6 +214,13 @@ impl ApiClient { self.send(req, Some(body)) } + /// DELETE request, exits on connection error, returns raw (status, body). + pub fn delete_raw(&self, path: &str) -> (reqwest::StatusCode, String) { + let url = format!("{}{path}", self.api_url); + let req = self.build_request(reqwest::Method::DELETE, &url); + self.send(req, None) + } + /// PATCH request with JSON body, returns parsed response. pub fn patch(&self, path: &str, body: &serde_json::Value) -> T { let url = format!("{}{path}", self.api_url); @@ -301,6 +308,39 @@ mod tests { mock.assert(); } + #[test] + fn delete_raw_returns_status_and_body() { + let mut server = mockito::Server::new(); + let mock = server + .mock("DELETE", "/widgets/abc") + .match_header("Authorization", "Bearer test-key") + .with_status(204) + .with_body("") + .create(); + + let api = ApiClient::test_new(&server.url(), "test-key", None); + let (status, body) = api.delete_raw("/widgets/abc"); + assert_eq!(status.as_u16(), 204); + assert!(body.is_empty()); + mock.assert(); + } + + #[test] + fn delete_raw_surfaces_error_body_on_4xx() { + let mut server = mockito::Server::new(); + let mock = server + .mock("DELETE", "/widgets/missing") + .with_status(404) + .with_body(r#"{"error":{"message":"not found"}}"#) + .create(); + + let api = ApiClient::test_new(&server.url(), "test-key", None); + let (status, body) = api.delete_raw("/widgets/missing"); + assert_eq!(status.as_u16(), 404); + assert!(body.contains("not found")); + mock.assert(); + } + #[test] fn get_none_if_not_found_returns_some_on_200() { let mut server = mockito::Server::new(); diff --git a/src/command.rs b/src/command.rs index fbefeaf..3013c70 100644 --- a/src/command.rs +++ b/src/command.rs @@ -125,16 +125,38 @@ pub enum Commands { command: IndexesCommands, }, + /// Manage embedding providers (OpenAI, local, etc.) used by vector indexes + #[command(name = "embedding-providers")] + EmbeddingProviders { + /// Workspace ID (defaults to first workspace from login) + #[arg(long, short = 'w', global = true)] + workspace_id: Option, + + #[command(subcommand)] + command: EmbeddingProvidersCommands, + }, + /// Full-text or vector search across a table column Search { - /// Search query text (omit to read a vector from stdin for vector search) - query: Option, + /// Search query text — required for both --type bm25 and --type vector + query: String, + + /// Search type — required (no default; choose deliberately) + /// + /// `vector` runs server-side `vector_distance(col, 'text')` — the server resolves the + /// embedding column, model, and metric from the index metadata. + /// + /// `bm25` runs server-side `bm25_search(table, col, 'text')` and requires a BM25 index + /// on the column. + #[arg(long, value_parser = ["vector", "bm25"])] + r#type: String, /// Table to search (connection.schema.table) #[arg(long)] table: String, - /// Column to search + /// Column to search. For `--type vector`, name the source text column — the server + /// resolves the embedding column from the index metadata. #[arg(long)] column: String, @@ -146,10 +168,6 @@ pub enum Commands { #[arg(long, default_value = "10")] limit: u32, - /// Embedding model to generate a vector from the query text (e.g. text-embedding-3-small) - #[arg(long, value_parser = ["text-embedding-3-small", "text-embedding-3-large"])] - model: Option, - /// Workspace ID (defaults to first workspace from login) #[arg(long, short = 'w')] workspace_id: Option, @@ -248,49 +266,60 @@ pub enum AuthCommands { #[derive(Subcommand)] pub enum IndexesCommands { - /// List indexes (defaults to the whole workspace; narrow with filters) + /// List indexes (defaults to the whole workspace; narrow with filters or pass --dataset-id) List { /// Filter by connection ID - #[arg(long, short = 'c')] + #[arg(long, short = 'c', conflicts_with = "dataset_id")] connection_id: Option, /// Filter by schema name - #[arg(long)] + #[arg(long, conflicts_with = "dataset_id")] schema: Option, /// Filter by table name - #[arg(long)] + #[arg(long, conflicts_with = "dataset_id")] table: Option, + /// List indexes for a specific dataset (alternative scope to --connection-id) + #[arg(long)] + dataset_id: Option, + /// Output format #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] output: String, }, - /// Create an index on a table + /// Create an index on a table or dataset + /// + /// Pass either connection scope (--connection-id + --schema + --table) OR + /// dataset scope (--dataset-id), not both. Create { - /// Connection ID - #[arg(long, short = 'c')] - connection_id: String, + /// Connection ID (use with --schema and --table) + #[arg(long, short = 'c', conflicts_with = "dataset_id", requires_all = ["schema", "table"])] + connection_id: Option, - /// Schema name - #[arg(long)] - schema: String, + /// Schema name (requires --connection-id) + #[arg(long, requires = "connection_id")] + schema: Option, - /// Table name - #[arg(long)] - table: String, + /// Table name (requires --connection-id) + #[arg(long, requires = "connection_id")] + table: Option, + + /// Dataset ID (alternative scope to --connection-id) + #[arg(long, conflicts_with_all = ["connection_id", "schema", "table"])] + dataset_id: Option, /// Index name #[arg(long)] name: String, - /// Columns to index (comma-separated) + /// Columns to index (comma-separated). Vector indexes accept exactly one column. #[arg(long)] columns: String, - /// Index type - #[arg(long, default_value = "sorted", value_parser = ["sorted", "bm25", "vector"])] + /// Index type — required (no default; choose deliberately) + #[arg(long, value_parser = ["sorted", "bm25", "vector"])] r#type: String, /// Distance metric for vector indexes @@ -300,6 +329,49 @@ pub enum IndexesCommands { /// Create as a background job #[arg(long)] r#async: bool, + + /// Embedding provider ID — when set on a vector index over a text column, + /// embeddings are generated automatically. Defaults to first system provider if omitted. + #[arg(long = "embedding-provider-id")] + embedding_provider_id: Option, + + /// Override embedding output dimensions (vector indexes with auto-embedding only) + #[arg(long)] + dimensions: Option, + + /// Custom name for the generated embedding column (defaults to `{column}_embedding`) + #[arg(long = "output-column")] + output_column: Option, + + /// Human-readable description of the embedding (e.g. "product titles") + #[arg(long)] + description: Option, + }, + + /// Delete an index from a table or dataset + /// + /// Pass either connection scope (--connection-id + --schema + --table) OR + /// dataset scope (--dataset-id), not both. + Delete { + /// Connection ID (use with --schema and --table) + #[arg(long, short = 'c', conflicts_with = "dataset_id", requires_all = ["schema", "table"])] + connection_id: Option, + + /// Schema name (requires --connection-id) + #[arg(long, requires = "connection_id")] + schema: Option, + + /// Table name (requires --connection-id) + #[arg(long, requires = "connection_id")] + table: Option, + + /// Dataset ID (alternative scope to --connection-id) + #[arg(long, conflicts_with_all = ["connection_id", "schema", "table"])] + dataset_id: Option, + + /// Index name + #[arg(long)] + name: String, }, } @@ -308,7 +380,7 @@ pub enum JobsCommands { /// List background jobs (shows active jobs by default) List { /// Filter by job type - #[arg(long, value_parser = ["data_refresh_table", "data_refresh_connection", "create_index"])] + #[arg(long, value_parser = ["data_refresh_table", "data_refresh_connection", "dataset_refresh", "create_index", "create_dataset_index"])] job_type: Option, /// Filter by status @@ -402,6 +474,16 @@ pub enum DatasetsCommands { #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] output: String, }, + + /// Refresh a dataset by re-running its source (URL fetch or saved query) and creating a new version + Refresh { + /// Dataset ID + id: String, + + /// Submit as a background job + #[arg(long)] + r#async: bool, + }, } #[derive(Subcommand)] @@ -467,10 +549,30 @@ pub enum ConnectionsCommands { output: String, }, - /// Refresh a connection's schema + /// Refresh a connection's schema or data Refresh { /// Connection ID connection_id: String, + + /// Refresh data instead of schema metadata + #[arg(long)] + data: bool, + + /// Narrow refresh to a specific schema (requires --table for data refresh) + #[arg(long)] + schema: Option, + + /// Narrow refresh to a specific table (requires --schema) + #[arg(long)] + table: Option, + + /// Submit as a background job (only valid with --data) + #[arg(long)] + r#async: bool, + + /// Include uncached tables in connection-wide data refresh (only with --data, no --table) + #[arg(long = "include-uncached")] + include_uncached: bool, }, } @@ -664,3 +766,86 @@ pub enum TablesCommands { output: String, }, } + +#[derive(Subcommand)] +pub enum EmbeddingProvidersCommands { + /// List embedding providers + List { + /// Output format + #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] + output: String, + }, + + /// Show details for a specific embedding provider + Get { + /// Provider ID + id: String, + + /// Output format + #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] + output: String, + }, + + /// Create a new embedding provider + Create { + /// Provider name (must be unique within the workspace) + #[arg(long)] + name: String, + + /// Provider type ("local" or "service") + #[arg(long, value_parser = ["local", "service"])] + provider_type: String, + + /// Provider-specific config as a JSON string (model, base_url, dimensions, etc.) + #[arg(long)] + config: Option, + + /// The provider's own API key (e.g. an OpenAI sk-... key). Auto-creates a + /// managed secret. Mutually exclusive with --secret-name. Named + /// `--provider-api-key` to pair with `--provider-type` and to avoid colliding + /// with the global `--api-key` (Hotdata auth) flag. + #[arg(long = "provider-api-key", conflicts_with = "secret_name")] + provider_api_key: Option, + + /// Reference an existing secret by name (for service providers) + #[arg(long)] + secret_name: Option, + + /// Output format + #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] + output: String, + }, + + /// Update an embedding provider's name, config, or secret + Update { + /// Provider ID + id: String, + + /// New name + #[arg(long)] + name: Option, + + /// New config as a JSON string + #[arg(long)] + config: Option, + + /// New provider API key (replaces or creates the managed secret). + /// See `embedding-providers create --provider-api-key` for naming rationale. + #[arg(long = "provider-api-key", conflicts_with = "secret_name")] + provider_api_key: Option, + + /// New secret name to reference + #[arg(long)] + secret_name: Option, + + /// Output format + #[arg(long = "output", short = 'o', default_value = "table", value_parser = ["table", "json", "yaml"])] + output: String, + }, + + /// Delete an embedding provider + Delete { + /// Provider ID + id: String, + }, +} diff --git a/src/connections.rs b/src/connections.rs index 9909cf0..fcf011e 100644 --- a/src/connections.rs +++ b/src/connections.rs @@ -344,21 +344,122 @@ pub fn list(workspace_id: &str, format: &str) { } } -pub fn refresh(workspace_id: &str, connection_id: &str) { - let body = serde_json::json!({ +pub fn refresh( + workspace_id: &str, + connection_id: &str, + data: bool, + schema: Option<&str>, + table: Option<&str>, + async_mode: bool, + include_uncached: bool, +) { + use crossterm::style::Stylize; + + if async_mode && !data { + eprintln!( + "{}", + "--async only valid with --data (schema refresh is always synchronous)".red() + ); + std::process::exit(1); + } + if include_uncached && !data { + eprintln!("{}", "--include-uncached only valid with --data".red()); + std::process::exit(1); + } + if include_uncached && table.is_some() { + eprintln!( + "{}", + "--include-uncached cannot be combined with --table (it only applies to connection-wide refresh)".red() + ); + std::process::exit(1); + } + if table.is_some() && schema.is_none() { + eprintln!("{}", "--table requires --schema".red()); + std::process::exit(1); + } + if data && schema.is_some() && table.is_none() { + eprintln!( + "{}", + "--schema requires --table for data refresh (no schema-scoped data refresh)".red() + ); + std::process::exit(1); + } + + let mut body = serde_json::json!({ "connection_id": connection_id, - "data": false, + "data": data, }); + if let Some(s) = schema { + body["schema_name"] = serde_json::Value::String(s.to_string()); + } + if let Some(t) = table { + body["table_name"] = serde_json::Value::String(t.to_string()); + } + if async_mode { + body["async"] = serde_json::Value::Bool(true); + } + if include_uncached { + body["include_uncached"] = serde_json::Value::Bool(true); + } let api = ApiClient::new(Some(workspace_id)); let (status, resp_body) = api.post_raw("/refresh", &body); if !status.is_success() { - use crossterm::style::Stylize; eprintln!("{}", crate::util::api_error(resp_body).red()); std::process::exit(1); } - use crossterm::style::Stylize; - println!("{}", "Schema refresh completed.".green()); + let parsed: serde_json::Value = serde_json::from_str(&resp_body).unwrap_or_default(); + + if async_mode { + let job_id = parsed["id"].as_str().unwrap_or("unknown"); + println!("{}", "Data refresh submitted.".green()); + println!("job_id: {}", job_id); + println!( + "{}", + format!("Use 'hotdata jobs {}' to check status.", job_id).dark_grey() + ); + return; + } + + if !data { + let discovered = parsed["tables_discovered"].as_u64().unwrap_or(0); + let added = parsed["tables_added"].as_u64().unwrap_or(0); + let modified = parsed["tables_modified"].as_u64().unwrap_or(0); + println!("{}", "Schema refresh completed.".green()); + println!( + "{}", + format!(" tables: {discovered} discovered, {added} added, {modified} modified") + .dark_grey() + ); + return; + } + + if let Some(rows) = parsed["rows_synced"].as_u64() { + let dur = parsed["duration_ms"].as_u64().unwrap_or(0); + println!("{}", "Data refresh completed.".green()); + println!("{}", format!(" {rows} rows synced ({dur} ms)").dark_grey()); + } else { + let refreshed = parsed["tables_refreshed"].as_u64().unwrap_or(0); + let failed = parsed["tables_failed"].as_u64().unwrap_or(0); + let total = parsed["total_rows"].as_u64().unwrap_or(0); + let dur = parsed["duration_ms"].as_u64().unwrap_or(0); + println!("{}", "Data refresh completed.".green()); + println!( + "{}", + format!( + " {refreshed} tables refreshed, {failed} failed, {total} total rows ({dur} ms)" + ) + .dark_grey() + ); + if let Some(errors) = parsed["errors"].as_array() { + if !errors.is_empty() { + eprintln!("{}", format!(" {} error(s):", errors.len()).yellow()); + for err in errors { + eprintln!(" {}", err); + } + } + } + } } diff --git a/src/datasets.rs b/src/datasets.rs index 0dff510..bab4604 100644 --- a/src/datasets.rs +++ b/src/datasets.rs @@ -553,6 +553,47 @@ pub fn update( } } +pub fn refresh(workspace_id: &str, dataset_id: &str, async_mode: bool) { + use crossterm::style::Stylize; + + let mut body = json!({ + "dataset_id": dataset_id, + }); + if async_mode { + body["async"] = json!(true); + } + + let api = ApiClient::new(Some(workspace_id)); + let (status, resp_body) = api.post_raw("/refresh", &body); + + if !status.is_success() { + eprintln!("{}", crate::util::api_error(resp_body).red()); + std::process::exit(1); + } + + let parsed: serde_json::Value = serde_json::from_str(&resp_body).unwrap_or_default(); + + if async_mode { + let job_id = parsed["id"].as_str().unwrap_or("unknown"); + println!("{}", "Dataset refresh submitted.".green()); + println!("job_id: {}", job_id); + println!( + "{}", + format!("Use 'hotdata jobs {}' to check status.", job_id).dark_grey() + ); + return; + } + + let id = parsed["id"].as_str().unwrap_or("unknown"); + let version = parsed["version"].as_i64().unwrap_or(0); + let dataset_status = parsed["status"].as_str().unwrap_or(""); + println!("{}", "Dataset refresh completed.".green()); + println!( + "{}", + format!(" id: {id}, version: {version}, status: {dataset_status}").dark_grey() + ); +} + #[cfg(test)] mod tests { use super::*; diff --git a/src/embedding.rs b/src/embedding.rs deleted file mode 100644 index 29b362e..0000000 --- a/src/embedding.rs +++ /dev/null @@ -1,123 +0,0 @@ -use serde_json::Value; - -/// Try to parse a vector from stdin. Accepts either: -/// - A raw JSON array of numbers: [0.1, -0.2, ...] -/// - An OpenAI-compatible response: {"data": [{"embedding": [...]}]} -pub fn read_vector_from_stdin() -> Vec { - use std::io::Read; - let mut input = String::new(); - std::io::stdin() - .read_to_string(&mut input) - .unwrap_or_else(|e| { - eprintln!("error reading stdin: {e}"); - std::process::exit(1); - }); - - let input = input.trim(); - if input.is_empty() { - eprintln!("error: no vector provided on stdin"); - std::process::exit(1); - } - - let parsed: Value = match serde_json::from_str(input) { - Ok(v) => v, - Err(e) => { - eprintln!("error parsing vector from stdin: {e}"); - std::process::exit(1); - } - }; - - extract_vector(&parsed) -} - -/// Extract a float vector from either a raw JSON array or an OpenAI embedding response. -fn extract_vector(value: &Value) -> Vec { - // Raw array: [0.1, -0.2, ...] - if let Some(arr) = value.as_array() { - return parse_float_array(arr); - } - - // OpenAI response: {"data": [{"embedding": [...]}]} - if let Some(embedding) = value - .get("data") - .and_then(|d| d.get(0)) - .and_then(|d| d.get("embedding")) - .and_then(|e| e.as_array()) - { - return parse_float_array(embedding); - } - - eprintln!("error: stdin must be a JSON array of numbers or an OpenAI embedding response"); - std::process::exit(1); -} - -fn parse_float_array(arr: &[Value]) -> Vec { - arr.iter() - .enumerate() - .map(|(i, v)| { - v.as_f64().unwrap_or_else(|| { - eprintln!("error: vector element {i} is not a number: {v}"); - std::process::exit(1); - }) - }) - .collect() -} - -/// Call the OpenAI embeddings API to generate a vector from text. -pub fn openai_embed(text: &str, model: &str) -> Vec { - let api_key = match std::env::var("OPENAI_API_KEY") { - Ok(k) if !k.is_empty() => k, - _ => { - eprintln!("error: OPENAI_API_KEY environment variable is not set"); - std::process::exit(1); - } - }; - - let body = serde_json::json!({ - "input": text, - "model": model, - }); - - let client = reqwest::blocking::Client::new(); - let req = client - .post("https://api.openai.com/v1/embeddings") - .header("Authorization", format!("Bearer {api_key}")) - .json(&body); - let (status, resp_body) = match crate::util::send_debug(&client, req, Some(&body)) { - Ok(pair) => pair, - Err(e) => { - eprintln!("error connecting to OpenAI API: {e}"); - std::process::exit(1); - } - }; - - if !status.is_success() { - let message = serde_json::from_str::(&resp_body) - .ok() - .and_then(|v| v["error"]["message"].as_str().map(str::to_string)) - .unwrap_or(resp_body); - eprintln!("error from OpenAI API ({status}): {message}"); - std::process::exit(1); - } - - let parsed: Value = match serde_json::from_str(&resp_body) { - Ok(v) => v, - Err(e) => { - eprintln!("error parsing OpenAI response: {e}"); - std::process::exit(1); - } - }; - - extract_vector(&parsed) -} - -/// Format a vector as a SQL ARRAY literal: ARRAY[0.1,-0.2,...] -pub fn vector_to_sql(vec: &[f64]) -> String { - format!( - "ARRAY[{}]", - vec.iter() - .map(|v| v.to_string()) - .collect::>() - .join(",") - ) -} diff --git a/src/embedding_providers.rs b/src/embedding_providers.rs new file mode 100644 index 0000000..335a741 --- /dev/null +++ b/src/embedding_providers.rs @@ -0,0 +1,264 @@ +use crate::api::ApiClient; +use serde::{Deserialize, Serialize}; + +#[derive(Deserialize, Serialize)] +struct Provider { + id: String, + name: String, + provider_type: String, + config: serde_json::Value, + has_secret: bool, + source: String, + created_at: String, + updated_at: String, +} + +#[derive(Deserialize)] +struct ListResponse { + embedding_providers: Vec, +} + +fn parse_config(raw: Option<&str>) -> Option { + use crossterm::style::Stylize; + raw.map(|s| match serde_json::from_str::(s) { + Ok(v) => v, + Err(e) => { + eprintln!("{}", format!("--config is not valid JSON: {e}").red()); + std::process::exit(1); + } + }) +} + +pub fn list(workspace_id: &str, format: &str) { + let api = ApiClient::new(Some(workspace_id)); + let body: ListResponse = api.get("/embedding-providers"); + + use crossterm::style::Stylize; + match format { + "json" => println!( + "{}", + serde_json::to_string_pretty(&body.embedding_providers).unwrap() + ), + "yaml" => print!( + "{}", + serde_yaml::to_string(&body.embedding_providers).unwrap() + ), + "table" => { + if body.embedding_providers.is_empty() { + eprintln!("{}", "No embedding providers found.".dark_grey()); + return; + } + let rows: Vec> = body + .embedding_providers + .iter() + .map(|p| { + vec![ + p.id.clone(), + p.name.clone(), + p.provider_type.clone(), + p.source.clone(), + if p.has_secret { "yes" } else { "no" }.to_string(), + ] + }) + .collect(); + crate::table::print(&["ID", "NAME", "TYPE", "SOURCE", "SECRET"], &rows); + } + _ => unreachable!(), + } +} + +pub fn get(workspace_id: &str, id: &str, format: &str) { + let api = ApiClient::new(Some(workspace_id)); + let p: Provider = api.get(&format!("/embedding-providers/{id}")); + + match format { + "json" => println!("{}", serde_json::to_string_pretty(&p).unwrap()), + "yaml" => print!("{}", serde_yaml::to_string(&p).unwrap()), + "table" => { + println!("id: {}", p.id); + println!("name: {}", p.name); + println!("type: {}", p.provider_type); + println!("source: {}", p.source); + println!("has_secret: {}", p.has_secret); + println!("created_at: {}", crate::util::format_date(&p.created_at)); + println!("updated_at: {}", crate::util::format_date(&p.updated_at)); + println!( + "config: {}", + serde_json::to_string_pretty(&p.config).unwrap_or_default() + ); + } + _ => unreachable!(), + } +} + +#[allow(clippy::too_many_arguments)] +pub fn create( + workspace_id: &str, + name: &str, + provider_type: &str, + config: Option<&str>, + api_key: Option<&str>, + secret_name: Option<&str>, + format: &str, +) { + use crossterm::style::Stylize; + + let api = ApiClient::new(Some(workspace_id)); + let mut body = serde_json::json!({ + "name": name, + "provider_type": provider_type, + }); + if let Some(cfg) = parse_config(config) { + body["config"] = cfg; + } + if let Some(k) = api_key { + body["api_key"] = serde_json::json!(k); + } + if let Some(s) = secret_name { + body["secret_name"] = serde_json::json!(s); + } + + let (status, resp_body) = api.post_raw("/embedding-providers", &body); + if !status.is_success() { + eprintln!("{}", crate::util::api_error(resp_body).red()); + std::process::exit(1); + } + + let parsed: serde_json::Value = serde_json::from_str(&resp_body).unwrap_or_default(); + eprintln!("{}", "Embedding provider created.".green()); + match format { + "json" => println!("{}", serde_json::to_string_pretty(&parsed).unwrap()), + "yaml" => print!("{}", serde_yaml::to_string(&parsed).unwrap()), + "table" => { + println!("id: {}", parsed["id"].as_str().unwrap_or("")); + println!("name: {}", parsed["name"].as_str().unwrap_or("")); + println!( + "type: {}", + parsed["provider_type"].as_str().unwrap_or("") + ); + } + _ => unreachable!(), + } +} + +pub fn update( + workspace_id: &str, + id: &str, + name: Option<&str>, + config: Option<&str>, + api_key: Option<&str>, + secret_name: Option<&str>, + format: &str, +) { + use crossterm::style::Stylize; + + if name.is_none() && config.is_none() && api_key.is_none() && secret_name.is_none() { + eprintln!( + "{}", + "error: provide at least one of --name, --config, --provider-api-key, --secret-name.".red() + ); + std::process::exit(1); + } + + let api = ApiClient::new(Some(workspace_id)); + let mut body = serde_json::json!({}); + if let Some(n) = name { + body["name"] = serde_json::json!(n); + } + if let Some(cfg) = parse_config(config) { + body["config"] = cfg; + } + if let Some(k) = api_key { + body["api_key"] = serde_json::json!(k); + } + if let Some(s) = secret_name { + body["secret_name"] = serde_json::json!(s); + } + + let resp: serde_json::Value = api.put(&format!("/embedding-providers/{id}"), &body); + eprintln!("{}", "Embedding provider updated.".green()); + match format { + "json" => println!("{}", serde_json::to_string_pretty(&resp).unwrap()), + "yaml" => print!("{}", serde_yaml::to_string(&resp).unwrap()), + "table" => { + println!("id: {}", resp["id"].as_str().unwrap_or("")); + println!("name: {}", resp["name"].as_str().unwrap_or("")); + if let Some(updated_at) = resp["updated_at"].as_str() { + println!("updated_at: {}", crate::util::format_date(updated_at)); + } + } + _ => unreachable!(), + } +} + +pub fn delete(workspace_id: &str, id: &str) { + use crossterm::style::Stylize; + let api = ApiClient::new(Some(workspace_id)); + let (status, resp_body) = api.delete_raw(&format!("/embedding-providers/{id}")); + if !status.is_success() { + eprintln!("{}", crate::util::api_error(resp_body).red()); + std::process::exit(1); + } + println!("{}", format!("Embedding provider '{id}' deleted.").green()); +} + +#[cfg(test)] +mod tests { + use super::*; + + /// Mirrors runtimedb's `EmbeddingProviderResponse` (see `runtimedb/openapi.yaml`). + /// If the server response shape changes, update this fixture in lockstep. + #[test] + fn provider_deserializes_runtimedb_payload() { + let body = serde_json::json!({ + "id": "sys_emb_openai", + "name": "openai", + "provider_type": "service", + "config": { + "base_url": "https://api.openai.com/v1", + "metric": "cosine", + "model": "text-embedding-3-small" + }, + "has_secret": true, + "source": "system", + "created_at": "2026-04-29T08:19:57.083658085Z", + "updated_at": "2026-04-29T08:19:57.083658085Z" + }); + let p: Provider = serde_json::from_value(body).unwrap(); + assert_eq!(p.id, "sys_emb_openai"); + assert_eq!(p.provider_type, "service"); + assert_eq!(p.source, "system"); + assert!(p.has_secret); + assert_eq!(p.config["model"], "text-embedding-3-small"); + } + + /// Mirrors runtimedb's `ListEmbeddingProvidersResponse`. + #[test] + fn list_response_deserializes_runtimedb_payload() { + let body = serde_json::json!({ + "embedding_providers": [ + { + "id": "sys_emb_openai", + "name": "openai", + "provider_type": "service", + "config": {}, + "has_secret": true, + "source": "system", + "created_at": "2026-04-29T08:19:57Z", + "updated_at": "2026-04-29T08:19:57Z" + } + ] + }); + let resp: ListResponse = serde_json::from_value(body).unwrap(); + assert_eq!(resp.embedding_providers.len(), 1); + assert_eq!(resp.embedding_providers[0].name, "openai"); + } + + #[test] + fn parse_config_rejects_invalid_json() { + // parse_config exits on invalid JSON, so we only verify the success path here. + let parsed = parse_config(Some(r#"{"key":"value"}"#)); + assert_eq!(parsed.unwrap()["key"], "value"); + assert!(parse_config(None).is_none()); + } +} diff --git a/src/indexes.rs b/src/indexes.rs index 8629311..b9473fa 100644 --- a/src/indexes.rs +++ b/src/indexes.rs @@ -128,6 +128,12 @@ fn list_one_table(api: &ApiClient, connection_id: &str, schema: &str, table: &st body.indexes } +fn list_one_dataset(api: &ApiClient, dataset_id: &str) -> Vec { + let path = format!("/datasets/{dataset_id}/indexes"); + let body: ListResponse = api.get(&path); + body.indexes +} + fn list_one_table_scan( api: &ApiClient, connection_id: &str, @@ -146,12 +152,24 @@ pub fn list( connection_id: Option<&str>, schema: Option<&str>, table: Option<&str>, + dataset_id: Option<&str>, format: &str, ) { let api = ApiClient::new(Some(workspace_id)); - let (rows, multi_table) = match (connection_id, schema, table) { - (Some(cid), Some(sch), Some(tbl)) => { + let (rows, multi_table) = match (dataset_id, connection_id, schema, table) { + (Some(did), _, _, _) => { + let indexes = list_one_dataset(&api, did); + let rows: Vec = indexes + .into_iter() + .map(|i| IndexRow { + inner: i, + table: None, + }) + .collect(); + (rows, false) + } + (None, Some(cid), Some(sch), Some(tbl)) => { let indexes = list_one_table(&api, cid, sch, tbl); let rows: Vec = indexes .into_iter() @@ -243,21 +261,85 @@ pub fn list( } } +/// Where an index is being created or deleted. +pub enum IndexScope<'a> { + Connection { + connection_id: &'a str, + schema: &'a str, + table: &'a str, + }, + Dataset { + dataset_id: &'a str, + }, +} + +impl IndexScope<'_> { + fn create_path(&self) -> String { + match self { + IndexScope::Connection { + connection_id, + schema, + table, + } => format!("/connections/{connection_id}/tables/{schema}/{table}/indexes"), + IndexScope::Dataset { dataset_id } => format!("/datasets/{dataset_id}/indexes"), + } + } + + fn delete_path(&self, index_name: &str) -> String { + match self { + IndexScope::Connection { + connection_id, + schema, + table, + } => format!( + "/connections/{connection_id}/tables/{schema}/{table}/indexes/{index_name}" + ), + IndexScope::Dataset { dataset_id } => { + format!("/datasets/{dataset_id}/indexes/{index_name}") + } + } + } +} + #[allow(clippy::too_many_arguments)] pub fn create( workspace_id: &str, - connection_id: &str, - schema: &str, - table: &str, + scope: IndexScope<'_>, name: &str, columns: &str, index_type: &str, metric: Option<&str>, async_mode: bool, + embedding_provider_id: Option<&str>, + dimensions: Option, + output_column: Option<&str>, + description: Option<&str>, ) { - let api = ApiClient::new(Some(workspace_id)); + use crossterm::style::Stylize; let cols: Vec<&str> = columns.split(',').map(str::trim).collect(); + + let auto_embed_set = embedding_provider_id.is_some() + || dimensions.is_some() + || output_column.is_some() + || description.is_some(); + if auto_embed_set && index_type != "vector" { + eprintln!( + "{}", + "--embedding-provider-id, --dimensions, --output-column, and --description are only valid with --type vector".red() + ); + std::process::exit(1); + } + if index_type == "vector" && cols.len() != 1 { + eprintln!( + "{}", + "--type vector requires exactly one column in --columns".red() + ); + std::process::exit(1); + } + + let api = ApiClient::new(Some(workspace_id)); + let mut body = serde_json::json!({ "index_name": name, "columns": cols, @@ -267,31 +349,54 @@ pub fn create( if let Some(m) = metric { body["metric"] = serde_json::json!(m); } + if let Some(p) = embedding_provider_id { + body["embedding_provider_id"] = serde_json::json!(p); + } + if let Some(d) = dimensions { + body["dimensions"] = serde_json::json!(d); + } + if let Some(o) = output_column { + body["output_column"] = serde_json::json!(o); + } + if let Some(d) = description { + body["description"] = serde_json::json!(d); + } - let path = format!("/connections/{connection_id}/tables/{schema}/{table}/indexes"); - let (status, resp_body) = api.post_raw(&path, &body); + let (status, resp_body) = api.post_raw(&scope.create_path(), &body); if !status.is_success() { - use crossterm::style::Stylize; eprintln!("{}", crate::util::api_error(resp_body).red()); std::process::exit(1); } - use crossterm::style::Stylize; if async_mode { let parsed: serde_json::Value = serde_json::from_str(&resp_body).unwrap_or_default(); - let job_id = parsed["job_id"].as_str().unwrap_or("unknown"); + let job_id = parsed["id"].as_str().unwrap_or("unknown"); println!("{}", "Index creation submitted.".green()); println!("job_id: {}", job_id); println!( "{}", - "Use 'hotdata jobs ' to check status.".dark_grey() + format!("Use 'hotdata jobs {}' to check status.", job_id).dark_grey() ); } else { println!("{}", "Index created.".green()); } } +pub fn delete(workspace_id: &str, scope: IndexScope<'_>, index_name: &str) { + use crossterm::style::Stylize; + + let api = ApiClient::new(Some(workspace_id)); + let (status, resp_body) = api.delete_raw(&scope.delete_path(index_name)); + + if !status.is_success() { + eprintln!("{}", crate::util::api_error(resp_body).red()); + std::process::exit(1); + } + + println!("{}", format!("Index '{}' deleted.", index_name).green()); +} + #[cfg(test)] mod tests { use super::*; @@ -304,6 +409,35 @@ mod tests { )); } + #[test] + fn index_scope_connection_paths() { + let scope = IndexScope::Connection { + connection_id: "conn1", + schema: "public", + table: "users", + }; + assert_eq!( + scope.create_path(), + "/connections/conn1/tables/public/users/indexes" + ); + assert_eq!( + scope.delete_path("idx_email"), + "/connections/conn1/tables/public/users/indexes/idx_email" + ); + } + + #[test] + fn index_scope_dataset_paths() { + let scope = IndexScope::Dataset { + dataset_id: "data_xyz", + }; + assert_eq!(scope.create_path(), "/datasets/data_xyz/indexes"); + assert_eq!( + scope.delete_path("idx_title"), + "/datasets/data_xyz/indexes/idx_title" + ); + } + #[test] fn information_schema_followup_breaks_when_more_but_no_cursor() { assert!(matches!( diff --git a/src/main.rs b/src/main.rs index d213fdc..f78d329 100644 --- a/src/main.rs +++ b/src/main.rs @@ -6,7 +6,7 @@ mod connections; mod connections_new; mod context; mod datasets; -mod embedding; +mod embedding_providers; mod indexes; mod jobs; mod jwt; @@ -24,8 +24,9 @@ use anstyle::AnsiColor; use clap::{Parser, builder::Styles}; use command::{ AuthCommands, Commands, ConnectionsCommands, ConnectionsCreateCommands, ContextCommands, - DatasetsCommands, IndexesCommands, JobsCommands, QueriesCommands, QueryCommands, - ResultsCommands, SandboxCommands, SkillCommands, TablesCommands, WorkspaceCommands, + DatasetsCommands, EmbeddingProvidersCommands, IndexesCommands, JobsCommands, QueriesCommands, + QueryCommands, ResultsCommands, SandboxCommands, SkillCommands, TablesCommands, + WorkspaceCommands, }; #[derive(Parser)] @@ -173,6 +174,9 @@ fn main() { table_name.as_deref(), &output, ), + Some(DatasetsCommands::Refresh { id, r#async }) => { + datasets::refresh(&workspace_id, &id, r#async) + } None => { use clap::CommandFactory; let mut cmd = Cli::command(); @@ -270,9 +274,22 @@ fn main() { ) } }, - Some(ConnectionsCommands::Refresh { connection_id }) => { - connections::refresh(&workspace_id, &connection_id) - } + Some(ConnectionsCommands::Refresh { + connection_id, + data, + schema, + table, + r#async, + include_uncached, + }) => connections::refresh( + &workspace_id, + &connection_id, + data, + schema.as_deref(), + table.as_deref(), + r#async, + include_uncached, + ), None => { use clap::CommandFactory; let mut cmd = Cli::command(); @@ -393,92 +410,186 @@ fn main() { connection_id, schema, table, + dataset_id, output, } => indexes::list( &workspace_id, connection_id.as_deref(), schema.as_deref(), table.as_deref(), + dataset_id.as_deref(), &output, ), IndexesCommands::Create { connection_id, schema, table, + dataset_id, name, columns, r#type, metric, r#async, - } => indexes::create( + embedding_provider_id, + dimensions, + output_column, + description, + } => { + let scope = match ( + dataset_id.as_deref(), + connection_id.as_deref(), + schema.as_deref(), + table.as_deref(), + ) { + (Some(did), _, _, _) => indexes::IndexScope::Dataset { dataset_id: did }, + (None, Some(cid), Some(sch), Some(tbl)) => { + indexes::IndexScope::Connection { + connection_id: cid, + schema: sch, + table: tbl, + } + } + _ => { + eprintln!( + "error: provide either --dataset-id or all three of --connection-id, --schema, --table" + ); + std::process::exit(1); + } + }; + indexes::create( + &workspace_id, + scope, + &name, + &columns, + &r#type, + metric.as_deref(), + r#async, + embedding_provider_id.as_deref(), + dimensions, + output_column.as_deref(), + description.as_deref(), + ) + } + IndexesCommands::Delete { + connection_id, + schema, + table, + dataset_id, + name, + } => { + let scope = match ( + dataset_id.as_deref(), + connection_id.as_deref(), + schema.as_deref(), + table.as_deref(), + ) { + (Some(did), _, _, _) => indexes::IndexScope::Dataset { dataset_id: did }, + (None, Some(cid), Some(sch), Some(tbl)) => { + indexes::IndexScope::Connection { + connection_id: cid, + schema: sch, + table: tbl, + } + } + _ => { + eprintln!( + "error: provide either --dataset-id or all three of --connection-id, --schema, --table" + ); + std::process::exit(1); + } + }; + indexes::delete(&workspace_id, scope, &name); + } + } + } + Commands::EmbeddingProviders { + workspace_id, + command, + } => { + let workspace_id = resolve_workspace(workspace_id); + match command { + EmbeddingProvidersCommands::List { output } => { + embedding_providers::list(&workspace_id, &output) + } + EmbeddingProvidersCommands::Get { id, output } => { + embedding_providers::get(&workspace_id, &id, &output) + } + EmbeddingProvidersCommands::Create { + name, + provider_type, + config, + provider_api_key, + secret_name, + output, + } => embedding_providers::create( &workspace_id, - &connection_id, - &schema, - &table, &name, - &columns, - &r#type, - metric.as_deref(), - r#async, + &provider_type, + config.as_deref(), + provider_api_key.as_deref(), + secret_name.as_deref(), + &output, + ), + EmbeddingProvidersCommands::Update { + id, + name, + config, + provider_api_key, + secret_name, + output, + } => embedding_providers::update( + &workspace_id, + &id, + name.as_deref(), + config.as_deref(), + provider_api_key.as_deref(), + secret_name.as_deref(), + &output, ), + EmbeddingProvidersCommands::Delete { id } => { + embedding_providers::delete(&workspace_id, &id) + } } } Commands::Search { query, + r#type, table, column, select, limit, - model, workspace_id, output, } => { let workspace_id = resolve_workspace(workspace_id); let select_cols = select.as_deref().unwrap_or("*"); - // Determine search mode: - // 1. --model flag: embed the query text via the model provider - // 2. No query + piped stdin: read vector from stdin - // 3. Query text without --model: BM25 text search - let sql = if let Some(ref model_name) = model { - let query_text = match query { - Some(ref q) => q.as_str(), - None => { - eprintln!("error: --model requires a search query text"); - std::process::exit(1); - } - }; - let vec = embedding::openai_embed(query_text, model_name); - let vec_str = embedding::vector_to_sql(&vec); - format!( - "SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}", - select_cols, column, vec_str, table, limit, - ) - } else if let Some(q) = query.as_ref() { - let bm25_columns = match select.as_deref() { - Some(cols) => format!("{}, score", cols), - None => "*".to_string(), - }; - format!( - "SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}", - bm25_columns, - table.replace('\'', "''"), - column.replace('\'', "''"), - q.replace('\'', "''"), - limit, - ) - } else { - use std::io::IsTerminal; - if std::io::stdin().is_terminal() { - eprintln!("error: provide a search query or pipe a vector via stdin"); - std::process::exit(1); + let sql = match r#type.as_str() { + "bm25" => { + let bm25_columns = match select.as_deref() { + Some(cols) => format!("{}, score", cols), + None => "*".to_string(), + }; + format!( + "SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}", + bm25_columns, + table.replace('\'', "''"), + column.replace('\'', "''"), + query.replace('\'', "''"), + limit, + ) } - let vec = embedding::read_vector_from_stdin(); - let vec_str = embedding::vector_to_sql(&vec); - format!( - "SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}", - select_cols, column, vec_str, table, limit, - ) + // Server-side vector_distance: resolves the embedding column, model, + // and metric from the index metadata. The user names the source text column. + "vector" => format!( + "SELECT {}, vector_distance({}, '{}') AS dist FROM {} ORDER BY dist LIMIT {}", + select_cols, + column, + query.replace('\'', "''"), + table, + limit, + ), + _ => unreachable!(), }; query::execute(&sql, &workspace_id, None, &output) }