Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 32 additions & 1 deletion .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,13 @@ DISABLE_DISPLAY_KEYS=false # if true, API keys will not be shown in the frontend
SANDBOX=local # code execution backend: 'local' (default) or 'docker'
# LOG_LEVEL=INFO # logging level for data_formulator modules (DEBUG, INFO, WARNING, ERROR)

# --- Feature gates ---
# Disable external data connectors (MySQL, PostgreSQL, etc.).
# Recommended for multi-user anonymous deployments to prevent credential exposure.
# DISABLE_DATA_CONNECTORS=false

# Prevent users from adding custom LLM endpoints via the UI.\n# Only server-configured models (below) will be available.\n# DISABLE_CUSTOM_MODELS=false

# Flask session secret key — used to sign cookies and encrypt session data.
# Required for SSO and plugin auth (Superset, etc.). Generate one with:
# python -c "import secrets; print(secrets.token_hex(32))"
Expand Down Expand Up @@ -200,4 +207,28 @@ OLLAMA_MODELS=qwen3:32b # models with good code generation capabilities recommen
# Superset-side setup:
# The Superset instance needs a small bridge endpoint at /df-sso-bridge/
# that converts a Superset session into a JWT and posts it back to DF.
# See: superset-sso-bridge-setup.md
# See: superset-sso-bridge-setup.md

# -------------------------------------------------------------------
# Deployment profiles (quick-start presets)
# -------------------------------------------------------------------
# See DEVELOPMENT.md "Deployment Profiles" for full documentation.
#
# Profile 1 — Single-user local (default, no changes needed):
# Just run: data_formulator
#
# Profile 2 — Multi-user anonymous demo:
# WORKSPACE_BACKEND=ephemeral
# DISABLE_DATA_CONNECTORS=true
# DISABLE_CUSTOM_MODELS=true
# DISABLE_DISPLAY_KEYS=true
# (or simply: DISABLE_DATABASE=true as shortcut)
#
# Profile 3 — Multi-user authenticated (enterprise):
# AUTH_PROVIDER=oidc
# OIDC_ISSUER_URL=https://your-idp.example.com/realms/main
# OIDC_CLIENT_ID=data-formulator
# ALLOW_ANONYMOUS=false
# DISABLE_CUSTOM_MODELS=true
# WORKSPACE_BACKEND=azure_blob
# FLASK_SECRET_KEY=<generate-with: python -c "import secrets; print(secrets.token_hex(32))">
264 changes: 175 additions & 89 deletions DEVELOPMENT.md

Large diffs are not rendered by default.

384 changes: 384 additions & 0 deletions design-docs/8-unified-data-source-panel.md

Large diffs are not rendered by default.

1,744 changes: 1,744 additions & 0 deletions design-docs/9-generalized-data-source-plugins.md

Large diffs are not rendered by default.

315 changes: 315 additions & 0 deletions design-docs/9.1-data-source-connection-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
# Data Source Connection Model — Auth, Persistence, and Multi-User Isolation

## Status: Complete (Phase A + B done, Phase C deferred to doc 9 Phase 4)

Parent: [9-generalized-data-source-plugins.md](9-generalized-data-source-plugins.md)

## 1. Problem

After Phase 3 of the generalized plugin migration, all external data sources flow through `DataConnector`. But the **connection lifecycle** has gaps:

1. **Connections are ephemeral.** `DataConnector._loaders` is an in-memory dict. Server restart = all connections lost. Users must re-enter credentials every session.
2. **No "already connected" state.** The data loader panel shows all sources as "Available" with a connect form. There's no way to show "you're already connected to Kusto — here are your tables."
3. **Credential storage exists but isn't wired.** `CredentialVault` (Fernet-encrypted SQLite) exists and works for the Superset plugin, but `DataConnector` doesn't use it.
4. **Multi-user isolation works but has no persistence.** Two users hitting `/api/connectors/kusto/auth/connect` get separate loaders (keyed by identity), but neither survives a restart.

## 2. Desired UX

The data loader panel should present two categories:

### 2.1 Connected Sources (user has active/stored credentials)

```
┌──────────────────────────────────┐
│ ● PostgreSQL (prod) Connected │ ← vault has credentials
│ ● Kusto (corp) Connected │ ← vault has credentials
│ ○ BigQuery (analytics) Session │ ← in-memory only, this session
└──────────────────────────────────┘
```

**Behavior:** User clicks → jumps directly to catalog/table browser. No credential form needed.

- **Vault-backed (●):** Credentials encrypted in `credentials.db`. Auto-reconnect on server restart.
- **Session-only (○):** In-memory only. Connected this session but credentials not persisted. Lost on restart.

### 2.2 Available Sources (registered but no credentials)

```
┌──────────────────────────────────┐
│ MySQL │ ← installed, no connection yet
│ S3 │ ← installed, no connection yet
│ MongoDB │ ← installed, no connection yet
│ ───────────────────────────── │
│ Athena (install) │ ← missing deps
│ MSSQL (install) │ ← missing deps
└──────────────────────────────────┘
```

**Behavior:** User clicks → shown credential form → connect → source moves to "Connected" category.

### 2.3 Multi-User Isolation

Same route path, different state per identity:

```
Route: /api/connectors/kusto/auth/connect
Alice → _loaders["user:alice@corp.com"] = KustoLoader(cluster="alice-cluster")
Bob → _loaders["user:bob@corp.com"] = KustoLoader(cluster="bob-cluster")
```

The admin can also pin shared params via config:

```yaml
# data-sources.yml
sources:
- type: kusto
id: kusto_corp
name: "Corp Kusto"
params:
cluster: "https://corp.kusto.windows.net" # pinned — hidden from user form
# Users only see: database, token
```

In this scenario, both Alice and Bob connect to the same cluster but provide their own database and token. Their loaders are still separate.

## 3. Credential Persistence Design

### 3.1 Existing Infrastructure

| Component | Location | Status |
|-----------|----------|--------|
| `CredentialVault` (abstract) | `credential_vault/base.py` | ✅ Working |
| `LocalCredentialVault` (Fernet + SQLite) | `credential_vault/local_vault.py` | ✅ Working |
| Key auto-generation | `credential_vault/__init__.py` | ✅ Zero-config for local mode |
| API endpoints | `credential_routes.py` | ✅ `/api/credentials/store\|list\|delete` |
| Vault integration | `plugins/superset/` only | ⚠️ Only wired for Superset |

### 3.2 What Needs to Happen

Wire `DataConnector` into `CredentialVault`:

```
Connect flow:
1. User submits params via /auth/connect
2. DataConnector._connect() creates loader, tests connection
3. If success AND vault available:
→ vault.store(identity, source_id, {user_params + safe metadata})
4. Loader cached in _loaders[identity]

Auto-reconnect flow (on /auth/status or first catalog/data call):
1. _loaders[identity] is empty
2. Check vault.retrieve(identity, source_id)
3. If credentials found → _connect(stored_params) → test connection
4. If test fails → delete stale vault entry, return "not connected"
5. If test succeeds → loader ready, return "connected"

Disconnect flow:
1. User calls /auth/disconnect
2. _loaders.pop(identity)
3. vault.delete(identity, source_id)
```

### 3.3 Storage Architecture: Centralized Vault

**Decision: Single centralized `credentials.db` at `DATA_FORMULATOR_HOME/`.** All users' credentials in one Fernet-encrypted SQLite file, keyed by `(user_id, source_key)`.

Considered and rejected: per-user storage at `users/{id}/credentials.db`.

**Rationale:**

| Concern | Centralized | Per-user dirs |
|---------|-------------|---------------|
| Security boundary | Server process holds the Fernet key and can decrypt all entries regardless of file layout | Same — server still needs all keys |
| Operational simplicity | One file, one volume mount, one backup | N directories, must manage creation/cleanup/permissions |
| User data deletion (GDPR) | `DELETE WHERE user_id = ?` | Delete user dir |
| Concurrent access | SQLite handles fine (rare writes) | No contention but N DB connections |
| Backend swap (e.g., Azure Key Vault) | One interface to replace | N stores to replace |

The logical separation is in the composite key `(user_id, source_key)`, not the physical file layout. Admin-configured credentials don't go in the vault at all — they live in `data-sources.yml` with `auto_connect: true`.

### 3.4 What Gets Stored in the Vault

```json
{
"user_params": {
"host": "db.corp.com",
"port": "5432",
"database": "analytics",
"password": "hunter2"
},
"connected_at": "2026-04-14T10:30:00Z",
"source_id": "postgresql"
}
```

The vault encrypts the **entire blob** with Fernet (AES-128-CBC + HMAC-SHA256). The encryption key:
- **Local mode:** Auto-generated, stored at `DATA_FORMULATOR_HOME/.vault_key`
- **Server mode:** Set via `CREDENTIAL_VAULT_KEY` env var

### 3.5 What Gets Stored in Workspace Metadata (Unchanged)

Workspace YAML only stores **non-sensitive** params (via `get_safe_params()`). This is already the case — passwords, tokens, and secrets are filtered out. No change needed.

### 3.6 Connection State Summary

| Scenario | _loaders dict | Vault | Survives restart? |
|----------|--------------|-------|-------------------|
| Just connected | ✅ has loader | ✅ encrypted | Yes |
| Reconnected from vault | ✅ has loader | ✅ encrypted | Yes |
| Vault disabled / not available | ✅ has loader | ❌ nothing | No |
| Disconnected | ❌ removed | ❌ deleted | — |
| Server restarted, vault has creds | ❌ empty | ✅ encrypted | Yes (auto-reconnect on next access) |

## 4. Deployment Scenarios

### 4.1 Local Mode (single user, `WORKSPACE_BACKEND=local`)

- User IS the admin
- All auto-discovered sources appear as "Available"
- User connects → credentials stored in vault (zero-config, key auto-generated)
- Server restart → auto-reconnect from vault
- No multi-user concerns

### 4.2 Centrally Managed (multi-user, auth provider configured)

- Admin configures shared sources in `data-sources.yml` with pinned params
- Each user provides their own credentials (password/token) for the unpinned params
- Vault keyed by `(user_identity, source_id)` — full isolation
- Two users connecting to the same source_id with different params = two separate vault entries, two separate loaders

Example:

```yaml
# Admin config: data-sources.yml
sources:
- type: kusto
id: kusto_corp
name: "Corp Kusto"
params:
cluster: "https://corp.kusto.windows.net"
```

```
Alice connects: vault["user:alice", "kusto_corp"] = {database: "sales", token: "aaa"}
Bob connects: vault["user:bob", "kusto_corp"] = {database: "eng", token: "bbb"}

Same route: /api/connectors/kusto_corp/auth/connect
Different credentials, different catalog results.
```

### 4.3 SSO / Token Forwarding (future)

When the app's auth provider (OIDC/Azure) issues tokens that the data source also accepts:

```
User logs in via OIDC → gets access_token
DataConnector sees auth_mode = "token_forward"
→ auto-connect using the user's OIDC token (no credential form)
→ no vault storage needed (token comes from auth session)
```

This is how the Superset SSO bridge already works. Generalizing it to DataConnector is a future enhancement.

### 4.4 Ephemeral Mode (`WORKSPACE_BACKEND=ephemeral`)

- No vault (no persistent storage)
- Connections are session-only (in-memory `_loaders` dict)
- Credentials typed each time
- This is fine — ephemeral mode is for demos/public instances where no state should persist

## 5. Frontend Changes

### 5.1 `/api/app-config` Enhancement

Add `CONNECTED_CONNECTORS` to the config response — the list of source_ids where the current user has vault credentials:

```json
{
"CONNECTORS": [...],
"DISABLED_SOURCES": {...},
"CONNECTED_CONNECTORS": ["postgresql", "kusto_corp"]
}
```

This lets the frontend immediately render the "Connected / Available" split on mount without calling `/auth/status` for each source.

### 5.2 Data Loader Panel States

```typescript
// Derived from serverConfig.CONNECTORS + serverConfig.CONNECTED_CONNECTORS
const connectedSources = sources.filter(s => connectedIds.includes(s.source_id));
const availableSources = sources.filter(s => !connectedIds.includes(s.source_id));
```

**Connected source row:**
```
[●] PostgreSQL (prod) [Browse Tables] [Disconnect]
```

**Available source row:**
```
[ ] MySQL [Connect...]
```

### 5.3 Connect Flow UI Change

After successful connect, the source moves from "Available" to "Connected":
1. Frontend sends `{ params, persist }` to `/auth/connect` (30s AbortController timeout)
2. Backend creates loader → tests connection → persists if requested
3. Backend returns `{ status: "connected", persisted: true/false }`
4. Frontend checks `status === "connected"` before calling `onConnected()`
5. Source re-renders in "Connected" category with catalog browser
6. If timeout or error → source stays in "Available", error message shown

### 5.4 Persist Credentials Toggle

The connect form includes a "Remember credentials" checkbox (default: checked).
When unchecked, `persist: false` is sent to the backend, and credentials are
session-only (in-memory). The toggle is only shown when there are param fields.

## 6. Implementation Plan

### Phase A: Vault Integration in DataConnector

1. ✅ Add `_vault_store()`, `_vault_retrieve()`, `_vault_delete()`, `_persist_credentials()` helpers
2. ✅ `_connect()` creates loader in-memory; vault persistence is separate via `_persist_credentials()`
3. ✅ Wire into `_disconnect()` → delete from vault
4. ✅ Add auto-reconnect in `_require_loader()` → try vault before raising
5. ✅ Add `CONNECTED_CONNECTORS` to `/api/app-config`
6. ✅ Tests: vault store/retrieve/disconnect/auto-reconnect/persist-flag (21 tests)

### Phase B: Frontend Two-Panel UX

7. ✅ Parse `CONNECTED_CONNECTORS` from server config
8. ✅ Split data loader panel into Connected / Available sections
9. ✅ Auto-open catalog browser for connected sources (auto-reconnect from vault)
10. ✅ "Remember credentials" checkbox (default: on), sends `persist` flag to backend
11. ✅ Connection timeout (30s AbortController), verified `status === "connected"` before state transition

### Phase C: Token Forwarding (deferred)

12. Add `auth_mode: "token_forward"` to DataConnector
13. Auto-connect using the user's auth session token
14. No credential form needed — just catalog browser

## 7. Design Decisions (Resolved)

### D1: Credential persistence — opt-out (default: persist)

Local users expect "remember me" behavior. They can disconnect to clear. Server admins can disable the vault entirely by not setting `CREDENTIAL_VAULT_KEY` (though local mode auto-generates a key, so it's always available unless explicitly blocked).

### D2: Credential rotation / expiry — lazy invalidation

Vault entries don't expire. Auto-reconnect tests the connection — if the password has changed, the stale entry is deleted and the user is prompted to reconnect. Token-based connections (OAuth) would need refresh token support (Phase C).

### D3: Vault scope — global, not per-workspace

A user who connects to PostgreSQL in workspace A should see it connected in workspace B too. The vault key is `(user_id, source_key)` with no workspace dimension.

### D4: Admin-provided credentials — config file, not vault

Use `auto_connect: true` in `data-sources.yml`. The admin provides full credentials (with `${ENV_VAR}` refs), and all users auto-connect without entering anything. These never enter the per-user vault.

### D5: Storage architecture — single centralized vault

One `credentials.db` at `DATA_FORMULATOR_HOME/`, keyed by `(user_id, source_key)`. Not per-user files. The trust boundary is the server process (which holds the Fernet key), so physical file separation adds operational complexity without security benefit. Admin credentials stay in config; user credentials stay in vault. User data deletion is `DELETE WHERE user_id = ?`.
Loading