diff --git a/README.md b/README.md deleted file mode 100644 index be62f50..0000000 --- a/README.md +++ /dev/null @@ -1,111 +0,0 @@ -# NL2SQL Platform - -**Production-Grade Natural Language to SQL Engine.** - -[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue.svg)](docs/index.md) -[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) - -The **NL2SQL Platform** is a modular, agentic system designed to convert natural language questions into accurate, authorized SQL queries across multiple database engines (Postgres, MySQL, MSSQL, SQLite). - -It features: - -* **Defense-in-Depth Security**: RBAC and Read-Only enforcement at multiple layers. [Learn more](docs/architecture/security.md). -* **Multi-Database Routing**: Federated queries across silos. -* **Agentic Reasoning**: Iterative planning, self-correction, and validation. - ---- - -## 📚 Documentation - -Detailed documentation is available in the `docs/` directory. - -* [**Architecture**](docs/architecture/overview.md): Understand the SQL Agent, Map-Reduce routing, and Plugins. -* [**Security**](docs/architecture/security.md): Authentication, RBAC, and Validation. -* [**Configuration**](docs/guides/configuration.md): Secrets, Cloud Auth, and Safety Limits. -* [**Guides**](docs/guides.md): Installation and Benchmarking. -* [**Reference**](docs/reference.md): CLI arguments and API specs. - ---- - -## 🚀 Try it Now (Demo) - -The fastest way to experience the platform is the interactive demo. This sets up a "Manufacturing" environment with 4 SQLite databases and sample data. - -```bash -# 1. Setup Demo Environment -nl2sql setup --demo - -# 2. Run a Query -nl2sql --env demo run "Show me broken machines in Austin" -``` - ---- - -## 🛠️ Installation & Setup - -### 1. Installation - -The platform is a monorepo. Install the CLI application: - -```bash -# Core & SDK -pip install -e packages/adapter-sdk -pip install -e packages/cli # Installs 'nl2sql' command - -# Database Adapters (install as needed, or let setup wizard handle it) -# pip install -e packages/adapters/postgres -``` - -### 2. Setup (Production) - -Run the interactive wizard to configure your own database connections (Postgres, MySQL, etc.) and LLM: - -```bash -nl2sql setup -``` - -*This will inspect your schema and index it into the vector store.* - -### 3. Usage - -**Querying** - -```bash -nl2sql run "Show me the top 5 users by sales" -``` - -**Common Commands** - -* `nl2sql index` - Re-index schemas and examples. -* `nl2sql list-adapters` - Show installed database adapters. -* `nl2sql doctor` - Diagnose environment issues. - -> See [**CLI Reference**](docs/guides/cli_reference.md) for full documentation on flags like `--env` and `--role`. - ---- - -## 🏗️ Architecture - -The system uses a directed graph of AI Agents (`Planner` -> `Validator` -> `Generator`). - -```mermaid -graph TD - UserQuery["User Query"] --> Semantic["Semantic Analysis"] - Semantic --> Decomposer["Decomposer Node"] - Decomposer -- "Splits Query" --> MapBranching["Fan Out (Map)"] - - subgraph Execution_Layer ["Execution Layer (Parallel)"] - MapBranching --> SQL_Agent["SQL Agent (Planner + Validator + Executor)"] - end - - SQL_Agent -- "Result Set" --> Reducer["State Aggregation"] - Reducer --> Aggregator["Aggregator Node"] -``` - -[Read more in the Architecture Overview](docs/architecture/overview.md). - ---- - -## 🤝 Contributing - -See [Development Guide](docs/guides/development.md). diff --git a/docs/adapters/development.md b/docs/adapters/development.md new file mode 100644 index 0000000..dcd2c4b --- /dev/null +++ b/docs/adapters/development.md @@ -0,0 +1,120 @@ +# Building Adapters + +The **Adapter SDK** (`nl2sql-adapter-sdk`) allows you to extend the platform to support new databases or APIs. + +## Implementing an Adapter + +You must implement the `DatasourceAdapter` interface. + +### Mandatory Properties + +* `datasource_id`: Unique identifier (e.g. "postgres_prod"). +* `row_limit`: **Safety Breaker**. Must return `1000` (or config value) to prevent massive result sets. +* `max_bytes`: **Safety Breaker**. limit result size at the network/driver level if possible. + +### Mandatory Methods + +* `fetch_schema()`: Must return `SchemaMetadata` with `tables`, `columns`, `pks`, `fks`. *Crucially, it should also populate `col.statistics` (samples, min/max) for Indexing.* +* `execute(sql)`: Returns `QueryResult`. +* `dry_run(sql)`: Returns validity checks. + +### Optional Optimization + +* `explain(sql)`: Returns query plan. +* `cost_estimate(sql)`: Returns estimated rows/time. used by PhysicalValidator. + +::: nl2sql_adapter_sdk.interfaces.DatasourceAdapter + +## Compliance Testing + +The SDK provides a compliance test suite. **All Adapters MUST pass this suite.** + +It verifies: + +* Schema Introspection (PKs/FKs detected?) +* Type Mapping (Date -> Python Date, Numeric -> Python Float) +* Error Handling (Bad SQL -> AdapterError) + +```python +# tests/test_my_adapter.py +from nl2sql_adapter_sdk.testing import BaseAdapterTest +from my_adapter import MyAdapter + +class TestMyAdapter(BaseAdapterTest): + @pytest.fixture + def adapter(self): + return MyAdapter(...) +``` + +## Choosing a Base Class + +The platform provides two ways to build adapters. Choose the one that fits your target datasource. + +| Feature | `DatasourceAdapter` (Base Interface) | `BaseSQLAlchemyAdapter` (Helper Class) | +| :--- | :--- | :--- | +| **Package** | `nl2sql-adapter-sdk` | `nl2sql-adapter-sqlalchemy` | +| **Best For** | REST APIs, NoSQL, GraphQL, Manual SQL Drivers. | SQL Databases with SQLAlchemy dialects (Postgres, Oracle, Snowflake). | +| **Schema Fetching** | **Manual Implementation Required**. You must map metadata to `SchemaMetadata`. | **Automatic**. Uses `sqlalchemy.inspect` to reflect tables/FKs. | +| **Execution** | **Manual Implementation Required**. You handle connections, cursors, and types. | **Automatic**. Handles pooling, transactions, and result formatting. | +| **Stats Gathering** | **Manual**. You write queries to fetch min/max/nulls. | **Automatic**. Runs optimized generic queries for stats. | +| **Dry Run** | **Manual**. | **Automatic**. Uses transaction rollback pattern. | + +### When to use `DatasourceAdapter`? + +Use the raw interface when: + +1. You are connecting to a non-SQL source (e.g., Elasticsearch, HubSpot API). +2. You are using a customized internal SQL driver that is not compatible with SQLAlchemy. +3. You need complete control over the execution lifecycle (e.g. async-only drivers). + +### When to use `BaseSQLAlchemyAdapter`? + +Use this helper class when: + +1. There is an existing SQLAlchemy dialect for your database (this covers 95% of SQL databases). +2. You want to save time on boilerplate (connection pooling, schema reflection). +3. You want consistent behavior with the core supported adapters. + +## Building SQL Adapters (The Fast Way) + +For SQL databases supported by SQLAlchemy, you should use the `nl2sql-adapter-sqlalchemy` package as described in the comparison above. + +### `BaseSQLAlchemyAdapter` Features + +This base class implements ~90% of the required functionality for you: + +* **Automatic Schema Fetching**: Uses `sqlalchemy.inspect` to get tables, columns, PKs. +* **Automatic Statistics**: Runs optimized queries to fetch `min/max`, `null_percentage`, `distinct_count`, and `sample_values` for text columns. +* **Generic Execution**: Handles connection pooling and result formatting. +* **Safety**: Built-in generic `dry_run` using transaction rollbacks. + +### Example Implementation + +See `packages/adapters/postgres` for a reference implementation. + +```python +from nl2sql_sqlalchemy_adapter import BaseSQLAlchemyAdapter + +class PostgresAdapter(BaseSQLAlchemyAdapter): + def construct_uri(self, args: Dict[str, Any]) -> str: + return f"postgresql://{args.get('user')}:{args.get('password')}@{args.get('host')}/{args.get('database')}" + + # Optional: Override dry_run for better performance using EXPLAIN + def dry_run(self, sql: str): + self.execute(f"EXPLAIN {sql}") + return DryRunResult(is_valid=True) +``` + +## Reference Adapters + +For detailed usage configurations of our supported adapters, please see the **[Supported Adapters](index.md)** section. + +Explore the `packages/adapters/` directory for examples: + +* `postgres`: Standard implementation using `sqlalchemy`. +* `sqlite`: Simple, file-based. +* `mssql` / `mysql`: Standard enterprise drivers. + +## Next Steps + +Check out the [Postgres Adapter Source Code](https://github.com/nadeem4/nl2sql/tree/main/packages/adapters/postgres) for a complete, production-grade example. diff --git a/docs/adapters/index.md b/docs/adapters/index.md new file mode 100644 index 0000000..102f459 --- /dev/null +++ b/docs/adapters/index.md @@ -0,0 +1,33 @@ +# Supported Adapters + +The NL2SQL Platform supports a variety of datasources through specialized adapters. Each adapter is designed to handle the specific idiosyncrasies of its underlying database engine, from connection strings to specialized `EXPLAIN` plans. + +## SQL Adapters + +We provide first-class support for the following SQL databases via SQLAlchemy. + +| Adapter | Description | Status | +| :--- | :--- | :--- | +| **[PostgreSQL](postgres.md)** | Full support including `EXPLAIN`, JSONB, and SSL. | 🟢 Stable | +| **[MySQL](mysql.md)** | Support for 5.7+ and 8.0. Includes `MAX_EXECUTION_TIME` management. | 🟢 Stable | +| **[Microsoft SQL Server](mssql.md)** | Enterprise support via `pyodbc` and `T-SQL` dialect. | 🟡 Beta | +| **[SQLite](sqlite.md)** | File-based local development. | 🟢 Stable | + +## Missing your database? + +Can't find what you need? Check out the **[Building Adapters](development.md)** guide to see how to implement your own. + +## Configuration + +All adapters are configured in your `configs/datasources.yaml` file. + +```yaml +version: 1 +datasources: + - id: "sales_db" + connection: + type: "postgres" + host: "${env:DB_HOST}" + port: 5432 + # ... see specific adapter docs for full reference +``` diff --git a/docs/adapters/mssql.md b/docs/adapters/mssql.md new file mode 100644 index 0000000..069245d --- /dev/null +++ b/docs/adapters/mssql.md @@ -0,0 +1,47 @@ +# Microsoft SQL Server (MSSQL) Adapter + +Support for SQL Server 2017+ and Azure SQL. + +!!! info "Implementation" + This adapter extends `BaseSQLAlchemyAdapter` but provides specialized `dry_run` logic using `SET NOEXEC ON` to safely validate T-SQL. + +## Configuration + +**Type**: `mssql` + +```yaml + +connection: + type: "mssql" + host: "localhost" + port: 1433 + user: "sa" + password: "${env:DB_PASS}" + database: "my_db" + driver: "ODBC Driver 17 for SQL Server" # Default + trusted_connection: false +``` + +### Connection Details + +* **Driver**: `pyodbc`. **Requires system ODBC headers installed.** +* **URI Constructed**: `mssql+pyodbc://{user}:{pass}@{host}:{port}/{db}?driver={driver}` + +## Features + +| Feature | Implementation | Note | +| :--- | :--- | :--- | +| **Timeout** | Not strictly enforced by driver | Rely on global statement timeout. | +| **Dry Run** | `SET NOEXEC ON` | Validates syntax without execution. | +| **Costing** | `SET SHOWPLAN_XML ON` | Parses XML for `StatementSubTreeCost`. | + +## Requirements + +You must have the MS ODBC Driver installed in your Docker image or local environment. + +[Download ODBC Driver for SQL Server](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server) + +```bash +# Debian / Ubuntu +sudo apt-get install unixodbc-dev +``` diff --git a/docs/adapters/mysql.md b/docs/adapters/mysql.md new file mode 100644 index 0000000..c603f15 --- /dev/null +++ b/docs/adapters/mysql.md @@ -0,0 +1,41 @@ +# MySQL Adapter + +Support for MySQL 5.7+ and 8.0+. + +!!! info "Implementation" + This adapter extends `BaseSQLAlchemyAdapter`. It overrides `connect()` to handle `MAX_EXECUTION_TIME` session variables. + +## Configuration + +**Type**: `mysql` + +```yaml + +connection: + type: "mysql" + host: "localhost" + port: 3306 + user: "root" + password: "${env:DB_PASS}" + database: "my_db" + options: + charset: "utf8mb4" +``` + +### Connection Details + +* **Driver**: `pymysql` (Pure Python). +* **URI Constructed**: `mysql+pymysql://{user}:{pass}@{host}:{port}/{db}?{options}` + +## Features + +| Feature | Implementation | Note | +| :--- | :--- | :--- | +| **Timeout** | `SET MAX_EXECUTION_TIME={ms}` | Session-level enforcement. | +| **Dry Run** | Transaction Rollback | Starts transaction, runs SQL, rolls back. | +| **Costing** | `EXPLAIN FORMAT=JSON` | Extracts `query_cost`. | +| **Stats** | `SELECT count(*), min(), max()` | Standard aggregation. | + +## Limitations + +* **Row Estimation**: MySQL's `EXPLAIN` does not always provide a reliable "Total Rows" estimate for complex joins compared to Postgres. diff --git a/docs/adapters/postgres.md b/docs/adapters/postgres.md new file mode 100644 index 0000000..c277686 --- /dev/null +++ b/docs/adapters/postgres.md @@ -0,0 +1,43 @@ +# PostgreSQL Adapter + +The Postgres adapter is the **Gold Standard** adapter for the platform. It supports the full set of optimization features including `EXPLAIN`-based dry runs and cost estimation. + +!!! info "Implementation" + This adapter extends `BaseSQLAlchemyAdapter`, leveraging automatic schema reflection and statistics gathering. + +## Configuration + +**Type**: `postgres` (or `postgresql`) + +```yaml + +connection: + type: "postgres" + host: "localhost" + port: 5432 + user: "postgres" + password: "${env:DB_PASS}" + database: "my_db" + options: + sslmode: "require" # Optional: passed to query string +``` + +### Connection Details + +* **Driver**: `psycopg2` (via `sqlalchemy`). +* **URI Constructed**: `postgresql://{user}:{pass}@{host}:{port}/{db}?{options}` + +## Features + +| Feature | Implementation | Note | +| :--- | :--- | :--- | +| **Timeout** | Native `-c statement_timeout={ms}` | Enforced server-side. | +| **Dry Run** | `EXPLAIN {sql}` | Highly accurate validation. | +| **Costing** | `EXPLAIN (FORMAT JSON) {sql}` | Returns "Total Cost" and "Plan Rows". | +| **Stats** | Optimized Queries | Fetches `null_perc`, `distinct`, `min/max`. | + +## Troubleshooting + +### SSL Verification + +If connecting to Azure or AWS RDS with strict SSL, ensure you pass the CA certificate path in options or standard libpq environment variables, or use `sslmode: disable` for testing. diff --git a/docs/adapters/sqlite.md b/docs/adapters/sqlite.md new file mode 100644 index 0000000..8262b39 --- /dev/null +++ b/docs/adapters/sqlite.md @@ -0,0 +1,37 @@ +# SQLite Adapter + +Simple file-based adapter for local development and testing. + +!!! info "Implementation" + This adapter extends `BaseSQLAlchemyAdapter`. Note that `timeout` configurations apply to the *database lock*, not query execution time. + +## Configuration + +**Type**: `sqlite` + +```yaml + +connection: + type: "sqlite" + database: "./my_data.db" # Absolute or relative path +``` + +!!! warning "Persistence" + If using **Docker**, avoid relative paths like `./my_data.db` as they will be lost on container restart. Use an absolute path mapped to a volume, e.g., `/app/data/my_data.db`. + +### Connection Details + +* **Driver**: Built-in `sqlite3`. +* **URI Constructed**: `sqlite:///{database}` + +## Features + +| Feature | Implementation | Note | +| :--- | :--- | :--- | +| **Timeout** | `connect_args["timeout"]` | Controls *Locking* timeout, not execution. | +| **Dry Run** | `EXPLAIN QUERY PLAN` | Validates parsing (rudimentary). | +| **Costing** | Stubbed | Returns default cost=1.0. | + +## Hints + +* **Concurrency**: SQLite is poor at high concurrency. Use for **Lite Mode** or single-user testing only. diff --git a/docs/architecture/adapters.md b/docs/architecture/adapters.md deleted file mode 100644 index cde30b1..0000000 --- a/docs/architecture/adapters.md +++ /dev/null @@ -1,46 +0,0 @@ -# Adapters (Plugins) - -Connectivity is strictly decoupled from the Core engine. The Core knows *nothing* about specific SQL drivers, connection strings, or dialect quirks. - -## The Architecture - -We use Python's `importlib.metadata` entry points to discover and load adapters dynamically. - -```mermaid -sequenceDiagram - participant Core - participant Registry - participant Plugin as PostgresAdapter - - Core->>Registry: get_adapter("postgres") - Registry->>Plugin: Load Entry Point - Plugin-->>Registry: Return Adapter Instance - - - Plugin-->>Core: { dry_run: true, limit: true } - - Core->>Plugin: execute("SELECT 1") - Plugin-->>Core: Result(rows=[(1)]) -``` - -## The Interface - -Every adapter adheres to the contract defined in `nl2sql-adapter-sdk`. - -| Method | Purpose | -| :--- | :--- | - -| `fetch_schema()` | Returns `SchemaMetadata` (Tables, Columns, Types). | -| `execute(sql)` | Runs the query and returns normalized `QueryResult`. | -| `cost_estimate(sql)` | Returns estimated rows/cost (optional). | - -## Creating a New Adapter - -1. Create a new python package (e.g. `nl2sql-oracle`). -2. Implement `DatasourceAdapter`. -3. Register the entry point in `pyproject.toml`: - -```toml -[project.entry-points."nl2sql.adapters"] -oracle = "nl2sql_oracle.adapter:OracleAdapter" -``` diff --git a/docs/architecture/indexing.md b/docs/architecture/indexing.md deleted file mode 100644 index 22fe93b..0000000 --- a/docs/architecture/indexing.md +++ /dev/null @@ -1,59 +0,0 @@ -# Data Indexing Strategy - -**Objective**: Move beyond schema-only understanding by indexing structured knowledge about data. This improves routing accuracy, prevents hallucinations, and enhances SQL planning. - -## The Problem: Data Blindness - -Without indexing, the AI knows the *tables* but not the *data*. - -1. **Hallucinations**: Asking for "Gold Tier" users when the data uses `tier='premium_level_1'`. -2. **Inefficient Routing**: Searching legacy tables for recent data. -3. **Invalid Assumptions**: Querying NULL columns. - -## Layered Indexing Model - -We use a layered approach to balance cost and accuracy. - -### Layer 1: Statistical Index (Must Have) - -Stores metadata *about* the data, not the data itself. - -* Row counts -* Distinct values -* Null percentages -* Min/Max sets (e.g. date ranges) - -**Benefit**: Helps the Planner know if a query will return 0 rows before writing SQL. - -### Layer 2: Sample Data (Controlle) - -Stores a small, stratified sample of actual rows (e.g. 100 rows). - -* **Security**: PII is masked. -* **Usage**: Used for few-shot prompting in the Generator. - -### Layer 3: Business Entity Index - -Maps business terms to database realities. - -* "Client" -> `customers` table. -* "Revenue" -> `total_amount` column. -* **Storage**: Vector embeddings for semantic search. - -## Retrieval Strategy - -When a query arrives, the **Decomposer** and **Planner** retrieve context in this order: - -1. **Schema Context**: Table definitions. -2. **Entity Index**: Mapping synonyms. -3. **Statistical Profile**: Validating values. -4. **Sample Data**: Only if needed for complex join logic. - -## Security & Privacy - -> [!IMPORTANT] -> We **NEVER** index full raw data blindly. - -* **PII**: Automatically masked or excluded. -* **Financials**: Sensitive identifiers are hashed. -* **Tenant Isolation**: Separate indexes for separate logical tenants. diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md deleted file mode 100644 index 652a71d..0000000 --- a/docs/architecture/overview.md +++ /dev/null @@ -1,44 +0,0 @@ -# System Overview - -The **NL2SQL Platform** is a modular, agentic system designed to convert natural language questions into accurate, authorized SQL queries across multiple database engines. - -## Core Philosophy - -The system is built on three pillars: - -1. **Agentic Workflow**: Instead of a "one-shot" generation, we use a multi-step **SQL Agent** that plans, validates, and refines queries. -2. **Map-Reduce Routing**: Complex queries are decomposed and routed to the most relevant datasources (Map), then executed in parallel, and finally aggregated (Reduce). -3. **Plugin Architecture**: The core engine is database-agnostic. Connectivity is handled by "Adapters" (Plugins) that strictly implement the `nl2sql-adapter-sdk`. - -## High-Level Architecture - -The system follows a directed graph execution flow, orchestrating several specialized AI Agents. - -```mermaid -graph TD - UserQuery["User Query"] --> Semantic["Semantic Analysis"] - Semantic --> Decomposer["Decomposer Node"] - Decomposer -- "Splits Query" --> MapBranching["Fan Out (Map)"] - - subgraph Execution_Layer ["Execution Layer (Parallel)"] - MapBranching --> SQL_Agent["SQL Agent (Planner + Validator + Executor)"] - end - - SQL_Agent -- "Result Set" --> Reducer["State Aggregation"] - Reducer --> Aggregator["Aggregator Node"] - Aggregator --> FinalAnswer["Final Response"] -``` - -### 1. Semantic Analysis & Decomposition - -The entry point. It understands the user's intent ("Compare sales in US and EU") and identifies which datasources contain the relevant data. If necessary, it splits the query into sub-questions (e.g., "Get US sales", "Get EU sales"). - -### 2. The SQL Agent - -For *each* relevant datasource, a dedicated **SQL Agent** is spawned. This is where the core logic lives. It iteratively plans a query, validates it against security policies, generates SQL, and verifies execution feasibility. - -[Read more about the SQL Agent](./sql_agent.md). - -### 3. Aggregation (Reduce) - -Once all agents complete their work, the **Aggregator Node** synthesizes the results. It can perform cross-database operations (like calculating the difference between the two sub-queries) and formats the final answer for the user. diff --git a/docs/architecture/routing.md b/docs/architecture/routing.md deleted file mode 100644 index 09b4688..0000000 --- a/docs/architecture/routing.md +++ /dev/null @@ -1,52 +0,0 @@ -# Routing Strategy - -The platform operates in a specific "Map-Reduce" pattern to handle multi-datasource environments. - -## The Challenge - -A user might ask: *"Compare the sales in our US Store (Postgres) vs. the EU Store (MySQL)."* - -This request cannot be solved by a single database query. It requires: - -1. Identifying that two different databases are involved. -2. Splitting the question into two independent sub-queries. -3. Routing each sub-query to the correct engine. - -## The Solution: Decomposer Node - -The **Decomposer Node** is responsible for this logic. - -```mermaid -graph LR - UserQuery --> VectorSearch["Vector Search (Schema Index)"] - VectorSearch --> Context["Relevant Tables Context"] - Context --> Decomposer["Decomposer LLM"] - Decomposer --> SubQueryA["Query A -> DB 1"] - Decomposer --> SubQueryB["Query B -> DB 2"] -``` - -### 1. Vector Search (Indexing) - -Before runtime, we "Index" all datasources. This creates embeddings for: - -* Table Names -* Column Descriptions -* Sample Data (Few-Shot examples) - -When a query arrives, we first search this index to find which tables are semantically relevant. - -### 2. Decomposition - -The LLM is presented with the user question and the *list of potentially relevant tables*. It is asked to: - -1. Determine if the query is ambiguous or multi-part. -2. If so, split it. -3. Assign each split query to a specific `datasource_id`. - -### 3. Fan-Out (Map) - -The system then spawns parallel execution branches. Each branch receives: - -* The simplified sub-query. -* The specific `datasource_id`. -* The schema context for that datasource. diff --git a/docs/architecture/security.md b/docs/architecture/security.md deleted file mode 100644 index 080c856..0000000 --- a/docs/architecture/security.md +++ /dev/null @@ -1,171 +0,0 @@ -# Security Architecture - -The NL2SQL Platform implements a multi-layered security approach designed to ensure that users can only access data they are explicitly authorized to see. This document outlines the security mechanisms for authentication, authorization, and query validation. - -## 1. Authentication & Context - -The platform assumes that the caller (API or CLI) has already authenticated the user. The user's identity and roles are passed into the execution pipeline via the `user_context` dictionary in the `GraphState`. - -### User Context Structure - -The `user_context` must contain the following keys to enable authorization checks: - -```json -{ - "role": "sales_analyst", - "allowed_datasources": ["manufacturing_history", "manufacturing_supply"], - "allowed_tables": [ - "customers", "sales_orders", "products", "inventory" - ] -} -``` - -* **role**: A label for logging and auditing purposes. -* **allowed_datasources**: A list of `datasource_id` strings that the user can query. Use `["*"]` for full access (e.g., Admin). -* **allowed_tables**: A list of table names the user can access. This enables fine-grained Row/Table-Level Security (RLS/TLS). - -## 2. Authorization Layers - -Security checks are performed at two distinct stages of the pipeline to fail fast and prevent unauthorized data access. - -### Layer 1: Datasource Access (Routing) - -**Component**: `DecomposerNode` & `OrchestratorVectorStore` - -Before any query planning begins, the system enforces **Knowledge Isolation**. - -* **Logic**: - 1. Extracts `allowed_datasources` from `state.user_context`. - 2. If the list is empty, the request is immediately rejected. - 3. **Vector Store Partitioning**: During context retrieval (RAG), the search is strictly filtered using `filter={"datasource_id": {"$in": allowed_ids}}`. - 4. This ensures that the LLM is **never** exposed to schema definitions or example questions from unauthorized datasources, effectively partitioning the "knowledge base" per user. - -### Layer 2: Table Access (Logical Validation) - -**Component**: `LogicalValidatorNode` - -After the `PlannerNode` generates an abstract syntax tree (AST) for the query, the `LogicalValidatorNode` performs a strict policy check. - -* **Logic**: - 1. Extracts distinct table names from the `PlanModel` (AST). - 2. Compares them against `state.user_context["allowed_tables"]`. - 3. If any table in the plan is not in the allowed list, a critical `SECURITY_VIOLATION` error is raised. - 4. The pipeline terminates immediately; the query is never generated or executed. - -## 3. Query Safety & Validation - -Beyond RBAC, the system enforces strict structural constraints to prevent SQL injection and accidental mutation. - -### Read-Only Enforcement - -The `LogicalValidatorNode` enforces that all generated plans are strictly `READ` operations (SELECT). - -* Mutation statements (INSERT, UPDATE, DELETE, DROP) are structurally impossible to represent in the `PlanModel` AST. -* Even if the planner were to hallucinate a non-read type, the validator acts as a firewall and rejects it. - -### Validated AST vs. Raw SQL - -The system does **not** rely on the LLM to generate raw SQL strings directly. - -1. **Planner**: Generates a typed JSON AST (`PlanModel`). -2. **Validator**: Validates the AST logic and security. -3. **Generator**: Compiles the validated AST into SQL using a deterministic compiler (`Visitor` pattern). - -This separation ensures that "Prompt Injection" attacks cannot easily force the model to output malicious SQL syntax, as the Generator controls the final output syntax. - -## 4. Resource Protection (DoS) - -To prevent Denial of Service (DoS) attacks or run-away queries, the system implements resource safeguards in the `ExecutorNode`. - -* **Cost Estimation Safeguard**: Before execution, the system requests a cost estimate from the database adapter. If the estimated rows exceed `row_limit` (configured in `datasources.yaml`, default: 1000), execution is aborted with a `SAFEGUARD_VIOLATION`. -* **Timeouts**: Datasource configurations (`datasources.yaml`) support `statement_timeout_ms` to kill long-running queries at the driver level (Native enforcement for Postgres/MySQL). -* **Payload Size Limit**: The system enforces a strict memory limit (`max_bytes`) on the serialized result set. If the data returned by the adapter exceeds this limit (default: 10MB), the execution is halted to prevent OOM (Out Of Memory) crashes. - -## 5. Secret Management - -The platform employs a **Pluggable Secret Management** system (`nl2sql.secrets`) to handle sensitive credentials securely. - -### Mechanism - -* **Configuration**: Secrets providers are defined in `secrets.yaml`. -* **Template Hydration**: Secrets are referenced in other config files (like `datasources.yaml`) using the syntax `${provider_id:key}`. -* **Resolution**: The `SecretManager` resolves these references *before* the configuration is parsed, ensuring that sensitive values are never hardcoded in YAML files or committed to version control. - -### Providers - -The system supports extensible providers via the `SecretProvider` protocol. You configure them in `secrets.yaml` with a unique `id`. - -1. **Environment (`env`)**: Standard lookup (e.g., `${env:DB_PASS}`). Always available. -2. **AWS Secrets Manager**: Defined by type `aws`. (e.g., `${aws-prod:db/pass}`). -3. **Azure Key Vault**: Defined by type `azure`. (e.g., `${azure-main:db-secret}`). -4. **HashiCorp Vault**: Defined by type `hashi`. (e.g., `${vault-internal:secret/data/db:pass}`). - -**Dependencies**: Cloud providers require optional extras (`nl2sql-core[aws]`, `nl2sql-core[azure]`, etc.) to keep the core lightweight. - -## 6. Configuration Security - -### Strict Validation (V3) - -Datasource configurations are validated strictly at load time. This ensures: - -* **Type Safety**: Malformed integers or booleans are rejected. -* **Field Constraints**: Unknown fields are forbidden, preventing "config injection" or typos. -* **Sanitization**: Passwords and sensitive fields are masked in logs. -* **Adapter Specifics**: Each adapter (e.g., `PostgresAdapter`) defines and validates its own configuration schema requirements. - -### 6.1 Policy Definition (`configs/policies.json`) - -The application uses **Role-Based Access Control (RBAC)**. The `policies.json` file defines policies keyed by **Role ID** (e.g., `admin`, `analyst`). - -**Strict Namespacing Rule**: To prevent namespace collisions, `allowed_tables` MUST use the format `datasource_id.table_name`. Simple table names are not supported. - -#### Example - -```json -{ - "sales_analyst": { - "description": "Access to Sales DB only", - "role": "analyst", - "allowed_datasources": ["sales_db"], - "allowed_tables": [ - // Exact Match - "sales_db.orders", - - // Datasource Wildcard - "sales_db.customers_*" - ] - }, - "admin": { - "description": "Super Admin", - "role": "admin", - "allowed_tables": ["*"] - } -} -``` - -In CLI execution: `nl2sql run ... --role sales_analyst`. -The application assumes the identity provider has already authenticated the user and assigned this role. - -### 6.2 Policy Schema & Validation - -Policies are treated as **Configuration Code**. To prevent misconfiguration (e.g., typos, invalid types), the system validates `policies.json` against a strict **Pydantic Schema** at startup. - -**Schema (`nl2sql.security.policies`)**: - -1. **Strict Typing**: Fields like `allowed_datasources` MUST be lists of strings. -2. **Syntax Enforcement**: `allowed_tables` values are validated ensuring they match the `datasource_id.table_name` or wildcard format. -3. **Fail Fast**: If the configuration is invalid, the application refuses to start, printing a clear error message describing the violation. - -### 6.3 Policy Management CLI - -You can validate your policy file without running a query using the CLI. - -```bash -# Validate Syntax & Integrity -nl2sql policy validate -``` - -This command performs two checks: - -1. **Schema Check**: Validates syntax against the Pydantic model. -2. **Integrity Check**: Verifies that referenced `datasources` and `tables` actually exist in `datasources.yaml`. Users often typo table names; this catches those errors before runtime. diff --git a/docs/architecture/sql_agent.md b/docs/architecture/sql_agent.md deleted file mode 100644 index bbafdf7..0000000 --- a/docs/architecture/sql_agent.md +++ /dev/null @@ -1,66 +0,0 @@ -# The SQL Agent - -The **SQL Agent** is the core execution engine of the platform. For every relevant datasource identified by the routing layer, a dedicated SQL Agent is instantiated. - -Unlike traditional text-to-SQL systems that attempt to generate SQL in one shot, the SQL Agent treats the problem as a **multi-step reasoning process**. - -## Execution Pipeline - -The pipeline is modeled as a state machine: - -```mermaid -graph TD - Planner["Planner Node"] --> LogicalValidator{"Logical Validator"} - - LogicalValidator -- "Valid" --> Generator["Generator Node"] - LogicalValidator -- "Invalid" --> Refiner["Refiner Node"] - - Generator --> PhysicalValidator{"Physical Validator"} - - PhysicalValidator -- "Valid" --> Executor["Executor Node"] - PhysicalValidator -- "Invalid" --> Refiner - - Refiner -- "Feedback" --> Planner -``` - -### 1. Planner Node - -* **Input**: User Query, Schema Context (Tables). -* **Role**: Reasoning & Planning. -* **Output**: A recursive Abstract Syntax Tree (AST), *not* SQL. -* **Why?**: Generating an AST allows us to validate structure and intent *before* committing to a specific SQL dialect syntax. - -### 2. Logical Validator Node - -* **Input**: `PlanModel` (AST). -* **Checks**: - * **Structure**: Are joins valid? Do column aliases exist? - * **Authorization (RBAC)**: Does the user have permission to access the referenced tables? -* **Outcome**: If invalid, the error is sent to the **Refiner**. - -### 3. Generator Node - -* **Input**: Validated `PlanModel`. -* **Role**: Compiler. -* **Output**: Dialect-specific SQL string (e.g., T-SQL, PostgreSQL). -* **Mechanism**: Uses the Visitor pattern to traverse the AST and emit SQL compatible with the specific adapter's capabilities. - -### 4. Physical Validator Node - -* **Input**: Generated SQL String. -* **Checks**: - * **Semantic**: Performs a "Dry Run" (if supported) to check for runtime errors (e.g., function mismatches). - * **Performance**: Estimates query cost/rows to prevent database overload. -* **Outcome**: If dry run fails or cost is too high, it loops back to **Refiner**. - -### 5. Executor Node - -* **Input**: Validated SQL. -* **Role**: Execution. -* **Output**: Raw result set. - -### 6. Refiner Node - -* **Input**: Validation Errors + Failed Plan. -* **Role**: Error Recovery. -* **Mechanism**: Uses an LLM to analyze *why* the plan failed and provides specific instructions to the Planner for the next attempt. diff --git a/docs/core/agents.md b/docs/core/agents.md new file mode 100644 index 0000000..69b83e7 --- /dev/null +++ b/docs/core/agents.md @@ -0,0 +1,43 @@ +# Agents & Subgraphs + +While Nodes are the building blocks, **Subgraphs** define the control flow and agentic behaviors. + +## The SQL Agent (ReAct Loop) + +The core "thinking" engine for a single datasource is encapsulated in the **SQL Agent Subgraph**. It implements a **ReAct (Reasoning + Acting)** loop pattern. + +### Flow Diagram + +```mermaid +graph LR + Plan(Planner) --> L_Val{Logical Validator} + L_Val -->|Pass| Gen(Generator) + L_Val -->|Fail| Ref(Refiner) + + Gen --> P_Val{Physical Validator} + P_Val -->|Pass| Exec(Executor) + P_Val -->|Fail| Ref + + Ref --> Plan + + Exec --> End +``` + +### Self-Correction Mechanism + +The `RefinerNode` acts as the feedback mechanism. It is only invoked when a validation or execution step fails. + +1. **Error Capture**: The state collects `PipelineError` objects. +2. **Retry Handler**: Checks `retry_count`. Max retries: **3**. +3. **Refinement**: The Refiner consumes the Error + Original Plan and outputs **Feedback**. +4. **Re-Planning**: The Planner receives this feedback and generates a new `plan_v2`. + +### Execution Subgraph + +The `SQL Agent` is wrapped inside an **Execution Subgraph**, which handles: + +1. **Initialization**: Loading vector stores and adapters. +2. **Formatting**: Converting raw execution results into a standardized dictionary for the Aggregator. + +::: nl2sql.pipeline.subgraphs.sql_agent.build_sql_agent_graph +::: nl2sql.pipeline.subgraphs.execution.build_execution_subgraph diff --git a/docs/core/architecture.md b/docs/core/architecture.md new file mode 100644 index 0000000..2a223a8 --- /dev/null +++ b/docs/core/architecture.md @@ -0,0 +1,60 @@ +# Architecture + +The NL2SQL Platform is built on a **State-Machine** architecture using [LangGraph](https://langchain-ai.github.io/langgraph/). This allows for deterministic execution flows with the flexibility of agentic loops (retries, self-correction). + +## The Graph State + +Central to the architecture is the `GraphState` object, which simulates a shared memory space for all nodes. It is passed from node to node, accumulating context, plans, and results. + +::: nl2sql.pipeline.state.GraphState + +## Data Flow Lifecycle + +1. **Ingestion**: User query enters the pipeline. +2. **Semantic Analysis**: + * Query is canonicalized (spelling correction, lowercasing). + * Intent is classified (SQL Generation vs General Chat). +3. **Decomposition (The Router)**: + * The **DecomposerNode** analyzes the complexity. + * Retrieves relevant Schema and Examples from the Vector Store. + * Determines if the query needs to be split (Multi-Datasource) or routed to a single source. +4. **SQL Agent Loop (Per Datasource)**: + * If valid, a **PlannerNode** creates an Abstract Syntax Tree (AST) plan. + * **LogicalValidator** checks the AST for security and structure. + * **GeneratorNode** converts AST to Dialect-Specific SQL. + * **PhysicalValidator** performs a dry-run and cost estimation. + * **ExecutorNode** runs the query sandbox. + * *Errors at any stage trigger the **RefinerNode** for self-correction.* +5. **Aggregation**: + * **AggregatorNode** collects results from all execution branches. + * Synthesizes a final answer or passes through raw data (Fast Path). + +## Entity Relationship + +The platform models SQL concepts using Pydantic models to ensure type safety before any SQL is generated. + +```mermaid +classDiagram + class GraphState { + user_query: str + plan: PlanModel + sql_draft: str + execution: ExecutionResult + errors: List[PipelineError] + } + + class PlanModel { + query_type: READ + tables: List[TableNode] + joins: List[JoinNode] + filters: List[FilterNode] + } + + class TableNode { + name: str + alias: str + } + + GraphState --> PlanModel + PlanModel --> TableNode +``` diff --git a/docs/core/environment.md b/docs/core/environment.md new file mode 100644 index 0000000..20b1d1b --- /dev/null +++ b/docs/core/environment.md @@ -0,0 +1,35 @@ +# Environment Management + +The platform employs a **Universal Environment Protocol** to manage configuration across Development, Demo, and Production. + +## .env Files + +Configuration is loaded from `.env.{environment}` files. The global `.env` is loaded first, followed by the specific environment file which overrides values. + +* `dev`: `.env.dev` +* `demo`: `.env.demo` +* `prod`: `.env.prod` + +### Generating Environments + +The CLI provides a generator to create strict, compliant environment files: + +```bash +nl2sql setup --env prod +``` + +This ensures required keys (like `OPENAI_API_KEY`, `SecretProvider` configs) are present. + +### Secrets Injection + +Secrets are dynamically injected during the build process or runtime using the `SecretManager`. References in the `.env` file can use the `${provider:key}` syntax. + +Example `.env.prod`: + +```ini +# Database Connection +DB_PASSWORD=${aws-secrets:prod-db-password} +API_KEY=${env:OPENAI_API_KEY} +``` + +::: nl2sql.cli.generators.env.generator.EnvFileGenerator diff --git a/docs/core/indexing.md b/docs/core/indexing.md new file mode 100644 index 0000000..e0d8527 --- /dev/null +++ b/docs/core/indexing.md @@ -0,0 +1,55 @@ +# Retrieval & Indexing Strategies + +The platform uses a specialized **Orchestrator Vector Store** to manage context for the Decomposer Node. This ensures that even with hundreds of tables, the LLM only receives the most relevant schema information. + +## Indexing Strategy + +We index two main types of documents: **Schema** and **Examples**. + +### 1. Schema Indexing + +Table schemas are "flattened" into a rich text representation before embedding. + +* **Format**: `Table: {name} (Alias: {alias}). Columns: {col_desc}. Primary Key: {pk}. Foreign Keys: {fk}.` +* **Offline Aliasing**: To prevent conflicts, tables are assigned canonical aliases (e.g., `sales_db_t1`) during indexing. +* **Rich Metadata (via Adapter)**: + * **Sample Values**: Adapters populate `col.statistics.sample_values`. We embed the **Top-5** distinct values for categorical columns (e.g., `status: ['ACTIVE', 'PENDING']`). + * **Range Data**: Adapters populate `min_value`/`max_value`. These are indexed for numeric/date columns (e.g., `created_at: [2023-01..2024-12]`). + +### 2. Example Indexing (Few-Shot) + +Routing examples are indexed to help the model distinguish between similar datasources foundationally. + +* **Source**: Loaded from `configs/sample_questions.yaml`. +* **Enrichment (Semantic Variants)**: + The `SemanticAnalysisNode` is used to generate variants for each example to maximize retrieval surface area. + 1. **Original**: "Show me my orders" + 2. **Canonical**: "select orders from sales_db" (Hypothetical SQL-like intent) + 3. **Meta-Text**: "purchases transactions history active items" (Keywords & Synonyms) + + *Each variant is stored as a separate vector document pointing to the same example.* + +## Retrieval Strategy + +We employ a **Partitioned Retrieval** strategy using **Maximal Marginal Relevance (MMR)**. + +### The Problem + +If we simple retrieved the top-10 chunks, a query like "Show me users" might return 10 "User" tables from different databases, crowding out helpful "Example" chunks that explain *which* User table is relevant. + +### The Solution: Partitioned MMR + +1. **Fetch Top-K Tables**: Independent MMR search for `type: table`. +2. **Fetch Top-K Examples**: Independent MMR search for `type: example`. +3. **Merge**: The results are concatenated. + +This ensures the Decomposer always receives both the **Schema Candidates** AND the **Instructional Examples**. + +### Filtering + +Retrieval is strictly scoped by Security considerations: + +* **AuthZ**: `filter={'datasource_id': {'$in': allowed_ids}}` is applied to every query. The user can *never* retrieve a schema they don't have access to. + +::: nl2sql.services.vector_store.OrchestratorVectorStore.index_schema +::: nl2sql.services.vector_store.OrchestratorVectorStore.retrieve_routing_context diff --git a/docs/core/nodes.md b/docs/core/nodes.md new file mode 100644 index 0000000..48d6550 --- /dev/null +++ b/docs/core/nodes.md @@ -0,0 +1,88 @@ +# Nodes & Pipeline + +The pipeline is composed of specialized **Nodes**. Each node performs a single responsibility and is designed to validatable independently. + +## 1. Semantic Analysis Node + +**Responsibility**: Pre-processing and Intent Classification. + +* **Canonicalization**: Corrects spelling and formats entities. +* **Enrichment**: Adds synonyms and keywords to aid retrieval. + +::: nl2sql.pipeline.nodes.semantic.node.SemanticAnalysisNode + +## 2. Decomposer Node (The Router) + +**Responsibility**: Complexity analysis and Datasource Routing. + +* **Retrieval**: Fetches relevant schemas and few-shot examples using **Partitioned MMR**. +* **Splitting**: Breaks down complex/ambiguous questions into sub-queries targeted at specific datasources. + +::: nl2sql.pipeline.nodes.decomposer.node.DecomposerNode + +## 3. Planner Node + +**Responsibility**: Logical Planning (Schema Hydration). + +* **Input**: User Query + Schema Context. +* **Output**: `PlanModel` (AST). +* **Logic**: Identifies tables, resolve joins using Foreign Keys, and maps natural language filters to columns. + +::: nl2sql.pipeline.nodes.planner.node.PlannerNode + +## 4. Logical Validator + +**Responsibility**: Static Analysis & Security. + +* **Checks**: + * Column Existence (scoping aliases). + * Join Validity (keys exist). + * Policy Compliance (Reference `safety/security.md`). + +::: nl2sql.pipeline.nodes.validator.node.LogicalValidatorNode + +## 5. Generator Node + +**Responsibility**: Code Generation. + +* **Input**: Validated `PlanModel`. +* **Output**: Dialect-specific SQL string. +* **Logic**: Uses Jinja2 templates and strict typing to convert the AST into executable SQL. + +::: nl2sql.pipeline.nodes.generator.node.GeneratorNode + +## 6. Physical Validator + +**Responsibility**: Execution Readiness. + +* **Dry Run**: Executes `EXPLAIN` or equivalent to verify syntax validity without running the query. +* **Performance**: Checks row cost estimates against `row_limit`. + +::: nl2sql.pipeline.nodes.validator.physical_node.PhysicalValidatorNode + +## 7. Executor Node + +**Responsibility**: Sandboxed Execution. + +* **Logic**: Connects to the database using the specific Adapter and executes the query. +* **Safety**: Read-only connection enforcement. + +::: nl2sql.pipeline.nodes.executor.node.ExecutorNode + +## 8. Refiner Node + +**Responsibility**: Self-Correction. + +* **Trigger**: Activation on Validator or Executor errors. +* **Logic**: Analyzes the error stack trace + previous plan, and generates feedback for the Planner to retry. + +::: nl2sql.pipeline.nodes.refiner.node.RefinerNode + +## 9. Aggregator Node + +**Responsibility**: Result Synthesis. + +* **Fast Path**: If single result set & no errors -> Return Data. +* **Slow Path**: If multiple results or errors -> Use LLM to summarize and explain. + +::: nl2sql.pipeline.nodes.aggregator.node.AggregatorNode diff --git a/docs/dev/api.md b/docs/dev/api.md new file mode 100644 index 0000000..63920ce --- /dev/null +++ b/docs/dev/api.md @@ -0,0 +1,31 @@ +# API Reference + +This section provides auto-generated documentation for the core Python API. + +## Core Package + +### Pipeline & Graph + +::: nl2sql.pipeline.graph +::: nl2sql.pipeline.state.GraphState + +### Nodes + +::: nl2sql.pipeline.nodes.semantic.node +::: nl2sql.pipeline.nodes.decomposer.node +::: nl2sql.pipeline.nodes.planner.node +::: nl2sql.pipeline.nodes.validator.node +::: nl2sql.pipeline.nodes.generator.node +::: nl2sql.pipeline.nodes.executor.node +::: nl2sql.pipeline.nodes.refiner.node +::: nl2sql.pipeline.nodes.aggregator.node + +### Services + +::: nl2sql.services.vector_store.OrchestratorVectorStore +::: nl2sql.services.llm.LLMRegistry + +## Adapter SDK + +::: nl2sql_adapter_sdk.interfaces +::: nl2sql_adapter_sdk.models diff --git a/docs/getting-started/demo.md b/docs/getting-started/demo.md new file mode 100644 index 0000000..f5e908d --- /dev/null +++ b/docs/getting-started/demo.md @@ -0,0 +1,65 @@ +# Quickstart (Demo) + +The CLI comes with a powerful `setup` command that can generate a fully functional **Demo Environment**, allowing you to test the platform without connecting to your production databases immediately. + +## 1. Lite Mode (Fastest) + +**Lite Mode** uses **SQLite** databases. It requires no Docker containers and runs entirely locally. Ideal for quick logic verification. + +```bash +nl2sql setup --demo --lite +``` + +**What happens?** + +1. Creates a `.env.demo` file. +2. Generates a local `my_sqlite_db.db`. +3. Populates it with sample data (e.g., "Users", "Orders"). +4. Configures `configs/datasources.yaml` to point to this file. + +**Run a Query:** + +```bash +nl2sql --env demo run "Show me all users in the system" +``` + +## 2. Docker Mode (Full Stack) + +**Docker Mode** spins up real **PostgreSQL** or **MySQL** containers using `docker-compose`. This simulates a real production environment. + +```bash +nl2sql setup --demo --docker --api-key sk-... +``` + +* `--api-key`: Optional. If provided, sets up your LLM configuration immediately. + +**What happens?** + +1. Creates a `deploy/docker` directory with `docker-compose.yml`. +2. Starts PostgreSQL/MySQL containers. +3. Waits for health checks. +4. Seeds the databases with sample schema and data. +5. Creates `.env.demo` pointing to these containers (localhost ports). + +## 3. Interactive Wizard + +If you run `setup` without flags, you enter the Interactive Wizard: + +```bash +nl2sql setup +``` + +The wizard will guide you through: + +1. **Environment Selection**: Dev, Demo, or Prod. +2. **Datasource Configuration**: Host, Port, User, Password (securely handled). +3. **LLM Configuration**: OpenAI, Gemini, or Ollama selection. +4. **RBAC Policies**: Generating default Admin roles. + +## Next Steps + +Once set up, you should run **Indexing** to prepare the Vector Store: + +```bash +nl2sql --env demo index +``` diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md new file mode 100644 index 0000000..08b90bf --- /dev/null +++ b/docs/getting-started/installation.md @@ -0,0 +1,52 @@ +# Installation + +The NL2SQL Platform is designed to run as a modular Python application. It uses a **Monorepo** structure managed by `poetry` (optional) or standard `pip`. + +## Prerequisites + +* **Python 3.9+** +* **Docker** (Required for database containers in Demo/Test modes) +* **Git** + +## Local Development Setup + +Clone the repository: + +```bash +git clone https://github.com/your-org/nl2sql-platform.git +cd nl2sql-platform +``` + +### Option 1: Using Pip (Standard) + +Install the core package in editable mode: + +```bash +pip install -e packages/core +pip install -e packages/cli +pip install -e packages/adapter-sdk +# Install specific adapters as needed +pip install -e packages/adapters/postgres +``` + +### Option 2: Using Make (Convenience) + +If a `Makefile` is present: + +```bash +make install +``` + +## Verifying Installation + +Verify the CLI is installed and accessible: + +```bash +nl2sql --version +``` + +Run the `doctor` command to check for missing dependencies or configuration issues: + +```bash +nl2sql doctor +``` diff --git a/docs/guides/adapters.md b/docs/guides/adapters.md deleted file mode 100644 index 1560ae3..0000000 --- a/docs/guides/adapters.md +++ /dev/null @@ -1,53 +0,0 @@ -# Adapter Development Guide - -The NL2SQL Platform uses a modular Adapter architecture to support any database engine. - -## 1. Overview - -Adapters bridge the gap between our semantic engine and specific database drivers. They implement the `DatasourceAdapter` protocol. - -## 2. Creating a New Adapter - -### Step 1: Subclass Base Adapter - -Inherit from `BaseSQLAlchemyAdapter` to get free connection handling, introspection, and dry-run support. - -```python -from nl2sql_sqlalchemy_adapter import BaseSQLAlchemyAdapter - -class MyDbAdapter(BaseSQLAlchemyAdapter): - - def connect(self) -> None: - """Override to handle specific timeout or connection logic.""" - super().connect() -``` - -### Step 2: Implement Protocol Methods - -You must ensure these methods work for your dialect: - -* `fetch_schema()`: (Handled by Base via SQLAlchemy Inspector) -* `estimate_cost(sql)`: Return query cost/rows. -* `dry_run(sql)`: Verify SQL without committing. -* `explain(sql)`: Return execution plan. - -### Step 3: Deployment - -1. Package your adapter (e.g. `nl2sql-adapter-mydb`). -2. Register it via Python Entry Points in `pyproject.toml`: - -```toml -[project.entry-points."nl2sql.adapters"] -mydb = "nl2sql_mydb.adapter:MyDbAdapter" -``` - -## 3. Handling Timeouts - -Your adapter receives `statement_timeout_ms` (int, milliseconds) in `__init__`. - -* **Standard approach**: Use `BaseSQLAlchemyAdapter` logic (maps to `execution_options={"timeout": s}`). -* **Native approach**: Override `connect()` and inject dialect-specific arguments (e.g., `connect_args={"options": "-c statement_timeout=..."}` for Postgres). - -## 4. Testing - -Use the `packages/adapter-sdk/tests/conftest.py` fixtures to verify your adapter against a real database instance. diff --git a/docs/guides/benchmarking.md b/docs/guides/benchmarking.md deleted file mode 100644 index f7bd561..0000000 --- a/docs/guides/benchmarking.md +++ /dev/null @@ -1,28 +0,0 @@ -# Evaluation & Benchmarking - -The platform includes a comprehensive benchmarking suite to measure accuracy, stability, and routing performance. - -## Running Benchmarks - -Benchmarks are run via the CLI. - -```bash -# Run the full Golden Set -python -m nl2sql.cli --benchmark --dataset tests/golden_dataset.yaml -``` - -## Metrics - -1. **Execution Accuracy**: Measures if the returned data matches the "Ground Truth". - - *Method*: Executes Generated SQL vs Expected SQL and compares row sets. -2. **Semantic Accuracy**: Used when data mismatches. Uses an LLM Judge to check if the *intent* is equivalent. -3. **Routing Accuracy**: Did we pick the right database? -4. **Stability (Pass@K)**: Runs each query `K` times to ensure results are deterministic. - -## Benchmarking Options - -| Flag | Description | -| :--- | :--- | -| `--routing-only` | Skips SQL generation; tests Decomposer only. | -| `--iterations N` | Runs each queries N times (Pass@N stability). | -| `--export-path` | Saves results to generic JSON/CSV. | diff --git a/docs/guides/cli_reference.md b/docs/guides/cli_reference.md deleted file mode 100644 index b774a59..0000000 --- a/docs/guides/cli_reference.md +++ /dev/null @@ -1,117 +0,0 @@ -# CLI Reference - -The `nl2sql` Command Line Interface (CLI) is the primary way to interact with the platform. - -## Global Options - -These flags apply to all commands and must be specified **before** the subcommand. - -* `--env`, `-e TEXT`: Environment name (e.g. `dev`, `demo`, `prod`). Isolates configurations and vector stores. Defaults to `default` (Production). - -Example: - -```bash -# Uses configs/datasources.yaml -nl2sql run "query" - -# Uses configs/datasources.demo.yaml -nl2sql --env demo run "query" -``` - -## Commands - -### `setup` - -Interactive wizard to initialize the platform. - -```bash -nl2sql setup [OPTIONS] -``` - -**Options:** - -* `--demo`: **Quickstart Mode**. Automatically generates a "Manufacturing" demo environment with 4 SQLite databases and sample questions. -* `--docker`: Used with `--demo` to generate a `docker-compose.yml` for full-fidelity testing (Postgres/MySQL) instead of SQLite. - -### Example: Try it Now - -```bash -nl2sql setup --demo -``` - -### `index` - -Indexes database schemas and examples into the vector store for retrieval. - -```bash -nl2sql index [OPTIONS] -``` - -**Features:** - -* **Schema Indexing**: Introspects tables, columns, foreign keys, and comments. -* **Example Indexing**: Indexes sample questions for few-shot routing. -* **Granular Feedback**: Displays a checklist of indexed items per datasource. -* **Summary Table**: Shows total tables, columns, and examples indexed. - -**Options:** - -* `--config PATH`: Path to datasource config. -* `--vector-store PATH`: Path to vector store directory. - -**Example:** - -```bash -nl2sql --env demo index -``` - -### `run` - -Executes a natural language query against the configured datasources. - -```bash -nl2sql run [QUERY] [OPTIONS] -``` - -**Arguments:** - -* `QUERY`: The natural language question (e.g. "Show me active users"). - -**Options:** - -* `--role TEXT`: The RBAC role to assume (e.g. `admin`, `analyst`). Defaults to `admin`. -* `--no-exec`: Plan and Validate only, do not execute SQL. -* `--verbose`, `-v`: Show detailed reasoning traces and intermediate node outputs. -* `--show-perf`: Display timing metrics. - -**Example:** - -```bash -nl2sql run "Who bought the Bolt M5?" --role sales_analyst --verbose -``` - -### `policy` - -Manage RBAC policies. - -* `validate`: Validates syntax and integrity of `policies.json`. - -```bash -nl2sql policy validate -``` - -### `doctor` - -Diagnoses environment issues (Python version, missing adapters, connectivity). - -```bash -nl2sql doctor -``` - -### `benchmark` - -Runs the evaluation suite against a "Golden Dataset". - -```bash -nl2sql benchmark --dataset configs/benchmark.yaml -``` diff --git a/docs/guides/configuration.md b/docs/guides/configuration.md deleted file mode 100644 index 152ffe1..0000000 --- a/docs/guides/configuration.md +++ /dev/null @@ -1,181 +0,0 @@ -# Configuration Guide - -The NL2SQL Platform uses a strict, type-safe configuration system (Schema V3) defined in `datasources.yaml`. - -## 1. Quick Start - -Run the setup wizard to generate a valid configuration: - -```bash -nl2sql setup -``` - -## 2. Secrets Management - -Sensitive credentials should never be stored in plaintext. We support **strict** variable expansion using the `${provider_id:key}` syntax, powered by `secrets.yaml`. - -### 2.1 Configuration (`secrets.yaml`) - -Define your secret providers in a `secrets.yaml` file (or `configs/secrets.yaml`). - -```yaml -version: 1 -providers: - - id: azure-main - type: azure - vault_url: "https://my-vault.vault.azure.net/" - # You can resolve credentials from ENV here (Two-Phase Loading) - client_secret: "${env:AZURE_CLIENT_SECRET}" - - - id: aws-prod - type: aws - region_name: us-east-1 -``` - -### 2.2 Usage (`datasources.yaml`) - -Reference the secrets using the `id` defined above. - -> [!IMPORTANT] -> -> - **Strict Syntax**: Format must be exactly `${provider_id:key}`. -> - **Provider ID**: Matches the `id` field in `secrets.yaml`. -> - **Environment**: `${env:VAR}` is always available without config. - -**Example**: - -```yaml -connection: - host: localhost - # Uses 'aws-prod' provider defined in secrets.yaml - password: ${aws-prod:db/password} - # Uses built-in env provider - user: ${env:DB_USER} -``` - -**Example**: - -```yaml -connection: - host: localhost - password: ${env:DB_PASSWORD} # Valid - # invalid_host: my-db-${env:ID}.com <-- Partial interpolation is NOT supported -``` - -## 3. Supported Databases - -### PostgreSQL - -```yaml -- id: my_postgres - connection: - type: postgres - host: localhost - port: 5432 - user: admin - password: ${env:PG_PASS} - database: analytics - # Options: require, prefer, verify-full - ssl_mode: prefer -``` - -### MySQL - -```yaml -- id: my_mysql - connection: - type: mysql - user: admin - password: ${env:MYSQL_PASS} - database: ecommerce - # Optional: Use Unix Socket for local connection - unix_socket: /var/run/mysqld/mysqld.sock -``` - -### SQL Server (MSSQL / Azure) - -Supports Standard, Windows, and Azure Authentication. - -**Standard**: - -```yaml -- id: mssql_std - connection: - type: mssql - host: sql.example.com - user: sa - password: ${env:SA_PASS} - database: master -``` - -**Azure Service Principal**: - -```yaml -- id: azure_db - connection: - type: mssql - host: my-server.database.windows.net - database: core_db - authentication: azure_sp - # Service Principal Credentials - client_id: ${env:AZURE_CLIENT_ID} - client_secret: ${env:AZURE_CLIENT_SECRET} - tenant_id: ${env:AZURE_TENANT_ID} -``` - -### SQLite - -```yaml -- id: local_sqlite - connection: - type: sqlite - database: /abs/path/to/db.sqlite -``` - -> [!NOTE] -> The `connection.type` field determines which **Adapter** loads the datasource. For standard SQL databases, this matches the **SQLAlchemy Dialect** (e.g., `postgres`, `mysql`, `mssql`, `sqlite`). Custom adapters may define their own types. - -## 4. Safety Limits - -To prevent "Out of Memory" (OOM) crashes and protect the LLM context window, we enforce strict limits on query results. - -| Field | Default | Description | -| :--- | :--- | :--- | -| `row_limit` | 1000 | Max rows returned by a query. | -| `max_bytes` | 10MB | **Hard Limit** on payload size. Calculated via efficient row sampling (avg of first 50 rows). Queries exceeding this will fail safely. | - -**Example**: - -```yaml -options: - row_limit: 500 - max_bytes: 5242880 # 5MB limit -``` - -## 5. IDE Support (VS Code) - -Your `datasources.yaml` includes a header that enables **Autocomplete** and **Validation** in VS Code: - -```yaml -# yaml-language-server: $schema=./datasources.schema.json -``` - -Do not remove this line. It ensures your configuration matches the strict Pydantic models used by the engine. - -## 6. Environment Variables - -You can configure the application using the following environment variables (defined in `.env` or system environment). - -| Variable | Default | Description | -| :--- | :--- | :--- | -| `OPENAI_API_KEY` | - | **Required** for LLM and Embedding services. | -| `DATASOURCE_CONFIG` | `configs/datasources.yaml` | Path to the datasource configuration file. | -| `SECRETS_CONFIG` | `configs/secrets.yaml` | Path to the secrets configuration file. | -| `LLM_CONFIG` | `configs/llm.yaml` | Path to the LLM model configuration file. | -| `POLICIES_CONFIG` | `configs/policies.json` | Path to the RBAC policies file. | -| `VECTOR_STORE` | `./chroma_db` | Path (directory) to persist the vector store. | -| `EMBEDDING_MODEL` | `text-embedding-3-small` | OpenAI embedding model name. | -| `BENCHMARK_CONFIG` | `configs/benchmark_suite.yaml` | Path to accurate testing suite configuration. | -| `ROUTING_EXAMPLES` | `configs/sample_questions.yaml` | Path to examples used for few-shot routing. | -| `ROUTER_L1_THRESHOLD` | `0.4` | Threshold for Vector Search relevance. | -| `ROUTER_L2_THRESHOLD` | `0.6` | Threshold for Multi-Query voting agreement. | diff --git a/docs/guides/deployment.md b/docs/guides/deployment.md deleted file mode 100644 index 5db248d..0000000 --- a/docs/guides/deployment.md +++ /dev/null @@ -1,40 +0,0 @@ -# Deployment Guide - -This guide covers deploying the NL2SQL platform in a production environment using Docker. - -## Architecture - -In production, the system typically runs as a backend API service. - -```mermaid -graph TD - Client["Web UI / Dashboard"] --> LoadBalancer - LoadBalancer --> API["NL2SQL API (FastAPI)"] - API --> VectorDB["Vector Store (Chroma)"] - API --> Databases["Target Databases"] -``` - -## Docker Compose - -Here is an example `docker-compose.yml` for deploying the API service: - -```yaml -version: '3.8' -services: - api: - build: . - ports: - - "8000:8000" - environment: - - OPENAI_API_KEY=${OPENAI_API_KEY} - - CONFIG_PATH=/app/configs/datasources.yaml - volumes: - - ./configs:/app/configs -``` - -## Production Checklist - -1. [ ] **Security**: Ensure `configs/policies.json` is mapped to your identity provider (e.g. OAuth) to dynamically enforce `allowed_tables`. -2. [ ] **Read-Only**: Configure the database users in `datasources.yaml` to have READ-ONLY permissions at the database level. - > The PhysicalValidator provides a safety net, but deep defense requires DB-level grants. -3. [ ] **Monitoring**: Enable the `PipelineMonitorCallback` to log all traces to your observability stack (e.g. LangSmith). diff --git a/docs/guides/development.md b/docs/guides/development.md deleted file mode 100644 index 3bb7999..0000000 --- a/docs/guides/development.md +++ /dev/null @@ -1,30 +0,0 @@ -# Development Guide - -How to contribute to the platform. - -## Setup - -```bash -# Windows PowerShell -./scripts/setup_dev.ps1 -``` - -## Running Tests - -We use `pytest`. - -```bash -# Run Unit Tests (Fast) -python -m pytest packages/core/tests/unit - -# Run Integration Tests (Requires Docker) -docker compose up -d -python -m pytest packages/core/tests/integration -``` - -## Adding a New Node - -1. Create the node class in `packages/core/src/nl2sql/pipeline/nodes/`. -2. Implement `__call__(self, state: GraphState) -> Dict`. -3. Register it in `graph.py` or the relevant subgraph (`sql_agent.py`). -4. Add unit tests. diff --git a/docs/guides/getting_started.md b/docs/guides/getting_started.md deleted file mode 100644 index 98dc5a0..0000000 --- a/docs/guides/getting_started.md +++ /dev/null @@ -1,61 +0,0 @@ -# Getting Started - -This guide will help you install the platform and run your first query. - -## Prerequisites - -* Python 3.10+ -* Docker (for running integrations) - -## 1. Installation - -The platform is designed as a monorepo. You should install the core and the adapters you need. - -```bash -# Clone the repository -git clone https://github.com/nadeem4/nl2sql.git -cd nl2sql - -# Install Core -pip install -e packages/adapter-sdk -pip install -e packages/cli # Installs 'nl2sql' command - -# Install Adapters (e.g. Postgres) -pip install -e packages/adapters/postgres -``` - -## 2. Setup - -Run the interactive setup wizard. This will inspect your environment, guide you through creating a configuration, and index your database schema. - -```bash -nl2sql setup -``` - -The wizard will ask for: - -1. **Database Details** (Host, Port, User, Password). -2. **LLM Provider** (OpenAI, Gemini, Ollama). -3. Confirmation to install required **Adapters** (e.g. `nl2sql-postgres`). - -## 3. Run a Query - -Now you are ready to ask questions! - -```bash -nl2sql run "Show me the top 5 users by sales" -``` - -The system will: - -1. Plan the query. -2. Route it to the correct database. -3. Generate and Validate SQL. -4. Execute and display the results. - -## Next Steps - -Explore the full capabilities of the CLI: - -* [CLI Reference](cli_reference.md) - Learn about `--env`, `--role`, and `index` options. -* [Configuration](configuration.md) - Connect to your own databases. diff --git a/docs/guides/production_roadmap.md b/docs/guides/production_roadmap.md deleted file mode 100644 index 8bf8753..0000000 --- a/docs/guides/production_roadmap.md +++ /dev/null @@ -1,19 +0,0 @@ -# Production Roadmap - -Steps to transition from MVP to Mission-Critical Service. - -## Phase 1: Security & Governance - -* [ ] **Secret Management**: Move credentials from `datasources.yaml` to AWS Secrets Manager / Azure Key Vault. -* [ ] **Dependency Locking**: Generate `poetry.lock` or `uv.lock`. -* [ ] **SQL Governance**: Implement a "Policy Agent" to block request types (e.g. `CROSS JOIN` on large tables). - -## Phase 2: Reliability - -* [ ] **Circuit Breakers**: Enforce strict timeouts (e.g. 5s) at the driver level. -* [ ] **Rate Limiting**: Implement per-tenant token budgeting. - -## Phase 3: Observability - -* [ ] **Audit Logging**: Write immutable execution logs (User, Query, SQL) to compliance storage. -* [ ] **Async Concurrency**: Refactor Core to `ainvoke` for higher throughput. diff --git a/docs/index.md b/docs/index.md index 73bcbc0..19e7fe3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,35 +1,53 @@ -# Welcome to the NL2SQL Platform - -The **NL2SQL Platform** is an enterprise-grade engine for converting natural language into SQL. - -It features: - -* **Defense-in-Depth Security** (RBAC, Read-Only enforcement). -* **Multi-Database Routing** (Federated queries across Postgres, MySQL, etc.). -* **Agentic Reasoning** (Iterative planning and self-correction). - -## Quick Start - -```bash -# Install Dependencies -pip install -e packages/core - -# Run a Query -python -m nl2sql.cli --query "Show me top 5 users" +# NL2SQL Platform + +Welcome to the **NL2SQL Platform**, a production-grade **Natural Language to SQL** engine built on an Agentic Graph Architecture. + +This platform transforms complex user questions into safe, optimized, and strictly validated SQL queries across multiple database engines (PostgreSQL, MySQL, MSSQL, SQLite). + +## 🚀 Key Features + +* **Agentic Graph Architecture**: Powered by [LangGraph](https://langchain-ai.github.io/langgraph/), the system orchestrates a graph of specialized nodes (Planner, Validator, Generator) that can self-correct and backtrack. +* **Production Security**: Implementation of **Strict AST Validation**, **Role-Based Access Control (RBAC)**, and **Secrets Management**. +* **Polyglot Support**: Works seamlessly with multiple SQL dialects via an **Adapter SDK**. +* **Smart Routing**: A specialized **Decomposer Node** handles complex multi-datasource queries by splitting them into sub-queries. +* **Optimization**: Built-in **Vector Store** for schema and few-shot example retrieval to optimize context window usage. + +## 🏗️ High-Level Architecture + +The system follows a directed cyclic graph (DCG) flow, allowing for feedback loops and self-correction. + +```mermaid +graph TD + User([User Query]) --> Semantic[Semantic Analysis] + Semantic --> Decomposer[Decomposer Node] + + subgraph "Orchestration & Routing" + Decomposer -->|Sub-Query 1| Execution{Execution Branch} + Decomposer -->|Sub-Query 2| Execution + end + + subgraph "SQL Agent (ReAct Loop)" + Execution --> Planner[Planner Node] + Planner --> L_Validator[Logical Validator (AST)] + + L_Validator -->|Error| Refiner[Refiner Node] + Refiner -->|Feedback| Planner + + L_Validator -->|OK| Generator[Generator Node] + Generator --> P_Validator[Physical Validator] + + P_Validator -->|Error| Refiner + P_Validator -->|OK| Executor[Executor Node] + end + + Executor --> Aggregator[Aggregator Node] + Aggregator --> Result([Final Result]) ``` -[Get Started Guide](guides/getting_started.md){ .md-button .md-button--primary } - -## Documentation Structure - -### [Architecture](architecture/overview.md) - -Understand the "Map-Reduce" design, the **SQL Agent** pipeline, and how the **Physical Validator** ensures safety. - -### [Guides](guides/getting_started.md) - -Step-by-step instructions for installation, configuration, and deploying to production with Docker. - -### [Reference](reference/cli.md) +## 📚 Documentation Guide -Technical specifications for the CLI, API, and internal Node logic. +* [**Getting Started**](getting-started/installation.md): Installation and quickstart demos. +* [**Core Engine**](core/nodes.md): Deep dive into the Neural Components (Nodes) and Graph State. +* [**Security**](safety/security.md): How we ensure Safety, Compliance, and Data Protection. +* [**Operations**](ops/configuration.md): Configuration, Logging, and Benchmarking guides. +* [**Development**](dev/adapters.md): Guide to building custom adapters and extending the platform. diff --git a/docs/nodes/aggregator_node.md b/docs/nodes/aggregator_node.md deleted file mode 100644 index 0fd1b43..0000000 --- a/docs/nodes/aggregator_node.md +++ /dev/null @@ -1,41 +0,0 @@ -# AggregatorNode - -## Purpose - -The `AggregatorNode` combines results from the execution phase and prepares the final response. It implements a "Fast Path" for direct data streaming and a "Slow Path" for LLM-based summarization or answer synthesis. - -## Class Reference - -- **Class**: `AggregatorNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/aggregator/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.user_query` (str): The user's question. -- `state.intermediate_results` (List): Results from the executor(s). -- `state.output_mode` (str): "data" (Fast Path) or "summary"/"verbose" (Slow Path). -- `state.errors` (List[PipelineError]): Any errors to include in the summary. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.final_answer` (Any): The final text entry or data payload. -- `state.reasoning` (List[Dict]): Log of which path was taken. - -## Logic Flow - -1. **Fast Path Check**: - - If there is exactly one result, no errors, and `output_mode` is "data": - - Returns the raw data directly. -2. **Slow Path (LLM Aggregation)**: - - Formats all `intermediate_results` (and errors) into a string. - - Prompts the LLM to synthesize an answer to the `user_query` using the provided data. - - Formats the LLM output (Table/List/Text). - - Returns the generated summary. - -## Error Handling - -- **`AGGREGATOR_FAILED`**: If the LLM summarization fails. diff --git a/docs/nodes/decomposer_node.md b/docs/nodes/decomposer_node.md deleted file mode 100644 index ef8fa01..0000000 --- a/docs/nodes/decomposer_node.md +++ /dev/null @@ -1,46 +0,0 @@ -# DecomposerNode - -## Purpose - -The `DecomposerNode` acts as the entry point and router for the pipeline. It is responsible for analyzing the user's query to determine which datasource(s) should handle the request. For complex requests, it can break the query down into sub-queries (though simple routing is the primary function). It also checks user authorization before proceeding. - -## Class Reference - -- **Class**: `DecomposerNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/decomposer/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.user_query` (str): The initial user question. -- `state.user_context` (Dict): User session data, specifically `allowed_datasources` for authorization. -- `state.semantic_analysis` (SemanticAnalysisResponse): Used to expand the query with keywords/synonyms for better vector retrieval. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.sub_queries` (List[SubQuery]): A list of routed queries. Each `SubQuery` contains: - - `question`: The specific question for the datasource. - - `datasource_id`: The ID of the chosen datasource. -- `state.confidence` (float): The confidence score of the routing decision. -- `state.reasoning` (List[Dict]): Explanation of why a specific datasource was selected. -- `state.errors` (List[PipelineError]): `SECURITY_VIOLATION` if the user lacks access. - -## Logic Flow - -1. **Authorization Check**: Verifies if `state.user_context` contains accessible datasources. If not, returns `SECURITY_VIOLATION`. -2. **Query Expansion**: If `state.semantic_analysis` is present, it augments the query with keywords and synonyms to improve retrieval recall. -3. **Context Retrieval**: - - Queries the `OrchestratorVectorStore` using the expanded query. - - Retrieves relevant table schemas and datasource descriptions. -4. **LLM Routing**: - - Uses the LLM to analyze the retrieved context and the user query. - - Decides which datasource is best suited to answer the question. -5. **Output Generation**: Returns the routing decision (datasource selection) and confidence score. - -## Error Handling - -- **`SECURITY_VIOLATION`**: Critical error if the user has no allowed datasources. -- **Retrieval Warnings**: Logs warnings if no relevant documents are found in the vector store. diff --git a/docs/nodes/executor_node.md b/docs/nodes/executor_node.md deleted file mode 100644 index f198d4c..0000000 --- a/docs/nodes/executor_node.md +++ /dev/null @@ -1,44 +0,0 @@ -# ExecutorNode - -## Purpose - -The `ExecutorNode` is responsible for executing the generated SQL query against the target datasource. It handles connection management via the `DatasourceRegistry` adapters, safeguards against massive result sets, and formats the output. - -## Class Reference - -- **Class**: `ExecutorNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/executor/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.sql_draft` (str): The SQL query to execute. -- `state.selected_datasource_id` (str): The target database ID. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.execution` (`ExecutionModel`): The result of the query. - - `columns` (List[str]): Column names. - - `rows` (List[Dict]): The data returned. - - `row_count` (int): Number of rows. -- `state.errors` (List[PipelineError]): Errors during execution. - -## Logic Flow - -1. **Validation**: Ensures `sql_draft` and `datasource_id` are present. -2. **Adapter Retrieval**: Fetches the correct adapter (e.g., PostgresAdapter) from the registry. -3. **Cost Estimation (Safeguard)**: - - If supported by the adapter, estimates the query cost. - - If the estimated row count exceeds `SAFEGUARD_ROW_LIMIT` (10,000), aborts execution and raises `SAFEGUARD_VIOLATION`. -4. **Execution**: - - Runs `adapter.execute(sql)`. - - Captures the result set. -5. **Formatting**: Converts the results into the standard `ExecutionModel`. - -## Error Handling - -- **`SAFEGUARD_VIOLATION`**: If the query is predicted to return too many rows. -- **`DB_EXECUTION_ERROR`**: If the database raises an exception (e.g., timeout, syntax error not caught by validator). diff --git a/docs/nodes/generator_node.md b/docs/nodes/generator_node.md deleted file mode 100644 index 3973a19..0000000 --- a/docs/nodes/generator_node.md +++ /dev/null @@ -1,43 +0,0 @@ -# GeneratorNode - -## Purpose - -The `GeneratorNode` is the compiler of the pipeline. It takes the abstract execution plan (`PlanModel`) produced by the Planner and generates a valid, dialect-specific SQL string. It uses `sqlglot` to transpile the internal AST into the target SQL dialect (e.g., PostgreSQL, T-SQL, MySQL), enforcing syntactic correctness. - -## Class Reference - -- **Class**: `GeneratorNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/generator/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.plan` (`PlanModel`): The logical plan to compile. -- `state.selected_datasource_id` (str): The ID of the target database, used to determine the SQL dialect. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.sql_draft` (str): The generated SQL query string. -- `state.reasoning` (List[Dict]): Logs the generated SQL. -- `state.errors` (List[PipelineError]): `SQL_GEN_FAILED` if compilation errors occur. - -## Logic Flow - -1. **Validation**: Checks if a plan and a datasource ID are present in the state. -2. **Profile Lookup**: Fetches the `dialect` (e.g., "postgres", "tsql") and default `row_limit` from the datasource registry. -3. **AST Transformation (`SqlVisitor`)**: - - The node uses a `SqlVisitor` class to traverse the `PlanModel` (Expr tree). - - It builds a corresponding `sqlglot` Expression tree. - - This visitor handles literals, columns, functions, binary/unary operations, and case statements. -4. **SQL Synthesis**: - - Constructs the top-level `SELECT` statement using `sqlglot` builders. - - Applies transformations for `SELECT`, `FROM` (Tables), `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`. - - Handles dialect-specific nuances (e.g., quoting identifiers, function names) via `sqlglot.transpile` mechanisms (implicit in `.sql(dialect=...)`). -5. **Output**: Returns the final SQL string. - -## Error Handling - -- **`SQL_GEN_FAILED`**: Raised if the visitor encounters unknown expression types or if `sqlglot` fails to generate the string. diff --git a/docs/nodes/planner_node.md b/docs/nodes/planner_node.md deleted file mode 100644 index ecb362d..0000000 --- a/docs/nodes/planner_node.md +++ /dev/null @@ -1,56 +0,0 @@ -# PlannerNode - -## Purpose - -The `PlannerNode` is the cognitive core of the SQL generation process. It synthesizes the user's intent and the retrieved schema (tables, columns) to create a structured, dialect-agnostic "Execution Plan" (`PlanModel`). Unlike a simple text-to-SQL prompt, this node generates a deterministic Abstract Syntax Tree (AST), ensuring strict adherence to the schema and enabling logical validation before any SQL is written. - -## Class Reference - -- **Class**: `PlannerNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/planner/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.user_query` (str): The original natural language question. -- `state.relevant_tables` (List[Table]): The list of schema definitions (tables/columns) found relevant by the Decomposer. -- `state.semantic_analysis` (SemanticAnalysisResponse): Enriched context (keywords, synonyms) to guide planning. -- `state.errors` (List[PipelineError]): Existing errors (if in a re-planning loop), used to provide feedback to the LLM. -- `state.selected_datasource_id` (str): ID of the target database, used to fetch dialect-specific settings (e.g., date formats). - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.plan` (`PlanModel`): The comprehensive execution plan containing: - - **tables**: List of `TableNode` (name, alias, schema). - - **joins**: List of `JoinNode` (right table, condition, join type). - - **select_items**: List of `SelectItemNode` (expression, alias). - - **where**: Root `Expr` node for filtering. - - **group_by**: List of `GroupByNode`. - - **having**: Root `Expr` node for post-aggregation filtering. - - **order_by**: List of `OrderByNode`. - - **limit**: Integer row limit. -- `state.reasoning` (List[Dict]): Log entry explaining the planning decisions. -- `state.errors` (List[PipelineError]): Appends `MISSING_LLM` or `PLANNING_FAILURE` if generation fails. - -## Logic Flow - -1. **Initialization**: Checks if the LLM is provided; otherwise returns a critical error. -2. **Context Assembly**: - - Serializes `state.relevant_tables` into a schema string. - - Formats previous `state.errors` into a feedback string (for self-correction). - - Retrieves the `date_format` from the selected datasource profile. - - includes `state.semantic_analysis` context. -3. **LLM Invocation**: - - Prompts the LLM with the query, schema, examples, and feedback. - - Expects a JSON response conforming to `PlanModel`. -4. **Post-Processing**: - - Validates and parses the JSON into the `PlanModel` Pydantic object. - - Updates `state.plan` and logs reasoning. - -## Error Handling - -- **`MISSING_LLM`**: Raised if the node is initialized without a language model. -- **`PLANNING_FAILURE`**: Raised if the LLM output is malformed or cannot be parsed into `PlanModel`. diff --git a/docs/nodes/refiner_node.md b/docs/nodes/refiner_node.md deleted file mode 100644 index 3394947..0000000 --- a/docs/nodes/refiner_node.md +++ /dev/null @@ -1,41 +0,0 @@ -# RefinerNode - -## Purpose - -The `RefinerNode` operates in the self-correction loop. When validation or execution fails, the Refiner analyzes the error, the failed plan, and the schema to generate constructive feedback (Natural Language Advice) for the Planner, enabling it to retry and fix the mistake. - -## Class Reference - -- **Class**: `RefinerNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/refiner/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.user_query` (str): The original intent. -- `state.plan` (`PlanModel`): The plan that failed. -- `state.errors` (List[PipelineError]): The specific errors (e.g., "Table not found", "Execution error"). -- `state.relevant_tables`: The schema context. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.errors` (List[PipelineError]): Appends a new error of type `PLAN_FEEDBACK` containing the LLM's advice. -- `state.reasoning` (List[Dict]): The feedback generated. - -## Logic Flow - -1. **Context Assembly**: - - Dumps the `relevant_tables`, `failed_plan`, and `errors` into strings. -2. **LLM Analysis**: - - Prompts the LLM to diagnose the failure. - - Asks: "Given this query, this plan, and these errors, what went wrong and how should the planner fix it?" -3. **Feedback Injection**: - - Wraps the LLM's response in a `PipelineError` (Severity WARNING). - - This error is read by the `PlannerNode` in the next iteration as "Feedback". - -## Error Handling - -- **`REFINER_FAILED`**: If the refinement LLM call fails. diff --git a/docs/nodes/semantic_analysis_node.md b/docs/nodes/semantic_analysis_node.md deleted file mode 100644 index 9df15ec..0000000 --- a/docs/nodes/semantic_analysis_node.md +++ /dev/null @@ -1,39 +0,0 @@ -# SemanticAnalysisNode - -## Purpose - -The `SemanticAnalysisNode` is a preprocessing step that normalizes the user query and extracts metadata such as keywords and synonyms. This enriched context supports both the `DecomposerNode` (by expanding the search query for better retrieval) and the `PlannerNode` (by resolving ambiguity). - -## Class Reference - -- **Class**: `SemanticAnalysisNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/semantic/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.user_query` (str): The raw user input. - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.semantic_analysis` (`SemanticAnalysisResponse`): - - `canonical_query`: Normalized form of the question. - - `keywords`: Extracted domain keywords. - - `synonyms`: List of potential synonyms for columns/tables. - - `reasoning`: The analysis thought process. - -## Logic Flow - -1. **LLM Invocation**: - - Prompts the LLM with the `user_query`. - - Requests an analysis including canonicalization and keyword extraction. -2. **Result Storage**: - - Stores the structured response in `state.semantic_analysis`. - - Logs the reasoning and keywords. - -## Error Handling - -- **Fallback**: If the LLM call fails, it defaults to returning the raw query with empty keywords to prevent pipeline blockage. diff --git a/docs/nodes/validator_node.md b/docs/nodes/validator_node.md deleted file mode 100644 index 6ec16d2..0000000 --- a/docs/nodes/validator_node.md +++ /dev/null @@ -1,44 +0,0 @@ -# LogicalValidatorNode - -## Purpose - -The `LogicalValidatorNode` validates the generated AST (`PlanModel`) *before* any SQL is generated. It performs static analysis to ensure the plan structure is valid (e.g., no duplicate aliases) and that all referenced tables and columns actually exist in the schema. It also enforces access control policies. - -## Class Reference - -- **Class**: `LogicalValidatorNode` -- **Path**: `packages/core/src/nl2sql/pipeline/nodes/validator/node.py` - -## Inputs - -The node reads the following fields from `GraphState`: - -- `state.plan` (`PlanModel`): The plan to validate. -- `state.relevant_tables` (List[Table]): The schema context to check against. -- `state.user_context` (Dict): Used for policy validation (`allowed_tables`). - -## Outputs - -The node updates the following fields in `GraphState`: - -- `state.errors` (List[PipelineError]): Appends error validation failures found. - - `TABLE_NOT_FOUND`: Referenced table does not exist in `relevant_tables`. - - `INVALID_PLAN_STRUCTURE`: Malformed AST (e.g., non-contiguous ordinals). - - `SECURITY_VIOLATION`: Reference to unauthorized table. - -## Logic Flow - -1. **Duplicate Alias Check**: Ensures all table aliases in the plan are unique. -2. **Schema Verification (`_build_alias_map`)**: - - Iterates through `plan.tables`. - - Verifies each table exists in `state.relevant_tables`. - - Maps aliases to valid columns for downstream checks. -3. **Policy Validation**: - - Checks if the user is authorized to access the referenced tables based on `state.user_context`. -4. **Static Validation**: - - (Implied) Additional checks on the structure of the AST. - -## Error Handling - -- **`TABLE_NOT_FOUND`**: If the plan invents a table name. -- **`SECURITY_VIOLATION`**: If a table is restricted by policy. diff --git a/docs/ops/benchmarking.md b/docs/ops/benchmarking.md new file mode 100644 index 0000000..c5adf54 --- /dev/null +++ b/docs/ops/benchmarking.md @@ -0,0 +1,41 @@ +# Benchmarking + +The platform includes a built-in **Matrix Benchmarking** tool to evaluate accuracy across different LLMs and Datasources. + +## Running a Benchmark + +```bash +nl2sql benchmark --config configs/benchmark_suite.yaml +``` + +## Matrix Testing + +You can define multiple LLM configurations in a `bench_config.yaml` to run head-to-head comparisons. + +```yaml +# bench_config.yaml +gpt4_base: + default: + provider: openai + model: gpt-4 +claude3_opus: + default: + provider: anthropic + model: claude-3-opus +``` + +Run the matrix: + +```bash +nl2sql benchmark --bench-config bench_config.yaml +``` + +## Metrics + +The benchmark reports: + +* **Execution Success Rate (ESR)**: % of queries that ran without error. +* **Valid SQL Rate**: % of generated SQL that passed Physical Validation. +* **Accuracy**: (Requires Golden SQL) % of results matching ground truth. + +::: nl2sql.cli.commands.benchmark.run_benchmark diff --git a/docs/ops/configuration.md b/docs/ops/configuration.md new file mode 100644 index 0000000..5af5820 --- /dev/null +++ b/docs/ops/configuration.md @@ -0,0 +1,40 @@ +# Configuration + +The platform is configured via a combination of **Environment Variables** (for secrets/global settings) and **YAML Config Files** (for structured data). + +## Global Settings (`config.yml` / `.env`) + +Global settings control the behavior of the core engine. + +::: nl2sql.common.settings.Settings + +## Datasources (`datasources.yaml`) + +Defines the available databases and their connection details. + +```yaml +postgres_prod: + type: postgres + connection_string: ${env:POSTGRES_URL} + schema: public +``` + +## RBAC Policies (`policies.json`) + +Defines roles and access rules. (See [Security](../safety/security.md) for details). + +## LLM Configuration (`llm.yaml`) + +Defines the LLM providers and model parameters for different agents. + +```yaml +default: + provider: openai + model: gpt-4o + +agents: + planner: + model: o1-preview # Uses reasoning model for planning + generator: + model: gpt-4o +``` diff --git a/docs/ops/errors.md b/docs/ops/errors.md new file mode 100644 index 0000000..aa69367 --- /dev/null +++ b/docs/ops/errors.md @@ -0,0 +1,8 @@ +# Error Reference + +Standardized Error Codes used throughout the Pipeline. + +::: nl2sql.common.errors.ErrorCode + options: + show_root_heading: true + show_source: false diff --git a/docs/ops/observability.md b/docs/ops/observability.md new file mode 100644 index 0000000..98b8a80 --- /dev/null +++ b/docs/ops/observability.md @@ -0,0 +1,29 @@ +# Observability + +## Logging + +We use a structured logging approach suitable for production environments (Splunk, Datadog, ELK). + +* **Format**: JSON (Production) or Human-Readable (Dev). +* **Attributes**: Logs include `request_id`, `user_id`, `node_name`, and `execution_time`. + +### Enabling JSON Logs + +Set the environment variable or use the flag: + +```bash +export LOG_FORMAT=json +# or +nl2sql run "query" --json-logs +``` + +::: nl2sql.common.logger.JsonFormatter + +## Tracing + +The platform is instrumented with [LangSmith](https://smith.langchain.com/) for deep tracing of the Agentic Graph. + +1. Set `LANGCHAIN_TRACING_V2=true`. +2. Set `LANGCHAIN_API_KEY=...`. + +This will stream full traces of the Planner, Validator, and Generator steps to the LangSmith dashboard. diff --git a/docs/reference/api/adapter_sdk.md b/docs/reference/api/adapter_sdk.md deleted file mode 100644 index a6ce734..0000000 --- a/docs/reference/api/adapter_sdk.md +++ /dev/null @@ -1,31 +0,0 @@ -# Adapter SDK - -The `nl2sql-adapter-sdk` defines the contract for new plugins. - -## Models - -### `DatasourceAdapter` (Abstract Base Class) - -The main entry point. Implementations must provide: - -- `fetch_schema()` -- `execute(sql)` -- `cost_estimate(sql)` - -### `Table` - -Schema definition. - -```python -name: str -columns: List[Column] -foreign_keys: List[ForeignKey] -``` - -Flags that control the `GeneratorNode`. - -| Flag | Description | -| :--- | :--- | -| `supports_limit_offset` | Generator will use LIMIT/OFFSET syntax. | -| `supports_dry_run` | PhysicalValidator will attempt `dry_run()`. | -| `supports_cost_estimation` | PhysicalValidator will check row counts. | diff --git a/docs/reference/api/graph_state.md b/docs/reference/api/graph_state.md deleted file mode 100644 index 505ce78..0000000 --- a/docs/reference/api/graph_state.md +++ /dev/null @@ -1,17 +0,0 @@ -# GraphState API - -The `GraphState` object represents the shared memory passed between nodes in the LangGraph pipeline. - -## Attributes - -| Field | Type | Description | -| :--- | :--- | :--- | -| `user_query` | `str` | Canonical user query. | -| `sql_draft` | `str` | The Generated SQL string (from Generator). | -| `plan` | `PlanModel` | The AST produced by the Planner. | -| `relevant_tables` | `List[Table]` | subset of schema found by Vector Search. | -| `user_context` | `Dict` | User identity (`role`, `allowed_tables`) for Policy Logic. | -| `errors` | `List[PipelineError]` | Accumulating list of errors (Validation/Execution). | -| `reasoning` | `List[Dict]` | Log of "thoughts" from agents. | -| `execution` | `ExecutionModel` | Final database results. | -| `sub_queries` | `List[SubQuery]` | If Map-Reduce was triggered. | diff --git a/docs/reference/cli.md b/docs/reference/cli.md deleted file mode 100644 index 7f7caba..0000000 --- a/docs/reference/cli.md +++ /dev/null @@ -1,70 +0,0 @@ -# CLI Reference - -The **NL2SQL** command line interface uses a subcommand structure. - -**Command**: `nl2sql [COMMAND] [OPTIONS]` - -## Commands - -### `setup` - -Interactively configure the environment, create config files, and index schemas. - -| Option | Description | -| :--- | :--- | -| `None` | Runs the interactive wizard. | - ---- - -### `run` - -Execute natural language queries against your datasources. - -**Usage**: `nl2sql run "Your query here"` - -| Option | Description | Default | -| :--- | :--- | :--- | -| `--config` | Path to datasources YAML. | `configs/datasources.yaml` | -| `--llm-config` | Path to LLM configuration. | `configs/llm.yaml` | -| `--id` | Target specific datasource ID (bypass routing). | auto-route | -| `--json` | Output result as raw JSON only. | `False` | -| `--no-exec` | Generate and Validate SQL only (skip execution). | `False` | -| `--user` | Context user ID for RBAC checks. | `admin` | - ---- - -### `doctor` - -Diagnose environment issues, check dependencies, and verify connectivity. - -| Option | Description | -| :--- | :--- | -| `None` | Runs diagnositc checks. | - ---- - -### `index` - -Manually trigger the schema indexing process. - -| Option | Description | Default | -| :--- | :--- | :--- | -| `--config` | Path to datasources YAML. | `configs/datasources.yaml` | -| `--llm-config` | Path to LLM configuration. | `configs/llm.yaml` | - ---- - -### `list-adapters` - -List all currently installed database adapter packages. - ---- - -### `benchmark` - -Run the evaluation suite. - -| Option | Description | Default | -| :--- | :--- | :--- | -| `--config` | Path to benchmark suite YAML. | `configs/benchmark.yaml` | -| `--output` | Directory to save report artifacts. | `benchmarks/` | diff --git a/docs/reference/lessons_learned.md b/docs/reference/lessons_learned.md deleted file mode 100644 index a3f3568..0000000 --- a/docs/reference/lessons_learned.md +++ /dev/null @@ -1,17 +0,0 @@ -# Lessons Learned - -A collection of architectural insights/gotchas discovered during development. - -## 1. Context Retrieval Bias (Signal Density) - -**Issue**: The Decomposer exhibits a "Table-First Bias". It prioritizes datasources with matched **Tables** over those with matched **Examples**. - -**Scenario**: - -* Query: "Show production runs for 'Widget A'" -* `db_history`: Returns `production_runs` table. (Strong Signal) -* `db_supply`: Returns NO tables, but matches example "Who supplies Widget A?". (Weak Signal) - -**Result**: The Decomposer ignores `db_supply` because the prompt prioritized schema. - -**Fix**: The `DECOMPOSER_PROMPT` was updated to treat **Matched Examples** as a valid routing signal, even if no tables are returned. diff --git a/docs/reference/test_data.md b/docs/reference/test_data.md deleted file mode 100644 index bd651d1..0000000 --- a/docs/reference/test_data.md +++ /dev/null @@ -1,272 +0,0 @@ -# Database Setup & Schema Documentation - -This document outlines the database environment used in the NL2SQL pipeline, including datasources, schemas, and data volume. - -## Overview - -The system runs a **Multi-Database Architecture** simulating a real-world manufacturing environment. - -| Metrics | Value | -| :--- | :--- | -| **Total Datasources** | 5 (4 Unique Engines) | -| **Total Tables** | 17 | -| **Total Rows** | ~30,000+ | - ---- - -## 1. Reference Data (`manufacturing_ref`) - -**Engine**: SQLite -**File**: `data/manufacturing.db` -**Description**: Static reference data used across the organization. - -### Tables - -#### `factories` - -Manufacturing plant locations. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INTEGER | PRIMARY KEY | -| `name` | TEXT | NOT NULL | -| `location` | TEXT | NOT NULL | -| `opened_on` | DATE | NOT NULL | -| `capacity_index` | INTEGER | DEFAULT 100 | - -#### `machine_types` - -Specifications for equipment models. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INTEGER | PRIMARY KEY | -| `model_name` | TEXT | NOT NULL | -| `manufacturer` | TEXT | NOT NULL | -| `specifications` | JSON | | - -#### `shifts` - -Standard work shifts. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INTEGER | PRIMARY KEY | -| `name` | TEXT | NOT NULL | -| `start_time` | TIME | NOT NULL | -| `end_time` | TIME | NOT NULL | - -> **Note**: The datasource `manufacturing_sqlite` also points to this same database file but is used for "Generic" demo queries. - ---- - -## 2. Operations Data (`manufacturing_ops`) - -**Engine**: PostgreSQL -**Description**: High-velocity operational data. Tracks daily plant activities. - -### Tables - -#### `employees` - -Staff details and roles. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | SERIAL | PRIMARY KEY | -| `full_name` | TEXT | NOT NULL | -| `role` | TEXT | NOT NULL | -| `factory_id` | INTEGER | NOT NULL (Logical FK to `factories.id`) | -| `hired_date` | DATE | NOT NULL | -| `contact_info` | JSONB | | - -#### `machines` - -Active equipment assets. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | SERIAL | PRIMARY KEY | -| `factory_id` | INTEGER | NOT NULL (Logical FK to `factories.id`) | -| `machine_type_id` | INTEGER | NOT NULL (Logical FK to `machine_types.id`) | -| `name` | TEXT | NOT NULL | -| `serial_number` | TEXT | NOT NULL UNIQUE | -| `commissioned_on` | DATE | NOT NULL | -| `status` | TEXT | NOT NULL DEFAULT 'Active' | - -#### `spare_parts` - -Inventory of repair parts. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | SERIAL | PRIMARY KEY | -| `name` | TEXT | NOT NULL | -| `machine_type_id` | INTEGER | NOT NULL (Logical FK to `machine_types.id`) | -| `stock_quantity` | INTEGER | NOT NULL | -| `criticality` | TEXT | NOT NULL | - -#### `maintenance_logs` - -History of repairs and downtime. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | SERIAL | PRIMARY KEY | -| `machine_id` | INTEGER | NOT NULL (FK to `machines.id`) | -| `performed_at` | TIMESTAMP | NOT NULL | -| `maintenance_type` | TEXT | NOT NULL | -| `downtime_minutes` | INTEGER | NOT NULL | -| `performed_by_id` | INTEGER | NOT NULL (FK to `employees.id`) | -| `notes` | TEXT | | - ---- - -## 3. Supply Chain (`manufacturing_supply`) - -**Engine**: MySQL -**Description**: External-facing supply chain and inventory management. - -### Tables - -#### `suppliers` - -Implementation vendors/partners. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | AUTO_INCREMENT PRIMARY KEY | -| `name` | TEXT | NOT NULL | -| `contact_email` | TEXT | NOT NULL | -| `country` | TEXT | NOT NULL | -| `rating` | FLOAT | | - -#### `products` - -Catalog of items produced/bought. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | AUTO_INCREMENT PRIMARY KEY | -| `sku` | VARCHAR(255) | NOT NULL UNIQUE | -| `name` | TEXT | NOT NULL | -| `category` | TEXT | NOT NULL | -| `unit_price` | DECIMAL(10, 2) | NOT NULL | -| `supplier_id` | INT | NOT NULL (FK to `suppliers.id`) | -| `description` | TEXT | | - -#### `inventory` - -Stock levels by warehouse. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | AUTO_INCREMENT PRIMARY KEY | -| `product_id` | INT | NOT NULL (FK to `products.id`) | -| `warehouse_location` | TEXT | NOT NULL | -| `quantity_on_hand` | INT | NOT NULL | -| `last_updated` | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | - -#### `purchase_orders` - -Procurement records. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | AUTO_INCREMENT PRIMARY KEY | -| `supplier_id` | INT | NOT NULL (FK to `suppliers.id`) | -| `ordered_at` | DATETIME | NOT NULL | -| `status` | VARCHAR(50) | NOT NULL | -| `total_amount` | DECIMAL(12, 2) | NOT NULL | - -#### `purchase_order_items` - -Line items for POs. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | AUTO_INCREMENT PRIMARY KEY | -| `purchase_order_id` | INT | NOT NULL (FK to `purchase_orders.id`) | -| `product_id` | INT | NOT NULL (FK to `products.id`) | -| `quantity` | INT | NOT NULL | -| `unit_price` | DECIMAL(10, 2) | NOT NULL | - ---- - -## 4. Historical Data (`manufacturing_history`) - -**Engine**: Microsoft SQL Server (MSSQL) -**Description**: Long-term archival of sales and production performance. - -### Tables - -#### `customers` - -Client database. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | IDENTITY(1,1) PRIMARY KEY | -| `company_name` | NVARCHAR(255) | NOT NULL | -| `contact_name` | NVARCHAR(255) | NOT NULL | -| `email` | NVARCHAR(255) | NOT NULL | -| `region` | NVARCHAR(100) | NOT NULL | - -#### `production_runs` - -Completed manufacturing batches. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | IDENTITY(1,1) PRIMARY KEY | -| `product_id` | INT | NOT NULL (Logical FK) | -| `machine_id` | INT | NOT NULL (Logical FK) | -| `start_time` | DATETIME | NOT NULL | -| `end_time` | DATETIME | NOT NULL | -| `quantity_produced` | INT | NOT NULL | -| `scrap_count` | INT | NOT NULL DEFAULT 0 | - -#### `defects` - -QC failures recorded. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | IDENTITY(1,1) PRIMARY KEY | -| `production_run_id` | INT | NOT NULL (FK to `production_runs.id`) | -| `defect_type` | NVARCHAR(100) | NOT NULL | -| `severity` | NVARCHAR(50) | NOT NULL | -| `count` | INT | NOT NULL | - -#### `sales_orders` - -Validated customer orders. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | IDENTITY(1,1) PRIMARY KEY | -| `customer_id` | INT | NOT NULL (FK to `customers.id`) | -| `order_date` | DATETIME | NOT NULL | -| `status` | NVARCHAR(50) | NOT NULL | -| `total_amount` | DECIMAL(12, 2) | NOT NULL | - -#### `sales_order_items` - -Line items for Sales Orders. - -| Column | Type | Constraints | -| :--- | :--- | :--- | -| `id` | INT | IDENTITY(1,1) PRIMARY KEY | -| `sales_order_id` | INT | NOT NULL (FK to `sales_orders.id`) | -| `product_id` | INT | NOT NULL (Logical FK) | -| `quantity` | INT | NOT NULL | -| `unit_price` | DECIMAL(10, 2) | NOT NULL | - ---- - -## Cross-Database Context - -The system effectively joins data across these silos logically: - -* **Ops** `factory_id` -> **Ref** `factories.id` -* **Supply** `product_id` <-> **History** `product_id` (Shared SKU logical link) diff --git a/docs/safety/security.md b/docs/safety/security.md new file mode 100644 index 0000000..4925113 --- /dev/null +++ b/docs/safety/security.md @@ -0,0 +1,65 @@ +# Security Architecture + +Security is a first-class citizen in the NL2SQL Platform. We implement a **Defense-in-Depth** strategy involving Static Analysis, Execution Sandboxing, and Role-Based Access Control. + +## 1. Logical Validation (The Firewall) + +Before any SQL is generated or executed, the **Abstract Syntax Tree (AST)** must pass the **Logical Validator**. + +### Static Analysis + +We perform analysis on the `PlanModel` (AST) to enforce: + +* **Read-Only**: Only `SELECT` statements are allowed. `DROP`, `ALTER`, `INSERT` are structurally impossible to represent in the Plan Model. +* **Ordinal Integrity**: Ensures plan structure is valid. +* **Safety Violations**: Any violation triggers `ErrorCode.SECURITY_VIOLATION` (Critical Severity). + +### Column Scoping (`ValidatorVisitor`) + +A recursive walker traverses the AST to verify that: + +* Every column reference `t1.col` resolves to a valid Alias `t1` defined in the Plan. +* The column `col` actually exists in the effective schema of `t1`. +* No ambiguous columns (without aliases) are present if multiple tables share the column name. +* *Failures result in `ErrorCode.COLUMN_NOT_FOUND` or `ErrorCode.INVALID_ALIAS_USAGE`.* + +## 2. Authorization (RBAC) + +We use a strict **Role-Based Access Control** system defined in `configs/policies.json`. + +### Policy Enforcement + +The `LogicalValidator` checks the `user_context` against the `RolePolicy`. + +```python +# Policy Rule +"role_id": { + "allowed_datasources": ["sales_db"], + "allowed_tables": ["sales_db.orders", "sales_db.items"] +} +``` + +* **Strict Namespacing**: Policies MUST use the `datasource.table` format. +* **Fail-Closed**: If the system cannot determine the `selected_datasource_id` (e.g., ambiguous routing), the Validator fails immediately/closed. It never defaults to "Allow All". + +## 3. Physical Validation & Sandboxing + +Even after safe SQL is generated, we perform **Physical Validation**. + +* **Dry Run**: We execute an `EXPLAIN` (or equivalent) on the generated SQL. This catches semantic errors (e.g., type mismatches) safely. +* **Cost Estimation**: We verify the query won't return > `row_limit` (default 1000) rows. Exceeding this triggers `ErrorCode.PERFORMANCE_WARNING` and stops execution. + +## 4. Secrets Management + +Secrets are never hardcoded. The `SecretManager` uses a **Provider Pattern**. + +* **Resolution**: Secrets are resolved at runtime using `${provider:key}` syntax. +* **Two-Phase Loading**: + 1. **Bootstrap**: Loads `env` provider first. + 2. **Resolution**: Resolves config references (e.g. `${env:DB_PASS}`) before registering subsequent providers. +* **Default Provider**: Environment Variables (`${env:MY_SECRET}`). +* **Extensible**: You can register custom providers (e.g., AWS Secrets Manager, Azure KeyVault). + +::: nl2sql.pipeline.nodes.validator.node.LogicalValidatorNode +::: nl2sql.security.policies.RolePolicy +::: nl2sql.secrets.manager.SecretManager diff --git a/mkdocs.yml b/mkdocs.yml index e89a5ba..8deb0d7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,59 +1,41 @@ site_name: NL2SQL Platform -site_description: Production-grade Natural Language to SQL engine. -site_url: https://nadeem4.github.io/nl2sql/ +site_description: Production-grade Natural Language to SQL Platform Documentation +site_author: Platform Engineering + theme: name: material features: - navigation.tabs - navigation.sections - navigation.expand - - search.suggest - - search.highlight + - navigation.top - toc.follow - palette: + - content.code.copy + palette: - scheme: default - primary: indigo - accent: indigo + primary: teal + accent: purple toggle: - icon: material/brightness-7 + icon: material/brightness-7 name: Switch to dark mode - scheme: slate - primary: indigo - accent: indigo + primary: teal + accent: lime toggle: icon: material/brightness-4 name: Switch to light mode - icon: - repo: fontawesome/brands/git-alt - - -extra: - social: - - icon: fontawesome/brands/github - link: https://github.com/nadeem4/nl2sql - -copyright: Copyright © 2024 NL2SQL Contributors - - -repo_url: https://github.com/nadeem4/nl2sql -repo_name: nl2sql -edit_uri: edit/main/docs/ - -# Validation settings (Strict Mode equivalent) -validation: - omitted_files: warn - absolute_links: warn - unrecognized_links: warn plugins: - search - + - mermaid2 + - mkdocstrings: + handlers: + python: + paths: [packages/core/src, packages/cli/src, packages/adapter-sdk/src] markdown_extensions: - pymdownx.highlight: anchor_linenums: true - line_spans: __span - pygments_lang_class: true - pymdownx.inlinehilite - pymdownx.snippets - pymdownx.superfences: @@ -61,42 +43,35 @@ markdown_extensions: - name: mermaid class: mermaid format: !!python/name:pymdownx.superfences.fence_code_format - - pymdownx.tabbed: - alternate_style: true - admonition - pymdownx.details - attr_list - - md_in_html nav: - Home: index.md - - Architecture: - - Overview: architecture/overview.md - - Security Model: architecture/security.md - - SQL Agent (Pipeline): architecture/sql_agent.md - - Data Indexing: architecture/indexing.md - - Routing Strategy: architecture/routing.md - - Adapter Plugins: architecture/adapters.md - - Guides: - - Getting Started: guides/getting_started.md - - Configuration: guides/configuration.md - - Deployment: guides/deployment.md - - Benchmarking: guides/benchmarking.md - - Production Roadmap: guides/production_roadmap.md - - Development: guides/development.md - - Reference: - - Test Data: reference/test_data.md - - CLI: reference/cli.md - - Lessons Learned: reference/lessons_learned.md - - API: - - GraphState: reference/api/graph_state.md - - Adapter SDK: reference/api/adapter_sdk.md - - Nodes: - - Aggregator: nodes/aggregator_node.md - - Decomposer: nodes/decomposer_node.md - - Executor: nodes/executor_node.md - - Generator: nodes/generator_node.md - - Planner: nodes/planner_node.md - - Refiner: nodes/refiner_node.md - - Semantic Analysis: nodes/semantic_analysis_node.md - - Logical Validator: nodes/validator_node.md + - Getting Started: + - Installation: getting-started/installation.md + - Quickstart (Demo): getting-started/demo.md + - Core Engine: + - Architecture: core/architecture.md + - Nodes & Pipeline: core/nodes.md + - Environments: core/environment.md + - Agents & Subgraphs: core/agents.md + - Retrieval & Indexing: core/indexing.md + - Security Architecture: safety/security.md + - Operations: + - Configuration: ops/configuration.md + - Observability: ops/observability.md + - Benchmarking: ops/benchmarking.md + - Error Reference: ops/errors.md + - Adapters: + - Overview: adapters/index.md + - Supported Databases: + - Postgres: adapters/postgres.md + - MySQL: adapters/mysql.md + - MSSQL: adapters/mssql.md + - SQLite: adapters/sqlite.md + - Extensions: + - Building Adapters: adapters/development.md + - Development: + - API Reference: dev/api.md diff --git a/requirements-docs.txt b/requirements-docs.txt index 1312c30..fac2496 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -1,3 +1,5 @@ mkdocs>=1.6.1 mkdocs-material>=9.5.0 pymdown-extensions>=10.0 +mkdocs-mermaid2-plugin>=1.1.1 +mkdocstrings[python]>=0.24.0