Skip to content

notashes/glued

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Glue Query Runner

Run SQL queries against Delta Lake tables in AWS Glue Data Catalog with DataFusion.

Why?

Query your Glue-registered Delta tables without Spark. Fast, lightweight, Rust-native.

Quick Start

# Run a query
cargo run --release --bin sql-runner -- \
  --file query.sql \
  --database my_database

# With debug tracing
RUST_LOG=glue_query_runner=debug cargo run --release --bin sql-runner -- \
  --file query.sql \
  --database my_database

Features

  • Builder API - Ergonomic, fluent configuration
  • Automatic caching - Table metadata cached with TTL (5min default)
  • Structured tracing - Optional debug logging with RUST_LOG
  • Batch processing - Process entire folders of SQL files
  • Query plans - --explain flag for optimization insights

Library Usage

Simple (one-liner setup)

use glue_query_runner::GlueContextExt;
use datafusion::prelude::*;

let aws_config = aws_config::load_defaults(aws_config::BehaviorVersion::latest()).await;
let ctx = SessionContext::new()
    .with_glue_database(&aws_config, "my_database")
    .await?;

let df = ctx.sql("SELECT * FROM my_table LIMIT 10").await?;
df.show().await?;

Advanced (builder pattern)

use glue_query_runner::{GlueSchemaProvider, catalog::GlueCatalog};
use std::sync::Arc;

let glue_catalog = Arc::new(GlueCatalog::new(&aws_config));

let schema = GlueSchemaProvider::builder("my_database")
    .catalog(glue_catalog)
    .cache_ttl(Duration::from_secs(600))  // 10 min cache
    .build()
    .await?;

CLI Options

# Single file
--file query.sql --database my_db

# Batch folder
--input-folder ./queries --output-folder ./results --database my_db

# With execution plans
--file query.sql --database my_db --explain

SQL Format

Queries separated by semicolons, comments with --:

-- Get top customers
SELECT customer_id, SUM(amount) as total
FROM orders
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

-- Daily revenue
SELECT DATE(order_date), SUM(amount)
FROM orders
GROUP BY DATE(order_date);

Features (Cargo)

[features]
default = ["tracing", "cache"]
tracing = ["dep:tracing", "dep:tracing-subscriber"]
cache = ["dep:moka"]

Disable with --no-default-features.

Tracing Example

RUST_LOG=glue_query_runner=debug cargo run --release -- --file query.sql

Output:

DEBUG Fetching table list from Glue catalog
INFO  Retrieved table names from Glue count=100
DEBUG Table cache miss
INFO  Resolved table location location=s3://bucket/table
INFO  Delta table loaded files=21
DEBUG Table cached
DEBUG Table cache hit  # <-- Second query instant!

How It Works

  1. GlueCatalog - Fetches table locations from AWS Glue
  2. GlueSchemaProvider - Lazy-loads Delta tables, caches results
  3. DataFusion - Executes SQL with full query optimization
  4. Moka cache - In-memory TTL cache for table metadata

AWS Setup

Uses default AWS credential chain:

export AWS_REGION=us-east-1
export AWS_PROFILE=your_profile
# or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY

Performance

  • First query: ~3s (Glue lookup + Delta load)
  • Cached query: ~instant (cache hit)
  • Cache TTL: 5 minutes (configurable)

Dependencies

  • deltalake 0.28 - Delta Lake with S3
  • datafusion 49.0 - Query engine
  • aws-sdk-glue 1.73 - Glue API
  • moka 0.12 - Async cache (optional)
  • tracing 0.1 - Structured logging (optional)

License

Apache-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages