Skip to content

feat(python): add Python bindings for ggsql#74

Merged
georgestagg merged 22 commits intoposit-dev:mainfrom
cpsievert:feature/python-e2e-tests
Jan 29, 2026
Merged

feat(python): add Python bindings for ggsql#74
georgestagg merged 22 commits intoposit-dev:mainfrom
cpsievert:feature/python-e2e-tests

Conversation

@cpsievert
Copy link
Collaborator

@cpsievert cpsievert commented Jan 20, 2026

This PR introduces ggsql-python, a new Python package providing PyO3 bindings for the ggsql crate. The package enables Python users to render Altair charts from DataFrames using ggsql's VISUALISE syntax.

  • Add ggsql-python package with PyO3 bindings exposing split_query() and render_altair() functions
  • render_altair() returns an altair.Chart directly for easy display and customization
  • Support any narwhals-compatible DataFrame (polars, pandas, etc.) with automatic LazyFrame collection
  • Add CI workflow testing across Python 3.10-3.13 on Linux, macOS, and Windows
  • Commit tree-sitter generated files to enable Windows CI builds without tree-sitter CLI

Key Features

API:

import ggsql
import duckdb

# Split a ggsql query into SQL and VISUALISE portions
sql, viz = ggsql.split_query("""
    SELECT date, revenue, region FROM sales
    WHERE year = 2024
    VISUALISE date AS x, revenue AS y, region AS color
    DRAW line
    LABEL title => 'Sales by Region'
""")

# Execute SQL with a Python package like duckdb, polars, sqlalchemy, etc
df = duckdb.sql(sql).pl()

# Render DataFrame + VISUALISE spec to Altair chart
chart = ggsql.render_altair(df, viz)

# Display or save the chart
chart.display()  # In Jupyter
chart.save("chart.html")  # Save to file

DataFrame Support:

  • Polars DataFrames (native)
  • Polars LazyFrames (auto-collected)
  • Pandas DataFrames (via narwhals)
  • Any narwhals-compatible DataFrame type

Mapping Styles:

  • Explicit: VISUALISE x AS x, y AS y
  • Implicit: VISUALISE x, y (column name matches aesthetic)
  • Wildcard: VISUALISE * (map all matching columns)

Building & Installation

Requirements:

  • Python 3.10+
  • Rust toolchain (install via rustup)
  • maturin for building

From source (development mode):

cd ggsql-python
python -m venv .venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows
pip install maturin
maturin develop

Build a wheel:

cd ggsql-python
maturin build --release
pip install target/wheels/ggsql-*.whl

Run tests:

pip install pytest
pytest tests/ -v

Dependencies:

  • altair>=5.0 - Chart rendering
  • narwhals>=2.15.0 - DataFrame compatibility layer
  • polars>=1.0 - Native DataFrame support
  • pyarrow>=14.0 - Required by pyo3-polars

Test plan

  • split_query() correctly separates SQL and VISUALISE portions
  • render_altair() accepts polars DataFrames
  • render_altair() accepts polars LazyFrames (auto-collected)
  • render_altair() accepts narwhals DataFrames
  • render_altair() accepts pandas DataFrames (via narwhals)
  • render_altair() rejects invalid DataFrame types with TypeError
  • render_altair() returns altair.Chart instance
  • render_altair() passes **kwargs through to altair.Chart.from_json()
  • Invalid VISUALISE syntax raises ValueError
  • CI runs on Python 3.10-3.13 across ubuntu, macos, and windows

🤖 Generated with Claude Code

@cpsievert cpsievert force-pushed the feature/python-e2e-tests branch 6 times, most recently from e30a71a to 874472b Compare January 21, 2026 22:44
cpsievert and others added 2 commits January 22, 2026 09:20
Adds Python bindings for ggsql via PyO3/maturin, enabling Python users
to render Vega-Lite visualizations from polars DataFrames.

Features:
- split_query(): Split a ggSQL query into SQL and VISUALISE parts
- render(): Render a DataFrame with a VISUALISE spec to Vega-Lite JSON
- Supports explicit, implicit, and wildcard mappings
- Works with Python polars >= 1.0

Includes:
- Unit tests and E2E tests with altair
- GitHub Actions workflow for Python 3.10-3.14
- README with installation and development guide

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the feature/python-e2e-tests branch from 874472b to 994062e Compare January 22, 2026 15:20
cpsievert and others added 10 commits January 22, 2026 09:57
- Add default-members to Cargo.toml so cargo build/test work without
  requiring --exclude flags. ggsql-python requires Python dev headers
  and must be built separately with maturin.
- Fix invalid import path in ggsql-python (ggsql::parser::ast doesn't
  exist, AestheticValue is exported at crate root)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove speculative <2.0 upper bound on maturin
- Split extras into test, e2e, and dev for better granularity
- Update CI workflow to use extras instead of hardcoded pip installs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix wheel install using --find-links instead of glob patterns
- Add Rust caching (Swatinem/rust-cache) to speed up builds
- Enable sccache in maturin-action for compilation caching
- Remove Python 3.14 from matrix (still experimental)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _to_polars() helper to convert narwhals/pandas DataFrames to polars
- Use GLOBAL_DATA_KEY constant instead of hardcoded "__global__"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add narwhals dependency for handling IntoFrame types
- Simplify render() to accept any narwhals-compatible DataFrame
- Add tests for GLOBAL_DATA_KEY fix, narwhals, and pandas DataFrames
- Fix test_split_query to test VISUALISE FROM behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The maturin build outputs to the workspace root's target/wheels/
directory, not ggsql-python/target/wheels/. This was causing pip
to fail finding the wheel.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
pyarrow is not required - pyo3-polars handles DataFrame conversion
without Arrow FFI. Updated README to reflect narwhals support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The pyo3-polars crate requires pyarrow for the Python-to-Rust DataFrame
conversion. Without it, Series.to_arrow() fails with ModuleNotFoundError.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The generated parser.c and related files are required for the build.
While Unix builds can regenerate these via tree-sitter-cli, Windows CI
often fails due to missing npm/tree-sitter dependencies. Committing
these files is the standard practice for tree-sitter grammars.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment on lines 27 to 32
# Tree-sitter generated files
/tree-sitter-ggsql/src/parser.c
/tree-sitter-ggsql/src/tree_sitter/
/tree-sitter-ggsql/src/node-types.json
/tree-sitter-ggsql/src/grammar.json

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed since tree-sitter generate was failing on Windows. Claude offered a few different ways to fix this, but recommended this approach:

Approach Repo Size Build Speed Reliability Complexity
Commit generated files +1.3MB Fast High Low
CI installs tree-sitter-cli Small Slow Medium Low
tree-sitter-cli crate Small Slow High Medium
Conditional build.rs Small* Fast* Medium Medium

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is simply that tree sitter is not installed in CI, I'd rather not check in the generated files and instead install it on Windows using something like https://github.com/tree-sitter/setup-action/tree/master.

Those generated files can get very large as grammars get complex, and I'd like to avoid the noise if we can.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do recall having to do a little 'nvm' song and dance to please installation on windows, but I've forgotten the details already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8ffdc1a

@cpsievert cpsievert marked this pull request as ready for review January 22, 2026 18:01
cpsievert and others added 8 commits January 22, 2026 12:08
- Add altair as required dependency (moved from optional e2e)
- render() now returns altair.Chart instead of JSON string
- Add @overload decorator for future writer support
- Update tests to work with Chart return type
- Simplify CI workflow by removing separate e2e job
- Update README with new API and usage examples

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Validates writer value in Python before calling into Rust,
providing a clearer error message with supported options.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove test_altair_e2e.py (e2e naming no longer meaningful)
- Consolidate all tests into test_ggsql.py
- Focus tests on Python-specific logic:
  - DataFrame conversion via narwhals
  - Writer parameter validation
  - Return type handling
- Remove tests that were testing Rust logic (chart structure, encodings, etc.)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename render() to render_altair() for clearer API
- Remove writer parameter (always renders to Altair)
- Add **kwargs pass-through to altair.Chart.from_json()
- Update tests and README

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace pyo3-polars PyDataFrame with IPC byte serialization
- Use polars native IPC reader/writer for Python-Rust data transfer
- Remove pyarrow from dependencies (saves ~117MB install size)
- Add proper Altair chart type detection (LayerChart, FacetChart, etc.)
- Wheel size: 5.8MB → 6.1MB (but eliminates 117MB pyarrow dep)

The IPC approach adds <2ms overhead for 1M rows, negligible compared
to Altair's JSON parsing which dominates execution time.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add section 8 documenting ggsql-python package
- Update feature flags to reflect Python is implemented (not planned)
- Update feature dependencies mapping for ggsql-python

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert
Copy link
Collaborator Author

cpsievert commented Jan 26, 2026

@georgestagg or @thomasp85 this is ready for a review whenever.

BTW, If neither of you have interest/bandwidth in reviewing the Python package (which, at least for now, I'm primarily wanting for querychat), then I'd be happy to gain write access and be considered a "maintainer" of the Python package. I'll handle submitting to PyPI, etc.

@georgestagg
Copy link
Collaborator

I am going to be very busy this week with other things, but I will take a look at this when I can. My apologies if it blocks the querychat work in the meantime.

I would like to think carefully about the consequences of checking in the tree sitter artefacts, and whether the split_query entry point is still the right design at our end. Probably it should be renamed at least.

@thomasp85
Copy link
Collaborator

I'm afraid my knowledge and interest in Python is too limited to provide any sensible review

Windows CI builds require tree-sitter-cli to generate parser files at
build time. This adds the npm package installation to all workflows
that build Rust code and removes the pre-generated parser files from
version control.

Changes:
- Install tree-sitter-cli via npm in build, python, publish, and R workflows
- Update build.rs to generate parser files during compilation
- Remove pre-generated tree-sitter files from git (grammar.json, parser.c, etc.)
- Add generated files to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the feature/python-e2e-tests branch from 8cc8caa to 8ffdc1a Compare January 27, 2026 22:14
Doc tests were causing linker crashes (Bus error) on CI due to memory
pressure when linking each doc test binary against the large dependency
tree (polars, duckdb, etc.). Skip doc tests in CI since unit tests
provide the real coverage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the feature/python-e2e-tests branch from 746d63b to d0bb9fe Compare January 28, 2026 00:06
@georgestagg
Copy link
Collaborator

georgestagg commented Jan 28, 2026

skip doc tests to avoid linker memory issues

If you bang up against this again, I've used this little snippet in the past. A little cheeky, perhaps, but it seems to work.

  - name: Increase swapfile
    run: |
      sudo swapoff -a
      sudo fallocate -l 15G /swapfile
      sudo chmod 600 /swapfile
      sudo mkswap /swapfile
      sudo swapon /swapfile
      sudo swapon --show

We could also try the with the posit-dev runners.

Copy link
Collaborator

@georgestagg georgestagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks good to go in. The infrastructure looks good, thank you for making the Python build optional.

Now, the tricky bit. I am going to rework the Rust API, as discussed elsewhere, and so the Python package will need to change in response. I will handle that, but in a followup PR and tag you for review. I will try to keep the render_altair() interface as it currently is presented to the end user.

@georgestagg georgestagg merged commit 177e000 into posit-dev:main Jan 29, 2026
16 checks passed
@cpsievert cpsievert deleted the feature/python-e2e-tests branch January 29, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants