Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
e0c217f
feat(#1133): add Polars DataFrame and LazyFrame support
lmeyerov Apr 18, 2026
94bec35
fix(ci): add polars to mypy ignore_missing_imports
lmeyerov Apr 18, 2026
afdb1ee
fix(tests): initialize pl=None before try/except so test_polars skips…
lmeyerov Apr 19, 2026
b9ca79a
docs(#1133): add Polars support to CHANGELOG
lmeyerov Apr 19, 2026
6773646
fix(#1133): audit fixes — DRY, module-string, test coverage
lmeyerov Apr 19, 2026
96e10da
fix(tests): use gfql() instead of deprecated chain() in polars hop/ch…
lmeyerov Apr 19, 2026
a4e5bd0
fix(tests): drop gfql hop/chain tests — ASTNode.execute pre-existing …
lmeyerov Apr 19, 2026
42c1705
fix(#1133): correct to_pandas(), clarify gpu-engine guard, add chain/…
lmeyerov Apr 20, 2026
5acc35d
fix(#1133): review cleanup — drop redundant _resolve_engine alias, ad…
lmeyerov Apr 21, 2026
7634c68
fix(lint): remove duplicate TestCombineStepsEdgeCases, add engine to …
lmeyerov Apr 22, 2026
9b2d87d
fix(lint): remove duplicate [mypy-polars.*] section that broke mypy e…
lmeyerov Apr 22, 2026
7ca4f3f
fix(review): pass in maybe_polars(); restructure test_polars.py to mo…
lmeyerov Apr 23, 2026
889343b
test(#1133): add unit tests for dbg_df, s_na, df_to_engine(DASK)
lmeyerov Apr 23, 2026
0c00a9f
fix(review): polars CI job, OTel attrs fix, ImportError guard, docstr…
lmeyerov Apr 23, 2026
d92da79
chore: rebase onto master after #1148 merge
lmeyerov Apr 24, 2026
49341e3
refactor(tests): move test_polars, test_engine_coercion, test_df_type…
lmeyerov Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,48 @@ jobs:
source pygraphistry/bin/activate
./bin/test-graphviz.sh

test-polars:
needs: [ test-minimal-python, test-gfql-core, generate-lockfiles ]
if: ${{ success() }}
runs-on: ubuntu-latest
timeout-minutes: 10

strategy:
matrix:
python-version: [3.9, '3.10', 3.11, 3.12, '3.13', '3.14']

steps:

- name: Checkout repo
uses: actions/checkout@v4
with:
lfs: true
persist-credentials: false

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Download lockfiles
uses: actions/download-artifact@v4
with:
name: lockfiles
path: requirements

- name: Install Python dependencies
run: |
python -m venv pygraphistry
source pygraphistry/bin/activate
python -m pip install --upgrade pip uv
uv pip install --require-hashes -r requirements/test-polars-py${{ matrix.python-version }}.lock
uv pip install -e . --no-deps

- name: Polars tests
run: |
source pygraphistry/bin/activate
./bin/test-polars.sh

test-core-umap:
needs: [ test-minimal-python, test-gfql-core, generate-lockfiles ]
# Inherit condition from test-minimal-python
Expand Down Expand Up @@ -1220,7 +1262,7 @@ jobs:
- name: Run Spark tests
run: |
source pygraphistry/bin/activate
python -B -m pytest graphistry/tests/test_df_types.py -v -k spark
python -B -m pytest graphistry/tests/compute/test_df_types.py -v -k spark


test-neo4j:
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
- **Release docs / metadata**: Updated publish instructions to pull `master` in fast-forward-only mode (`git pull --ff-only origin master`), require a clean working tree before tagging (`git status --short` should be empty), push only the intended tag ref (`git push origin refs/tags/X.Y.Z`) instead of `--tags`/ambiguous ref pushes, clarified manual publish dispatch as maintainer-only recovery on `master`, added guidance to avoid rerunning already-published versions, and normalized legacy `pypi.python.org` links in `README.md` to `pypi.org`.
- **CI / OIDC context tightening**: `publish-pypi.yml` now verifies repository/workflow identity via `GITHUB_REPOSITORY` + `GITHUB_WORKFLOW_REF` and enforces release-tag format checks before publish. `DEVELOP.md` now documents the required PyPI Trusted Publisher binding (`repository`, `workflow`, `environment`, and trusted refs) so external OIDC policy stays aligned with workflow constraints.

### Added
- **CI / Polars**: Added `test-polars` CI job (Python 3.9–3.14) with a dedicated `test-polars` lockfile profile; `polars` is now a named `setup.py` extra so the test matrix installs and exercises `test_polars.py` on every PR (#1133).
- **Polars support**: `polars.DataFrame` and `polars.LazyFrame` now work in `plot()`, `materialize_nodes()`, `get_degrees()`, `get_indegrees()`, `get_outdegrees()`, and `hypergraph()`. Polars is an optional dependency — no behavior change when not installed. Upload path uses efficient Arrow conversion (`to_arrow()` with schema-metadata stripping and memoization); compute/hypergraph paths coerce to pandas at entry. `LazyFrame` is materialized via `.collect()` at each boundary. Adds `test_polars.py` with 17 tests; skips gracefully when polars is absent (#1133).

### Fixed
- **GFQL / Cypher binder**: Replaced fragile regex-based WHERE label narrowing fallback in `_apply_where_label_narrowing` with AST-derived narrowing. `generic_where_clause` now lifts AND-joined bare label predicates (`WHERE n:Admin AND n:Active`) to structured `WhereClause.predicates` using the existing quote/bracket/paren/backtick-aware `_split_top_level_and_terms` helper; string-literal false-matches (e.g. `WHERE n.name = 'n:Admin'` incorrectly narrowing alias `n`) are closed by `fullmatch` anchoring. Removes `_WHERE_LABEL_RE` and `_WHERE_NON_CONJUNCTIVE_RE` from `binder.py`. Adds 10 targeted tests covering single/double/triple AND, multi-alias, multi-label-per-alias, lowercase `and`, XOR/OR/NOT conservative non-narrowing, mixed label+property all-or-nothing, and string-literal false-positive guards (#1125, #1193).
- **DataFrame engine coercion**: Unified all DataFrame-to-engine conversion behind `df_to_engine()` with explicit dispatch for Arrow, Spark, dask, dask_cudf, cuDF, Polars, and pandas; unknown types now raise `ValueError` instead of silently calling `.to_pandas()`. `_coerce_input_formats(g, engine)` replaces `_coerce_to_pandas(g)` as the engine-aware coercion entry point in `chain()`, `hop()`, and `materialize_nodes()`, preserving GPU (cuDF) output when input is cuDF. `to_pandas()` now handles all input types via the same dispatch. Adds `test_engine_coercion.py` with 50+ tests (#1148).
Expand Down Expand Up @@ -2508,3 +2512,4 @@ Code that looks like `g.edges(some_fn, None, None, some_arg)` should now be like
### Changed
- Removed deprecated docker test harness in favor of `docker/` - [#172](https://github.com/graphistry/pygraphistry/pull/172)


1 change: 1 addition & 0 deletions bin/generate-lockfiles.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ PROFILE_DEFS=(
"test-compat-latest:test,bolt,nodexl:3.14:3.14:--constraint /tmp/pandas-latest.txt"
"test-compat-gfql-legacy:test:3.9:3.9:--constraint /tmp/pandas-legacy.txt"
"test-compat-gfql-latest:test:3.14:3.14:--constraint /tmp/pandas-latest.txt"
"test-polars:test,polars:3.9::"
"test-graphviz:test,pygraphviz:3.8::"
"test-umap:test,testai,umap-learn:3.9::--no-emit-package torch"
"test-ai:test,testai,ai:3.9::--no-emit-package torch --constraint /tmp/sentence-transformers-compat.txt"
Expand Down
13 changes: 13 additions & 0 deletions bin/test-polars.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [polars,test] installed

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/compute/test_polars.py
8 changes: 8 additions & 0 deletions graphistry/Engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,14 @@ def resolve_engine(
except ImportError:
pass

if 'polars' in str(type(g_or_df).__module__):
try:
import polars as pl
if isinstance(g_or_df, (pl.DataFrame, pl.LazyFrame)):
return Engine.PANDAS
except ImportError:
pass

if 'cudf.core.dataframe' in str(getmodule(g_or_df)):
has_cudf_dependancy_, _, _ = lazy_cudf_import()
if has_cudf_dependancy_:
Expand Down
50 changes: 46 additions & 4 deletions graphistry/PlotterBase.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,17 @@ def maybe_spark():
logger.warning('Runtime error import pyspark: Available but failed to initialize', exc_info=True)
return None

@lru_cache(maxsize=1)
def maybe_polars():
try:
import polars
return polars
except ImportError:
1
except RuntimeError:
logger.warning('Runtime error importing polars', exc_info=True)
return None

# #####################################


Expand All @@ -165,13 +176,15 @@ class PlotterBase(Plottable):

_pd_hash_to_arrow : WeakValueDictionary = WeakValueDictionary()
_cudf_hash_to_arrow : WeakValueDictionary = WeakValueDictionary()
_polars_hash_to_arrow : WeakValueDictionary = WeakValueDictionary()
_umap_param_to_g : WeakValueDictionary = WeakValueDictionary()
_feat_param_to_g : WeakValueDictionary = WeakValueDictionary()

def reset_caches(self):
def reset_caches(self):
"""Reset memoization caches"""
self._pd_hash_to_arrow.clear()
self._cudf_hash_to_arrow.clear()
self._polars_hash_to_arrow.clear()
self._umap_param_to_g.clear()
self._feat_param_to_g.clear()
cache_coercion_helper.cache_clear()
Expand Down Expand Up @@ -2753,7 +2766,8 @@ def _plot_dispatch(self, graph, nodes, name, description, mode='json', metadata=
or ( not (maybe_cudf() is None) and isinstance(graph, maybe_cudf().DataFrame) ) \
or ( not (maybe_dask_cudf() is None) and isinstance(graph, maybe_dask_cudf().DataFrame) ) \
or ( not (maybe_dask_dataframe() is None) and isinstance(graph, maybe_dask_dataframe().DataFrame) ) \
or ( not (maybe_spark() is None) and isinstance(graph, maybe_spark().sql.dataframe.DataFrame) ):
or ( not (maybe_spark() is None) and isinstance(graph, maybe_spark().sql.dataframe.DataFrame) ) \
or ( not (maybe_polars() is None) and isinstance(graph, (maybe_polars().DataFrame, maybe_polars().LazyFrame)) ):
return g._make_dataset(graph, nodes, name, description, mode, metadata, memoize, validate_mode, emit_warnings)

try:
Expand Down Expand Up @@ -2861,7 +2875,7 @@ def bind(df, pbname, attrib, default=None):

def _table_to_pandas(self, table) -> Optional[pd.DataFrame]:
"""
pandas | arrow | dask | cudf | dask_cudf => pandas
pandas | arrow | dask | cudf | dask_cudf | polars | spark => pandas
"""

if table is None:
Expand All @@ -2882,6 +2896,11 @@ def _table_to_pandas(self, table) -> Optional[pd.DataFrame]:
if not (maybe_dask_dataframe() is None) and isinstance(table, maybe_dask_dataframe().DataFrame):
return self._table_to_pandas(table.compute())

if not (maybe_polars() is None) and isinstance(table, (maybe_polars().DataFrame, maybe_polars().LazyFrame)):
if isinstance(table, maybe_polars().LazyFrame):
table = table.collect()
return table.to_pandas()

raise Exception('Unknown type %s: Could not convert data to Pandas dataframe' % str(type(table)))

def _find_bad_arrow_columns(self, df: Any, is_cudf: bool = False) -> List[str]:
Expand Down Expand Up @@ -2923,7 +2942,7 @@ def _coerce_mixed_type_columns(self, df: Any, is_cudf: bool = False, emit_warnin

def _table_to_arrow(self, table: Any, memoize: bool = True, validate_mode: ValidationMode = 'autofix', emit_warnings: bool = True) -> Optional[pa.Table]: # noqa: C901
"""
pandas | arrow | dask | cudf | dask_cudf => arrow
pandas | arrow | dask | cudf | dask_cudf | polars | spark => arrow

dask/dask_cudf convert to pandas/cudf

Expand Down Expand Up @@ -3035,6 +3054,29 @@ def _table_to_arrow(self, table: Any, memoize: bool = True, validate_mode: Valid
#TODO push the hash check to Spark
return self._table_to_arrow(df, memoize, validate_mode, emit_warnings)

if not (maybe_polars() is None) and isinstance(table, (maybe_polars().DataFrame, maybe_polars().LazyFrame)):
# validate_mode and emit_warnings are not applied for polars input: polars frames are
# strictly typed so mixed-type columns cannot exist, making validation a no-op here.
if isinstance(table, maybe_polars().LazyFrame):
table = table.collect()
hashed = None
if memoize:
try:
hashed = (
hashlib.sha256(table.hash_rows().to_numpy().tobytes()).hexdigest()
+ hashlib.sha256(str(table.columns).encode('utf-8')).hexdigest()
)
if hashed in PlotterBase._polars_hash_to_arrow:
return PlotterBase._polars_hash_to_arrow[hashed].v
except Exception:
logger.debug('Failed to hash polars frame', exc_info=True)
out = table.to_arrow().replace_schema_metadata({})
if memoize and hashed is not None:
w = WeakValueWrapper(out)
cache_coercion(hashed, w)
PlotterBase._polars_hash_to_arrow[hashed] = w
return out

raise Exception('Unknown type %s: Could not convert data to Arrow' % str(type(table)))

def to_arrow(
Expand Down
27 changes: 14 additions & 13 deletions graphistry/compute/gfql/df_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,21 +83,22 @@ def edges_df_for_step(self, edge_idx: int, state: Optional[PathState] = None) ->
return state.pruned_edges[edge_idx] if state is not None and edge_idx in state.pruned_edges else self.forward_steps[edge_idx]._edges

def run(self) -> Plottable:
mode = os.environ.get(_CUDF_MODE_ENV, "auto").lower()
if self.inputs.engine == Engine.CUDF:
cudf_available = True
try:
import cudf # type: ignore # noqa: F401
except Exception:
cudf_available = False
if not cudf_available:
if mode == "strict":
raise RuntimeError(
"cuDF engine requested with strict mode but cudf is unavailable")
# auto mode: fall back to pandas transparently
self.inputs = dataclass_replace(self.inputs, engine=Engine.PANDAS)
# Collect OTel attrs after engine fallback so gfql.engine reflects actual execution engine
attrs = self._otel_attrs() if otel_enabled() else None
with otel_span("gfql.df_executor.run", attrs=attrs):
mode = os.environ.get(_CUDF_MODE_ENV, "auto").lower()
if self.inputs.engine == Engine.CUDF:
cudf_available = True
try:
import cudf # type: ignore # noqa: F401
except Exception:
cudf_available = False
if not cudf_available:
if mode == "strict":
raise RuntimeError(
"cuDF engine requested with strict mode but cudf is unavailable")
# auto mode: fall back to pandas transparently
self.inputs = dataclass_replace(self.inputs, engine=Engine.PANDAS)
self._forward()
if mode == "oracle":
return self._unsafe_run_test_only_oracle()
Expand Down
14 changes: 3 additions & 11 deletions graphistry/hyper_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
from typing_extensions import Literal
from .Engine import Engine, EngineAbstractType, DataframeLike, DataframeLocalLike, resolve_engine
from .Engine import Engine, EngineAbstractType, DataframeLike, DataframeLocalLike, resolve_engine, df_to_engine
import numpy as np, pandas as pd, pyarrow as pa, sys
from .util import setup_logger
logger = setup_logger(__name__)
Expand Down Expand Up @@ -817,17 +817,9 @@ def hypergraph(
engine_resolved = resolve_engine(engine, raw_events)
else:
engine_resolved = engine
# Coerce input-format types (Arrow, Spark) to the resolved engine's native type
# Coerce input-format types (Arrow, Spark, Polars, dask) to the resolved engine's native type
if raw_events is not None and engine_resolved == Engine.PANDAS and not isinstance(raw_events, pd.DataFrame):
if isinstance(raw_events, pa.Table):
raw_events = raw_events.to_pandas()
else:
try:
from pyspark.sql import DataFrame as SparkDataFrame
if isinstance(raw_events, SparkDataFrame):
raw_events = raw_events.toPandas()
except ImportError:
pass
raw_events = df_to_engine(raw_events, Engine.PANDAS)

defs = HyperBindings(**opts)
entity_types = [i for i in screen_entities(raw_events, entity_types, defs) if i != defs.event_id]
Expand Down
Loading
Loading