Skip to content

Bug: Non-pandas DataFrames accepted by plot() crash in compute methods #1132

@lmeyerov

Description

@lmeyerov

Summary

pa.Table (PyArrow) and pyspark.DataFrame work when passed to plot(), but crash with confusing errors in materialize_nodes(), get_degrees(), and hypergraph(). The library already has the conversion infrastructure — it just isn't called in those paths.

Support matrix

Type plot() materialize_nodes() / get_degrees() hypergraph()
pd.DataFrame
pa.Table
cudf.DataFrame
dask.DataFrame ⚠️ partial ⚠️ partial
pyspark.DataFrame

This issue covers the ❌ cells. Dask gaps are out of scope here.

Reproduction

import pyarrow as pa
import graphistry

edges = pa.table({'src': ['a', 'b', 'c'], 'dst': ['b', 'c', 'a']})
g = graphistry.edges(edges, 'src', 'dst')

g.plot()               # ✅ works
g.materialize_nodes()  # ❌ ValueError: Could not determine engine for edges,
                       #    expected pandas or cudf dataframe, got: pyarrow.Table
g.get_degrees()        # ❌ same

events = pa.table({'user': ['alice', 'bob'], 'action': ['click', 'view']})
g.hypergraph(events)   # ❌ AttributeError: 'pyarrow.lib.ChunkedArray' object
                       #    has no attribute 'dropna'

Same failures with pyspark.DataFrame.

Workaround

g = graphistry.edges(arrow_table.to_pandas(), 'src', 'dst')
g.hypergraph(spark_df.toPandas())

Root cause and fix

pygraphistry has two conversion patterns:

Upload path (plot()): graphistry/PlotterBase.py_table_to_arrow() and _table_to_pandas() each have an explicit branch per supported type. Arrow and Spark are handled here and work correctly.

Compute/hypergraph paths: The intended pattern is resolve_engine(df) → coerce df to match the resolved engine → run engine-specific code. This was applied for pandas and cuDF but not completed for Arrow and Spark.

Three localized fixes, all using existing infrastructure:

Fix 1 — graphistry/Engine.py: resolve_engine()

Currently resolve_engine() returns Engine.PANDAS for unrecognized types via silent fallthrough (line ~84). Add explicit branches before the fallthrough:

if isinstance(g_or_df, pa.Table):
    return Engine.PANDAS

if not (maybe_spark() is None) and isinstance(g_or_df, maybe_spark().sql.dataframe.DataFrame):
    return Engine.PANDAS

Fix 2 — graphistry/compute/ComputeMixin.py: materialize_nodes()

Currently (line ~191) checks isinstance(g._edges, pd.DataFrame) then cudf, then raises. After engine detection resolves to Engine.PANDAS, add a coerce step before the engine-specific code runs:

if engine_concrete == Engine.PANDAS and not isinstance(g._edges, pd.DataFrame):
    g = g.edges(self._table_to_pandas(g._edges)).nodes(self._table_to_pandas(g._nodes))

Fix 3 — graphistry/hyper_dask.py: hypergraph()

After resolve_engine() (line ~817), raw_events is still in its original type when passed to engine-specific ops. Add one coerce-at-entry block before screen_entities() is called:

if engine_resolved == Engine.PANDAS and not isinstance(raw_events, pd.DataFrame):
    raw_events = _table_to_pandas(raw_events)

_table_to_pandas() already handles Arrow and Spark — no new conversion logic needed. This same fix will cover Polars once #1124 adds it to _table_to_pandas().

Testing

Add to tests/test_compute.py and tests/test_hypergraph.py (or a new tests/test_df_types.py):

  • pa.table(...) in materialize_nodes() → returns pandas-backed result, no error
  • pa.table(...) in get_degrees() → same
  • pa.table(...) in hypergraph() → returns valid Hypergraph result
  • Repeat for Spark if available; skip gracefully if not

Relationship to #1124

Once _table_to_pandas() gains a Polars branch (per #1124), Fix 2 and Fix 3 above automatically cover Polars in materialize_nodes() and hypergraph() — no additional Polars-specific code needed in those paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions