Skip to content

fix(plot): Fix dataset_id invalidation in encoding and metadata methods#797

Merged
lmeyerov merged 11 commits into
masterfrom
fix/dataset-file-id-invalidation
Oct 17, 2025
Merged

fix(plot): Fix dataset_id invalidation in encoding and metadata methods#797
lmeyerov merged 11 commits into
masterfrom
fix/dataset-file-id-invalidation

Conversation

@lmeyerov
Copy link
Copy Markdown
Contributor

@lmeyerov lmeyerov commented Oct 17, 2025

Summary

Fixed dataset_id invalidation logic for encoding, metadata, and style methods. Dataset IDs now properly invalidate when encodings, name, description, or visualization styles change, preventing stale ID reuse after modifying visualization properties.

Analysis Approach

Used complementary analysis tools to ensure complete coverage:

  • AST pattern matching: Found 9 methods with direct attribute modifications
  • Pysa call graph analysis: Found 11 methods total, including 2 additional bugs AST missed (+17% bug detection)

Changes

Fixed 7 bugs in graphistry/PlotterBase.py:

  1. __encode() helper - Now invalidates dataset_id (fixes all 8 encoding methods)
  2. encode_axis() - Now invalidates dataset_id
  3. name() - Now invalidates dataset_id
  4. description() - Now invalidates dataset_id
  5. bind() - Conditional invalidation (only when changing encodings, not when setting IDs)
  6. style() - Now invalidates dataset_id (Pysa-discovered)
  7. addStyle() - Now invalidates dataset_id (Pysa-discovered)

Why Pysa Found More:

  • AST finds direct patterns: res._attr = value
  • Pysa finds indirect patterns: res = self.bind() followed by modifications
  • style() and addStyle() use double-copy pattern via bind() that AST missed

Test Coverage

Comprehensive test suite with 25 tests in graphistry/tests/test_dataset_id_invalidation.py:

  • ✅ 11 core tests for basic invalidation scenarios
  • ✅ 3 Pysa-discovered bug tests (style, addStyle, infer_labels)
  • ✅ 8 parametric tests covering ALL encoding methods (future-proof)
  • ✅ 3 bind() edge case tests (multiple encodings, no params, mixed usage)

Test design principles:

  • Parametric over individual (one test covers all 8 encoding methods)
  • Future-proof (adding new encoding = add to parametrize list)
  • Non-brittle (clear assertions, independent tests, representative samples)

Validation

  • ✅ All 25 tests passing (Python 3.8 and 3.12)
  • ✅ PlotterBase tests passing (48 passed, no regressions)
  • ✅ Type checking clean (mypy)
  • ✅ Linting clean (flake8)

Impact

Server-side uploads now correctly detect when datasets have changed, preventing visualization artifacts from stale data. Multi-tool analysis (AST + Pysa) provided 17% better bug detection than AST alone.

🤖 Generated with Claude Code

Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread ai/prompts/PYRE_ANALYSIS.md Outdated
Comment thread graphistry/compute/hop.py
lmeyerov and others added 11 commits October 17, 2025 02:02
Dataset IDs were not being invalidated when encoding or metadata attributes
changed, causing stale ID reuse and incorrect server-side uploads.

**Problem**: After upload, modifying encodings (encode_point_color, bind with
point_*/edge_* params) or metadata (name, description) would reuse the old
dataset_id, sending outdated visualization configurations to the server.

**Root cause**: Invalidation was only implemented for DataFrame-changing
methods (.nodes(), .edges(), .graph()) but not for encoding/metadata methods.

**Fixed methods** (5 total, ~10 LOC):
- __encode() helper: Invalidates dataset_id after modifying _complex_encodings
  → Fixes all 7+ .encode_*() methods (encode_point_color, encode_edge_color,
    encode_point_size, encode_point_icon, encode_edge_icon, encode_point_badge,
    encode_edge_badge) with single choke-point fix
- encode_axis(): Invalidates dataset_id after modifying _complex_encodings
- name(): Invalidates dataset_id after modifying _name
- description(): Invalidates dataset_id after modifying _description
- bind(): Conditionally invalidates dataset_id when changing encodings
  → Preserves IDs when explicitly setting them via bind(dataset_id='...')
  → Invalidates when changing encodings via bind(point_color='...', etc.)

**Analysis approach**: Used Python AST analysis to identify all methods
modifying dataset-relevant attributes, confirming manual audit findings.

**Testing**: Added comprehensive test suite with 11 tests covering:
- 5 bug reproduction tests (now passing)
- 2 correct behavior tests (settings, bind with IDs)
- 2 file ID invalidation tests (already working)
- 2 real-world mixed scenario tests

**Impact**: Server-side uploads now correctly detect dataset changes,
preventing visualization artifacts from stale encodings and metadata.

**Documentation**: Added pyre-check setup guide (ai/prompts/PYRE_ANALYSIS.md)
and tool selection guidance (ai/README.md) demonstrating AST-based analysis
as practical alternative for custom code patterns.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Pyre-check hangs when analyzing graphistry/compute/hop.py due to the
hop() function's complexity (384 lines, nested loops, runtime TypeVar
resolution). Excluding this single file allows pyre to successfully
analyze the entire codebase.

Changes:
- Exclude compute/hop.py from pyre analysis in .pyre_configuration
- Include tests/ in pyre analysis (previously excluded)
- Add .pyre/ cache directory to .gitignore
- Document exclusion rationale in hop.py docstring
- Update PYRE_ANALYSIS.md with working configuration and quick start
- Add pyre investigation results to PLAN.md template

Results:
- Pyre now analyzes 3711 functions (including tests) in ~2.5 seconds
- Previously hung at 1631/1709 functions indefinitely
- No other files require exclusion

The hop() function exceeds pyre's hard-coded complexity limit for deep
analysis. Mypy handles this file without issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Both style() and addStyle() modify the _style attribute which affects
visualization rendering. These methods should invalidate dataset_id to
ensure proper server-side upload behavior.

Found by Pysa call graph analysis - these methods use indirect
modification via bind() that AST pattern matching missed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add 25 tests covering dataset_id and file_id invalidation logic:
- 11 core tests for basic invalidation scenarios
- 3 Pysa-discovered bug tests (style, addStyle, infer_labels)
- 8 parametric tests covering ALL encoding methods
- 3 bind() edge case tests (conditional invalidation)

Uses parametric testing (@pytest.mark.parametrize) to ensure all
encoding methods are tested and future methods are easily added.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Reduce from 373 to 65 lines by:
- Removing generic pyre documentation
- Focusing on PyGraphistry-specific commands
- Adding timeout debugging section
- Using $(pwd) instead of absolute paths
- Removing future work section

Addresses PR feedback for more concise, actionable documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add ai/assets/pysa_extract_callers.py for extracting caller
information from Pysa's call-graph.json output. Moved from gitignored
plans/ to committed ai/assets/ for team reuse.

Update .pyre_configuration with taint_models_path for Pysa analysis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove Status line (project-specific findings) and redundant Pysa
Workflow section. Keep as reusable workflow guide focused on commands
and gotchas.

Addresses PR feedback on lines 3 and 25.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Reduce verbose 5-line explanation to concise 1-line note.
Still conveys essential information about pyre exclusion.

Addresses PR feedback on line 9.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove taint_models_path referencing plans/ (gitignored directory).
This path would break pyre for other developers.

Taint models are not needed for call graph analysis (documented
workflow in PYRE_ANALYSIS.md). Configuration now works on all systems.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Keep .pyre_configuration clean (no gitignored paths).
Document how users can optionally add taint models locally.

Taint models are experimental and user-specific, not committed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add numbered workflow and missing commands:
1. Generate call graph (existing)
2. Verify call graph generated (new)
3. Extract callers with single/multiple method examples (enhanced)

Makes the workflow clearer and more actionable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the fix/dataset-file-id-invalidation branch from aa5a814 to ac4dd54 Compare October 17, 2025 09:03
@lmeyerov lmeyerov marked this pull request as ready for review October 17, 2025 09:11
@lmeyerov lmeyerov merged commit 56ac00e into master Oct 17, 2025
69 checks passed
@lmeyerov lmeyerov deleted the fix/dataset-file-id-invalidation branch October 17, 2025 09:20
@lmeyerov lmeyerov mentioned this pull request Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant