Skip to content

Arrow IPC Streaming Cache Implementation#35

Merged
kmacrow merged 3 commits into
masterfrom
arrow-cache-v1
Sep 25, 2025
Merged

Arrow IPC Streaming Cache Implementation#35
kmacrow merged 3 commits into
masterfrom
arrow-cache-v1

Conversation

@kmacrow
Copy link
Copy Markdown
Collaborator

@kmacrow kmacrow commented Sep 24, 2025

What is this?

Pontoon has an internal abstraction called a cache. The cache is a data-store-agnostic record store for data that is being read from a source connector and will be written somewhere by a destination connector. We've used Arrow in Pontoon from day 1 as an intermediate, language-independent in-memory format, but we haven't been leveraging it's serialization capabilities: instead, we've used different cache implementations to translate between Arrow and various on-disk formats. The default for a while now has been a SQLite-backed cache, which has performed surprisingly well, but has significant limitations for really large, and really wide, data.

  • This PR replaces SqliteCache with ArrowIpcCache leveraging columnar Arrow IPC Streaming serialization
  • Uses the Arrow streaming format (.arrows) to store record streams in transit between source and destination connectors
  • Uses Arrow native Python type translation by default
  • Memory-mapping stream files for extremely fast reads
  • Increases record-per-second (RPS) processing by between 3x and 50x with very low, bounded resident memory overhead
  • Achieves upwards of 1M RPS on large datasets with wide schemas

Why make the caching layer faster?

  • Although the Arrow cache is faster, even for small streams, the real benefit is the scalability in terms of schema width and record volume that this cache can handle efficiently compared to the SqliteCache
  • This gives us a much more scalable core that works well for small workloads and really shines with very large, complex datasets

Benchmark summary

Write Performance vs SqliteCache:
Optimized Streaming Mode: up to 5.2x faster in best cases

Read Performance vs SqliteCache:
Overall: 3-57x faster depending on schema
Wide schemas: Up to 57x faster
Mixed schemas: Consistently 3-4x faster

@kmacrow kmacrow merged commit 6cb1833 into master Sep 25, 2025
3 checks passed
@kmacrow kmacrow deleted the arrow-cache-v1 branch September 25, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant