Skip to content

Add pod_vector and mmap zero-copy support to generic_loader#11

Open
adamant-pwn wants to merge 1 commit intojermp:masterfrom
ratschlab:pr/pod-vector-mmap
Open

Add pod_vector and mmap zero-copy support to generic_loader#11
adamant-pwn wants to merge 1 commit intojermp:masterfrom
ratschlab:pr/pod-vector-mmap

Conversation

@adamant-pwn
Copy link
Contributor

Summary

This PR adds pod_vector<T>, a dual-mode vector for POD types, and extends generic_loader with an optional mmap zero-copy deserialization path. Together, these allow applications that serialize data structures built on essentials to skip heap allocation and memcpy during loading by pointing directly into a memory-mapped file.

No behavioral change for existing users — pod_vector is a drop-in replacement for std::vector when used in the default owned mode.

Motivation

In Metagraph, SSHash graphs are serialized using the essentials visitor framework. For large indices (tens to hundreds of GB), deserialization is dominated by allocating vectors and copying data from disk into heap memory. By switching serialized members to pod_vector and using generic_loader::set_mmap(), the loader can set up non-owning views into a memory-mapped file instead, eliminating allocation and copy overhead entirely.

Benchmark results

We benchmarked metagraph query (on the graphs constructed with SSHash as underlying representation) on a production-scale index to measure end-to-end impact. The times below reflect the full Metagraph query pipeline, not SSHash in isolation.

Setup:

  • Hardware: 2× Intel Xeon E5-2680 v4 @ 2.40 GHz (14 cores/socket, 56 threads total), 503 GB RAM, Samsung 7 TB NVMe SSD
  • SSHash graph: 95 GB on disk (~4 billion k-mers, k=31, built from 100 SRA metagenomic studies)
  • Succinct graph (control): 23 GB on disk (same k-mer set, BOSS representation)
  • Annotation: 19 GB BRWT row_diff
  • Query input: 5.6 MB FASTQ (~18,900 sequences)
  • Command: metagraph query -p 4 (i.e. only 4 cores were actually utilized)
Graph Mode Load (s) Map (s) Anno (s) Batch (s) Total Peak RSS
Succinct no_mmap 19.8 2.2 19.7 23.7 1:04 43.4 GB
Succinct mmap cold 2.5 7.8 26.6 36.2 0:42 42.4 GB
Succinct mmap warm 1.8 1.7 19.1 22.5 0:26 42.5 GB
SSHash no_mmap 121.5 0.6 16.5 19.0 2:28 115.1 GB
SSHash mmap cold 4.3 24.7 21.0 47.2 0:55 102.2 GB
SSHash mmap warm 1.2 1.0 17.2 19.5 0:25 103.6 GB

Column definitions: Load = graph deserialization; Map = k-mer lookups into graph; Anno = BRWT row_diff annotation slice; Batch = Map + Anno + misc (timed query work); Total = wall clock including load.

Warm vs cold cache: "mmap cold" means the file was not in the OS page cache before the run (echo 3 > /proc/sys/vm/drop_caches). Pages are faulted in on demand during Map/Anno, so load is near-instant but batch work is slower. "mmap warm" means the file was already fully cached (e.g. from a prior run or cat > /dev/null), so page accesses hit RAM with no I/O stalls.

Key takeaways:

  • SSHash load time: 121.5 s → 1.2 s (+1.0 s Map) with warm mmap, 4.3 s (+24.7 s Map) cold
  • SSHash total query time: 2:28 → 0:25 with warm mmap
  • RSS reduction: 115.1 GB → 103.6 GB (~11 GB saved, since mmap'd pages aren't double-counted as heap)
  • Cold mmap penalizes batch time (page faults during Map phase) but still cuts total time in half vs no_mmap

RAM loading vs mmap — when each wins:
The benefit depends on the ratio of loading time to query work. With mmap, the graph is not fully resident at load time — pages are faulted in lazily as they are touched. For reasonably small queries (like the 5.6 MB FASTQ above), loading dominates the wall clock, so mmap wins decisively. For very large query batches that touch most of the graph repeatedly, the per-access overhead of page faults (cold) or kernel-mediated page table lookups (warm) can make the query phase itself slower than with a fully-materialized in-RAM copy. In the extreme case, SSHash warm mmap batch time (19.5 s) is comparable to no_mmap (19.0 s), and for cold mmap the batch time increases to 47.2 s. So mmap is most beneficial when many short queries are served from a long-lived process (warm cache, near-zero startup) or when fast startup matters more than peak throughput on a single giant batch. Note also that these benchmarks were run on NVMe SSD, where random page fault latency is low (~100 µs). On spinning disks (HDD) or network-attached storage, cold mmap performance would be substantially worse due to seek times, and RAM loading may be preferable unless the cache is warm.

Changes

pod_vector<T> (new)

A dual-mode vector that operates in either:

  • Owned mode (default): wraps std::vector<T>, full mutable API, drop-in replacement
  • View mode: holds a non-owning const T* pointer + size into externally managed memory (e.g., mmap'd region), with an optional std::shared_ptr<const void> owner to keep the backing memory alive

Key methods:

  • set_view(data, n, owner) — switch to view mode
  • clear() — resets to empty owned mode
  • swap() — works across modes and with std::vector<T>

is_pod_vector<T> / is_pod_vector_v<T> (new)

Type trait for compile-time detection of pod_vector types, used to enable the mmap fast path.

generic_loader::set_mmap() (new)

void set_mmap(const uint8_t* base, size_t size,
              std::shared_ptr<const void> owner = {});

Call before visit() to enable zero-copy loading. When set, visit(pod_vector<T>&) computes the file offset from the stream position and calls set_view() pointing directly into the mmap'd region, then seeks the stream past the data. Non-pod_vector members (plain PODs, std::vector) continue using the normal read path.

generic_saver / sizer refactoring

  • Merged the std::vector and pod_vector visit overloads into a single private visit_seq() template in each of generic_loader, generic_saver, and sizer
  • Removed standalone save_vec() function (its logic is now in generic_saver::visit_seq())

contiguous_memory_allocator

  • Added visit(pod_vector<T>&) overload so pod_vector members can be loaded via the contiguous allocator path as well

Usage example

// Application-side: open file with mmap, pass context to loader
auto [base, size] = mmap_file("data.bin");
std::ifstream is("data.bin", std::ios::binary);

essentials::generic_loader loader(is);
loader.set_mmap(base, size);  // enable zero-copy

my_data_structure ds;
ds.visit(loader);  // pod_vector members now point into mmap'd memory
// ds is usable immediately, no heap copies made for pod_vector fields

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Mar 4, 2026

Disclaimer: As you may guess, most of the changes, experiments, and the huge description above are mostly authored by @copilot (mostly Claude Opus 4.6 model) with my close supervision, so I tried my best to make sure it's informative and useful.

Note: This PR (and its companion in bits) focuses on providing the means to use mmap with the loader, but doesn't incorporate mmap-based loading in SSHash itself, as we directly use loader to load dictionary in Metagraph.

You might want to experiment a bit with directly using mmap-based load in SSHash binary itself (I assume, tools/common.hpp is appropriate place for this, or maybe adding some params to essentials::load) 😊

@jermp
Copy link
Owner

jermp commented Mar 5, 2026

Nice, thanks Oleksandr! This is actually something that I wanted to do sooner or later; thanks for pushing it.
I would definitely use if for both SSHash and Fulgor then.

I'll review the changes soon. Now running into a lecture :)

@jermp
Copy link
Owner

jermp commented Mar 5, 2026

Hi @adamant-pwn, I've read the code. Thanks again!

Actually, I don't like the fact that pod_vector operates in a dual mode, i.e., it wraps around a std::vector if the memory is owned...because that adds an explicit if everywhere in the code which is not elegant.

It would be better to rewrite a class vector that holds a T*; it does not care where this pointer points to -- it could be on the heap or coming from a memory-mapped region on disk. The only trick would be, in the case where we have to "steal" the memory from a std::vector, to extend the lifetime of the vector with a shared_ptr.

What you do think?

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Mar 5, 2026

Hi Giulio!

The way pod_vector is currently implemented is set in a way that you can use typical dynamic functions (resize, push_back, etc) that you use in other parts of the code directly on the structure when it is in heap mode (so, typically during construction). It looks like we would need to have more refactoring if we want to do these operations on std::vector, and only then give its data up for stealing, which I wouldn't really see as an improvement.

I also don't really like keeping std::vector in shared_ptr just to extend its lifetime after it is passed by &&, as it feels like a hacky solution, and there would be no other use for it as a member if it is implemented the way you suggest.

If by explicit if you mean is_view checks when accessing members, I didn't notice any performance degradation in non-mmap case, and I generally expect that branch prediction should handle checking pointer against nullptr fairly well.

But if you really don't want to have them, and want a more systematic solution, Metagraph uses int_vector.hpp from sdsl that just stores T* and shared_ptr to mmap_context to keep the mapping alive. int_vector directly manages memory via malloc, and implements all dynamic operations properly. So, doing something like this, or incorporating int_vector.hpp could be an alternative if you prefer it that way. But I don't really think something warrants writing a full-on custom vector yet.

On a side note: C++20 has std::span, with similar functionality to what you described, but it unfortunately would also require external lifetime management for vectors...

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Mar 5, 2026

I thought about it a bit more. I think, shared_ptr<std::vector> is a bad member for vector type of classes, but it would be appropriate and expected if the class is called something like owning_span with shared_ptr<T[]> to store the data. This way, we could also use aliasing to combine potential shared_ptr<vector>, shared_ptr<mmap_owner> and T* itself in one member, which would be pretty good.

The only question then is, do we want to do all the construction in std::vector and then move them into owning_span? Now that I look into the code again, looks like the changes needed are fairly minimal.

…with optional ownership

owning_span<T> uses shared_ptr<const T[]> with aliasing constructor for
zero-branch const T* access under three ownership models:
1. Heap-owned: constructed from any contiguous range (vector, etc.)
2. Externally-owned: raw pointer + shared_ptr owner (e.g. mmap)
3. Unowned: raw pointer without owner

Changes to generic_loader:
- mmap path: zero-copy view into mapped memory (POD types only)
- non-mmap path: always reads into std::vector<T> then moves into Vec
- Unified visit_seq handles both owning_span and std::vector
@adamant-pwn
Copy link
Contributor Author

I updated the PR (and also jermp/bits#12) with a version of owning_span, should be aligned with what you suggested.

Will also check if it still works soon 😅

@jermp
Copy link
Owner

jermp commented Mar 5, 2026

Hi Oleksandr, indeed my idea was to have a private member m_life of type shared_ptr<T[]>.
Using the aliasing constructor is even cleaner! I like this current solution very much.
Don't you agree it's very elegant?

Regarding the construction -- yes, I was think exactly that: to use a std::vector and lastly steal its data with the view, hence making it non-modifyable (which is ok, since bits is all about static data structures).

@adamant-pwn
Copy link
Contributor Author

Yes, I also think current version is pretty good!

indeed my idea was to have a private member m_life of type shared_ptr<T[]>.

Sorry, I probably misunderstood your suggestion at first then 🙂

@jermp
Copy link
Owner

jermp commented Mar 5, 2026

No worries! I'll test the code tomorrow probably. (Did you make some test?) Thanks a lot!

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Mar 5, 2026

Not for owning_span version in particular yet, but I'm generally working on testing and benchmarking various scenarios on different query sizes on SSHash-based DBG, so hopefully also would have a chance to run it against our graphs tomorrow.

@jermp
Copy link
Owner

jermp commented Mar 5, 2026

Beautiful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants