Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving buffering of SyncedCollections #454

Closed
vyasr opened this issue Dec 30, 2020 · 3 comments
Closed

Improving buffering of SyncedCollections #454

vyasr opened this issue Dec 30, 2020 · 3 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Dec 30, 2020

I've been aware of this issue for a while, but I haven't had a chance to benchmark it until working on #453, so I'm just getting around to documenting. While the examples below are for the JSONDict class in master, I only plan to implement any future changes in the new SyncedCollection classes since those are nearing completion.

Currently, while our buffering strategy substantially improves performance on slow filesystems, it isn't particularly useful on fast filesystems (see #453). While it's not surprising that the filesystem speed plays a big role, the fact that buffering can actually slow down our code when used on a fast filesystem shows how much overhead our method incurs. Benchmarking the buffered behavior on a fast filesystem reveals that the bulk of the time is spent on JSON serialization and deserialization (which has to happen on every operation) as well as updating the in-memory representation to match, which has to do a recursive check of the entire dictionary's contents. These operations are extremely expensive. We can easily test this with the current JSONDict using the instance-level buffering as currently implemented, which basically just disables all synchronization processes. Consider the following two snippets:

d = JSONDict('foo.json')

# Option 1
with buffer_reads_writes():
    for i in range(1000):
        d[str(i)] = i

# Option 2
with d.buffered() as db:
    for i in range(1000):
        db[str(i)] = i

On my machine, Option 2 is ~100x faster. The reason is that while both of these involve only 1 write to disk, option 1 also has to update the cached data every time, then reload it prior to the call to d._data.__setitem__. That indicates to me that we need to seriously reconsider our buffering strategy.

The basic buffering strategy employed by the per-instance buffering has two problems:

  1. It offers no protection against external modifications of the underlying file, nor does it have any way of synchronizing between multiple instances of the file.
  2. It has no concept of maximum buffer size. This is fine for a single object, since no matter how big it is we have to write it out at some point and JSON has no concept of incremental changes and will dump the entire file at once. However, this may not be acceptable if we're opening a large number of files and modifying all of them; in addition to the large amount of actual file modification, there's also the overhead of opening and closing many file handles, and some file systems might choke on huge number of writes.

One possible solution is to observe that in many use cases, client code (e.g. signac-flow) may be able to guarantee (or at least assume that the user is staying within safe usage) that (1) is not an issue. In that case, we only have to consider (2). In that case, we might be able to get away with "buffering" by just making in-memory modifications to collections without writing to any centralized buffer at all. In that case, we could use a simpler metric of total process resource usage such as the number of distinct collections that have been changed during buffering up to that point and force-writing all of them when the number gets too large. This feature would be relatively easy to add to the new buffering framework for SyncedCollections.

A much smaller change would be to replace JSON encoding with pickling, which should be safe within the context of buffering (no security issues since it's purely internal/in-memory with no external files being read) and substantially faster. I'm open to other solutions, but given the massive performance implications I think that we need to consider employing some alternative buffering protocol.

@bdice
Copy link
Member

bdice commented Dec 31, 2020

Buffering strategy: I agree that your benchmark of executing many small edits to a single document should be faster using the document-specific buffer. However, in my understanding, the use case for buffering in signac-flow is iterating over thousands of state points in multiple loops, and attempting to avoid re-loading. That is, the benchmark of many small modifications may not be an appropriate match for the use case of mostly-reading over a large data space. I might see the benefit if we used this approach for a project document with many status updates, but probably not at the job-level.

Buffering serialization: I agree it would be nice if we used another approach instead of recursive conversions with JSON serialization/deserialization. I think that's worth testing, regardless of whether we use global or file-local buffering. The only catch is that (depending on implementation) we might not be able to detect whether the contents are possible to JSON encode until the point where the buffer writes. That would be a lazy validation approach and might still be valid. It may also be possible to ensure JSON-encodability only on the delta of data being updated in the buffer, but actually store it in a form that is fast to serialize/deserialize.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 1, 2021

You're right that I probably chose an unfair access pattern. I've proposed a solution in #462 that includes more appropriate benchmarks, let's continue discussion there.

Regarding the validation, this isn't an issue because we still run our validators (irrespective of how the data is saved), so as long as the validators are comprehensive we don't have a problem.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 6, 2021

Closed in #462 .

@vyasr vyasr closed this as completed Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants