-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving buffering of SyncedCollections #454
Comments
Buffering strategy: I agree that your benchmark of executing many small edits to a single document should be faster using the document-specific buffer. However, in my understanding, the use case for buffering in signac-flow is iterating over thousands of state points in multiple loops, and attempting to avoid re-loading. That is, the benchmark of many small modifications may not be an appropriate match for the use case of mostly-reading over a large data space. I might see the benefit if we used this approach for a project document with many status updates, but probably not at the job-level. Buffering serialization: I agree it would be nice if we used another approach instead of recursive conversions with JSON serialization/deserialization. I think that's worth testing, regardless of whether we use global or file-local buffering. The only catch is that (depending on implementation) we might not be able to detect whether the contents are possible to JSON encode until the point where the buffer writes. That would be a lazy validation approach and might still be valid. It may also be possible to ensure JSON-encodability only on the delta of data being updated in the buffer, but actually store it in a form that is fast to serialize/deserialize. |
You're right that I probably chose an unfair access pattern. I've proposed a solution in #462 that includes more appropriate benchmarks, let's continue discussion there. Regarding the validation, this isn't an issue because we still run our validators (irrespective of how the data is saved), so as long as the validators are comprehensive we don't have a problem. |
Closed in #462 . |
I've been aware of this issue for a while, but I haven't had a chance to benchmark it until working on #453, so I'm just getting around to documenting. While the examples below are for the JSONDict class in master, I only plan to implement any future changes in the new SyncedCollection classes since those are nearing completion.
Currently, while our buffering strategy substantially improves performance on slow filesystems, it isn't particularly useful on fast filesystems (see #453). While it's not surprising that the filesystem speed plays a big role, the fact that buffering can actually slow down our code when used on a fast filesystem shows how much overhead our method incurs. Benchmarking the buffered behavior on a fast filesystem reveals that the bulk of the time is spent on JSON serialization and deserialization (which has to happen on every operation) as well as updating the in-memory representation to match, which has to do a recursive check of the entire dictionary's contents. These operations are extremely expensive. We can easily test this with the current
JSONDict
using the instance-level buffering as currently implemented, which basically just disables all synchronization processes. Consider the following two snippets:On my machine, Option 2 is ~100x faster. The reason is that while both of these involve only 1 write to disk, option 1 also has to update the cached data every time, then reload it prior to the call to
d._data.__setitem__
. That indicates to me that we need to seriously reconsider our buffering strategy.The basic buffering strategy employed by the per-instance buffering has two problems:
One possible solution is to observe that in many use cases, client code (e.g. signac-flow) may be able to guarantee (or at least assume that the user is staying within safe usage) that (1) is not an issue. In that case, we only have to consider (2). In that case, we might be able to get away with "buffering" by just making in-memory modifications to collections without writing to any centralized buffer at all. In that case, we could use a simpler metric of total process resource usage such as the number of distinct collections that have been changed during buffering up to that point and force-writing all of them when the number gets too large. This feature would be relatively easy to add to the new buffering framework for SyncedCollections.
A much smaller change would be to replace JSON encoding with pickling, which should be safe within the context of buffering (no security issues since it's purely internal/in-memory with no external files being read) and substantially faster. I'm open to other solutions, but given the massive performance implications I think that we need to consider employing some alternative buffering protocol.
The text was updated successfully, but these errors were encountered: