Implements an in-memory buffer for SyncedCollections #462

vyasr · 2021-01-01T20:59:46Z

Description

In case readers want to skip to them, benchmarks are at the bottom of the motivation section.
Here's the notebook I used to run the benchmarks (saved as a text file because Github doesn't know how to handle anything else, but just rename it to ipynb and it'll load fine).

This PR implements an alternative to the current buffering strategy used by SyncedCollections (and by the JSONDict in signac master). Rather than serializing and deserializing the data into a JSON-encoded buffer on every operation, a buffering strategy that saves on I/O costs but still has to do a lot of work, it allows all SyncedCollections pointing a given file to directly share their internal data so that any operation on one is transparently reflected by the others. In addition to skipping serialization, this method also avoids the need for recursive updates of nested collections, which can be expensive. In a sense it's no longer true buffering because we're not simulating a write by serializing the data in the same manner that we would if we were truly writing to an intermediate buffer, we're stopping at the data being stored in memory.

Motivation and Context

The current buffering implementation is quite expensive. Although it largely avoids I/O operations, maintains a fully JSON-encoded buffer of all the data associated with existing SyncedCollection objects, which has a few notable benefits relative to the approach in this PR:

Data in the buffer can be directly written to disk
JSON-encoded data can be hashed to perform relatively cheap equality checks.
The total buffer size is trivially the sum of all JSON-encoded bytes in the buffer, so a buffer capacity can be set explicitly to control buffer flushes.

However, this approach also has significant performance implications:

Every write operation requires a JSON encoding step
Every read requires decoding the encoded JSON
Every read requires updating the collection's internal data store to match the one in the buffer, which requires an expensive recursive traversal of the (possibly nested) data structure.

The approach in this PR circumvents these three performance bottlenecks by making all collections synced with a particular file directly share their underlying data attribute (an object of the "base type" that contains the data, e.g. a dict for a SyncedAttrDict). Since all mutable collections in Python are references, any change to one instance is transparently conveyed to others.

There are some tradeoffs with this approach when considering the advantages of the old approach listed above, but they can be mitigated to some extent:

While data in the buffer cannot be directly written to disk, this just means that we're deferring the cost of JSON-encoding to when the buffer flush occurs, it does not change the total amount of work to be done.
The equality checks are primarily of use as a way to determine whether or not the buffered data has changed from the version on disk, which indicates that it needs to be written. The new approach can't directly check this without performing a dict equality check, which I have not implemented but I would expect to be slow. However, we can use a simpler approach (which is currently implemented) that just checks whether any mutating operation has been called (e.g. setting an element); if the only operations that occur are read operations, then there's no need to write back to disk. The only scenario where this method will be slower than the old approach is in code that sets elements of a collection to the exact original values, which I don't think we need to optimize for. If that use case truly becomes a bottleneck (e.g. in status updates in flow), it's simple enough to just add a conditional check in client code that verifies whether the value will actually be changed, e.g. if value != synced_dict[key]: synced_dict[key] = value.
Although we can no longer exactly calculate the buffer size, we can employ various techniques to approximate it. The crudest one that I can think of would just be to assume that the collection's size won't increase by more than some factor -- say an order of magnitude -- during one instance of buffering. We already use stat calls to get file metadata when a collection first inserts into the buffer, so we could just also insert the original file size (which is also provided by os.stat) and then use that as a benchmark for how much can be inserted before a buffer flush is necessary. I'm open to other suggestions for this as well.

The performance payoff is shown in the table below. To get a sense for how much I/O contributes to the performance characteristics of buffered mode, I tested on three separate file systems. I tested both read (d[key]) and write (d[key] = value) operations. To capture the balance between in-memory operations and disk reads/writes, I perform 1000 total operations in each benchmark, distributed over different numbers of files. For example, some benchmarks read 1000 values from the same JSON file, while others read 1 value from 1000 different files. Note that there's some significant variability in the results because I only ran 10 loops in the benchmark, but I think the numbers are accurate enough for the purposes of this PR.

System	Operation	Number of dicts	Elements per dict	Speed multiplier
Mac SSD	setitem	1	1000	24.72
Mac SSD	setitem	10	100	21.79
Mac SSD	setitem	100	10	12.19
Mac SSD	setitem	1000	1	3.97
Mac SSD	getitem	1	1000	182.96
Mac SSD	getitem	10	100	187.33
Mac SSD	getitem	100	10	166.74
Mac SSD	getitem	1000	1	98.47
Linux NFS	setitem	1	1000	16.39
Linux NFS	setitem	10	100	7.58
Linux NFS	setitem	100	10	2.51
Linux NFS	setitem	1000	1	1.33
Linux NFS	getitem	1	1000	133.48
Linux NFS	getitem	10	100	138.15
Linux NFS	getitem	100	10	115.78
Linux NFS	getitem	1000	1	72.95
Linux scratch	setitem	1	1000	22.35
Linux scratch	setitem	10	100	20.48
Linux scratch	setitem	100	10	11.93
Linux scratch	setitem	1000	1	3.25
Linux scratch	getitem	1	1000	143.59
Linux scratch	getitem	10	100	141.50
Linux scratch	getitem	100	10	116.29
Linux scratch	getitem	1000	1	71.02

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

codecov · 2021-01-01T21:01:06Z

Codecov Report

Merging #462 (bb67d6f) into feature/synced_collections (0b2bbcf) will increase coverage by 0.38%.
The diff coverage is 92.03%.

@@                      Coverage Diff                       @@
##           feature/synced_collections     #462      +/-   ##
==============================================================
+ Coverage                       77.85%   78.24%   +0.38%     
==============================================================
  Files                              56       57       +1     
  Lines                            6575     6688     +113     
  Branches                         1232     1251      +19     
==============================================================
+ Hits                             5119     5233     +114     
+ Misses                           1155     1152       -3     
- Partials                          301      303       +2

Impacted Files	Coverage Δ
...ore/synced_collections/file_buffered_collection.py	`94.54% <ø> (+9.09%)`	⬆️
...e/synced_collections/memory_buffered_collection.py	`90.90% <90.90%> (ø)`
signac/core/synced_collections/collection_json.py	`98.86% <100.00%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b2bbcf...bb67d6f. Read the comment docs.

bdice · 2021-01-01T21:22:21Z

@vyasr The concept seems sound. I haven’t reviewed the code carefully but your description of the feature and the large speedups indicate this is probably a very good idea. The benchmarks for the slow network filesystems are the most important in my opinion. It makes the penalty of excess system calls most obvious.

vyasr · 2021-01-01T22:20:10Z

@bdice just to make sure that we're on the same page, this PR does not change the amount of I/O performed at all (except in the one edge case mentioned in point 2 above). The slow nfs benchmarks simply indicate that on nfs I/O costs dramatically outweigh all others to the point where removing JSON-serialization and internal updates only leads to marginal performance improvements. So the other benchmarks are much more representative of how expensive all of the other tasks are on a fast filesystem. On the other hand, the nfs benchmarks indicate the limited impact those changes will have when the filesystem is slow. I'm not sure if that's what you mean by "most important".

bdice · 2021-01-02T01:30:13Z

@vyasr Ok! Thanks for clarifying, that was helpful.

…hat it also applies to instance-level buffering.

…than attempting to account for file size.

…equired methods to the collection as well.

bdice

A few comments. This looks good overall. I find it a little funny that the buffer capacity doesn't have the same meaning as in the other (serialized) buffer type, but there isn't a good solution to that. We'll just need to rely on good documentation (and perhaps avoiding use of the same names to indicate "object count" and "byte count").

signac/core/synced_collections/memory_buffered_collection.py

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Improve Sync Data Structures (#336) Adds a generic framework for synced collections to signac. The abstract SyncedCollection defines the interface for any Collection-like object (as defined by Python's collections.abc.Collection) that is synced with some form of persistent storage. This concept generalizes the existing _SyncedDict and SyncedList classes by implementing a common framework for these. Critically, this framework decouples the act of saving to and loading from the backend from the recursive conversion between a SyncedCollection and its base data type (e.g. a SyncedDict to a dict and vice versa), the latter of which is necessary to ensure proper synchronization of nested collections. This decoupling makes it much easier to enable different back ends for a given data structure without having to modify the synchronization logic. * Add alternative backends for synced_collection (#364) Adds Redis, Zarr, and MongoDB backends. * Added validation layer to SyncedCollection (#378) SyncedCollections can now arbitrary validators to input data to ensure that data conforms to a desired specification. This layer can be employed to ensure, for instance, that all inputs are valid JSON. * Added buffering to SyncedCollection (#363) Adds the ability to buffer I/O in SyncedCollections using an in-memory cache to store data. * Feature/synced collection/reorg (#437) Reorganizes synced collections into their own subpackage and move all corresponding tests into a subdirectory for easier identification. * Feature/synced collection/reorg tests (#438) Cleans up and standardizes the testing architecture so that all backends can easily apply the exact same set of tets using a few minor extra pieces of information. Prevents duplicate execution of tests upon import, while also adding a large number of additional tests for backends that were not previously tested. * Fix incomplete merge * Remove lingering old file. * Feature/synced collection/cleanup (#445) Rename various methods and remove unnecessary ones from the public API. Standardize internal APIs and simplify the implementation of new backends by automating more subclass behaviors. Improve constructors to enable positional arguments. Improve interfaces for various backends by making it easier for the user to specify and access the precise storage mechanisms. * Merge master, apply all relevant formatting tools, and add documentation (#446) Makes the SyncedCollection framework adhere to black, isort, flake8, pydocstyle, and mypy requirements while adding sufficiently detailed documentation to all classes. * Feature/synced collection/cleanup2 (#447) Simplifies and standardizes file buffering implementation. Adds extra tests for SyncedAttrDict.update and simplifies its implementation to use _update. * Feature/synced collection/optimization (#453) Optimize various aspects of the SyncedCollection framework, including validators, abstract base class type lookups, and the core data structures. * Remove unnecessary backend str from tests. * Feature/synced collection/test mongodb redis (#464) MongoDB and Redis will no longer be silently skipped on CI, so any changes that break support for those will be immediately discovered. * Make SyncedCollections thread-safe (#463) Operations that modify SyncedCollections (and their subclasses) now acquire object-specific mutexes prior to modification in order to guarantee consistency between threads. Thread safety can be enabled or disabled by the user. * Implements an in-memory buffer for SyncedCollections (#462) The new buffering mode is a variant on the old one that avoids unnecessary encoding, decoding, and in-memory updating by directly sharing memory between various collections accessing the same file in the buffer. This direct sharing allows all changes to be automatically persisted, avoiding any cache inconsistencies without the high overhead of JSON serialization for every single modification. * Clean up miscellaneous outstanding to-do items (#466) Completes TODO items scattered throughout the code base and removes a number of outdated ones that have already been addressed. * Make buffering thread safe (#468) In addition to synced collections being thread safe individually, while in buffered mode the buffer accesses also have to be made thread safe for multithreaded operation to be safe. This pull request introduces sufficient locking mechanisms to support thread-safe reads from and writes to the buffers. * Feature/synced collection/unify buffering (#469) Unifies the implementation of the two different file buffering modes as much as possible using a shared base class. In addition, this fixes a couple of issues with the thread-safe buffering solution in #468 that only show up on PyPy where execution is fast enough to stress-test multithreaded execution. It also reduces thread locking scopes to the minimum possible. * Feature/synced collection/contexts (#470) Removes usage of contextlib.contextmanager and replaces it with custom context classes implementing the context manager protocol. The contextlib decorator has measurable overhead that these pre-instantiated context managers avoid. Furthermore, many of the context managers in synced collections follow a very similar counter-based pattern that is now generalized and shared between them. * Install pymongo on pypy. * Don't sync on construction. * Add comparison operators to SyncedList and make sure modifying the filename of JSONDict is safe. * Remove unnecessary constructor validation, providing both data and resource arguments (e.g. filename for a JSONDict) is valid. * Fix unused imports. * Feature/synced collection/replace jsondict (#472) The old JSONDict and SyncedAttrDict classes are replaced with the new ones from the SyncedCollections framework. The new classes are now used for the Job's statepoint and document attributes as well as the Project document. The state point is now stored in the new _StatePointDict class, which extends the new JSONDict class to afford greater control over saving and loading of data. Much of the internals of the Job class have also been simplified, with most of the complex logic for job migration and validation when the state point changes now contained within the _StatePointDict. * Replace old JSONDict with new BufferedJSONDict. * Verify that replacing BufferedJSONDict with MemoryBufferedJSONDict. * Remove largely redundant _reset_sp method. * Remove single-use internal functions in Job to reduce surface area for SyncedCollection integration. * Move logic from _init into init. * Working implementation of statepoint using new SyncedCollection. * Remove _check_manifest. * Expose loading explicitly to remove the need for internal laziness in the StatepointDict. * Simplify the code as much as possible by inlining move method and catching the correct error. * Improve documentation of context manager for statepoint loading. * Replace MemoryBufferedJSONDict in Project for now. * Add documentation of why jobs must be stored as a list in the statepoint. * Address PR comments. * Add back import. * Ensure _StatepointDict is always initialized in constructor. * Change _StatepointDict to validate id on load. * Refactor error handling into _StatepointDict class. * Update docstrings. * Update comment. * Fix some docstrings. * Remove redundant JobsCorruptedError check. * Rewrite reset_statepoint to not depend on creating another job. * Reduce direct accesses of internal attributes and do some simplification of the code. * Reraise errors in JSONCollection. * Change reset to require a non-None argument and to call _update internally. * Add reset_data method to provide clear access points of the _StatepointDict for the Job. * Create new internal method for handling resetting. * Move statepoint resetting logic into the statepoint object itself. * Stop accessing internal statepoint filename attribute directly and rely on validation on construction. * Make statepoint thread safe. * Some minor cleanup. * Remove now unnecessary protection of the filename key. * Explicitly document behavior of returning None from _load_from_resource. * Apply suggestions from code review Co-authored-by: Bradley Dice <bdice@bradleydice.com> * Rename SCJSONEncoder to SyncedCollectionJSONEncoder. * Only access old id once. * Move lazy attribute initialization into one location. * Address PR requests that don't cause any issues. * Remove the temporary state point file backup. * Make as many old buffer tests as possible. * Reset buffer size after test. * Last changes from PR review. Co-authored-by: Bradley Dice <bdice@bradleydice.com> * Fix synced collection support for 0d numpy arrays. * Add oldest supported version of pymongo and ensure that zarr/mongo collections don't fail on import. * Deprecate json module (#480) Change all non-deprecated modules to import the built-in json module instead of signac's and deprecate signac.core.json. * Feature/synced collection/reorg (#481) Reorganizes the package structure so that the synced_collections subpackage is now at the package root and is internally structured with subpackages for data types, backends, and buffers. * Move synced collections from core to package root. * Fix all import locations. * Reorganize internals of synced collection package. * Fix all imports for reorganized package. * Hide caching module since it's still experimental. * Address PR comments. * Make __all__ into an empty list not rather than a list containing an empty string. * Feature/synced collection/simplify global buffering (#482) Eliminate the global buffering mode across all backends in favor of a more targeted per-backend approach. * Enable per backend buffering. * Remove global buffering in favor of class-specific buffering. * Reintroduce warnings for deprecated functionality. * Remove truly global buffering and replace it with class-level buffering. * Document new features. * Feature/synced collection/deprecate old (#483) Deprecates the old _SyncedDict, SyncedAttrDict, and JSONDict classes, along with any associated functions and exceptions. * Deprecate old synced dict classes. * Move class deprecation warnings to constructors. * Feature/synced collection/fix buffer reload (#486) * Make sure data type is preserved when reloading from buffer after flush. * Add test of new error case. * Fix lots of documentation issues. * Address first round of PR comments. * Update changelog. * Fix mypy error. * Don't error check uid unless it's provided. * First pass to address PR comments. * Feature/synced collection/remove attr access (#504) This patch removes attribute-based access to synced dictionaries. This logic is moved to a new `AttrDict` mixin class that can be inherited by any other subclasses if this feature is desired. * Simplify definition of __setattr__ by relying on complete list of protected keys. * Move all attribute-based access to a separate mixin class AttrDict. * Rename SyncedAttrDict to SyncedDict. * Move synced_attr_dict to synced_dict. * Remove attribute-based access from existing backend dict types and add new validator to check string keys. * Isolate deprecated non-str key handling to the _convert_key_to_str json validator. * Add tests of the AttrDict behavior. * Use new attrdict based classes in signac and make all tests pass. * Remove support for inheriting protected attributes, they must be defined by the class itself. * Change initialization order to call super first wherever possible. * Address PR comments. * Address final round of PR comments. * Feature/synced collection/numpy (#503) Isolate all numpy logic into a single utility file so that handling of numpy arrays can be standardized. Also substantially improves test coverage, testing a large matrix of different numpy dtypes and shapes with different types of synced collections. The testing framework as a whole is refactored to simplify the code and reduce the amount of boilerplate required to add the new numpy tests. * Initial pass at isolating all numpy logic to a single file. * Use pytest to generate temporary directories. * Stop saving backend kwargs as class variables. * Remove most references to class-level variables in tests. * Remove the _backend_collection variable. * Remove unnecessary autouse fixtures. * Start adding comprehensive tests for numpy conversion. * Make sure locks are released even if saving fails. * Add tests of vector types and add tests for SyncedList as well as SyncedDict. * Use pytest to parametrize numpy types instead of looping manually. * Unify vector and scalar tests. * Stop testing squeezing and just test the relevant shapes directly. * Add test of reset. * Move numpy tests back to main file. * Remove _conert_numpy_scalar, which performed undesirable squeezing of 1-element 1d arrays, and replace usage with _convert_numpy. * Separate numpy testing into separate functions and limit supported behavior to only the necessary use cases. * Add a warning when implicit numpy conversion occurs. * Update changelog. * Address all PR comments aside from numpy conversion in type resolver. * Catch numpy warnings in signac numpy integration tests. * Support older versions of numpy for random number generation. * Fix isort issue introduced by rebase. * Address PR comments. * Allow AbstractTypeResolver to perform arbitrary preprocessing, delegating the numpy-specialized code to the caller and making it less confusing. * Add missing call to _convert_numpy. * Set NUMPY_SHAPES for MongoDB tests when numpy is not present. * Remove add_validators and specify validators in class definition (#507) * Remove add_validators classmethod and instead require validators to be defined at class definition time, preventing one application from modifying validation process for others and giving a standard means to completely override parent validators in parents. * Change type in docstring. * Don't use os.path.join where not needed. (#511) The extra work performed by os.path.join can be slow, so this PR replaces it with direct string concatenation of os.sep. * Don't use os.path.join where not needed. * Update signac/contrib/job.py Co-authored-by: Bradley Dice <bdice@bradleydice.com> * Disable recursive validation during recursive conversion of nested types. (#509) * Feature/synced collection/optimize jsondict validation (#508) Define a single validator for JSONAttrDict classes that combines the logic of other validators while reducing collection traversal costs. Also switch from converting numpy arrays to just bypassing the resolver's cache for them. * Use single separate validator for state points for performance. * Remove preprocessor from type resolver and instead use a blocklist that prevents caching data for certain types. * Reorder resolvers to optimize performance. * Make sure not to include strings as sequences. * Move state point validator to collection_json and use for all JSONAttrDict types. * Make sure to also check complex types. * Add back missing period lost during stash merge. * Address review comments. * Feature/synced collection/optimize protected key lookup (#510) Since the most common operation on protected keys is to check if some key is within the list of protected keys, this patch changes the `_PROTECTED_KEYS` attribute to a set for faster O(1) membership checks. * Switch protected keys from a sequence to a set for faster containment checks. * Change evaluation order of checks in setattr. * Address PR comments. * Defer statepoint instantiation, unify reset_statepoint logic (#497) * Defer state point initialization when lazy loading. * Allow validation to be disabled in SyncedCollection._update. * Unify reset_statepoint logic across methods. * Revert validation-related changes. * Add test, fix bug. * Update tests/test_job.py * Feature/synced collection/optimize (#513) * Add a resoler to fast-exit _from_base for non-collection types. * Optimize synced collection init method. * Remove validators property and access internal variable directly. * Add early exit for _convert_array. * Use locally cached root var instead of self._root. * Remove superfluous duplicate check. * Optionally skip validation in SyncedCollection _update. (#512) * Add option to trust the data source during _update. * Skip validation if JSON is valid. * Fix pre-commit. * Rename trust_source to _validate. * Set _validate=False. Co-authored-by: Vishav Sharma <46069089+vishav1771@users.noreply.github.com> Co-authored-by: Bradley Dice <bdice@bradleydice.com>

vyasr mentioned this pull request Jan 1, 2021

Improving buffering of SyncedCollections #454

Closed

Base automatically changed from feature/synced_collection/optimization to feature/synced_collections January 3, 2021 16:22

vyasr and others added 11 commits January 3, 2021 08:56

Initial version of memory buffered collection.

50fc277

Move removal from list of cached objects into per-instance flush so t…

461f16d

…hat it also applies to instance-level buffering.

Ensure cache consistency when nesting local and global buffered mode.

db249c7

Expose both old and new buffering behaviors for benchmarks.

6d22d7b

Store modification flag to avoid unnecessary writes.

0d432a4

Rename and document memory buffer.

16a9cc0

Add tests of memory buffered collections.

7a66c8e

Enable size-based buffer flushing.

10e0e71

Switch to flushing based on the number of files in the buffer rather …

7c7d0ff

…than attempting to account for file size.

Add test of automatic buffer flushing.

6ba67ba

Add tests of buffer flushing for newer buffering method, adding the r…

2e24e26

…equired methods to the collection as well.

vyasr force-pushed the feature/synced_collection/memory_buffering branch from 9c470f4 to 2e24e26 Compare January 4, 2021 22:44

Remove accidentally committed Redis databases.

76ddf6f

vyasr marked this pull request as ready for review January 4, 2021 23:04

vyasr requested review from a team as code owners January 4, 2021 23:04

vyasr requested review from mikemhenry and vishav1771 and removed request for a team January 4, 2021 23:04

This was referenced Jan 5, 2021

Feature/synced collection/thread safety #463

Merged

Proposal: Unify dict classes and improve buffering and synchronization #249

Closed

bdice reviewed Jan 6, 2021

View reviewed changes

Apply suggestions from code review

bb67d6f

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

vyasr merged commit af991e7 into feature/synced_collections Jan 6, 2021

vyasr deleted the feature/synced_collection/memory_buffering branch January 6, 2021 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements an in-memory buffer for SyncedCollections #462

Implements an in-memory buffer for SyncedCollections #462

vyasr commented Jan 1, 2021 •

edited

codecov bot commented Jan 1, 2021 •

edited

bdice commented Jan 1, 2021

vyasr commented Jan 1, 2021

bdice commented Jan 2, 2021

bdice left a comment •

edited

Implements an in-memory buffer for SyncedCollections #462

Implements an in-memory buffer for SyncedCollections #462

Conversation

vyasr commented Jan 1, 2021 • edited

Description

Motivation and Context

Types of Changes

Checklist:

codecov bot commented Jan 1, 2021 • edited

Codecov Report

bdice commented Jan 1, 2021

vyasr commented Jan 1, 2021

bdice commented Jan 2, 2021

bdice left a comment • edited

Choose a reason for hiding this comment

vyasr commented Jan 1, 2021 •

edited

codecov bot commented Jan 1, 2021 •

edited

bdice left a comment •

edited