New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stress and scale testing #668
Comments
Might be able to leverage some of the existing benchmark-testing stuff for this. Next steps:
|
Components in Papiea that can affect performance at scale
Parameters we can use to scale/stress test the system
Most accessed functionalities within papiea
Metrics for test result
These metrics should be collected over multiple test runs (maybe 5) and averaged out to get an accurate idea of the values. We also need to plot these metrics and save them every time we do stress and scale testing. Risks in PapieaCurrently, the major risk in papiea is race condition for the entities in papiea engine. Hence, the goal of the tests should also be to check for race conditions by creating delay randomness in procedures and diff handlers. Strategy for running testsStress testing: As I understand, stress testing is done to find out the breaking point for the system. In case of papiea, I think we can do that by tuning the parameters mentioned above to various number (100, 1000, 10000, etc.). The parameters that we're tuning on needs to be fired all at once i.e. all the procedure calls or all the CRUD operations should be sent together without any delay. These tests should be run only when we do major updates to the engine which can change the limits for some of the components mentioned above. I would suggest to do it on every minor and major version update in papiea i.e. (0.9.50 -> 0.10.0) or the end of month whichever comes first. Scale testing: For scale testing, I think we should try to simulate a real-world application that operates at a large scale and develop common scenarios/workloads which can be tested regularly for the components mentioned above. For example, spec_update(entity1) -> intent_handler(entity1) -> spec_update(entity2) -> status_update(entity1) is a common scenario in papiea. The above strategy should be applied at multiple load levels so that we gather an idea of how the system works for each level (high, medium and low load). Also, we need to log and save the error handling response in case of failures/abnormal conditions within the system so that we understand what's going wrong in papiea. Scale testing should be set to run at the end of week, so that the test scenarios can survive for a longer time and we get an idea of how the system would work in real-world scenarios. |
@nitesh-idnani1 can you provide some more input into what tech choice we should make, e.g. scale configuring, stress monitoring, etc. |
I haven't done much research on what we can use, but I was thinking once we get the requirements finalized choosing the library/framework should be a simple task. |
Based on my discussion with @joshua-berry-ntnx , here are some of the points:
Test ObjectiveSince we are doing scale testing, the objective of the test is to monitor and analyze the performance of the system under varying load levels. Also, we need to identify the risks within the system, and develop tests to ensure complete safety against such risks. Test ScenarioTo simulate real-world scenario, we'll be testing Papiea on a file-system based use-case which has the following components:
Bucket Entity - The bucket structure which contains one or more objects. Bucket has two fields i.e. Object Entity - The object structure which stores the content and relevant metadata. Object has four fields i.e. Note: We maintain the BucketRefs in object to support symbolic links creation for an object. Each item in bucket refs tells the bucket name, object name (name of object in that bucket) and reference to bucket entity.
We have the following procedures which are responsible for creation of entitites and managing the content: Ensure Bucket Exists - This procedure creates a new bucket, if it does not exist. Otherwise, returns the bucket that was found. Change Bucket Name - This procedure updates the name of the bucket in the entity and also in the BucketRefs list for each object. Create Object - This procedure creates a new object and populates it with empty/default values for content and size. The object name should be unique to the bucket, otherwise this procedure will fail. Link Object - This procedure creates a (symbolic) link to an existing object, only if it is found. The linked object can be in the same bucket or a different bucket. Unlink Object - This procedure removes the link to an object (should be linked). The unlinked object is removed from the bucket list as well.
We have the following intent handlers which are responsible for resolving the diffs for the buckets and objects Bucket Create Handler - This handler is invoked every time a new bucket is created. Bucket Name Handler - This handler is invoked every time the bucket name is updated. Object Added Handler - This handler is invoked every time an object is added to the bucket (even for link objects). Object Removed Handler - This handler is invoked every time an object is removed from the bucket (even for unlink object). Object Create Handler - This handler is invoked every time a new object is created. Object Content Handler - This handler is invoked every time the object content is updated to update the related metadata. Note: For the procedures/intent handlers we'll be adding some delay to create randomness in the processing/return time and ensure safety against race conditions. Test ConfigurationFor the purpose of scale testing, I'm planning to run the tests under varying levels of load i.e.
Note: Each level will be executed at least 3-5 times, to average out the findings and get a more accurate idea of the system. Test Deliverables
For monitoring and analyzing the above parameters, we'll have to add our own logic in the system to track and save these values which can be used later to get an idea of the system performance. Test Risks
Test Exist StrategyA generalized strategy should be to exit when the system stops responding to API operations, also when we could not verify the correctness of the operations after a certain retry/timeout threshold. |
@nitesh-idnani1 You've got a lot of good stuff here spread across a few comments; can you capture all of it in a doc in our Papiea folder? That will make it easier to review and comment on. Thanks! |
We should have some regular stress and scale testing for Papiea. TBD what this will actually look like specifically, but I'm thinking lots of entities, lots of things happening in parallel. Lots of procedures, lots of concurrent diff resolutions, etc. Basically we want to find any races/inconsistencies in Papiea itself, particularly the engine.
The text was updated successfully, but these errors were encountered: