Stress and scale testing #668

joshua-berry-ntnx · 2021-03-25T18:07:21Z

We should have some regular stress and scale testing for Papiea. TBD what this will actually look like specifically, but I'm thinking lots of entities, lots of things happening in parallel. Lots of procedures, lots of concurrent diff resolutions, etc. Basically we want to find any races/inconsistencies in Papiea itself, particularly the engine.

joshua-berry-ntnx · 2021-04-08T17:30:51Z

Might be able to leverage some of the existing benchmark-testing stuff for this.

Next steps:

Write a test plan (what do we want to test?)
Come up with an approach/design for how to implement the tests (e.g. how do we run longer-running tests in CI?)

nitesh-idnani1 · 2021-04-13T16:23:44Z

Components in Papiea that can affect performance at scale

Authentication
Database
Intentful Engine (Diff & Intent Resolver)

Parameters we can use to scale/stress test the system

Providers
Kinds per provider
Entities within a kind
Procedure calls
CRUD operation requests
Intent handlers invocations

Most accessed functionalities within papiea

Update entity spec
Update entity status
Diff computation for spec and status
Watchlist operations

Metrics for test result

Response time for the request (should be done for each operation)
Throughput i.e no of requests processed per second (for the whole test run)
CPU usage (for the whole test run)
Memory usage (for the whole test run)
Network usage (for the whole test run)

These metrics should be collected over multiple test runs (maybe 5) and averaged out to get an accurate idea of the values. We also need to plot these metrics and save them every time we do stress and scale testing.

Risks in Papiea

Currently, the major risk in papiea is race condition for the entities in papiea engine. Hence, the goal of the tests should also be to check for race conditions by creating delay randomness in procedures and diff handlers.

Strategy for running tests

Stress testing:

As I understand, stress testing is done to find out the breaking point for the system. In case of papiea, I think we can do that by tuning the parameters mentioned above to various number (100, 1000, 10000, etc.). The parameters that we're tuning on needs to be fired all at once i.e. all the procedure calls or all the CRUD operations should be sent together without any delay.

These tests should be run only when we do major updates to the engine which can change the limits for some of the components mentioned above. I would suggest to do it on every minor and major version update in papiea i.e. (0.9.50 -> 0.10.0) or the end of month whichever comes first.

Scale testing:

For scale testing, I think we should try to simulate a real-world application that operates at a large scale and develop common scenarios/workloads which can be tested regularly for the components mentioned above. For example, spec_update(entity1) -> intent_handler(entity1) -> spec_update(entity2) -> status_update(entity1) is a common scenario in papiea.

The above strategy should be applied at multiple load levels so that we gather an idea of how the system works for each level (high, medium and low load). Also, we need to log and save the error handling response in case of failures/abnormal conditions within the system so that we understand what's going wrong in papiea.

Scale testing should be set to run at the end of week, so that the test scenarios can survive for a longer time and we get an idea of how the system would work in real-world scenarios.

Emulebest · 2021-04-28T15:52:16Z

@nitesh-idnani1 can you provide some more input into what tech choice we should make, e.g. scale configuring, stress monitoring, etc.

nitesh-idnani1 · 2021-05-03T20:42:24Z

I haven't done much research on what we can use, but I was thinking once we get the requirements finalized choosing the library/framework should be a simple task.

nitesh-idnani1 · 2021-05-07T00:11:11Z

Based on my discussion with @joshua-berry-ntnx , here are some of the points:

Scale testing is more important for papiea since we want to identify and analyze the reason for failures that could happen at scale.
Scale testing will run for a span of 1-2 days to ensure its stability in real-world applications. The tests should be triggered every weekend and whenever we have a major/minor version update.
Scale testing should be done using a real-world scenario where we can develop relation between the entity type(kinds). Hence, I have developed the plan below based on the File System application.

Test Objective

Since we are doing scale testing, the objective of the test is to monitor and analyze the performance of the system under varying load levels. Also, we need to identify the risks within the system, and develop tests to ensure complete safety against such risks.

Test Scenario

To simulate real-world scenario, we'll be testing Papiea on a file-system based use-case which has the following components:

Entities

Bucket Entity - The bucket structure which contains one or more objects. Bucket has two fields i.e. Name and ObjectRefs[]. ObjectRefs is a list which contains the Object Name and Ref(Object) entity.

Object Entity - The object structure which stores the content and relevant metadata. Object has four fields i.e. Content, Size, Last Modified time and BucketRefs[]. BucketRefs is a list which contains the Bucket Name, Object Name and Ref(Bucket) fields.

Note: We maintain the BucketRefs in object to support symbolic links creation for an object. Each item in bucket refs tells the bucket name, object name (name of object in that bucket) and reference to bucket entity.

Procedures

We have the following procedures which are responsible for creation of entitites and managing the content:

Ensure Bucket Exists - This procedure creates a new bucket, if it does not exist. Otherwise, returns the bucket that was found.

Change Bucket Name - This procedure updates the name of the bucket in the entity and also in the BucketRefs list for each object.

Create Object - This procedure creates a new object and populates it with empty/default values for content and size. The object name should be unique to the bucket, otherwise this procedure will fail.

Link Object - This procedure creates a (symbolic) link to an existing object, only if it is found. The linked object can be in the same bucket or a different bucket.

Unlink Object - This procedure removes the link to an object (should be linked). The unlinked object is removed from the bucket list as well.

Intent Handlers

We have the following intent handlers which are responsible for resolving the diffs for the buckets and objects

Bucket Create Handler - This handler is invoked every time a new bucket is created.

Bucket Name Handler - This handler is invoked every time the bucket name is updated.

Object Added Handler - This handler is invoked every time an object is added to the bucket (even for link objects).

Object Removed Handler - This handler is invoked every time an object is removed from the bucket (even for unlink object).

Object Create Handler - This handler is invoked every time a new object is created.

Object Content Handler - This handler is invoked every time the object content is updated to update the related metadata.

Note: For the procedures/intent handlers we'll be adding some delay to create randomness in the processing/return time and ensure safety against race conditions.

Test Configuration

For the purpose of scale testing, I'm planning to run the tests under varying levels of load i.e.

Low load:

Limit of 100 entities each for bucket and object
Limit of 100 procedure calls at a given time
Limit of 100 CRUD operations at a given time

Medium load:

Limit of 1000 entities each for bucket and object
Limit of 1000 procedure calls at a given time
Limit of 1000 CRUD operations at a given time

Heavy load:

Limit of 10000 entities each for bucket and object
Limit of 10000 procedure calls at a given time
Limit of 10000 CRUD operations at a given time

Note: Each level will be executed at least 3-5 times, to average out the findings and get a more accurate idea of the system.

Test Deliverables

Response time - The time it takes for the CRUD operations to return.
Throughput - Total CRUD operations performed in a minute
Correctness - The correctness of the operations i.e. spec updates, status updates, entity create/delete, etc.

For monitoring and analyzing the above parameters, we'll have to add our own logic in the system to track and save these values which can be used later to get an idea of the system performance.

Test Risks

Component failure i.e. diff resolver, intent resolver, database, API gateway, etc.
High memory footprint causing the system to slow-down or run into OOM issues.
Incorrect behavior of the operations (spec/status updates, watchlist operations, etc.) due to race conditions.

Test Exist Strategy

A generalized strategy should be to exit when the system stops responding to API operations, also when we could not verify the correctness of the operations after a certain retry/timeout threshold.

joshua-berry-ntnx · 2021-05-17T22:57:29Z

@nitesh-idnani1 You've got a lot of good stuff here spread across a few comments; can you capture all of it in a doc in our Papiea folder? That will make it easier to review and comment on. Thanks!

joshua-berry-ntnx added the backlog label Mar 25, 2021

joshua-berry-ntnx added this to the Customer-1 M1 milestone Mar 25, 2021

Emulebest added the size: 7 label Apr 8, 2021

joshua-berry-ntnx assigned nitesh-idnani1 Apr 8, 2021

joshua-berry-ntnx modified the milestones: Customer-1 M1, Customer-1 M2 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress and scale testing #668

Stress and scale testing #668

joshua-berry-ntnx commented Mar 25, 2021

joshua-berry-ntnx commented Apr 8, 2021

nitesh-idnani1 commented Apr 13, 2021 •

edited

Emulebest commented Apr 28, 2021

nitesh-idnani1 commented May 3, 2021

nitesh-idnani1 commented May 7, 2021

joshua-berry-ntnx commented May 17, 2021

Stress and scale testing #668

Stress and scale testing #668

Comments

joshua-berry-ntnx commented Mar 25, 2021

joshua-berry-ntnx commented Apr 8, 2021

nitesh-idnani1 commented Apr 13, 2021 • edited

Components in Papiea that can affect performance at scale

Parameters we can use to scale/stress test the system

Most accessed functionalities within papiea

Metrics for test result

Risks in Papiea

Strategy for running tests

Emulebest commented Apr 28, 2021

nitesh-idnani1 commented May 3, 2021

nitesh-idnani1 commented May 7, 2021

Test Objective

Test Scenario

Test Configuration

Test Deliverables

Test Risks

Test Exist Strategy

joshua-berry-ntnx commented May 17, 2021

nitesh-idnani1 commented Apr 13, 2021 •

edited