Performance testing and improvement #195

bmtcril · 2024-02-23T15:21:38Z

As we approach our v1 release, it is important that we develop and maintain methodologies for testing performance of various parts of the system and occasionally checking for regressions. This epic holds the high level tasks for managing that work.

Event delivery

We currently have several ways of delivering xAPI events to ClickHouse. For each of these we should document a methodology and reference configuration for testing throughput of events to ClickHouse before issues emerge (queues filling up, task sizes growing, delivery time lagging, etc).

We should be able to emulate traffic using tracking log replay of very large files with a batch size of 1 and adjust the sleep setting to the maximum the backend can handle. If the backend can take a 0 sleep loop, we should run additional processes until it breaks.

Investigate using Ralph to transform generated xapi logs back to tracking logs #196

We should be careful to constrain the configurations to be roughly equivalent in resources / cost, to emulate a production environment for a mid-sized system, and to use the same version of Aspects for all tests.

Template for reporting results:

Test system configuration:
- Tutor version
- Aspects version
- Environment specifications (local / k8s, CPU / Memory / Disk resources allocated)

Load generation specifications:
- Tool
- Exact script
- Any custom settings for things like sleep time and # of processes

Data captured for results:
- Length of run
- Sleep time / batch size
- We should have values for these every 10 seconds:
  - Latency of events in ClickHouse (now - most recent event)
  - Queue size (if applicable) ex: pending tasks in celery, pending stream size in redis, etc
  - Total events in CH
  - Query times for 2-3 ClickHouse reporting queries (as taken from Superset)

Query performance

On a load test dataset, check every reporting query we have (as captured from the "show SQL" in Superset), with and without any applicable filters to see how they perform. We should run the queries 5x each and capture the response times and number of rows returned. It should also be possible to capture the queries by browsing each chart, using different filters, then pulling the SQL from the ClickHouse logs.

We should be careful to capture the xapi-db-load configuration for generating the data so we can regenerate as necessary.

Template for reporting results:

Test ClickHouse configuration
- local / k8s/CH Cloud, Altinity...
- hardware or config specs
- Total rows in ClickHouse

For each query
- Query short name (enrollments no filter, enrollments enrollment type filter)
- Raw query
- Duration
- Rows returned

The text was updated successfully, but these errors were encountered:

bmtcril · 2024-04-04T20:41:55Z

All data collected for the first sets of load tests are in these two files:

load_test_stats_1.txt
load_test_runs_1.txt

The Superset dashboard I'm using along with associated datasets can be imported from this zip:

dashboard_export_20240404T203605.zip

bmtcril added the aspects v1 label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance testing and improvement #195

Performance testing and improvement #195

bmtcril commented Feb 23, 2024 •

edited

Loading

bmtcril commented Apr 4, 2024

Performance testing and improvement #195

Performance testing and improvement #195

Comments

bmtcril commented Feb 23, 2024 • edited Loading

Event delivery

Query performance

bmtcril commented Apr 4, 2024

bmtcril commented Feb 23, 2024 •

edited

Loading