Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance testing and improvement #195

Open
7 of 11 tasks
bmtcril opened this issue Feb 23, 2024 · 1 comment
Open
7 of 11 tasks

Performance testing and improvement #195

bmtcril opened this issue Feb 23, 2024 · 1 comment

Comments

@bmtcril
Copy link
Contributor

bmtcril commented Feb 23, 2024

As we approach our v1 release, it is important that we develop and maintain methodologies for testing performance of various parts of the system and occasionally checking for regressions. This epic holds the high level tasks for managing that work.

Event delivery

We currently have several ways of delivering xAPI events to ClickHouse. For each of these we should document a methodology and reference configuration for testing throughput of events to ClickHouse before issues emerge (queues filling up, task sizes growing, delivery time lagging, etc).

We should be able to emulate traffic using tracking log replay of very large files with a batch size of 1 and adjust the sleep setting to the maximum the backend can handle. If the backend can take a 0 sleep loop, we should run additional processes until it breaks.

We should be careful to constrain the configurations to be roughly equivalent in resources / cost, to emulate a production environment for a mid-sized system, and to use the same version of Aspects for all tests.

Template for reporting results:

Test system configuration:
- Tutor version
- Aspects version
- Environment specifications (local / k8s, CPU / Memory / Disk resources allocated)

Load generation specifications:
- Tool
- Exact script
- Any custom settings for things like sleep time and # of processes

Data captured for results:
- Length of run
- Sleep time / batch size
- We should have values for these every 10 seconds:
  - Latency of events in ClickHouse (now - most recent event)
  - Queue size (if applicable) ex: pending tasks in celery, pending stream size in redis, etc
  - Total events in CH
  - Query times for 2-3 ClickHouse reporting queries (as taken from Superset)

Query performance

On a load test dataset, check every reporting query we have (as captured from the "show SQL" in Superset), with and without any applicable filters to see how they perform. We should run the queries 5x each and capture the response times and number of rows returned. It should also be possible to capture the queries by browsing each chart, using different filters, then pulling the SQL from the ClickHouse logs.

We should be careful to capture the xapi-db-load configuration for generating the data so we can regenerate as necessary.

Template for reporting results:

Test ClickHouse configuration
- local / k8s/CH Cloud, Altinity...
- hardware or config specs
- Total rows in ClickHouse

For each query
- Query short name (enrollments no filter, enrollments enrollment type filter)
- Raw query
- Duration
- Rows returned
@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 4, 2024

All data collected for the first sets of load tests are in these two files:

load_test_stats_1.txt
load_test_runs_1.txt

The Superset dashboard I'm using along with associated datasets can be imported from this zip:

dashboard_export_20240404T203605.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant