Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Investigate Clickhouse as an event store, check Ralph integration points #24

Closed
bmtcril opened this issue Nov 2, 2022 · 3 comments
Assignees

Comments

@bmtcril
Copy link

bmtcril commented Nov 2, 2022

We're interested in running Ralph as an LRS, but are hoping to use a highly scalable analytic database as the backend. This spike is to get hands on with Clickhouse and see how well it suits our needs. This isn't intended to be a load test or scalability test, but rather to answer these questions:

  • How easy is it to run Clickhouse in Tutor?
  • How many resources does it take up on the host machine when running in a Tutor environment?
  • How performant is it running at a small scale load locally?
  • How easy is it to work with querying data back out?

AC:

Do the following and document the necessary steps, SQL, etc. as well as the outcomes...

  • Stand up Clickhouse as a Tutor plugin locally
  • Write a script to send randomized xAPI JSON events into a single Clickhouse table, starting with 1 million and document how it goes, and how long it takes to do the inserts. If it goes quickly and smoothly, try 1 billion.
    • Will need enrollments / unenrollments for randomized users and courses, but enough for each user and course to be meaningful
  • Query data from the table and document how long the query runs (while the insert script is running)
    • Count "total enrollments for this course"
    • Count "total enrollments for this organization" (presumably this means parsing out the org from the course key, it doesn't need to be perfect)
    • Count "total enrollments for this course in the past 2 hours"
    • Count "total number of events for this learner"
  • If time permits, it would be worth running a couple of more complicated analytics queries such as...
    • Number of enrollments for this course - number of unenrollments
    • Sliding window of enrollments over the past minute
    • For every course, how many enrollments - unenrollments have there been in the past 5 minutes
@bmtcril bmtcril self-assigned this Nov 10, 2022
@bmtcril
Copy link
Author

bmtcril commented Nov 21, 2022

An update on where I'm at with this. All tests have been on my M1 Mac under Docker Desktop assigned 4 CPUs, 10GB memory, 1GB swap, with no other containers running.

  • I have a test Tutor plugin that runs the latest Clickhouse Docker image, creates an xapi database, creates users, and assigns permissions. It allows for some customization of the Clickhouse server through template overrides.
  • It was fairly easy to set up the plugin, Clickhouse itself was very easy.
  • I created scripts for loading randomized instances of each different xAPI event type as defined in event-routing-backends. with a distribution matching what edx.org sees.
  • At rest and with 200k test rows, Clickhouse consumes about 1% CPU and 300MB of RAM.
  • I have updated the Clickhouse script configuration for some basic Clickhouse best practices (using a buffered table for insert, breaking out fields we wish to query by into their own columns).
  • I have used my script to generate several rounds of data up to 100 million rows with this configuration:
    • ENGINE MergeTree ORDER BY (course_run_id, verb, emission_time)
    • PRIMARY KEY (course_run_id, verb)
  • Fully loaded with that configuration
    • Querying only by event_id timed out (not indexed, expected)
    • Querying only by actor_id timed out (not indexed, expected)
    • Querying a count of all event types for a course with ~30k events: 14sec
    • Querying by emission time and verb timed out (not indexed, expected)
    • Querying by course_id and verb - 499ms
    • Querying registrations - unregistrations by course id: 900ms
    • Querying registrations - unregistrations by course id and emission date for the last 30 days: 900ms
    • Backing up the whole set to disk: 100000000 rows in set. Elapsed: 19288.384 sec. Processed 100.00 million rows, 145.28 GB (5.18 thousand rows/s., 7.53 MB/s.)

Next steps:

  • Try for a billion row load
  • Run more relevant queries in this ticket automatically throughout the load, with timings
  • Document for real / share findings

@bmtcril
Copy link
Author

bmtcril commented Nov 29, 2022

I was able to complete a billion row test with remarkable success, and am writing up findings here: https://openedx.atlassian.net/wiki/spaces/COMM/pages/3593109505/Clickhouse+as+an+LRS+Data+Store

I'll close this ticket out when the details are all completed there.

@bmtcril
Copy link
Author

bmtcril commented May 25, 2023

This has been done for a while, ClickHouse is selected, Ralph integration is done. 👍

@bmtcril bmtcril closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant