New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bundles xapi-db-load into a command-line tool with setuptools #1
Conversation
@bmtcril Are you ok with these changes? I had to make them in order to use this code to (optionally) loading demo xAPI data with my OARS tutor plugin. |
@bmtcril and out of interest, I assume that the xAPI schemas used here more closely match Ralph's not the event-routing-backends? I will make my Superset permissions filter flexible so it will suit any (flat) xAPI course field imported into Clickhouse, but was just wondering for my own notes. |
I'll have to dig through and see what I've got pending. I think this is fine, I just want to make sure any big refactors I've done have already landed otherwise it might be easier to land them first then rebase. The events generated here are straight from the event-routing-backends tests right now: https://github.com/openedx/event-routing-backends/tree/master/event_routing_backends/processors/xapi/tests/fixtures/expected There are definitely some things that can be improved, especially w/r/t ordering of events. Right now I think it enrolls learners in courses before firing other events for them, but everything else happens in a random order and will make for some extremely confusing visualizations. |
Sadly I did have a large branch running for the Ralph tests. If you want to take a spin through it here, I can merge it and then fight through the rebase on this PR to get it in: #2 |
@bmtcril I'm more than happy to rebase -- my changes are simple, they only look big. Please let me know when you've merged your major work, and I'll reapply my changes. |
Ok, I've merged my big mess and will start treating this as a real shared repo now. Thanks! |
@@ -20,9 +20,10 @@ def __init__(self, db_host="localhost", db_port=18123, db_username="default", | |||
password=db_password, | |||
port=self.port, | |||
database=self.database, | |||
date_time_input_format="best_effort", # Allows RFC dates | |||
old_parts_lifetime=10, # Seconds, reduces disk usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmtcril I get this error when I include 'old_parts_lifetime': 10
in the client settings:
clickhouse_connect.driver.exceptions.ProgrammingError: Setting old_parts_lifetime is unknown or readonly
So I've added it to the MergeTable CREATE statement below instead. From what you've learned about Clickhouse, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fine here. I wonder if there's a version mismatch in one of our installs, though. If we're going to be using this elsewhere we should probably do the work to make it an officially supported repo. I'll ask around about what people think about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmtcril Or maybe we need to pin the version of clickhouse-connect
? I've got this, what are you running?
clickhouse-connect==0.5.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got 0.3.6 so am on the other side of this breaking change: https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#050-2023-01-14
I think pinning it to >0.5 and <0.6 is probably the best since they seem to do breaking changes on minor versions.
Thanks for merging your changes @bmtcril ! I've rebased and re-tested, and fixed some things (see individual commits). I'm working on a 3rd PR against https://github.com/open-craft/tutor-contrib-superset with scripts to initialize the clickhouse data source, and add a chart + dashboard for https://github.com/openedx/data-wg/issues/26. |
-- video_id String NULL, | ||
-- nav_starting_point String NULL, | ||
-- nav_ending_point String NULL, | ||
course_id String NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are all fine, this is mostly a note for myself to come back and change the rest of this create statement to the one that was performing better in the Ralph tests:
CREATE TABLE IF NOT EXISTS {self.event_table_name} (
event_id UUID NOT NULL,
verb_id String NOT NULL,
actor_id UUID NOT NULL,
org String NOT NULL,
course_id String NOT NULL,
emission_time timestamp NOT NULL,
event JSON NOT NULL
)
ENGINE MergeTree ORDER BY (org, course_id, verb_id, actor_id, emission_time,
event_id)
PRIMARY KEY (org, course_id, verb_id, actor_id, emission_time, event_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool -- I've left the extra fields alone (since the data loading requires them), but c106f35 adds those additional fields to ORDER BY
and PRIMARY KEY
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I'll normalize those between the clickhouse and Ralph backends in a different PR. That table structure is going to all need to be updated after some discussions I just had with FUN, but I haven't decided on exactly how yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, I left some comments but I think the only real change needed is to the help string.
Just like the other one, @pomegranited this repository has been moved into the openedx GitHub org. Can you confirm here that you're good with contributing this PR to openedx under your CLA? |
@bmtcril I've addressed your suggestions, please have a look? Thanks for moving this repo into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to approve this as-is and resolve the outstanding comments in a follow-on PR so as to unblock openedx-unsupported/tutor-contrib-clickhouse#1
@@ -20,9 +20,10 @@ def __init__(self, db_host="localhost", db_port=18123, db_username="default", | |||
password=db_password, | |||
port=self.port, | |||
database=self.database, | |||
date_time_input_format="best_effort", # Allows RFC dates | |||
old_parts_lifetime=10, # Seconds, reduces disk usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got 0.3.6 so am on the other side of this breaking change: https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#050-2023-01-14
I think pinning it to >0.5 and <0.6 is probably the best since they seem to do breaking changes on minor versions.
-- video_id String NULL, | ||
-- nav_starting_point String NULL, | ||
-- nav_ending_point String NULL, | ||
course_id String NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I'll normalize those between the clickhouse and Ralph backends in a different PR. That table structure is going to all need to be updated after some discussions I just had with FUN, but I haven't decided on exactly how yet.
@pomegranited I added you as a grandfathered-in "writer" to this repo and tutor-contrib-clickhouse since you were already working on them. I'm happy to sponsor you to the analytics CC group, which would likely make all the management easier. These repos are all in there as well, if you're ok with it I'll make a post about it in the forums! |
Thanks for the pull request, @pomegranited! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
Thank you :) I'll merge this, and get it tagged.
Thank you, yes please! I'm already a core contributor to the legacy analytics repos, but this was granted before we had these CC repo groups. |
Modularizes the source code so that it can be installed and run as a command line script, and adds missing dependencies.
Clickhouse client no longer accepts arbitrary settings with each request. cf https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#warning----breaking-change----removing-get_client-arbitrary-keyword-arguments * Retains the date_time_input_format as a general setting for all calls * Moves old_parts_lifetime to the MergeTree table creation because errors are thrown if it's used by the client on non-MergeTree tables: clickhouse_connect.driver.exceptions.ProgrammingError: Setting old_parts_lifetime is unknown or readonly * Ensures events table gets created with all the expected fields * Tweaks the order_by and primary key fields for performance
@bmtcril Ach, sorry.. I squashed commits and git wants a new approving review. (Tried reverting back to the previously approved commit, and it didn't buy it.) I'll remember for next time and just squash on merge! |
@pomegranited 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
...and handles a breaking change with query settings from
clickhouse-connect==0.5.0
.Related discussions
Testing instructions
E.g inside a docker container with access to
clickhouse_service
from https://github.com/bmtcril/tutor-contrib-clickhouse :pip install .
to install it and the latest dependencies.xapi-db-load --backend clickhouse...
to ensure that the script still runs as expected.Author Notes & Concerns
0.1
after merging? Then I can reference it in my tutor init scripts, and we don't have to worry about api changes on the script.