Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundles xapi-db-load into a command-line tool with setuptools #1

Merged
merged 2 commits into from Feb 6, 2023
Merged

Bundles xapi-db-load into a command-line tool with setuptools #1

merged 2 commits into from Feb 6, 2023

Conversation

pomegranited
Copy link
Contributor

@pomegranited pomegranited commented Jan 25, 2023

...and handles a breaking change with query settings from clickhouse-connect==0.5.0.

Related discussions

Testing instructions

E.g inside a docker container with access to clickhouse_service from https://github.com/bmtcril/tutor-contrib-clickhouse :

  1. Checkout this branch, and change into the repo directory.
  2. In a new python virtualenv, run pip install . to install it and the latest dependencies.
  3. Run xapi-db-load --backend clickhouse... to ensure that the script still runs as expected.

Author Notes & Concerns

  1. @bmtcril Can you tag this as 0.1 after merging? Then I can reference it in my tutor init scripts, and we don't have to worry about api changes on the script.

@pomegranited
Copy link
Contributor Author

@bmtcril Are you ok with these changes? I had to make them in order to use this code to (optionally) loading demo xAPI data with my OARS tutor plugin.

@pomegranited
Copy link
Contributor Author

pomegranited commented Jan 25, 2023

@bmtcril and out of interest, I assume that the xAPI schemas used here more closely match Ralph's not the event-routing-backends?

I will make my Superset permissions filter flexible so it will suit any (flat) xAPI course field imported into Clickhouse, but was just wondering for my own notes.

@bmtcril
Copy link
Contributor

bmtcril commented Jan 25, 2023

I'll have to dig through and see what I've got pending. I think this is fine, I just want to make sure any big refactors I've done have already landed otherwise it might be easier to land them first then rebase.

The events generated here are straight from the event-routing-backends tests right now: https://github.com/openedx/event-routing-backends/tree/master/event_routing_backends/processors/xapi/tests/fixtures/expected

There are definitely some things that can be improved, especially w/r/t ordering of events. Right now I think it enrolls learners in courses before firing other events for them, but everything else happens in a random order and will make for some extremely confusing visualizations.

@bmtcril
Copy link
Contributor

bmtcril commented Jan 25, 2023

Sadly I did have a large branch running for the Ralph tests. If you want to take a spin through it here, I can merge it and then fight through the rebase on this PR to get it in: #2

@pomegranited
Copy link
Contributor Author

@bmtcril I'm more than happy to rebase -- my changes are simple, they only look big. Please let me know when you've merged your major work, and I'll reapply my changes.

@bmtcril
Copy link
Contributor

bmtcril commented Jan 26, 2023

Ok, I've merged my big mess and will start treating this as a real shared repo now. Thanks!

@@ -20,9 +20,10 @@ def __init__(self, db_host="localhost", db_port=18123, db_username="default",
password=db_password,
port=self.port,
database=self.database,
date_time_input_format="best_effort", # Allows RFC dates
old_parts_lifetime=10, # Seconds, reduces disk usage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmtcril I get this error when I include 'old_parts_lifetime': 10 in the client settings:

clickhouse_connect.driver.exceptions.ProgrammingError: Setting old_parts_lifetime is unknown or readonly

So I've added it to the MergeTable CREATE statement below instead. From what you've learned about Clickhouse, is that ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine here. I wonder if there's a version mismatch in one of our installs, though. If we're going to be using this elsewhere we should probably do the work to make it an officially supported repo. I'll ask around about what people think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmtcril Or maybe we need to pin the version of clickhouse-connect? I've got this, what are you running?

clickhouse-connect==0.5.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got 0.3.6 so am on the other side of this breaking change: https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#050-2023-01-14

I think pinning it to >0.5 and <0.6 is probably the best since they seem to do breaking changes on minor versions.

@pomegranited
Copy link
Contributor Author

Thanks for merging your changes @bmtcril ! I've rebased and re-tested, and fixed some things (see individual commits).
So this PR and openedx-unsupported/tutor-contrib-clickhouse#1 are ready for your review :)

I'm working on a 3rd PR against https://github.com/open-craft/tutor-contrib-superset with scripts to initialize the clickhouse data source, and add a chart + dashboard for https://github.com/openedx/data-wg/issues/26.

README.rst Show resolved Hide resolved
-- video_id String NULL,
-- nav_starting_point String NULL,
-- nav_ending_point String NULL,
course_id String NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are all fine, this is mostly a note for myself to come back and change the rest of this create statement to the one that was performing better in the Ralph tests:

            CREATE TABLE IF NOT EXISTS {self.event_table_name} (
            event_id UUID NOT NULL,
            verb_id String NOT NULL,
            actor_id UUID NOT NULL,
            org String NOT NULL,
            course_id String NOT NULL,
            emission_time timestamp NOT NULL,
            event JSON NOT NULL
            )
            ENGINE MergeTree ORDER BY (org, course_id, verb_id, actor_id, emission_time,
             event_id)
            PRIMARY KEY (org, course_id, verb_id, actor_id, emission_time, event_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool -- I've left the extra fields alone (since the data loading requires them), but c106f35 adds those additional fields to ORDER BY and PRIMARY KEY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I'll normalize those between the clickhouse and Ralph backends in a different PR. That table structure is going to all need to be updated after some discussions I just had with FUN, but I haven't decided on exactly how yet.

Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, I left some comments but I think the only real change needed is to the help string.

@bmtcril
Copy link
Contributor

bmtcril commented Feb 1, 2023

Just like the other one, @pomegranited this repository has been moved into the openedx GitHub org. Can you confirm here that you're good with contributing this PR to openedx under your CLA?

@pomegranited
Copy link
Contributor Author

@bmtcril I've addressed your suggestions, please have a look?

Thanks for moving this repo into the openedx fold :)
I'm happy to contribute this PR to openedx under my CLA. Do I need to also request merge access to it as part of my CC role, and get this and the other OARS repos added to the "analytics" group?

bmtcril
bmtcril previously approved these changes Feb 2, 2023
Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to approve this as-is and resolve the outstanding comments in a follow-on PR so as to unblock openedx-unsupported/tutor-contrib-clickhouse#1

README.rst Show resolved Hide resolved
@@ -20,9 +20,10 @@ def __init__(self, db_host="localhost", db_port=18123, db_username="default",
password=db_password,
port=self.port,
database=self.database,
date_time_input_format="best_effort", # Allows RFC dates
old_parts_lifetime=10, # Seconds, reduces disk usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got 0.3.6 so am on the other side of this breaking change: https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#050-2023-01-14

I think pinning it to >0.5 and <0.6 is probably the best since they seem to do breaking changes on minor versions.

-- video_id String NULL,
-- nav_starting_point String NULL,
-- nav_ending_point String NULL,
course_id String NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I'll normalize those between the clickhouse and Ralph backends in a different PR. That table structure is going to all need to be updated after some discussions I just had with FUN, but I haven't decided on exactly how yet.

@bmtcril
Copy link
Contributor

bmtcril commented Feb 2, 2023

@pomegranited I added you as a grandfathered-in "writer" to this repo and tutor-contrib-clickhouse since you were already working on them. I'm happy to sponsor you to the analytics CC group, which would likely make all the management easier. These repos are all in there as well, if you're ok with it I'll make a post about it in the forums!

@bmtcril bmtcril changed the title Bundles xapi-db-load into a command-line tool with setuptools Bundle xapi-db-load into a command-line tool with setuptools Feb 2, 2023
@bmtcril bmtcril changed the title Bundle xapi-db-load into a command-line tool with setuptools Bundles xapi-db-load into a command-line tool with setuptools Feb 2, 2023
@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Feb 2, 2023
@openedx-webhooks
Copy link

Thanks for the pull request, @pomegranited! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@pomegranited
Copy link
Contributor Author

@bmtcril

I added you as a grandfathered-in "writer" to this repo and tutor-contrib-clickhouse since you were already working on them.

Thank you :) I'll merge this, and get it tagged.

I'm happy to sponsor you to the analytics CC group, which would likely make all the management easier. These repos are all in there as well, if you're ok with it I'll make a post about it in the forums!

Thank you, yes please! I'm already a core contributor to the legacy analytics repos, but this was granted before we had these CC repo groups.

Modularizes the source code so that it can be installed
and run as a command line script, and adds missing dependencies.
Clickhouse client no longer accepts arbitrary settings with each request.

cf https://github.com/ClickHouse/clickhouse-connect/blob/main/CHANGELOG.md#warning----breaking-change----removing-get_client-arbitrary-keyword-arguments

* Retains the date_time_input_format as a general setting for all calls
* Moves old_parts_lifetime to the MergeTree table creation
  because errors are thrown if it's used by the client on non-MergeTree tables:
  clickhouse_connect.driver.exceptions.ProgrammingError: Setting old_parts_lifetime is unknown or readonly
* Ensures events table gets created with all the expected fields
* Tweaks the order_by and primary key fields for performance
@pomegranited
Copy link
Contributor Author

@bmtcril Ach, sorry.. I squashed commits and git wants a new approving review. (Tried reverting back to the previously approved commit, and it didn't buy it.) I'll remember for next time and just squash on merge!

@bmtcril bmtcril merged commit dbc1468 into openedx:main Feb 6, 2023
@openedx-webhooks
Copy link

@pomegranited 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants