Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

feat: backfill data to clickhouse in batches and same thread #74

Merged
merged 9 commits into from
Feb 12, 2024

Conversation

Ian2012
Copy link
Contributor

@Ian2012 Ian2012 commented Feb 6, 2024

Description

This PR improves the backfill command to send data in configurable batches and dismisses the use of Celery. This doesn't affect the usual LMS traffic and backfill time is reduced because clickhouse likes batch inserts and serialization is more performant.

This PR:

  • Improves the performance of the queries by using select_related when needed.
  • Uses Django paginators to fetch batches of objects.
  • Removes heavy objects from memory
  • Reduces the memory footprint of the whole command to around the size of the object batch in memory

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Feb 6, 2024
@openedx-webhooks
Copy link

Thanks for the pull request, @Ian2012! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch from d8d4c80 to bb364df Compare February 6, 2024 19:51
Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, otherwise looks great

"""
Return the queryset to be used for the insert
"""
return self.get_model().objects.all()
if start_pk:
return self.get_model().objects.filter(pk__gt=start_pk).order_by("pk")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last performance improvement we could make here is to use .values() or .values_list(), which would save a bunch of overhead on creating model instances, instead returning a dict or tuple. values_list performs best, but dealing with tuples can be annoying. We could also then choose the fields we want to select to just what we use, potentially saving a lot of data transfer from the db and memory on the job.

Copy link
Contributor Author

@Ian2012 Ian2012 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealing with values_list would break the serializer, while values would return the data already partially serialized (still needs to deal with dates and related names). and we would need to implement a method serializer for every sink. Not sure about the gains here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've seen some pretty significant gains in the past from it, but not always. If you're comfortable with the performance as it is we can leave it and revisit later if necessary.

@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch 2 times, most recently from 1c5281e to e752256 Compare February 7, 2024 14:48
@itsjeyd itsjeyd added the core contributor PR author is a Core Contributor (who may or may not have write access to this repo). label Feb 8, 2024
@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch from b2d450d to 358b2cf Compare February 12, 2024 15:05
fix: query course overviews with django model

fix: remove submitted_objects objects

chore: quality fixes

chore: remove old dump courses to clickhouse command

fix: fix initial page, tests, and quality

chore: handle pr comments

test: improve coverage

fix: paginator last page is included

test: fixing test
test: add test for get_queryset

test: add test for get_queryset
@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch from f8d341d to a43189d Compare February 12, 2024 16:59
Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch from a43189d to 755f961 Compare February 12, 2024 17:25
test: add test for course_published changes

chore: quality fixes

test: add test for get_queryset changes in sinks

test: add test for should_dump_item in course_published

chore: use invalid emails

chore: quality fixes

test: add tests for course_published should_dump_item

tmp
@Ian2012 Ian2012 force-pushed the cag/add-batching-for-backfill branch from 84d9c64 to 69fa84e Compare February 12, 2024 17:29
@Ian2012 Ian2012 merged commit ea2c666 into main Feb 12, 2024
9 checks passed
@Ian2012 Ian2012 deleted the cag/add-batching-for-backfill branch February 12, 2024 18:08
@openedx-webhooks
Copy link

@Ian2012 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants