Skip to content
This repository was archived by the owner on Aug 26, 2022. It is now read-only.

Bug 1190171 - Revision content denormalization and other cleanups of the compare revision view.#3386

Merged
groovecoder merged 6 commits intomasterfrom
bug1190171
Aug 10, 2015
Merged

Bug 1190171 - Revision content denormalization and other cleanups of the compare revision view.#3386
groovecoder merged 6 commits intomasterfrom
bug1190171

Conversation

@jezdez
Copy link
Contributor

@jezdez jezdez commented Aug 4, 2015

  • Adds ratelimiting to the compare revision view.
  • Refactors the signal handling to use new patterns and less confusing code.
  • Prefetch revision tags and document for fewer queries.
  • Denormalize the revision content in a separate field that is filled on save and uses tidy to do the CPU-intense work once instead of on every request.

@jezdez
Copy link
Contributor Author

jezdez commented Aug 4, 2015

So this does a bunch of things, which we don't have to do all at once, but I wanted to offer it anyway since I think we shouldn't run tidy on every request of that view but keep the tidied revision content on file instead.

The other changes are probably less troubling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is a good way to preserve the submitted data (in the content field) and store the tidied content. But should we populate the tidied_content field when it's accessed via get_tidied_content() or should we just populate it when the Revision is saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to not have to go through all the hundred of thousands revisions at once but spread it a bit. Updating the tidied content on access is only useful for old revisions, the post_save hook is for all new revisions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, sorry I over-looked that; I jumped straight to this models.py diff to see what you did here.

@groovecoder
Copy link
Contributor

❤️ using Django 1.7 app system.

@jezdez jezdez force-pushed the bug1190171 branch 2 times, most recently from 880d013 to ecd578f Compare August 6, 2015 07:07
@jezdez jezdez removed the not ready label Aug 6, 2015
@groovecoder
Copy link
Contributor

Needs a rebase now.

@groovecoder groovecoder assigned jezdez and unassigned groovecoder Aug 6, 2015
jezdez added 5 commits August 6, 2015 20:41
This makes use of Django's new app loading classes that are guarenteed to connect the signals at startup and is the recommended way of doing this.

This also moves the search index config into the wiki app.
That reduces the number of queries being done when rendering the view.
@groovecoder
Copy link
Contributor

Intern tests pass. Spot-check on get_tidied_content() from a django shell works as expected - empty at first, triggers a celery task, then it's populated next time. Saving a new revision populates tidied_content before get_tidied_content is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still 60/m?

@jezdez
Copy link
Contributor Author

jezdez commented Aug 7, 2015

I actually decided to increase the rate limit of the tidy_revision_content Celery task to 120/m. The reasoning is simple: we have 550k revisions right now, with a rate limit of 15/m we'd wait 12 days for even running the tasks. On the other hand, MySQL is currently able to handle UPDATE queries very fast and hasn't hit performance ceilings (not even close) with that when I looked at the last week's stats. We also are able to have the Celery nodes run through the CPU intense tidying easier given we can spread it to their collective CPUs, in other words, they aren't fully utilized right now as well.

That said I agree that there is a big risk with scheduling Celery tasks on every call to get_tidied_content since if assuming we create a big stack of tasks by normal site browsing it's guarenteed that we'll create duplicates of the tasks. That's why I've added a simple cache based check to the get_tidied_content function that prevents duplicate scheduling. It's set to timeout after 3 days, after which a call to get_tidied_content would again schedule a task. The calculation for that timeout is: (546235 / 120) / 60 / 24 = 3.1 days -- the number of days it takes in theory to run through all revisions.

This will be filled on post_save and via a Celery task and occasionally for older revisions on demand (also via Celery). The task is rate limited in a way to be able to get through the ~550k revisions rather quickly.
@jezdez jezdez assigned groovecoder and unassigned jezdez Aug 7, 2015
@groovecoder
Copy link
Contributor

Note: It's set to 120/m, which I assume is correct. (Not 120/s written in the comment. 😉)

@jezdez
Copy link
Contributor Author

jezdez commented Aug 10, 2015

@groovecoder woops, good catch!

Edit: Fixed.

groovecoder added a commit that referenced this pull request Aug 10, 2015
Bug 1190171 - Revision content denormalization and other cleanups of the compare revision view.
@groovecoder groovecoder merged commit 9120a26 into master Aug 10, 2015
@jezdez jezdez deleted the bug1190171 branch August 11, 2015 09:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants