Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-12239: Fix "too many URLs" in incremental sitemaps #3092

Merged
merged 2 commits into from Mar 8, 2024

Conversation

mwiencek
Copy link
Member

Problem

MBS-12239

If the incremental sitemaps have not run for some time, then many changes can accumulate which are too numerous to bundle into a single sitemap without exceeding $MAX_SITEMAP_SIZE.

Solution

I implemented batching code for the incremental sitemaps which is mostly copied from the overall sitemaps code.

Testing

I copied these changes into the production musicbrainz-sitemaps container and ran the hourly script. It appears to have resolved the errors.

@mwiencek
Copy link
Member Author

Hint: View the diff with "Hide whitespace" enabled.

Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made minor suggestions only. A bit lost with one of your comments in code.

lib/MusicBrainz/Server/Sitemap/Incremental.pm Outdated Show resolved Hide resolved
lib/MusicBrainz/Server/Sitemap/Incremental.pm Outdated Show resolved Hide resolved
lib/MusicBrainz/Server/Sitemap/Incremental.pm Show resolved Hide resolved
Comment on lines 228 to 234
# The batching code is mostly copied from `build_one_entity` in
# `Sitemap::Overall`. However, the logic to always put the last batch
# in its own sitemap is not performed here, since we're not as
# concerned about which sitemap a URL starts in for an *incremental*
# dump. (As mentioned previously, it's very unlikely there will be
# more than one batch, anyway, due to the code that attempts to merge
# adjacent ones.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure to understand the differences mentioned in the second sentence. What is to put the last batch in its own sitemap? Why are we not as concerned about which sitemap a URL starts in for an incremental dump?

Also about the sentence in parenthesis: What if there is more than one batch still? Where is the code that attempts to merge adjacent ones? (and which ones? batches?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure to understand the differences mentioned in the second sentence. What is to put the last batch in its own sitemap? Why are we not as concerned about which sitemap a URL starts in for an incremental dump?

I tried to improve the comment in Sitemap::Overall::build_one_entity to explain the batching more, and also the comment here. Does it help at all?

Also about the sentence in parenthesis: What if there is more than one batch still? Where is the code that attempts to merge adjacent ones? (and which ones? batches?)

Not sure I follow the first question (the PR makes it possible for there to be more than one batch). The code to merge adjacent batches for incremental dumps is below this comment (via the for my $raw_batch (@$raw_batches) loop). For the overall sitemaps, it's in Sitemap::Overall::build_one_entity which the comment here references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I’m not even sure what/why/how are batches to start with. There seems to be many prerequisites missing to allow any understanding. This isn’t new to this pull request though as I’ve been under the same impression when reading Sitemap::Overall::build_one_entity. Maybe an architecture documentation in the wiki and/or a Markdown file would be helpful. I’ve just created MBS-13508 (and a Sitemaps sub-component under Back-end) for whenever you’d like to make us understand how things are and work. 🗺️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the ticket. ;) The batching is just a way to split the sitemap files into smaller chunks. For example, the overall sitemap files sitemap-artist-$batch.xml.gz contain every artist URL in MusicBrainz...which is a lot of URLs.

Sitemap files have a lastmod timestamp associated with them which is stored in the sitemap index. So if a particular batch hasn't changed since the previous build, we preserve its lastmod time. That indicates to crawlers that that particular file doesn't need to be downloaded or checked again. (You can imagine how this would be pointless if all artist URLs were contained in a single file.)

Additionally, in many sitemap files we have lastmod times for the individual URLs. These times indicate when the JSON-LD on the page changed.

But on pages which don't have JSON-LD, the sitemaps basically just list all the URLs for a particular path suffix. (In which case the lastmod of a batch only changes if a URL is removed, because of an entity being merged or deleted, or if a new URL is added, which can only happen in the very last batch.)

This PR adds batching to the incremental sitemaps, which output the URLs of pages whose JSON-LD has changed in the past hour. Normally there is not 50,000 such changes in an hour, except in the extreme case I mentioned. Since these URLs change hourly, the lastmod properties of the incremental batches don't help much (since they will change every hour anyway). The batching for these is mainly a way to work around a hard-coded limit ($MAX_SITEMAP_SIZE) without making that configurable.

If the incremental sitemaps have not run for some time, then many changes can
accumulate which are too numerous to bundle into a single sitemap without
exceeding `$MAX_SITEMAP_SIZE`.

I implemented batching code for the incremental sitemaps which is mostly copied
from the overall sitemaps code.
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works at least and is running in production already, so let’s just merge this pull request and start a documentation later on with MBS-13508 as a reminder. 🚢

@mwiencek mwiencek merged commit eee631e into metabrainz:master Mar 8, 2024
2 of 3 checks passed
@mwiencek mwiencek deleted the mbs-12239 branch March 8, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants