-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic sitemap generation route and caching mechanism #2234
Add dynamic sitemap generation route and caching mechanism #2234
Conversation
6c1dc5e
to
d84cfc1
Compare
Here is a sample sitemap generated by this change: |
bb2bb26
to
0d2353f
Compare
7aaa4a5
to
0329ec4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per ecosystem sitemaps are a good start, though some ecosystems (Git for example) have over 30000 entries, which is approaching the 50000 url limit of sitemaps. I think it's fine for now, but maybe add in a TODO to document this limitation.
Can you also see if performing the generation within the Dockerfile will work?
gcp/appengine/frontend_handlers.py
Outdated
sitemaps_directory = get_sitemaps_directory() | ||
sitemap_path = f"{sitemaps_directory}/{filename}" | ||
if not os.path.exists(sitemap_path): | ||
generate_sitemap.generate_sitemap_for_ecosystem(cleaned_ecosystem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it does not exist, I think we should just 404 rather than generating it on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
gcp/appengine/frontend_handlers.py
Outdated
sitemaps_directory = get_sitemaps_directory() | ||
sitemaps_index_path = f"{sitemaps_directory}/{_SITEMAP_INDEX_FILENAME}" | ||
if not os.path.exists(sitemaps_index_path): | ||
generate_sitemap.generate_sitemaps(get_base_url()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it does not exist, I think we should just 404 rather than generating it on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
def main() -> int: | ||
parser = argparse.ArgumentParser(description='Generate sitemaps.') | ||
parser.add_argument( | ||
'--base_url', required=True, help='The base URL for the sitemap entries.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sitemap should only be generated for osv.dev main site, and not needed for test.osv.dev or any other deployment, so we can probably hard code in the base URL into the Dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added df3dccf but couldn't test it locally, due to the auth failure during docker image build.
lastmod.text = datetime.datetime.now().isoformat() | ||
|
||
tree = ElementTree(sitemapindex) | ||
tree.write(_SITEMAP_INDEX_PATH, encoding='utf-8', xml_declaration=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add some compression here, as the contents are easily compressible (I'm getting up to 10% compression ratio), and we only need to do it one time onto disk, then we can just send the compressed file directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added aeb15e9 to save and serve the sitemaps from gz
files, while the urls are still displayed as xml
.
gcp/appengine/generate_sitemap.py
Outdated
urlset = Element( | ||
'urlset', xmlns="http://www.sitemaps.org/schemas/sitemap/0.9") | ||
|
||
for vuln in vulnerability_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit this to 50000 (or 49999) to avoid generating an invalid sitemap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
0329ec4
to
df3dccf
Compare
gcp/appengine/generate_sitemap.py
Outdated
for ecosystem in ecosystems: | ||
sitemap = SubElement(sitemapindex, "sitemap") | ||
loc = SubElement(sitemap, 'loc') | ||
loc.text = compress_file(get_sitemap_url_for_ecosystem(ecosystem, base_url)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is compress_file supposed to be here?
compress_file(get_sitemap_filename_for_ecosystem(ecosystem)) | ||
|
||
generate_sitemap_index(base_ecosystems, base_url) | ||
compress_file(_SITEMAP_INDEX_PATH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index file is already compressed and deleted on line 109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes was an unwanted consequence of a rebase. It should be fixed now.
df3dccf
to
f9603c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This change adds generate_sitemap.py script that accepts base URL as an argument. The script generates a sitemap index file including individual sitemaps for each ecosystem. The sitemaps are then converted to a compressed version. It also implements unit tests to ensure proper functionality and coverage of generate_sitemap script.
f9603c4
to
c3693b5
Compare
This change:
This change improves the SEO functionality by ensuring the sitemap is always available and up-to-date, with minimal performance overhead.
issue: #1639