Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic sitemap generation route and caching mechanism #2234

Merged

Conversation

zahraaalizadeh
Copy link
Contributor

@zahraaalizadeh zahraaalizadeh commented May 23, 2024

This change:

  • Adds a route to dynamically generate sitemap.xml on the first request.
  • Implements caching mechanism to serve the generated sitemap.xml from the directory on subsequent requests.
  • Updates the Flask app to include the generate_sitemap function.
  • Ensures the sitemap is generated during the initial deployment and reused afterward.

This change improves the SEO functionality by ensuring the sitemap is always available and up-to-date, with minimal performance overhead.

issue: #1639

@zahraaalizadeh
Copy link
Contributor Author

zahraaalizadeh commented May 23, 2024

Here is a sample sitemap generated by this change:
sitemap.xml.zip

@zahraaalizadeh zahraaalizadeh force-pushed the issue-1639/improve-seo-web-crawling branch 2 times, most recently from bb2bb26 to 0d2353f Compare May 30, 2024 09:34
@zahraaalizadeh zahraaalizadeh force-pushed the issue-1639/improve-seo-web-crawling branch 5 times, most recently from 7aaa4a5 to 0329ec4 Compare June 4, 2024 06:13
Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per ecosystem sitemaps are a good start, though some ecosystems (Git for example) have over 30000 entries, which is approaching the 50000 url limit of sitemaps. I think it's fine for now, but maybe add in a TODO to document this limitation.

Can you also see if performing the generation within the Dockerfile will work?

sitemaps_directory = get_sitemaps_directory()
sitemap_path = f"{sitemaps_directory}/{filename}"
if not os.path.exists(sitemap_path):
generate_sitemap.generate_sitemap_for_ecosystem(cleaned_ecosystem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does not exist, I think we should just 404 rather than generating it on the fly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

sitemaps_directory = get_sitemaps_directory()
sitemaps_index_path = f"{sitemaps_directory}/{_SITEMAP_INDEX_FILENAME}"
if not os.path.exists(sitemaps_index_path):
generate_sitemap.generate_sitemaps(get_base_url())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does not exist, I think we should just 404 rather than generating it on the fly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

def main() -> int:
parser = argparse.ArgumentParser(description='Generate sitemaps.')
parser.add_argument(
'--base_url', required=True, help='The base URL for the sitemap entries.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sitemap should only be generated for osv.dev main site, and not needed for test.osv.dev or any other deployment, so we can probably hard code in the base URL into the Dockerfile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added df3dccf but couldn't test it locally, due to the auth failure during docker image build.

lastmod.text = datetime.datetime.now().isoformat()

tree = ElementTree(sitemapindex)
tree.write(_SITEMAP_INDEX_PATH, encoding='utf-8', xml_declaration=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add some compression here, as the contents are easily compressible (I'm getting up to 10% compression ratio), and we only need to do it one time onto disk, then we can just send the compressed file directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added aeb15e9 to save and serve the sitemaps from gz files, while the urls are still displayed as xml.

Screenshot 2024-06-06 at 10 06 25 AM

urlset = Element(
'urlset', xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")

for vuln in vulnerability_ids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit this to 50000 (or 49999) to avoid generating an invalid sitemap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@zahraaalizadeh zahraaalizadeh force-pushed the issue-1639/improve-seo-web-crawling branch from 0329ec4 to df3dccf Compare June 6, 2024 01:27
for ecosystem in ecosystems:
sitemap = SubElement(sitemapindex, "sitemap")
loc = SubElement(sitemap, 'loc')
loc.text = compress_file(get_sitemap_url_for_ecosystem(ecosystem, base_url))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is compress_file supposed to be here?

compress_file(get_sitemap_filename_for_ecosystem(ecosystem))

generate_sitemap_index(base_ecosystems, base_url)
compress_file(_SITEMAP_INDEX_PATH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index file is already compressed and deleted on line 109

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes was an unwanted consequence of a rebase. It should be fixed now.

@zahraaalizadeh zahraaalizadeh force-pushed the issue-1639/improve-seo-web-crawling branch from df3dccf to f9603c4 Compare June 7, 2024 03:02
Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This change adds generate_sitemap.py script that accepts
base URL as an argument. The script generates a sitemap
index file including individual sitemaps for each ecosystem.

The sitemaps are then converted to a compressed version.

It also implements unit tests to ensure proper functionality
and coverage of generate_sitemap script.
@zahraaalizadeh zahraaalizadeh force-pushed the issue-1639/improve-seo-web-crawling branch from f9603c4 to c3693b5 Compare June 7, 2024 05:19
@another-rex another-rex merged commit 94d68e8 into google:master Jun 7, 2024
11 checks passed
@G-Rath G-Rath deleted the issue-1639/improve-seo-web-crawling branch August 20, 2024 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants