New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] website_blog: Saner crawler instructions #30832

Open
wants to merge 1 commit into
base: 10.0
from

Conversation

Projects
None yet
5 participants
@Yajo
Copy link
Contributor

Yajo commented Feb 5, 2019

The fixes that this patch includes are:

  • The tag URL for tags printed in short and expanded blog posts had no slug, thus producing a brand new URL. Now they have the slug.
  • The <a> elements in right column archives have now rel="nofollow".
  • The <a> elements in right column tag cloud, where more than 1 tag is used, have now rel="nofollow".

Now, to know why, imagine a website where there is:

  • 1 blog post per week.
  • For the last 3 years.
  • Using 100 different tags.
  • The website admin has activated tags cloud, archives, and tags per post.

With behavior prior to this patch:

  1. There have been 3*52=156 blog posts.
  2. There will exist 156/20=7.8 pages of posts.
  3. There will exist 12*3=36 untagged archive links.
  4. There will exist 100 single-tag links which don't use the slug (only the ID).
  5. There will exist 100 single-tag links with the slug.
  6. There will exist 100^100=1×10²⁰⁰ multi-tag links, all with slug.
  7. Summarizing last 3 points, the crawler will have to gather 100+100^100^36=Infinity pages that will only add duplicated content.

The result of this was:

  • If your site is interesting enough, crawlers will probably eat all your CPU resources and backend users will notice a big lag.
  • Crawlers punish duplicated content, so you'd get infinit pages that penalize you.
  • Weight of tags becomes exponential.

With current patch:

  • All tag links contain a slug.
  • All tag links with more than 1 tag are marked as nofollow, so an obedient crawler would only index single-tag-with-slug pages, which actually enhace SEO.
  • All links with dates are not followed. The crawler will get to the post content via paginator, and will index the posts themselves, which is what actually matters.
  • Adding a tag still has crawler cost, but linear to the amount of content in such tag.

--
I confirm I have signed the CLA and read the PR guidelines at www.odoo.com/submit-pr
@Tecnativa

@robodoo robodoo added the seen 🙂 label Feb 5, 2019

@pedrobaeza

This comment has been minimized.

Copy link
Contributor

pedrobaeza commented Feb 5, 2019

@JKE-be what do you think about this?

@robodoo robodoo added the CI 🤖 label Feb 5, 2019

@mart-e mart-e added the Website label Feb 6, 2019

@JKE-be

This comment has been minimized.

Copy link
Contributor

JKE-be commented Feb 8, 2019

[FIX] website_blog: Saner crawler instructions
The fixes that this patch includes are:

- The tag URL for tags printed in short and expanded blog posts had no slug, thus producing a brand new URL. Now they have the slug.
- The `<a>` elements in right column archives have now `rel="nofollow"`.
- The `<a>` elements in right column tag cloud, where more than 1 tag is used, have now `rel="nofollow"`.

Now, to know why, imagine a website where there is:

- 1 blog post per week.
- For the last 3 years.
- Using 100 different tags.
- The website admin has activated tags cloud, archives, and tags per post.

With behavior prior to this patch:

1. There have been 3*52=156 blog posts.
2. There will exist 156/20=7.8 pages of posts.
3. There will exist 12*3=36 untagged archive links.
4. There will exist 100 single-tag links which don't use the slug (only the ID).
5. There will exist 100 single-tag links with the slug.
6. There will exist 100^100=1×10²⁰⁰ multi-tag links, all with slug.
7. Summarizing last 3 points, the crawler will have to gather [100+100^100^36=Infinity][1] pages that will only add duplicated content.

The result of this was:

- If your site is interesting enough, crawlers will probably eat all your CPU resources and backend users will notice a big lag.
- Crawlers punish duplicated content, so you'd get infinit pages that penalize you.
- Weight of tags becomes exponential.

With current patch:

- All tag links contain a slug.
- All tag links with more than 1 tag are marked as `nofollow`, so an obedient crawler would only index single-tag-with-slug pages, which actually enhace SEO.
- All links with dates are not followed. The crawler will get to the post content via paginator, and will index the posts themselves, which is what actually matters.
- Adding a tag still has crawler cost, but linear to the amount of content in such tag.

[1]: https://duckduckgo.com/?q=100%2B100%5E100%5E36&ia=calculator

@Yajo Yajo force-pushed the Tecnativa:10.0-website_blog-saner_tag_archive_crawling branch from 8a1a8af to aa544c2 Feb 11, 2019

@Yajo

This comment has been minimized.

Copy link
Contributor Author

Yajo commented Feb 11, 2019

I updated the patch for the parts that were actually improving. Thanks @JKE-be!

@robodoo robodoo added CI 🤖 and removed CI 🤖 labels Feb 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment