Please sign in to comment.
[FIX] website_blog: Saner crawler instructions
The fixes that this patch includes are: - The tag URL for tags printed in short and expanded blog posts had no slug, thus producing a brand new URL. Now they have the slug. - The `<a>` elements in right column archives have now `rel="nofollow"`. - The `<a>` elements in right column tag cloud, where more than 1 tag is used, have now `rel="nofollow"`. Now, to know why, imagine a website where there is: - 1 blog post per week. - For the last 3 years. - Using 100 different tags. - The website admin has activated tags cloud, archives, and tags per post. With behavior prior to this patch: 1. There have been 3*52=156 blog posts. 2. There will exist 156/20=7.8 pages of posts. 3. There will exist 12*3=36 untagged archive links. 4. There will exist 100 single-tag links which don't use the slug (only the ID). 5. There will exist 100 single-tag links with the slug. 6. There will exist 100^100=1×10²⁰⁰ multi-tag links, all with slug. 7. Summarizing last 3 points, the crawler will have to gather [100+100^100^36=Infinity] pages that will only add duplicated content. The result of this was: - If your site is interesting enough, crawlers will probably eat all your CPU resources and backend users will notice a big lag. - Crawlers punish duplicated content, so you'd get infinit pages that penalize you. - Weight of tags becomes exponential. With current patch: - All tag links contain a slug. - All tag links with more than 1 tag are marked as `nofollow`, so an obedient crawler would only index single-tag-with-slug pages, which actually enhace SEO. - All links with dates are not followed. The crawler will get to the post content via paginator, and will index the posts themselves, which is what actually matters. - Adding a tag still has crawler cost, but linear to the amount of content in such tag. : https://duckduckgo.com/?q=100%2B100%5E100%5E36&ia=calculator
- Loading branch information...
Showing with 7 additions and 4 deletions.