Bad requests with GTM #508
-
I have noticed a lot of bad requests from archive.org's crawler on our sites using Google Tag Manager. For instance:
These are starting to add noticeable load on the server (which serves many sites). I understand Heritrix is speculatively trying URLs based on the Javascript code, which is known to sometimes result in 404s. But GTM is used on many websites, so these issues are bad for everybody. Could this speculation be improved to take Google's code into account ? Alternatively, is there a way to disable that speculation with |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Hi @damien-git, you should probably drop the Internet Archive a note (mailto:info@archive.org), as they may be able to tune the behaviour of their crawler. In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth. I quite like the idea of tuning the crawl via If we can find our which extractor they are using that might help. |
Beta Was this translation helpful? Give feedback.
-
I've run a variant of ExtractorJS for years, that lets me filter out the links it discovers using a set of regular expressions. These are applied before the links are turned into full URLs, making it a bit easier to target common false positives in JS libraries than it would be if we are doing the filtering in the scope. You also don't risk catching any URLs extracted via other (more reliable) means. Looking at the above, I should probably filter out any links extracted via ExtractorJS containing "gtm." |
Beta Was this translation helpful? Give feedback.
-
We (Akamai) are seeing a similar issue with sites that have our mPulse product enabled, which includes JavaScript in the page's HTML that looks like this:
This results in our customer's websites getting crawled by numerous crawlers on each page for those 20+ elements of the array, e.g.: http://website/foo/bar/ak.bpcip |
Beta Was this translation helpful? Give feedback.
-
+1 I am using google tag manager and crawler is making many requests with "/gtm.js" |
Beta Was this translation helpful? Give feedback.
Hi @damien-git, you should probably drop the Internet Archive a note (mailto:info@archive.org), as they may be able to tune the behaviour of their crawler.
In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.
I quite like the idea of tuning the crawl via
robots.txt
but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.If we can find our which extractor they are using that might help.