Bad requests with GTM #508

damien-git · 2019-02-18T17:36:46Z

damien-git
Feb 18, 2019

I have noticed a lot of bad requests from archive.org's crawler on our sites using Google Tag Manager. For instance:

/in.tag/11/10/2024/gtm.start/
/mouseup.dismiss/11/10/2024/gtm.start
/mousedown.dismiss/
/gtm.load/gtm.start/11/10/2024
/json/11/10/2024/gtm.start
/11/10/2024/gtm.js/

These are starting to add noticeable load on the server (which serves many sites).

I understand Heritrix is speculatively trying URLs based on the Javascript code, which is known to sometimes result in 404s. But GTM is used on many websites, so these issues are bad for everybody. Could this speculation be improved to take Google's code into account ? Alternatively, is there a way to disable that speculation with robots.txt ?

Answered by anjackson

Feb 19, 2019

Hi @damien-git, you should probably drop the Internet Archive a note (mailto:info@archive.org), as they may be able to tune the behaviour of their crawler.

In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.

I quite like the idea of tuning the crawl via robots.txt but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.

If we can find our which extractor they are using that might help.

View full answer

anjackson · 2019-02-19T11:07:41Z

anjackson
Feb 19, 2019
Maintainer

Hi @damien-git, you should probably drop the Internet Archive a note (mailto:info@archive.org), as they may be able to tune the behaviour of their crawler.

In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.

I quite like the idea of tuning the crawl via robots.txt but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.

If we can find our which extractor they are using that might help.

0 replies

kris-sigur · 2019-02-20T07:45:59Z

kris-sigur
Feb 20, 2019
Maintainer

I've run a variant of ExtractorJS for years, that lets me filter out the links it discovers using a set of regular expressions. These are applied before the links are turned into full URLs, making it a bit easier to target common false positives in JS libraries than it would be if we are doing the filtering in the scope. You also don't risk catching any URLs extracted via other (more reliable) means.

Looking at the above, I should probably filter out any links extracted via ExtractorJS containing "gtm."

0 replies

nicjansma · 2019-04-18T19:19:11Z

nicjansma
Apr 18, 2019

We (Akamai) are seeing a similar issue with sites that have our mPulse product enabled, which includes JavaScript in the page's HTML that looks like this:

var a=["ak.bpcip","ak.cport","..."];

This results in our customer's websites getting crawled by numerous crawlers on each page for those 20+ elements of the array, e.g.:

http://website/foo/bar/ak.bpcip
http://website/foo/bar/ak.cport
... etc

0 replies

poolerMF · 2021-12-31T18:37:24Z

poolerMF
Dec 31, 2021

+1

I am using google tag manager and crawler is making many requests with "/gtm.js"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad requests with GTM #508

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Bad requests with GTM #508

damien-git Feb 18, 2019

Replies: 4 comments

anjackson Feb 19, 2019 Maintainer

kris-sigur Feb 20, 2019 Maintainer

nicjansma Apr 18, 2019

poolerMF Dec 31, 2021

damien-git
Feb 18, 2019

anjackson
Feb 19, 2019
Maintainer

kris-sigur
Feb 20, 2019
Maintainer

nicjansma
Apr 18, 2019

poolerMF
Dec 31, 2021