Skip to content

Bad requests with GTM #508

Answered by anjackson
damien-git asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @damien-git, you should probably drop the Internet Archive a note (mailto:info@archive.org), as they may be able to tune the behaviour of their crawler.

In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.

I quite like the idea of tuning the crawl via robots.txt but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.

If we can find our which extractor they are using that might help.

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ato
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
archive.org archive.org services not (just) Heritrix question
5 participants
Converted from issue

This discussion was converted from issue #232 on September 30, 2022 00:45.