New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noisy alerts about 401s without auth challenge #158

Closed
kris-sigur opened this Issue Apr 27, 2016 · 0 comments

Comments

Projects
None yet
1 participant
@kris-sigur
Collaborator

kris-sigur commented Apr 27, 2016

A 401 response is supposed to include an auth challenge but in practice a lot of sites erroneously use 401 without it (they should really be using 403s).

When Heritrix encounters such a situation it logs the error in a such a manner that it is added to the alerts log. As this isn't an issue with the crawler, this isn't very useful and the spamming of such errors may hide other, more serious and actionable errors.

Example entry from the alerts log:

Apr 27, 2016 1:47:31 PM org.archive.modules.fetcher.FetchHTTP extractChallenges
WARNING: Failed to extract auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu (in thread 'ToeThread #7: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu'; in processor 'fetchHttp')

Suggest we modify how these errors are handled and log them in the nonfatal-errors.log only.

kris-sigur added a commit to kris-sigur/heritrix3 that referenced this issue May 2, 2016

nlevitt added a commit that referenced this issue May 2, 2016

Fixes issue #158 : Noisy alerts about 401s without auth challenge (#159)
* Fixes issue #158 : Noisy alerts about 401s without auth challenge

* Update test to account for non-fatal-error log not being empty on
non-auth 401s.

@kris-sigur kris-sigur closed this May 3, 2016

nlevitt added a commit to nlevitt/heritrix3 that referenced this issue Jun 7, 2016

Merge remote-tracking branch 'origin/master' into fix-test-errors
* origin/master:
  Setup TravisCI
  Set fetch status on curis when testing link extraction
  No link extraction on URI not successfully downloaded
  Fixes issue #158 : Noisy alerts about 401s without auth challenge (#159)
  Make Content-Location header url INFERRED not REFFER hop type since Content-Location is not for redirection (#151)
  fixes for kafka 0.9 (?)
  upgrade to kafka 0.9
  somewhat ugly fix to handle exceptions from the bean browser like java.lang.RuntimeException: not implemented at org.archive.modules.fetcher.BdbCookieStore$RestrictedCollectionWrappedList.get(BdbCookieStore.java:92)

nlevitt added a commit to vonrosen/heritrix3 that referenced this issue Jun 7, 2016

Merge branch 'fix-test-errors' into qa
* fix-test-errors:
  hopefully fix remaining serialization tests in oraclejdk8 by using ConcurrentSkipListMap instead of ConcurrentHashMap
  hopefully fix serialization tests in oraclejdk8 by using TreeSet instead of HashSet in KeyedProperties.java
  clear the history store at the beginning of testBasics(), because the other test might have run first
  yeesh... "cd .."  to get back to the right place to see the failure reports
  let me see what failed, travis
  Setup TravisCI
  Set fetch status on curis when testing link extraction
  No link extraction on URI not successfully downloaded
  Fixes issue #158 : Noisy alerts about 401s without auth challenge (#159)
  Make Content-Location header url INFERRED not REFFER hop type since Content-Location is not for redirection (#151)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment