Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected 404 link #648

Closed
sbesson opened this issue Aug 2, 2021 · 3 comments
Closed

Unexpected 404 link #648

sbesson opened this issue Aug 2, 2021 · 3 comments

Comments

@sbesson
Copy link

sbesson commented Aug 2, 2021

A DOI URL which previously passed htmlproofer just broke, likely due to upstream website changes.

To reproduce

$ htmlproofer --as-links https://doi.org/10.1126/science.1082602
...
Checking 1 external link...
- 
  *  External link https://doi.org/10.1126/science.1082602 failed: 404 No error

HTML-Proofer found 1 failure!

With increased verbosity

$ htmlproofer --as-links https://doi.org/10.1126/science.1082602 --typhoeus-config='{"verbose":true}'
...
Connection #2 to host science.sciencemag.org left intact
Issue another request to this URL: 'https://science.sciencemag.org/content/300/5616/100'
Found bundle for host science.sciencemag.org: 0x55a6f03696e0 [can multiplex]
Re-using existing connection! (#2) with host science.sciencemag.org
Connected to science.sciencemag.org (104.18.24.238) port 443 (#2)
Using Stream ID: 7 (easy handle 0x55a6f043b7a0)
0GET /content/300/5616/100 HTTP/2
Host: science.sciencemag.org
user-agent: Mozilla/5.0 (compatible; HTML Proofer/3.19.2; +https://github.com/gjtorikian/html-proofer)
accept: application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5

jHTTP/2 404 
date: Mon, 02 Aug 2021 08:09:57 GMT
content-type: text/html; charset=utf-8
content-length: 46103
accept-ranges: bytes
age: 0
cache-control: no-cache, must-revalidate
cf-railgun: 16eb96f076 stream 0.000000 0210 e6be
content-language: en
expires: Sun, 19 Nov 1978 05:00:00 GMT
set-cookie: ALT_LOGIN=IlsndHlwJyA9PiAnSldUJywgJ2FsZycgPT4gJ0hTMjU2J10i.e30=; path=/
vary: Accept-Encoding
via: 1.1 varnish
x-content-type-options: nosniff
x-drupal-cache: MISS
x-frame-options: SAMEORIGIN
x-generator: Drupal 7 (http://drupal.org)
x-highwire-sitecode: sci
x-highwire-smart-code: science_production
x-varnish: 1538811437
x-varnish-cache: 
x-varnish-ttl: 
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 6785d3b4bc2d3628-MAN

j?jjjjjjRjjjjj]jjjjjh%jjhjjj%jj<$Connection #2 to host science.sciencemag.org left intact
- 
  *  External link https://doi.org/10.1126/science.1082602 failed: 404 No error

HTML-Proofer found 1 failure!

The same command works for other DOIs hosted by other websitese.g. htmlproofer --as-links https://doi.org/10.1007/s00335-015-9587-6.

On the other hand, curl -IL https://doi.org/10.1126/science.1082602 reports 200 as expected. Is there a particular option that should be set for this specific URL or should it be excluded for now?

@gjtorikian
Copy link
Owner

It seems to me that the website is blocking html-proofer based on its User Agent. Does the command work for you if you change it, like this:

htmlproofer --as-links https://doi.org/10.1126/science.1082602 --typhoeus-config='{"verbose":true, "headers" : {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/91.0.4472.164 Safari/537.36"}}'

@sbesson
Copy link
Author

sbesson commented Aug 2, 2021

Thanks @gjtorikian, your command indeed fixed the output of htmlproofer. Despite being annoying, it feels like the problem is server-side and there is nothing this library should do differently unless you feel otherwise.
I will get in touch with the site owners and enquire whether the User-Agent could be added to the allowed list. Is

'User-Agent' => "Mozilla/5.0 (compatible; HTML Proofer/#{HTMLProofer::VERSION}; +https://github.com/gjtorikian/html-proofer)",
the right value?

@gjtorikian
Copy link
Owner

It is. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants