-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to defeat CloudFlare challenges? #738
Comments
I came here to report what I believe to be fundamentally the same problem, which is not specific to cloudflare. (Happy to spike this out as a separate issue, if it is considered different.) The documents I am checking may contain links to pages on websites that are ordinarily only accessible when logged in. This is expected. When such links are visited without being logged in (as is the case when accessed by linkchecker), the website returns 403 Forbidden and presents information about how to log in or redirects to a login page. If such links are visited by a human and the human logs in with credentials known to them, they are likely to be taken to the original URL as a logged in user. In that sense, the links are valid. I don't know what the URLs will be in advance, nor which ones will be behind a login, so I can't define a rule to exclude the affected URLs. I just expect that there will be URLs of that type. I wish to configure LinkChecker to accept URLs that reach a valid server that subsequently responds with 403 Forbidden, rather than treating them as an error. Returning a warning, rather than an error, would be fine. How can I instruct LinkChecker to not report an error when the server responds with 403 Forbidden? Thanks. |
Same status code, different question I suspect - fortunately because there is a solution the linkcheckerrc https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#url-checking-results Haven't tried it for this. A URL regular expression is required but a match any |
Thank you for this @cjmayo. I had to overcome the hurdle that the On further investigation, I did actually discover that one of the URLs returning 403 Forbidden for me (an institution page on ResearchGate) is affected by the cloudflare blocking problem. If it were possible to validate the response that a normal user would be presented with when visiting the URL, rather than simply masking the problem like the |
Summary
Our website has lists of publications with links to the original publishers. Many of these are protected by CloudFlare challenges, and report as 403: Forbidden by the linkchecker.
Steps to reproduce
Check a webpage that contains a link to a cloud-flare-protected page.
For instance,
The HACMS program: using formal methods to eliminate exploitable bugs
Actual result
I expect the link check to pass, as the link is valid. It actually gives a 403 forbidden response.
Expected result
Environment
Configuration
DEBUG linkcheck.cmdline 2023-05-10 11:06:58,728 MainThread configuration: [('aborttimeout', 300),
('allowedschemes', []),
('authentication', []),
('checkextern', False),
('cookiefile', None),
('csv', {}),
('debugmemory', False),
('dot', {}),
('enabledplugins', []),
('externlinks', []),
('failures', {}),
('fileoutput', []),
('gml', {}),
('gxml', {}),
('html', {}),
('ignoreerrors', []),
('ignorewarnings', []),
('internlinks', []),
('localwebroot', None),
('logger', 'NoneLogger'),
('loginextrafields', {}),
('loginpasswordfield', 'password'),
('loginurl', None),
('loginuserfield', 'login'),
('maxfilesizedownload', 5242880),
('maxfilesizeparse', 1048576),
('maxhttpredirects', 10),
('maxnumurls', None),
('maxrequestspersecond', 10),
('maxrunseconds', None),
('nntpserver', None),
('none', {}),
('output', 'text'),
('pluginfolders', []),
('quiet', False),
('recursionlevel', -1),
('resultcachesize', 100000),
('robotstxt', True),
('sitemap', {}),
('sql', {}),
('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
('status', True),
('status_wait_seconds', 5),
('text', {}),
('threads', 10),
('timeout', 60),
('trace', False),
('useragent',
'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
'+https://linkchecker.github.io/linkchecker/)'),
('verbose', False),
('warnings', True),
('xml', {})]
WARNING linkcheck.cmdline 2023-05-10 11:06:58,729 MainThread no files or URLs given
Logs
/usr/bin/linkchecker -F 'html/var/www/html/linkcheck/index.html' --no-status --ignore-url='^https?://twitter.com/' --ignore-url=print$ --ignore-url='^mailto:' --ignore-url='https?://scholar.google.com(.au)?/.*' --user-agent='Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Firefox/38.0 SeaMonkey/2.35' --check-extern https://trustworthy.systems/publications
Read the documentation at https://linkchecker.github.io/linkchecker/
Write comments and bugs to https://github.com/linkchecker/linkchecker/issues
Start checking at 2023-05-10 11:11:43+011
URL
https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor' Name
official list'Parent URL https://trustworthy.systems/students/theses, line 221, col 4
Real URL https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor
Check time 1.038 seconds
Size 145B
Result Error: 404 Not Found
URL
http://dx.doi.org/10.1002/%28SICI%291099-159X%28199611/12%294:6%3C399::AID-PIP148%3E3.0.CO;2-4' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2246, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4 Check time 1.304 seconds Info Redirected to
https://dx.doi.org/10.1002/(SICI)1099-159X(199611/12)4%3A6%3C399%3A%3AAID-PIP148%3E3.0.CO;2-4'.Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4'.
Result Error: 403 Forbidden
URL
http://dx.doi.org/10.1002/%28SICI%291099-159X%28199609/10%294:5%3C355::AID-PIP145%3E3.0.CO;2-X' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2270, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X Check time 1.131 seconds Info Redirected to
https://dx.doi.org/10.1002/(SICI)1099-159X(199609/10)4%3A5%3C355%3A%3AAID-PIP145%3E3.0.CO;2-X'.Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X'.
Result Error: 403 Forbidden
URL
http://dx.doi.org/10.1002/pip.4670020103' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2426, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103 Check time 1.287 seconds Info Redirected to
https://dx.doi.org/10.1002/pip.4670020103'.Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103'.
Result Error: 403 Forbidden
URL
http://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract' Parent URL https://trustworthy.systems/people/?cn=Gerwin%20Klein, line 1323, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract Check time 2.509 seconds Info Redirected to
https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract'.Result Error: 403 Forbidden
Other notes
There're quite a few python libraries out there that purport to bypass cloudflare's protection.
The text was updated successfully, but these errors were encountered: