Fix issue#191: "RIS already open for ToeThread..." exception during https pages crawl over proxy#457
Merged
ato merged 1 commit intointernetarchive:masterfrom Jan 17, 2022
Conversation
- do not send the scheme/host/port part of the request even when using a proxy, if the scheme is https - in case of a https request send over a proxy, do not wrap the first input/output streams (which only contain the data for the `CONNECT` request), but the ones after them, which contain the actual data Fixes internetarchive#191
Collaborator
|
Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See issue#191 for the problem description.
Description of the Change
When using a proxy, HTTPS-Requests are handled in two steps:
a) the client sends a request of the form:
to the proxy. The proxy then opens an encrypted connection to
the actual server and passes this to the client.
b) the client sends the data over the secure connection
as plain HTTP-request like
and receives the response in plain, too (as it is tunneled via an encrypted connection)
To support this behavior, the current code in the
FetchHTTPRequestis changed in two places:a) in the constructor around line 189 do not use the full URI as data to be send after the verb
if the method is HTTPS. Instead send the same line as without proxy
b) in the methods for the inner class
RecordingHttpClientConnectioncheck if we have an https request and a proxy in use. In that case do not wrap the first created input/output stream, but only the ones afterwards.Here it would be better to check if the socket is the one used for the
CONNECTor the one used to tunnel the actual request/response data (as theXXXcomments suggests, which are already present in the code).However this information is not available inside this class, so instead make the assumption that the input / output streams for the
CONNECTare created first, and the one for the data to be recorded afterwards.This avoids trying to use the same RecordingInput/OutputStream twice without the need to remove the safety checks in these classes which prevents erroneous reuse.
Testing
There is one new test
testHttpsProxyin theFetchHTTPTests, but I am not sure if that is sufficiently complete.Additionally I tested the changes manually using Apache HTTPD as a proxy, running on the same host as the Herirtix crawler.
The following snippet was used as configuration file
/etc/apache2/sites-enabled/proxy-vhost.conf:Enabled modules:
mod_proxymod_sslmod_proxy_httpmod_proxy_connectThe the
fetchHttpbean in the heritirix config has the proxy configured in the usual way:The HTTPS-Server used as target for the crawl was also an apache httpd; as it turned out this server is not affected by the first change (it seems not to mind if you send it requests which looks like proxy requests). The second change was necessary to crawl any HTTPS-URIs via proxy at all and did not affect the crawling of HTTP-URIs.
If you have change requests, suggestions for improvements or any other questions/comments about the patch, please let me know.