Skip to content

Fix issue#191: "RIS already open for ToeThread..." exception during https pages crawl over proxy#457

Merged
ato merged 1 commit intointernetarchive:masterfrom
ClemensRobbenhaar:issues191-https-over-proxy
Jan 17, 2022
Merged

Fix issue#191: "RIS already open for ToeThread..." exception during https pages crawl over proxy#457
ato merged 1 commit intointernetarchive:masterfrom
ClemensRobbenhaar:issues191-https-over-proxy

Conversation

@ClemensRobbenhaar
Copy link
Copy Markdown
Contributor

See issue#191 for the problem description.

Description of the Change

When using a proxy, HTTPS-Requests are handled in two steps:

a) the client sends a request of the form:

CONNECT <hostname>:<port> HTTP/1.1

to the proxy. The proxy then opens an encrypted connection to
the actual server and passes this to the client.

b) the client sends the data over the secure connection
as plain HTTP-request like

GET /some/path/name?param1=value1 HTTP/1.1

and receives the response in plain, too (as it is tunneled via an encrypted connection)

To support this behavior, the current code in the FetchHTTPRequest is changed in two places:

a) in the constructor around line 189 do not use the full URI as data to be send after the verb
if the method is HTTPS. Instead send the same line as without proxy

b) in the methods for the inner class RecordingHttpClientConnection check if we have an https request and a proxy in use. In that case do not wrap the first created input/output stream, but only the ones afterwards.

Here it would be better to check if the socket is the one used for the CONNECT or the one used to tunnel the actual request/response data (as the XXX comments suggests, which are already present in the code).
However this information is not available inside this class, so instead make the assumption that the input / output streams for the CONNECT are created first, and the one for the data to be recorded afterwards.

This avoids trying to use the same RecordingInput/OutputStream twice without the need to remove the safety checks in these classes which prevents erroneous reuse.

Testing

There is one new test testHttpsProxy in the FetchHTTPTests, but I am not sure if that is sufficiently complete.

Additionally I tested the changes manually using Apache HTTPD as a proxy, running on the same host as the Herirtix crawler.

The following snippet was used as configuration file /etc/apache2/sites-enabled/proxy-vhost.conf:

Listen 8000

<VirtualHost *:8000>
  ProxyRequests On

  ProxyVia On

  SSLProxyVerify none
  SSLProxyCheckPeerCN off
  SSLProxyCheckPeerName off

  <Proxy "*">
    Require ip 127.0.0 192.168
  </Proxy>

  LogFormat "%h %l %u %t \"%r\" %>s %b [%D] \"%{User-agent}i\"" withduration
  CustomLog ${APACHE_LOG_DIR}/proxy_access.log withduration

</VirtualHost>

Enabled modules:

  • mod_proxy
  • mod_ssl
  • mod_proxy_http
  • mod_proxy_connect

The the fetchHttp bean in the heritirix config has the proxy configured in the usual way:

 <property name="httpProxyHost" value="127.0.0.1" />
 <property name="httpProxyPort" value="8000" />

The HTTPS-Server used as target for the crawl was also an apache httpd; as it turned out this server is not affected by the first change (it seems not to mind if you send it requests which looks like proxy requests). The second change was necessary to crawl any HTTPS-URIs via proxy at all and did not affect the crawling of HTTP-URIs.

If you have change requests, suggestions for improvements or any other questions/comments about the patch, please let me know.

- do not send the scheme/host/port part of the request
  even when using a proxy, if the scheme is https
- in case of a https request send over a proxy, do not
  wrap the first input/output streams (which only contain
  the data for the `CONNECT` request), but the ones after
  them, which contain the actual data

Fixes internetarchive#191
@ato ato merged commit aa4f0f9 into internetarchive:master Jan 17, 2022
@ato
Copy link
Copy Markdown
Collaborator

ato commented Jan 17, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants