Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue211: support dns over https if local DNS is not working / available #476

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 65 additions & 8 deletions docs/configuring-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -558,17 +558,74 @@ Scripting Console
[This section to be written. For now see the
`Heritrix3 Useful Scripts <https://github.com/internetarchive/heritrix3/wiki/Heritrix3%20Useful%20Scripts>`_ wiki page.]


Configuring HTTP Proxies
~~~~~~~~~~~~~~~~~~~~~~~~

There are two options to specify a proxy for crawling.

The command line options ``--proxy-host`` and ``--proxy-port`` can be used to define a proxy for all jobs.
If only the ``--proxy-host`` option is given, a default value of 8000 is used for the proxy port.
These proxy settings are also used when connecting to a "DNS-over-HTTP" server
(see the `section on DNS-over-HTTP <#configuring-dns-over-http-doh>`_ below).

Alternatively one can define a per-job proxy via a the ``httpProxyHost`` and ``httpProxyPort`` properties of the
``fetchHttp`` bean. These settings, if both defined, will overwrite the global options. These setting also allow for
a user and password in the ``httpProxyUser`` and ``httpProxyPassword`` properties, which the global options do not
support, due to incompatibilities of the different supported Java versions.

Also the optional "SOCKS5" proxy documented in the next section is used on a per-job basis; there are currently no
global options to define it.

Configuring SOCKS5 Proxy
~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~

An optional configuration value to route Heritrix crawler traffic through a SOCKS5 proxy. This will override any set
HTTP proxy configuration. It is facilitated by extending the `org.archive.modules.fetcher.FetchHTTP` bean with
`socksProxyHost` and `socksProxyPort` values, as in the example below:

```
<bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp">
<!-- ... -->
<property name="socksProxyHost" value="127.0.0.1"/>
<property name="socksProxyPort" value="24000"/>
</bean>
```
.. code-block:: xml

<bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp">
<!-- ... -->
<property name="socksProxyHost" value="127.0.0.1"/>
<property name="socksProxyPort" value="24000"/>
</bean>

Configuring DNS over HTTP (DoH)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the local DNS on the server running Heritrix is not able to resolve the DNS names of the crawled sites, e.g. because
the server is sitting behind an enterprise firewall and can only resolve names of the local network, then using
DNS-over-HTTP (DoH) might be an alternative to fetch DNS information.

To activate this, one needs to set the ``dnsOverHttpServer`` setting of the ``fetchDns`` bean to the URL of an DoH server.
If one has configured a global proxy via the ``--proxy-host`` and ``--proxy-port`` command line options,
these proxy settings will be used to contact the DoH server as well. However due to limitation of the library in use,
username and password information for the proxy are not supported. Also any per-job defined proxy settings in the
``fetchHttp`` bean are not used when contacting the DoH server.

As the implementation relies on the corresponding client in the "dnsjava" library, which is currently labeled as
experimental, this option comes with some limitations:

* If you use Java 11 then due to a `well known bug <https://bugs.openjdk.java.net/browse/JDK-8221395>`_ it will not
close connections to the DoH server unless Heritrix shuts down.
As the DoH server might not accept new connections after some limits while these connections are still open, it is
not recommended to use this feature when running Heritrix with Java 11.
* For other Java versions, the connection to the DoH server will be closed when the garbage collector runs.
Depending on the garbage collector used this will cause a delay of anything between a few seconds and several
minutes before closing the connection. Also note that if the garbage collector is explicitely triggered via the
Heritrix UI one needs to add the ``-XX:-DisableExplicitGC`` option in the ``JAVA_OPTS`` for Java versions 13 and up
as otherwise this action has no effect.

Without making a recommendation the following DoH servers have been tested with the DoH feature:

* https://dns.google/dns-query
* https://cloudflare-dns.com/dns-query

However servers implementing the official `RFC 8484 <https://tools.ietf.org/html/rfc8484>`_ specification
unfortunately do not work with the current implementation. This includes e.g. the following server:

* https://dns.digitale-gesellschaft.ch/dns-query

This limitation might be overcome by a newer version of the "dnsjava" library.
19 changes: 16 additions & 3 deletions engine/src/main/java/org/archive/crawler/Heritrix.java
Original file line number Diff line number Diff line change
Expand Up @@ -141,11 +141,15 @@ private static Options options() {
"web interface to bind to.");
options.addOption("p", "web-port", true, "The port the web interface " +
"should listen on.");
options.addOption("r", "run-job", true, "Run a single job and then exit when it" +
"finishes.");
options.addOption("r", "run-job", true, "Run a single job and then exit " +
"when it finishes.");
options.addOption("s", "ssl-params", true, "Specify a keystore " +
"path, keystore password, and key password for HTTPS use. " +
"Separate with commas, no whitespace.");
options.addOption(null, "proxy-host", true, "Global http(s) proxy host " +
"to use for crawling.");
options.addOption(null, "proxy-port", true, "Global http(s) proxy port " +
"to use for crawling.");
return options;
}

Expand Down Expand Up @@ -307,6 +311,15 @@ public void instanceMain(String[] args)
useAdhocKeystore(startupOut);
}

if(cl.hasOption("proxy-host")) {
String proxyHost = cl.getOptionValue("proxy-host");
String proxyPort = cl.getOptionValue("proxy-port", "8000");
System.setProperty("http.proxyHost", proxyHost);
System.setProperty("http.proxyPort", proxyPort);
System.setProperty("https.proxyHost", proxyHost);
System.setProperty("https.proxyPort", proxyPort);
}

// Restlet will reconfigure logging according to the system property
// so we must set it for -l to work properly
System.setProperty("java.util.logging.config.file", properties.getPath());
Expand All @@ -315,7 +328,7 @@ public void instanceMain(String[] args)
LogManager.getLogManager().readConfiguration(finp);
finp.close();
}

// Set timezone here. Would be problematic doing it if we're running
// inside in a container.
TimeZone.setDefault(TimeZone.getTimeZone("GMT"));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@
<!-- <property name="acceptNonDnsResolves" value="false" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
<!-- <property name="dnsOverHttpServer" value="https://dns.google/dns-query" /> -->
</bean>
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- <property name="maxLengthBytes" value="0" /> -->
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ http://example.example/example
<!-- <property name="acceptNonDnsResolves" value="false" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
<!-- <property name="dnsOverHttpServer" value="https://dns.google/dns-query" /> -->
</bean>
<!-- <bean id="fetchWhois" class="org.archive.modules.fetcher.FetchWhois">
<property name="specialQueryTemplates">
Expand Down
39 changes: 36 additions & 3 deletions modules/src/main/java/org/archive/modules/fetcher/FetchDNS.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import java.util.regex.Matcher;

import org.apache.commons.httpclient.URIException;
import org.apache.commons.lang.StringUtils;
import org.archive.modules.CrawlURI;
import org.archive.modules.Processor;
import org.archive.modules.net.CrawlHost;
Expand All @@ -45,6 +46,7 @@
import org.springframework.beans.factory.annotation.Autowired;
import org.xbill.DNS.ARecord;
import org.xbill.DNS.DClass;
import org.xbill.DNS.DohResolver;
import org.xbill.DNS.Lookup;
import org.xbill.DNS.Record;
import org.xbill.DNS.ResolverConfig;
Expand Down Expand Up @@ -115,6 +117,19 @@ public boolean getDisableJavaDnsResolves() {
public void setDisableJavaDnsResolves(boolean disableJavaDnsResolves) {
kp.put("disableJavaDnsResolves",disableJavaDnsResolves);
}

public String getDnsOverHttpServer() {
return (String) kp.get("dnsOverHttpServer");
}
/**
* URL to the DNS-on-HTTP(S) server.
* If this not set or set to an empty string, no DNS-over-HTTP(S)
* will be used; otherwise if should contain the URL to the
* DNS-over-HTTPS server.
*/
public void setDnsOverHttpServer(String dnsOverHttpServer) {
kp.put("dnsOverHttpServer", dnsOverHttpServer);
}

/**
* Used to do DNS lookups.
Expand Down Expand Up @@ -163,8 +178,8 @@ public FetchDNS() {
protected boolean shouldProcess(CrawlURI curi) {
return curi.getUURI().getScheme().equals("dns");
}


protected void innerProcess(CrawlURI curi) {
Record[] rrecordSet = null; // Retrieved dns records
String dnsName = null;
Expand Down Expand Up @@ -194,7 +209,7 @@ protected void innerProcess(CrawlURI curi) {
// If we have not disabled JavaDNS, use that:
if (!getDisableJavaDnsResolves()) {
try {
rrecordSet = (new Lookup(lookupName, TypeType, ClassType)).run();
rrecordSet = createDNSLookup(lookupName).run();
} catch (TextParseException e) {
rrecordSet = null;
}
Expand Down Expand Up @@ -378,4 +393,22 @@ protected ARecord getFirstARecord(Record[] rrecordSet) {
}
return arecord;
}

protected Lookup createDNSLookup(String lookupName)
throws TextParseException {
Lookup lookup = new Lookup(lookupName, TypeType, ClassType);

String dohServer = getDnsOverHttpServer();
if (StringUtils.isNotEmpty(dohServer)) {
if (logger.isLoggable(Level.FINER)) {
logger.log(Level.FINER,
"use dns on http with server " + dohServer);
}

DohResolver hts = new DohResolver(dohServer);
lookup.setResolver(hts);
}

return lookup;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,16 @@ public FetchHTTPRequest(FetchHTTP fetcher, CrawlURI curi) throws URIException {
// HTTP proxy settings
String proxyHostname = (String) fetcher.getAttributeEither(curi, "httpProxyHost");
Integer proxyPort = (Integer) fetcher.getAttributeEither(curi, "httpProxyPort");
if (!(StringUtils.isNotEmpty(proxyHostname) && proxyPort != null) && !this.useSocksProxy) {
String sysPropertyPrefix;
if ("https".equalsIgnoreCase(curi.getUURI().getScheme())) {
sysPropertyPrefix = "https";
} else {
sysPropertyPrefix = "http";
}
proxyHostname = System.getProperty(sysPropertyPrefix + ".proxyHost");
proxyPort = Integer.getInteger(sysPropertyPrefix + ".proxyPort");
}

// use HTTP proxy settings if SOCKS5 has not already been specified
String requestLineUri;
Expand Down