HtmlExtractor: allow relative hrefs in the base element #209

anjackson · 2018-07-04T15:19:51Z

I believe this ensures Heritrix3 deals with relative base hrefs. (fixes #208)

ato · 2018-07-04T22:23:40Z

Looks good to me. I think JerichoExtractorHtml has the same problem.

anjackson · 2018-07-05T09:24:30Z

Hm, well, this now breaks other tests. Elsewhere, BAD BASE tests are in play that assume base href should be absolute.

Not sure what to do about this.

ato · 2018-07-05T11:34:29Z

Hm. So I think the problem is that HTML4 specified that base URLs must be absolute. Presumably browsers at the time ignored absolute URLs hence the BADBASE tests. However HTML5 allows them to be relative but only the first is obeyed and any subsequent base element is ignored.

So I guess the options for Heritrix's behaviour are:

Follow what contemporary browsers do: allow relative URLs and ignore base tags other than the first.
Generate candidates for every possible case.
Change behaviour depending on the doctype.

Personally I would go with option 1 since it's simple and matches what browsers do. I would consider option 2 speculative link extraction.

anjackson · 2018-07-05T12:21:09Z

I'm happy to implement option 1. Option 2 gets quite messy as it would mean storing all potential base URIs and generating all potential URL combinations (i.e. for L links and B base URIs we'd need to emit L x B links!) Option 3 is doable too I guess, with a fallback to option 1 when the DOCTYPE is not known.

I'm not really sure what to do with the tests, as they appear to be testing what happens when you get bad URIs, and happen to use the HTML4 base URI convention to make it happen. Having taken that away, I am struggling to make a test case that actually does create a bad URI!

ato · 2018-07-05T13:56:56Z

I'm not really sure what to do with the tests, as they appear to be testing what happens when you get bad URIs

If we're going with option 1 I think we should replace those bad url tests entirely. They're just plain wrong for modern html. We should instead test relative base hrefs and a page with several base tags.

anjackson · 2018-07-05T14:00:36Z

The problem is that the failing test is called BadURIsStopPageParsingSelfTest i.e. it's about hitting bad URIs (that it uses base hrefs to do it seems to be incidental). Having modified the base href handling, I don't know how to make the BadURIs test go off. I guess we could use very, very long URIs? If I add 'bad' characters it just escapes them.

ato · 2018-07-05T14:50:48Z

Looks like this is the original issue: https://webarchive.jira.com/browse/HER-25

I don't know how to make the BadURIs test go off.

I'm confused, why would we want them to? Isn't the point of the original issue that Heritrix should NOT choke on bad URIs? (ie the name of the test is describing the problem addressed not the desired behaviour)

ato · 2018-07-05T15:02:19Z

Those tests look all wrong to me. I'm pretty sure modern browsers would apply the base urls in all cases (one.html two.html html). If there's extra slashes or illegal characters ir whatever modern browsers just normalise and escape them. The schemaless url is treated as a relative URL. So I still think the tests are wrong and should be removed. (Or inverted so they enforce that goodone.html etc are actually not found)

anjackson · 2018-07-05T15:08:36Z

Yes, and the test appears to be designed to check that the page parsing does not stop if the extractor hits a bad URI. If I can't create a URI error, then I can't make the test test anything.

anjackson · 2018-07-05T15:18:44Z

Okay, so from a Real Life Log File^(TM) I find three types of error:

URI length > 2083
URI too short to be a meaningful URI
Contains non-LDH characters due to a %20 in the host name.

I think the latter is probably easiest to use.

ato · 2018-07-05T15:26:05Z

If I can't create a URI error, then I can't make the test test anything.

Ah. How about this?

UURIFactory.getInstance("http://a:b/");
-->
org.apache.commons.httpclient.URIException: invalid port number
	at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
	at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
	at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
	at org.apache.commons.httpclient.URI.<init>(URI.java:147)

anjackson · 2018-07-06T14:41:43Z

Right, so, finally got the tests working. One of the other tests assumed BaseURI would be set every time a base element was read, rather than just the first, so I had to modify that too.

anjackson · 2018-07-06T14:42:58Z

modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java

@@ -329,7 +402,6 @@ public void testScriptTagWritingScriptType() throws URIException {
    public void testOutLinksWithBaseHref() throws URIException {
        CrawlURI puri = new CrawlURI(UURIFactory
                .getInstance("http://www.example.com/abc/index.html"));
-        puri.setBaseURI(puri.getUURI());


This was the change needed because BaseURI can now only be set once in ExtractorHTML

kris-sigur · 2018-07-13T13:23:33Z

@anjackson It looks fine, but unfortunately you caught me minutes before I'm out the door for a holiday so I can't properly vet it until the end of the month.

anjackson · 2018-09-07T20:14:12Z

Hey @nlevitt @kris-sigur or anyone able to take a quick look at this?

nlevitt · 2018-09-11T22:18:42Z

Looks fine to me.

Test case and fix for internetarchive#208.

42181b8

Same fix for JerichoExtractorHTML.

14148cc

ato changed the title ~~Test case and fix for internetarchive/heritrix3#208.~~ HtmlExtractor: allow relative hrefs in the base element Jul 5, 2018

ato changed the title ~~HtmlExtractor: allow relative hrefs in the base element~~ HtmlExtractor: allow relative hrefs in the base element (for #208) Jul 5, 2018

ato changed the title ~~HtmlExtractor: allow relative hrefs in the base element (for #208)~~ HtmlExtractor: allow relative hrefs in the base element Jul 5, 2018

Only set the BaseURI if not set already.

8bf2308

Fix up tests to account for new Base URI behaviour.

8384a59

anjackson commented Jul 6, 2018

View reviewed changes

anjackson requested review from nlevitt and kris-sigur July 13, 2018 13:16

anjackson self-assigned this Jul 13, 2018

nlevitt merged commit a831676 into internetarchive:master Sep 11, 2018

anjackson deleted the relative-base-href branch September 13, 2018 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HtmlExtractor: allow relative hrefs in the base element #209

HtmlExtractor: allow relative hrefs in the base element #209

anjackson commented Jul 4, 2018 •

edited by ato

ato commented Jul 4, 2018 •

edited

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 6, 2018

anjackson Jul 6, 2018

kris-sigur commented Jul 13, 2018

anjackson commented Sep 7, 2018

nlevitt commented Sep 11, 2018

HtmlExtractor: allow relative hrefs in the base element #209

HtmlExtractor: allow relative hrefs in the base element #209

Conversation

anjackson commented Jul 4, 2018 • edited by ato

ato commented Jul 4, 2018 • edited

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 5, 2018

anjackson commented Jul 5, 2018

ato commented Jul 5, 2018

anjackson commented Jul 6, 2018

anjackson Jul 6, 2018

Choose a reason for hiding this comment

kris-sigur commented Jul 13, 2018

anjackson commented Sep 7, 2018

nlevitt commented Sep 11, 2018

anjackson commented Jul 4, 2018 •

edited by ato

ato commented Jul 4, 2018 •

edited