Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HtmlExtractor: allow relative hrefs in the base element #209

Merged
merged 4 commits into from Sep 11, 2018

Conversation

anjackson
Copy link
Collaborator

@anjackson anjackson commented Jul 4, 2018

I believe this ensures Heritrix3 deals with relative base hrefs. (fixes #208)

@ato
Copy link
Collaborator

ato commented Jul 4, 2018

Looks good to me. I think JerichoExtractorHtml has the same problem.

@anjackson
Copy link
Collaborator Author

Hm, well, this now breaks other tests. Elsewhere, BAD BASE tests are in play that assume base href should be absolute.

Not sure what to do about this.

@ato
Copy link
Collaborator

ato commented Jul 5, 2018

Hm. So I think the problem is that HTML4 specified that base URLs must be absolute. Presumably browsers at the time ignored absolute URLs hence the BADBASE tests. However HTML5 allows them to be relative but only the first is obeyed and any subsequent base element is ignored.

So I guess the options for Heritrix's behaviour are:

  1. Follow what contemporary browsers do: allow relative URLs and ignore base tags other than the first.
  2. Generate candidates for every possible case.
  3. Change behaviour depending on the doctype.

Personally I would go with option 1 since it's simple and matches what browsers do. I would consider option 2 speculative link extraction.

@ato ato changed the title Test case and fix for internetarchive/heritrix3#208. HtmlExtractor: allow relative hrefs in the base element Jul 5, 2018
@ato ato changed the title HtmlExtractor: allow relative hrefs in the base element HtmlExtractor: allow relative hrefs in the base element (for #208) Jul 5, 2018
@ato ato changed the title HtmlExtractor: allow relative hrefs in the base element (for #208) HtmlExtractor: allow relative hrefs in the base element Jul 5, 2018
@anjackson
Copy link
Collaborator Author

I'm happy to implement option 1. Option 2 gets quite messy as it would mean storing all potential base URIs and generating all potential URL combinations (i.e. for L links and B base URIs we'd need to emit L x B links!) Option 3 is doable too I guess, with a fallback to option 1 when the DOCTYPE is not known.

I'm not really sure what to do with the tests, as they appear to be testing what happens when you get bad URIs, and happen to use the HTML4 base URI convention to make it happen. Having taken that away, I am struggling to make a test case that actually does create a bad URI!

@ato
Copy link
Collaborator

ato commented Jul 5, 2018

I'm not really sure what to do with the tests, as they appear to be testing what happens when you get bad URIs

If we're going with option 1 I think we should replace those bad url tests entirely. They're just plain wrong for modern html. We should instead test relative base hrefs and a page with several base tags.

@anjackson
Copy link
Collaborator Author

The problem is that the failing test is called BadURIsStopPageParsingSelfTest i.e. it's about hitting bad URIs (that it uses base hrefs to do it seems to be incidental). Having modified the base href handling, I don't know how to make the BadURIs test go off. I guess we could use very, very long URIs? If I add 'bad' characters it just escapes them.

@ato
Copy link
Collaborator

ato commented Jul 5, 2018

Looks like this is the original issue: https://webarchive.jira.com/browse/HER-25

I don't know how to make the BadURIs test go off.

I'm confused, why would we want them to? Isn't the point of the original issue that Heritrix should NOT choke on bad URIs? (ie the name of the test is describing the problem addressed not the desired behaviour)

@ato
Copy link
Collaborator

ato commented Jul 5, 2018

Those tests look all wrong to me. I'm pretty sure modern browsers would apply the base urls in all cases (one.html two.html html). If there's extra slashes or illegal characters ir whatever modern browsers just normalise and escape them. The schemaless url is treated as a relative URL. So I still think the tests are wrong and should be removed. (Or inverted so they enforce that goodone.html etc are actually not found)

@anjackson
Copy link
Collaborator Author

Yes, and the test appears to be designed to check that the page parsing does not stop if the extractor hits a bad URI. If I can't create a URI error, then I can't make the test test anything.

@anjackson
Copy link
Collaborator Author

Okay, so from a Real Life Log File^(TM) I find three types of error:

  • URI length > 2083
  • URI too short to be a meaningful URI
  • Contains non-LDH characters due to a %20 in the host name.

I think the latter is probably easiest to use.

@ato
Copy link
Collaborator

ato commented Jul 5, 2018

If I can't create a URI error, then I can't make the test test anything.

Ah. How about this?

UURIFactory.getInstance("http://a:b/");
-->
org.apache.commons.httpclient.URIException: invalid port number
	at org.apache.commons.httpclient.URI.parseAuthority(URI.java:2248)
	at org.archive.url.LaxURI.parseAuthority(LaxURI.java:190)
	at org.archive.url.LaxURI.parseUriReference(LaxURI.java:359)
	at org.apache.commons.httpclient.URI.<init>(URI.java:147)

@anjackson
Copy link
Collaborator Author

Right, so, finally got the tests working. One of the other tests assumed BaseURI would be set every time a base element was read, rather than just the first, so I had to modify that too.

@@ -329,7 +402,6 @@ public void testScriptTagWritingScriptType() throws URIException {
public void testOutLinksWithBaseHref() throws URIException {
CrawlURI puri = new CrawlURI(UURIFactory
.getInstance("http://www.example.com/abc/index.html"));
puri.setBaseURI(puri.getUURI());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the change needed because BaseURI can now only be set once in ExtractorHTML

@anjackson anjackson self-assigned this Jul 13, 2018
@kris-sigur
Copy link
Collaborator

@anjackson It looks fine, but unfortunately you caught me minutes before I'm out the door for a holiday so I can't properly vet it until the end of the month.

@anjackson
Copy link
Collaborator Author

Hey @nlevitt @kris-sigur or anyone able to take a quick look at this?

@nlevitt
Copy link
Contributor

nlevitt commented Sep 11, 2018

Looks fine to me.

@nlevitt nlevitt merged commit a831676 into internetarchive:master Sep 11, 2018
@anjackson anjackson deleted the relative-base-href branch September 13, 2018 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML extractor does not handle the base href correctly when it's relative
4 participants