Atom: implement xml:base relative URI resolution #101

cristoper · 2018-09-26T17:34:36Z

My application needs to resolve relative URLs in content html according to the xml:base attribute of the root feed element, so this is my attempt at implementing xml:base resolution (Issue #2 )

I believe this will work well for my needs, and if you think it's a reasonable approach in general I don't mind spending more time fixing any issues you foresee with it.

What it does:

Resolve relative URIs in feed element attributes, feed elements which contain URIs (like author:uri), and HTML element attributes in atom elements of type "html" or "xhtml" according to the xml:base specification (https://www.w3.org/TR/xmlbase/)

What it is:

Three changesets:

The first actually implements the XMLBase type and functions which live in the internal/shared package (internal/shared/xmlbase.go), with a smallish patch against atom/parser.go
The second adds several tests adapted from the Python feedparser project
The third fixes a small bug in atom/parser_test.go which confused me while testing for a second

How it works:

As each atom element is parsed, a new xml:base is (recursively) pushed to the stack; the top xml:base URI is used to resolve attributes (uses golang.org/x/net/html to parse any "html" or "xhtml" element content); then the base is popped from the stack.

TODO:

This has not been manually tested much yet so I'm sure there are edge cases that fail and possibly some low-hanging performance improvements.

coveralls · 2018-09-26T17:39:15Z

Coverage decreased (-2.9%) to 60.619% when pulling d3aa2ab on cristoper:xmlbase-rebase into c9d2a40 on mmcdole:master.

cristoper · 2018-09-26T21:05:32Z

It looks like the travis build fails because Go 1.4 marshal's urlencoded strings differently. I can delete that test if it is important to have travis checks pass.

mmcdole · 2018-09-26T21:17:01Z

This is great! I'm working on looking through it now.

mmcdole · 2018-09-26T23:18:58Z

I still need to finish reviewing all the changes, but one thing that stood out to me is if it would be possible to avoid doing all the pushing/popping of the xml base in atom.Parser all together.

If we could push that functionality further up, into shared it might be cleaner. I think that we currently call a wrapped NextToken() function in shared. Perhaps if this was part of the new base parser object you created, then on StartTag and EndTag we can handle the pushing/popping in a single place there.

cristoper · 2018-09-27T04:45:22Z

Thanks for taking a look so soon. I agree it would be better to have push and pop in one place instead of having them sprinkled throughout atom/parser.go. Tomorrow I'll experiment with moving them to NextToken().

cristoper · 2018-09-27T16:14:12Z

This version moves the xml:base stack management out of atom.parser and into XMLBase. It is better, but not perfect:

atom.parser's parseAtomText() func calls goxpp's DecodeElement() func which consumes an element so that the end tag is never found by NextTag(). This had me puzzled for a second, but I got it working by checking in NextTag() if we're already on an EndTag to know when to pop the xml:base urlStack.
As an aside, I thought of using the XMLPullParser's Depth attribute to determine when to pop the xml:base stack, but it wasn't decrementing when expected. Is it a bug in goxpp that p.Depth is not decremented here?
This duplicates both shared.FindRoot() and shared.NextTag() because those methods are used by RSS also, and some of the XMLBase stuff is currently rather Atom-centric. Though supporting xml:base resolution for RSS feeds might be a nice feature, even if not required by the spec.

mmcdole · 2018-09-28T03:32:56Z

Nice, I think the recent changes are good with the move into xmlbase.

I've created mmcdole/goxpp#3 to track the depth field not updating in goxpp. I will have to look and see what is going on there.

I think that the following link might be useful for us to look at: https://pythonhosted.org/feedparser/resolving-relative-links.html

It looks like feedparser treats the following HTML/XHTML elements as URIs which are to be resolved following the XML:Base spec:

<a href=”...”>
<applet codebase=”...”>
<area href=”...”>
<audio src=”...”>
<blockquote cite=”...”>
<body background=”...”>
<del cite=”...”>
<form action=”...”>
<frame longdesc=”...”>
<frame src=”...”>
<head profile=”...”>
<iframe longdesc=”...”>
<iframe src=”...”>
<img longdesc=”...”>
<img src=”...”>
<img usemap=”...”>
<input src=”...”>
<input usemap=”...”>
<ins cite=”...”>
<link href=”...”>
<object classid=”...”>
<object codebase=”...”>
<object data=”...”>
<object usemap=”...”>
<q cite=”...”>
<script src=”...”>
<source src=”...”>
<video poster=”...”>
<video src=”...”>

It then mentions that the following feed fields are identified URIs which should be resolved:

entries[i].author_detail.href
entries[i].comments
entries[i].contributors[j].href
entries[i].enclosures[j].href
entries[i].id
entries[i].license
entries[i].link
entries[i].links[j].href
entries[i].publisher_detail.href
entries[i].source.author_detail.href
entries[i].source.contributors[j].href
entries[i].source.links[j].href
feed.author_detail.href
feed.contributors[i].href
feed.docs
feed.generator_detail.href
feed.id
feed.image.href
feed.image.link
feed.license
feed.link
feed.links[i].href
feed.publisher_detail.href
feed.textinput.link

Both of these lists might be useful to reference. The second list of field elements would need to be translated back to their Atom equivalents for our purposes here.

I think it would be nice if we could keep shared.xmlbase completely ignorant of any atom specific knowledge.

I think this would mean we should move the current URIElements to atom.Parser (marked as unexported and perhaps expand this list), as these elements are specific to the atom spec. It will own this list so it knowns which fields it needs to call xmlbase.ResolveURL on.

Then, xmlbase will have a list of of xhtml/html elements/attributes that need to be resolved for its own ResolveHTML function.

We are getting close!

cristoper · 2018-09-28T17:03:25Z

Thanks!

I expanded the list of URI-containing HTML attributes according to what feedparser uses.

I'm not sure we catch all of the same URI-containing Atom elements as feedparser. But by my search of the Atom 1.0 spec that is only ICON, ID, LOGO, and URI. Plus URL from Atom 0.3. We get those.

I moved the atom-specific vars from shared.XMLBase to parser.Atom (which passes the list of element attributes to resolve to its instance of XMLBase). I also removed the duplicated shared.FindRoot and shared.NextTag functions, so now rss.Parser also uses XMLBase.NextTag(), but with an empty URIAttrs field so that it doesn't actually resolve anything (but it should be easy to add xml:base resolution to the rss parser now, I just hesitate to do so without tests in place).

(Let me know when/if you consider this branch ready to pull from, and I can do a --force update to clean up some of the commit history first.)

internal/shared/xmlbase.go

mmcdole · 2018-09-29T13:51:45Z

I'm ready to merge this. Let me know if still wanted to any force update.

What it does: Resolve relative URIs in feed element attributes, feed elements which contain URIs (like author:uri), and HTML element attributes in atom elements of type "html" or "xhtml" according to the xml:base specification (https://www.w3.org/TR/xmlbase/) What it is: The XMLBase type and functions live in the internal/shared package (internal/shared/xmlbase.go), with a minimalish patch against atom/parser.go. Tests live in testdata/parser/atom/ and are adapted from the python feedparser project: https://github.com/kurtmckee/feedparser/tree/master/feedparser/tests/wellformed/base How it works: As each atom element is parsed, a new xml:base is pushed to the stack; the top xml:base URI is used to resolve attributes (uses golang.org/x/net/html to parse any "html" or "xhtml" element content); then the base is popped from the stack. The shared.FindRoot() and shared.NextTag() functions have been moved to methods of XMLBase so that they can manage the xml:base url stack.

Also: No need to pass address of pointer to json.Unmarshal

cristoper · 2018-09-29T17:58:41Z

Awesome, thanks mmcdole. I just pushed a final version with a slightly cleaned up commit message.

mmcdole · 2018-09-29T21:24:39Z

@cristoper done! Thanks again for your contribution. I can finally close #2.

mmcdole · 2018-09-29T22:02:51Z

I've updated the README.md to credit you for your work on this as well.

cristoper · 2018-09-30T02:16:19Z

I appreciate it! And thanks for all your work making gofeed available. Now I can get back to work on my little feed reader side project :)

mmcdole mentioned this pull request Sep 28, 2018

Depth field not updating as expected mmcdole/goxpp#3

Closed

mmcdole reviewed Sep 28, 2018

View reviewed changes

internal/shared/xmlbase.go Show resolved Hide resolved

internal/shared/xmlbase.go Outdated Show resolved Hide resolved

cristoper added 2 commits September 29, 2018 11:55

Fix transposed 'expected' and 'actual' arguments

d3aa2ab

Also: No need to pass address of pointer to json.Unmarshal

mmcdole merged commit b7a5fbf into mmcdole:master Sep 29, 2018

cristoper deleted the xmlbase-rebase branch September 30, 2018 02:16

cristoper mentioned this pull request Feb 25, 2023

Make XMLBase thread-safe #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom: implement xml:base relative URI resolution #101

Atom: implement xml:base relative URI resolution #101

cristoper commented Sep 26, 2018

coveralls commented Sep 26, 2018 •

edited

Loading

cristoper commented Sep 26, 2018

mmcdole commented Sep 26, 2018

mmcdole commented Sep 26, 2018 •

edited

Loading

cristoper commented Sep 27, 2018

cristoper commented Sep 27, 2018 •

edited

Loading

mmcdole commented Sep 28, 2018 •

edited

Loading

cristoper commented Sep 28, 2018

mmcdole commented Sep 29, 2018

cristoper commented Sep 29, 2018

mmcdole commented Sep 29, 2018

mmcdole commented Sep 29, 2018

cristoper commented Sep 30, 2018

Atom: implement xml:base relative URI resolution #101

Atom: implement xml:base relative URI resolution #101

Conversation

cristoper commented Sep 26, 2018

coveralls commented Sep 26, 2018 • edited Loading

cristoper commented Sep 26, 2018

mmcdole commented Sep 26, 2018

mmcdole commented Sep 26, 2018 • edited Loading

cristoper commented Sep 27, 2018

cristoper commented Sep 27, 2018 • edited Loading

mmcdole commented Sep 28, 2018 • edited Loading

cristoper commented Sep 28, 2018

mmcdole commented Sep 29, 2018

cristoper commented Sep 29, 2018

mmcdole commented Sep 29, 2018

mmcdole commented Sep 29, 2018

cristoper commented Sep 30, 2018

coveralls commented Sep 26, 2018 •

edited

Loading

mmcdole commented Sep 26, 2018 •

edited

Loading

cristoper commented Sep 27, 2018 •

edited

Loading

mmcdole commented Sep 28, 2018 •

edited

Loading