Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom: implement xml:base relative URI resolution #101

Merged
merged 2 commits into from
Sep 29, 2018
Merged

Atom: implement xml:base relative URI resolution #101

merged 2 commits into from
Sep 29, 2018

Conversation

cristoper
Copy link
Contributor

My application needs to resolve relative URLs in content html according to the xml:base attribute of the root feed element, so this is my attempt at implementing xml:base resolution (Issue #2 )

I believe this will work well for my needs, and if you think it's a reasonable approach in general I don't mind spending more time fixing any issues you foresee with it.

What it does:

Resolve relative URIs in feed element attributes, feed elements which contain URIs (like author:uri), and HTML element attributes in atom elements of type "html" or "xhtml" according to the xml:base specification (https://www.w3.org/TR/xmlbase/)

What it is:

Three changesets:

  1. The first actually implements the XMLBase type and functions which live in the internal/shared package (internal/shared/xmlbase.go), with a smallish patch against atom/parser.go

  2. The second adds several tests adapted from the Python feedparser project

  3. The third fixes a small bug in atom/parser_test.go which confused me while testing for a second

How it works:

As each atom element is parsed, a new xml:base is (recursively) pushed to the stack; the top xml:base URI is used to resolve attributes (uses golang.org/x/net/html to parse any "html" or "xhtml" element content); then the base is popped from the stack.

TODO:

This has not been manually tested much yet so I'm sure there are edge cases that fail and possibly some low-hanging performance improvements.

@coveralls
Copy link

coveralls commented Sep 26, 2018

Coverage Status

Coverage decreased (-2.9%) to 60.619% when pulling d3aa2ab on cristoper:xmlbase-rebase into c9d2a40 on mmcdole:master.

@cristoper
Copy link
Contributor Author

It looks like the travis build fails because Go 1.4 marshal's urlencoded strings differently. I can delete that test if it is important to have travis checks pass.

@mmcdole
Copy link
Owner

mmcdole commented Sep 26, 2018

This is great! I'm working on looking through it now.

@mmcdole
Copy link
Owner

mmcdole commented Sep 26, 2018

I still need to finish reviewing all the changes, but one thing that stood out to me is if it would be possible to avoid doing all the pushing/popping of the xml base in atom.Parser all together.

If we could push that functionality further up, into shared it might be cleaner. I think that we currently call a wrapped NextToken() function in shared. Perhaps if this was part of the new base parser object you created, then on StartTag and EndTag we can handle the pushing/popping in a single place there.

@cristoper
Copy link
Contributor Author

Thanks for taking a look so soon. I agree it would be better to have push and pop in one place instead of having them sprinkled throughout atom/parser.go. Tomorrow I'll experiment with moving them to NextToken().

@cristoper
Copy link
Contributor Author

cristoper commented Sep 27, 2018

This version moves the xml:base stack management out of atom.parser and into XMLBase. It is better, but not perfect:

  • atom.parser's parseAtomText() func calls goxpp's DecodeElement() func which consumes an element so that the end tag is never found by NextTag(). This had me puzzled for a second, but I got it working by checking in NextTag() if we're already on an EndTag to know when to pop the xml:base urlStack.

  • As an aside, I thought of using the XMLPullParser's Depth attribute to determine when to pop the xml:base stack, but it wasn't decrementing when expected. Is it a bug in goxpp that p.Depth is not decremented here?

  • This duplicates both shared.FindRoot() and shared.NextTag() because those methods are used by RSS also, and some of the XMLBase stuff is currently rather Atom-centric. Though supporting xml:base resolution for RSS feeds might be a nice feature, even if not required by the spec.

@mmcdole
Copy link
Owner

mmcdole commented Sep 28, 2018

Nice, I think the recent changes are good with the move into xmlbase.

I've created mmcdole/goxpp#3 to track the depth field not updating in goxpp. I will have to look and see what is going on there.

I think that the following link might be useful for us to look at: https://pythonhosted.org/feedparser/resolving-relative-links.html

It looks like feedparser treats the following HTML/XHTML elements as URIs which are to be resolved following the XML:Base spec:

<a href=”...”>
<applet codebase=”...”>
<area href=”...”>
<audio src=”...”>
<blockquote cite=”...”>
<body background=”...”>
<del cite=”...”>
<form action=”...”>
<frame longdesc=”...”>
<frame src=”...”>
<head profile=”...”>
<iframe longdesc=”...”>
<iframe src=”...”>
<img longdesc=”...”>
<img src=”...”>
<img usemap=”...”>
<input src=”...”>
<input usemap=”...”>
<ins cite=”...”>
<link href=”...”>
<object classid=”...”>
<object codebase=”...”>
<object data=”...”>
<object usemap=”...”>
<q cite=”...”>
<script src=”...”>
<source src=”...”>
<video poster=”...”>
<video src=”...”>

It then mentions that the following feed fields are identified URIs which should be resolved:

entries[i].author_detail.href
entries[i].comments
entries[i].contributors[j].href
entries[i].enclosures[j].href
entries[i].id
entries[i].license
entries[i].link
entries[i].links[j].href
entries[i].publisher_detail.href
entries[i].source.author_detail.href
entries[i].source.contributors[j].href
entries[i].source.links[j].href
feed.author_detail.href
feed.contributors[i].href
feed.docs
feed.generator_detail.href
feed.id
feed.image.href
feed.image.link
feed.license
feed.link
feed.links[i].href
feed.publisher_detail.href
feed.textinput.link

Both of these lists might be useful to reference. The second list of field elements would need to be translated back to their Atom equivalents for our purposes here.

I think it would be nice if we could keep shared.xmlbase completely ignorant of any atom specific knowledge.

I think this would mean we should move the current URIElements to atom.Parser (marked as unexported and perhaps expand this list), as these elements are specific to the atom spec. It will own this list so it knowns which fields it needs to call xmlbase.ResolveURL on.

Then, xmlbase will have a list of of xhtml/html elements/attributes that need to be resolved for its own ResolveHTML function.

We are getting close!

@cristoper
Copy link
Contributor Author

Thanks!

I expanded the list of URI-containing HTML attributes according to what feedparser uses.

I'm not sure we catch all of the same URI-containing Atom elements as feedparser. But by my search of the Atom 1.0 spec that is only ICON, ID, LOGO, and URI. Plus URL from Atom 0.3. We get those.

I moved the atom-specific vars from shared.XMLBase to parser.Atom (which passes the list of element attributes to resolve to its instance of XMLBase). I also removed the duplicated shared.FindRoot and shared.NextTag functions, so now rss.Parser also uses XMLBase.NextTag(), but with an empty URIAttrs field so that it doesn't actually resolve anything (but it should be easy to add xml:base resolution to the rss parser now, I just hesitate to do so without tests in place).

(Let me know when/if you consider this branch ready to pull from, and I can do a --force update to clean up some of the commit history first.)

internal/shared/xmlbase.go Show resolved Hide resolved
internal/shared/xmlbase.go Outdated Show resolved Hide resolved
@mmcdole
Copy link
Owner

mmcdole commented Sep 29, 2018

I'm ready to merge this. Let me know if still wanted to any force update.

What it does:

Resolve relative URIs in feed element attributes, feed elements which contain
URIs (like author:uri), and HTML element attributes in atom elements of
type "html" or "xhtml" according to the xml:base specification
(https://www.w3.org/TR/xmlbase/)

What it is:

The XMLBase type and functions live in the internal/shared package
(internal/shared/xmlbase.go), with a minimalish patch against
atom/parser.go.

Tests live in testdata/parser/atom/ and are adapted from the python feedparser project:

https://github.com/kurtmckee/feedparser/tree/master/feedparser/tests/wellformed/base

How it works:

As each atom element is parsed, a new xml:base is pushed to the stack;
the top xml:base URI is used to resolve attributes (uses
golang.org/x/net/html to parse any "html" or "xhtml" element content);
then the base is popped from the stack.

The shared.FindRoot() and shared.NextTag() functions have been moved to
methods of XMLBase so that they can manage the xml:base url stack.
Also: No need to pass address of pointer to json.Unmarshal
@cristoper
Copy link
Contributor Author

Awesome, thanks mmcdole. I just pushed a final version with a slightly cleaned up commit message.

@mmcdole mmcdole merged commit b7a5fbf into mmcdole:master Sep 29, 2018
@mmcdole
Copy link
Owner

mmcdole commented Sep 29, 2018

@cristoper done! Thanks again for your contribution. I can finally close #2.

@mmcdole
Copy link
Owner

mmcdole commented Sep 29, 2018

I've updated the README.md to credit you for your work on this as well.

@cristoper
Copy link
Contributor Author

I appreciate it! And thanks for all your work making gofeed available. Now I can get back to work on my little feed reader side project :)

@cristoper cristoper deleted the xmlbase-rebase branch September 30, 2018 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants