-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atom: implement xml:base relative URI resolution #101
Conversation
It looks like the travis build fails because Go 1.4 marshal's urlencoded strings differently. I can delete that test if it is important to have travis checks pass. |
This is great! I'm working on looking through it now. |
I still need to finish reviewing all the changes, but one thing that stood out to me is if it would be possible to avoid doing all the pushing/popping of the xml base in atom.Parser all together. If we could push that functionality further up, into shared it might be cleaner. I think that we currently call a wrapped |
Thanks for taking a look so soon. I agree it would be better to have push and pop in one place instead of having them sprinkled throughout |
This version moves the xml:base stack management out of atom.parser and into XMLBase. It is better, but not perfect:
|
Nice, I think the recent changes are good with the move into I've created mmcdole/goxpp#3 to track the depth field not updating in goxpp. I will have to look and see what is going on there. I think that the following link might be useful for us to look at: https://pythonhosted.org/feedparser/resolving-relative-links.html It looks like feedparser treats the following HTML/XHTML elements as URIs which are to be resolved following the XML:Base spec:
It then mentions that the following feed fields are identified URIs which should be resolved:
Both of these lists might be useful to reference. The second list of field elements would need to be translated back to their Atom equivalents for our purposes here. I think it would be nice if we could keep I think this would mean we should move the current Then, We are getting close! |
Thanks! I expanded the list of URI-containing HTML attributes according to what feedparser uses. I'm not sure we catch all of the same URI-containing Atom elements as feedparser. But by my search of the Atom 1.0 spec that is only ICON, ID, LOGO, and URI. Plus URL from Atom 0.3. We get those. I moved the atom-specific vars from shared.XMLBase to parser.Atom (which passes the list of element attributes to resolve to its instance of XMLBase). I also removed the duplicated (Let me know when/if you consider this branch ready to pull from, and I can do a --force update to clean up some of the commit history first.) |
I'm ready to merge this. Let me know if still wanted to any force update. |
What it does: Resolve relative URIs in feed element attributes, feed elements which contain URIs (like author:uri), and HTML element attributes in atom elements of type "html" or "xhtml" according to the xml:base specification (https://www.w3.org/TR/xmlbase/) What it is: The XMLBase type and functions live in the internal/shared package (internal/shared/xmlbase.go), with a minimalish patch against atom/parser.go. Tests live in testdata/parser/atom/ and are adapted from the python feedparser project: https://github.com/kurtmckee/feedparser/tree/master/feedparser/tests/wellformed/base How it works: As each atom element is parsed, a new xml:base is pushed to the stack; the top xml:base URI is used to resolve attributes (uses golang.org/x/net/html to parse any "html" or "xhtml" element content); then the base is popped from the stack. The shared.FindRoot() and shared.NextTag() functions have been moved to methods of XMLBase so that they can manage the xml:base url stack.
Also: No need to pass address of pointer to json.Unmarshal
Awesome, thanks mmcdole. I just pushed a final version with a slightly cleaned up commit message. |
@cristoper done! Thanks again for your contribution. I can finally close #2. |
I've updated the README.md to credit you for your work on this as well. |
I appreciate it! And thanks for all your work making gofeed available. Now I can get back to work on my little feed reader side project :) |
My application needs to resolve relative URLs in content html according to the xml:base attribute of the root feed element, so this is my attempt at implementing xml:base resolution (Issue #2 )
I believe this will work well for my needs, and if you think it's a reasonable approach in general I don't mind spending more time fixing any issues you foresee with it.
What it does:
Resolve relative URIs in feed element attributes, feed elements which contain URIs (like author:uri), and HTML element attributes in atom elements of type "html" or "xhtml" according to the xml:base specification (https://www.w3.org/TR/xmlbase/)
What it is:
Three changesets:
The first actually implements the XMLBase type and functions which live in the internal/shared package (
internal/shared/xmlbase.go
), with a smallish patch againstatom/parser.go
The second adds several tests adapted from the Python feedparser project
The third fixes a small bug in
atom/parser_test.go
which confused me while testing for a secondHow it works:
As each atom element is parsed, a new xml:base is (recursively) pushed to the stack; the top xml:base URI is used to resolve attributes (uses golang.org/x/net/html to parse any "html" or "xhtml" element content); then the base is popped from the stack.
TODO:
This has not been manually tested much yet so I'm sure there are edge cases that fail and possibly some low-hanging performance improvements.