-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken relative links #125
Comments
This: import reader
r = reader.Reader(':memory:')
r.add_feed('https://rachelbythebay.com/w/atom.xml')
r.update_feeds()
e, = [e for e in r.get_entries() if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2])
import feedparser
f = feedparser.parse('https://rachelbythebay.com/w/atom.xml')
e, = [e for e in f.entries if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2]) Outputs this: <a href="/w/2019/07/21/reliability/">put forth</a>
<a href="/w/2019/07/21/reliability/">put forth</a> So this is from feedparser, not reader. Next steps:
|
Installing sgmllib3k results in: <a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a> |
Ideally, we should pull relative link resolution out of feedparser's control and into reader's (like we did with HTTP requests). This will also allow downloading assets (images etc.) in the future. I assume sanitization also doesn't work (it probably relies on sgmllib). This should be documented / fixed ASAP, since it is a security issue. Update: nope, sanitization doesn't work without sgmllib; from feedparser/sgml.py:
Next steps:
|
On my blog I use relative urls for links and images. This makes them proper long links. kurtmckee/feedparser#43 lemon24/reader#125
So in the end, I made sgmllib3k a required dependency, and forced sanitization and link resolution on (commit above). We can consider the problem fixed; the "ideally" part of the comment above can be considered a feature request. |
Deploying 1.0 doesn't seem to fix it... Update: Turns out it's update_feeds()'s fault; see #164 for details. |
A few quick thoughts on how re-implementing sanitization would work:
Note:
|
The text was updated successfully, but these errors were encountered: