Broken relative links #125

lemon24 · 2019-08-09T10:08:58Z

pictures in https://mcfunley.com/manual-delivery
links in https://rachelbythebay.com/w/2019/08/04/olddocs/

lemon24 · 2019-08-12T17:44:21Z

This:

import reader
r = reader.Reader(':memory:')
r.add_feed('https://rachelbythebay.com/w/atom.xml')
r.update_feeds()
e, = [e for e in r.get_entries() if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2])

import feedparser
f = feedparser.parse('https://rachelbythebay.com/w/atom.xml')  
e, = [e for e in f.entries if '2019/08/01/reliability' in e.link]    
print(e.content[0].value.split('<p>')[1].splitlines()[2])

Outputs this:

<a href="/w/2019/07/21/reliability/">put forth</a>
<a href="/w/2019/07/21/reliability/">put forth</a>

So this is from feedparser, not reader.

Next steps:

Check each of the cases from How Relative URIs Are Resolved.
See if it's related to Relative link resolution doesn't work for some <img> kurtmckee/feedparser#43.

lemon24 · 2019-08-17T10:14:45Z

Installing sgmllib3k results in:

<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>

lemon24 · 2019-08-18T15:53:26Z

Ideally, we should pull relative link resolution out of feedparser's control and into reader's (like we did with HTTP requests). This will also allow downloading assets (images etc.) in the future.

I assume sanitization also doesn't work (it probably relies on sgmllib). This should be documented / fixed ASAP, since it is a security issue.

Update: nope, sanitization doesn't work without sgmllib; from feedparser/sgml.py:

sgmllib is not available by default in Python 3; if the end user doesn't have it available then we'll lose illformed XML parsing and content sanitizing

Next steps:

Check if sanitization works ~~(without sgmllib)~~. If it does, see if it is possible to turn it off, and force it on (also, mark all entries as stale). If it doesn't, warn in the documentation and force it off (so we get consistent results).
Similarly, force relative link resolution off.
Check why were the relative.{rss,atom} tests passing.
Document what feedparser features reader is using;
- even better, be explicit about all configurable behaviors and don't leave them up to defaults.
Re-implement sanitization (html5lib.filters.sanitizer.Filter or Bleach should do it).
- 2021-10 update: Relevant feedparser issue: Use bleach for HTML sanitizing kurtmckee/feedparser#257
Re-implement relative link resolution (we'll probably need to expose the base attribute).

On my blog I use relative urls for links and images. This makes them proper long links. kurtmckee/feedparser#43 lemon24/reader#125

For #125 / #157.

lemon24 · 2020-04-27T13:19:20Z

So in the end, I made sgmllib3k a required dependency, and forced sanitization and link resolution on (commit above).

We can consider the problem fixed; the "ideally" part of the comment above can be considered a feature request.

For #125 / #157.

lemon24 · 2020-04-28T15:12:55Z

Deploying 1.0 doesn't seem to fix it...

Update: Turns out it's update_feeds()'s fault; see #164 for details.

lemon24 · 2023-12-03T08:05:51Z

A few quick thoughts on how re-implementing sanitization would work:

(optional) add a .sanitized flag to data objects, default false, feedparser true
(optional) add a "before entry update" hook to allow a plugin to sanitize the content before it's stored
add a get_entries()/get_feeds() hook/plugin that lazily sanitizes attributes at runtime (noop if .sanitized == true)

Note:

sanitization refers strictly to html attributes
ideally, relative link resolution would also be part of this, since we want to avoid parsing the html twice
- bleach seems to have some hooks out of the box https://bleach.readthedocs.io/en/latest/linkify.html
- I need to think about how resolution works more; per summary of this issue, "relative to original page" does not work for us, we want something like
  - link (same page) -> fragment link (same page)
  - relative link to another page -> link to corresponding entry, if we have it (how do we find out?); this only makes sense in the context of the web app
  - -> external (absolute) link otherwise

lemon24 added bug core web app labels Aug 12, 2019

hrw added a commit to hrw/very-simple-planet-aggregator that referenced this issue Jan 22, 2020

require sgmllib3k to get relative urls fixed

4fa3c19

On my blog I use relative urls for links and images. This makes them proper long links. kurtmckee/feedparser#43 lemon24/reader#125

lemon24 mentioned this issue Mar 2, 2020

Cannot search through entries #122

Closed

lemon24 mentioned this issue Apr 9, 2020

Fix sanitization #157

Closed

lemon24 added a commit that referenced this issue Apr 27, 2020

Test for broken relative links inside content and for sanitization.

206052d

For #125 / #157.

lemon24 added a commit that referenced this issue Apr 27, 2020

Require sgmllib3k.

6ccc4bd

For #125 / #157.

lemon24 added a commit that referenced this issue Apr 27, 2020

Force relative link resolution and content sanitization ON.

97672a0

For #125 / #157.

lemon24 added a commit that referenced this issue Apr 27, 2020

Force relative link resolution and content sanitization ON.

d15f55b

For #125 / #157.

lemon24 added a commit that referenced this issue Apr 27, 2020

Mark feeds as stale for #125.

2b05e6a

For #125 / #157.

lemon24 mentioned this issue Apr 29, 2020

update_feeds() does not continue after a feed fails #164

Closed

lemon24 closed this as completed Apr 29, 2020

lemon24 reopened this Apr 29, 2020

lemon24 mentioned this issue Mar 20, 2021

JSON feed content is not sanitized #227

Open

lemon24 mentioned this issue Nov 18, 2021

Consider supporting alternative feed parsers #264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken relative links #125

Broken relative links #125

lemon24 commented Aug 9, 2019

lemon24 commented Aug 12, 2019

lemon24 commented Aug 17, 2019

lemon24 commented Aug 18, 2019 •

edited

Loading

lemon24 commented Apr 27, 2020 •

edited

Loading

lemon24 commented Apr 28, 2020 •

edited

Loading

lemon24 commented Dec 3, 2023

Broken relative links #125

Broken relative links #125

Comments

lemon24 commented Aug 9, 2019

lemon24 commented Aug 12, 2019

lemon24 commented Aug 17, 2019

lemon24 commented Aug 18, 2019 • edited Loading

lemon24 commented Apr 27, 2020 • edited Loading

lemon24 commented Apr 28, 2020 • edited Loading

lemon24 commented Dec 3, 2023

lemon24 commented Aug 18, 2019 •

edited

Loading

lemon24 commented Apr 27, 2020 •

edited

Loading

lemon24 commented Apr 28, 2020 •

edited

Loading