Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader treats all bozo feeds as errors #270

Open
lemon24 opened this issue Jan 29, 2022 · 1 comment
Open

reader treats all bozo feeds as errors #270

lemon24 opened this issue Jan 29, 2022 · 1 comment

Comments

@lemon24
Copy link
Owner

lemon24 commented Jan 29, 2022

reader treats all bozo feeds as errors, even if the loose parser managed to parse them:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>title</title>
  <updated>2021-12-18T11:00:00</updated>
  <id>http://example.com/</id>
  <entry>
    <id>http://example.com/entry</id>
    <updated>2021-07-29T00:00:00</updated>
    <content type="html">
        &#39; &amp; &gt; &ldquo; &lt; &quot; &rdquo; &rsquo;
    </content>
  </entry>
</feed>
{
    'bozo': 1,
    'bozo_exception': SAXParseException('undefined entity'),
    'encoding': 'utf-8',
    'entries': [
        {
            'content': [
                {
                    'base': '',
                    'language': None,
                    'type': 'text/html',
                    'value': '\' & > “ < " ” ’',
                }
            ],
            'id': 'http://example.com/entry',
            'summary': '\' & > “ < " ” ’',
            ...
        }
    ],
    'feed': {
        'id': 'http://example.com/',
        'title': 'title',
        ...
    },
    'headers': {},
    'namespaces': {'': 'http://www.w3.org/2005/Atom'},
    'version': 'atom10',
}

We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):

>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}
@lemon24
Copy link
Owner Author

lemon24 commented Feb 6, 2022

Some conclusions from playing with the Atom feed below:

  • xml.sax.SAXParseException "undefined entity" is survivable.
  • "mismatched tag" is not; we get all the good entries, and then the broken entry, in a bad state (e.g. all content in <title>); entries after it are missing, but not always.
  • It may be worth finding what other kinds of errors can be encountered... (all of them).

Also, when the loose parser is used, the feed should be considered stale; that is, we should always prefer entries from the non-broken feed.

I'm thinking of something like this:

existing parsed desired behavior current behavior
none any use new (any) yes
any strict use new (strict) yes (hash takes care of it)
strict loose keep old (strict) no (different hash => update)
loose loose use new (loose) yes (hash takes care of it)

This would favor feeds that are temporarily broken, and eventually get fixed. For feeds that become permanently broken, it results in old strict entries not receiving updates.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <entry>
        <id>one</id>
        <title>1</title>
        <summary>i</summary>
    </entry>
    <entry>
        <id>two</id>
        <title>Atom-Powered Robots Run Amok
        <summary>Summary.&veryundefinedentity;
        <content>Content.</content>
    </entry>
    <entry>
        <id>three</id>
        <title>3</title>
        <summary>iii</summary>
    </entry>

</feed>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant