Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using Atoma #263

Closed
lemon24 opened this issue Nov 18, 2021 · 2 comments
Closed

Consider using Atoma #263

lemon24 opened this issue Nov 18, 2021 · 2 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Nov 18, 2021

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll look at https://github.com/NicolasLM/atoma; my comments [in brackets]:

Features:

  • RSS 2.0 - RSS 2.0 Specification
  • Atom Syndication Format v1 - RFC4287
  • JSON Feed v1 - JSON Feed specification [including v1.1]
  • OPML 2.0, to share lists of feeds - OPML 2.0
  • Typed: feeds decomposed into meaningful Python objects
  • Secure: uses defusedxml to load untrusted feeds [no plain etree, no lxml]
  • Compatible with Python 3.6+

Non-implemented Features:

  • XML signature and encryption [likely not needed]
  • Some Atom and RSS extensions [although feedparser may have more, I don't think reader uses them]
  • Atom content other than text, html and xhtml [likely OK]

Of note, it:

  • Seems actively developed.
  • Supports passing open files (but you need to know what type of feed you have for this).
  • Does not do sanitization.
  • Does not support relative link resolution (nor does it expose base).
  • Does not support various kinds of malformed feeds (see comment below).
  • Seems to use much less memory.
@lemon24
Copy link
Owner Author

lemon24 commented Nov 18, 2021

I did a comparison between feedparser and atoma, by parsing 157 feeds from disk.

atoma seems to be faster and consume significantly less memory (for a fair comparison, feedparser had both sanitization and relative link resolution disabled).

noop doesn't do anything with the feeds, to provide a baseline.

# impl time maxrss

# Ubuntu 20.04, Python 3.8.10

feedparser 9.0 61
atoma 1.5 28
noop 0.0 20

# macOS Catalina, Python 3.8.10

feedparser 14.5 56
atoma 2.3 29
noop 0.0 18

Unfortunately, atoma doesn't support some of the RSS feeds:

error: _feeds/https-blog-nelhage-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-nedbatchelder-com-blog-rss-xml.rss: Cannot process RSS feed version "None"
error: _feeds/https-ciechanow-ski-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/http-www-xn-8ws00zhy3a-com-feed.atom: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)
error: _feeds/https-www-reddit-com-r-oilshell-rss.rss: Not a valid XML document
error: _feeds/https-blog-ncase-me-rss.rss: Cannot process RSS feed version "None"
error: _feeds/https-danluu-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-blogs-dropbox-com-tech-feed.rss: Could not parse feed: "url" text is required but is empty

The EntitiesForbidden error is due using defusedxml (#212 (comment)).

The script I used:
import sys, time, resource
import feedparser, atoma

def feedparser_parse(path, file):
    return feedparser.parse(
        file,
        resolve_relative_uris=False,
        sanitize_html=False,
    )

def atoma_parse(path, file):
    return getattr(atoma, f'parse_{path.rpartition(".")[2]}_file')(file)

def noop_parse(*_): pass

impl = sys.argv[1] 
parse = locals()[f'{impl}_parse']

timings = 0
for line in sys.stdin:
    path = line.rstrip()
    with open(path, 'rb') as file:
        try:
            start = time.perf_counter()
            parse(path, file)
            end = time.perf_counter()
            timings += end - start
        except Exception as e:
            print(f'error: {path}: {e}', file=sys.stderr)

print(
    impl, round(timings, 1),
    int(
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    ),
)

@lemon24
Copy link
Owner Author

lemon24 commented Jan 29, 2022

Closing in favor of #265.

@lemon24 lemon24 closed this as completed Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant