-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using Atoma #263
Comments
I did a comparison between feedparser and atoma, by parsing 157 feeds from disk. atoma seems to be faster and consume significantly less memory (for a fair comparison, feedparser had both sanitization and relative link resolution disabled). noop doesn't do anything with the feeds, to provide a baseline.
Unfortunately, atoma doesn't support some of the RSS feeds:
The EntitiesForbidden error is due using defusedxml (#212 (comment)). The script I used:import sys, time, resource
import feedparser, atoma
def feedparser_parse(path, file):
return feedparser.parse(
file,
resolve_relative_uris=False,
sanitize_html=False,
)
def atoma_parse(path, file):
return getattr(atoma, f'parse_{path.rpartition(".")[2]}_file')(file)
def noop_parse(*_): pass
impl = sys.argv[1]
parse = locals()[f'{impl}_parse']
timings = 0
for line in sys.stdin:
path = line.rstrip()
with open(path, 'rb') as file:
try:
start = time.perf_counter()
parse(path, file)
end = time.perf_counter()
timings += end - start
except Exception as e:
print(f'error: {path}: {e}', file=sys.stderr)
print(
impl, round(timings, 1),
int(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
/ 2 ** (20 if sys.platform == 'darwin' else 10)
),
) |
Closing in favor of #265. |
In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.
In this issue, we'll look at https://github.com/NicolasLM/atoma; my comments [in brackets]:
Of note, it:
base
).The text was updated successfully, but these errors were encountered: