Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Parsing HTML5 Pages #309

Closed
sjehuda opened this issue May 19, 2022 · 7 comments
Closed

[Feature Request] Parsing HTML5 Pages #309

sjehuda opened this issue May 19, 2022 · 7 comments

Comments

@sjehuda
Copy link

sjehuda commented May 19, 2022

It appears that the only Feed reader to handle <article/> tags is Liferea of Mr. Larse Windolf @lwindolf.

Intoduction:
Subscribing To Html5 Websites That Have No Feed

First commit:
Add support for subscribing to HTML5 websites without RSS/Atom feeds by extracting article titles, links and descriptions

Last commit to daye:
Improve HTML5 extraction: extract

if it exists and no article was found

Test page:
https://miranda-ng.org/
https://www.brandenburg.de/
http://intertwingly.net/blog/

Frankly, this is one of the best features of Liferea to date, namely because novice users don't need to handle scrapping for pages with <article/> tag.

@kurtmckee
Copy link
Owner

You haven't actually reported a bug or requested a feature. I can guess what the point is, but please modify the text of your issue to include a feature request or a bug report. Thanks!

@sjehuda sjehuda changed the title Parsing HTML5 Pages [Feature Request] Parsing HTML5 Pages May 19, 2022
@sjehuda
Copy link
Author

sjehuda commented May 19, 2022

Title corrected accordingly

@kurtmckee
Copy link
Owner

Thanks! So the request is: support extracting feed items directly from HTML data?

@sjehuda
Copy link
Author

sjehuda commented May 19, 2022 via email

@sjehuda
Copy link
Author

sjehuda commented Jul 14, 2022

I think this is unacceptable.
Software should do one thing and do it well.

I want to close this issue (or change it).
If you want, I can provide, for feedparser documentation, a complement script that will scrap and guess Title and Summary from element </article> using lxml (XPath) and output an Atom feed using feedparser.

If someone has a problem with websites not providing web feeds (probably because they are unaware of this technology), contact the web admins. It's a better solution.

What do you think, @kurtmckee?

@kurtmckee
Copy link
Owner

I would still like to see this in feedparser in the future, using the h-feed spec as a guide. For now, I'm fine with closing this issue.

@sjehuda
Copy link
Author

sjehuda commented Jul 14, 2022

I didn't know there's a specs documentation for </article>.
h-feed spec is definitely worth adding.
I just don't think that something so specific, let alone can be done by a relatively simple external script, is a sensible addition to feedparser.

Thank you for sharing h-feed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants