[Feature Request] Parsing HTML5 Pages #309

sjehuda · 2022-05-19T03:37:34Z

It appears that the only Feed reader to handle <article/> tags is Liferea of Mr. Larse Windolf @lwindolf.

Intoduction:
Subscribing To Html5 Websites That Have No Feed

First commit:
Add support for subscribing to HTML5 websites without RSS/Atom feeds by extracting article titles, links and descriptions

Last commit to daye:
Improve HTML5 extraction: extract

if it exists and no article was found

Test page:
https://miranda-ng.org/
https://www.brandenburg.de/
http://intertwingly.net/blog/

Frankly, this is one of the best features of Liferea to date, namely because novice users don't need to handle scrapping for pages with <article/> tag.

The text was updated successfully, but these errors were encountered:

kurtmckee · 2022-05-19T11:58:08Z

You haven't actually reported a bug or requested a feature. I can guess what the point is, but please modify the text of your issue to include a feature request or a bug report. Thanks!

sjehuda · 2022-05-19T13:04:00Z

Title corrected accordingly

kurtmckee · 2022-05-19T13:35:21Z

Thanks! So the request is: support extracting feed items directly from HTML data?

sjehuda · 2022-05-19T15:44:40Z

On Thu, 19 May 2022 06:35:32 -0700 Kurt McKee ***@***.***> wrote: Thanks! So the request is: support extracting feed items directly from HTML data?

Yes, but only on certain occasions, just like Liferea. Of course, this leaves use with a limited options because we are guessing an </article> entry. Apparently, some websites that don't provide feeds, are useful when treated as feeds, hence I think a very-specific guessing mechanism is worth to have.

sjehuda · 2022-07-14T13:21:06Z

I think this is unacceptable.
Software should do one thing and do it well.

I want to close this issue (or change it).
If you want, I can provide, for feedparser documentation, a complement script that will scrap and guess Title and Summary from element </article> using lxml (XPath) and output an Atom feed using feedparser.

If someone has a problem with websites not providing web feeds (probably because they are unaware of this technology), contact the web admins. It's a better solution.

What do you think, @kurtmckee?

kurtmckee · 2022-07-14T13:39:07Z

I would still like to see this in feedparser in the future, using the h-feed spec as a guide. For now, I'm fine with closing this issue.

sjehuda · 2022-07-14T13:47:02Z

I didn't know there's a specs documentation for </article>.
h-feed spec is definitely worth adding.
I just don't think that something so specific, let alone can be done by a relatively simple external script, is a sensible addition to feedparser.

Thank you for sharing h-feed!

sjehuda changed the title ~~Parsing HTML5 Pages~~ [Feature Request] Parsing HTML5 Pages May 19, 2022

kurtmckee closed this as completed Jul 14, 2022

This was referenced Jul 19, 2022

h-feed support #320

Open

h-feed h-entry support mastodon/mastodon#18845

Closed

h-feed h-entry support lwindolf/liferea#1127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Parsing HTML5 Pages #309

[Feature Request] Parsing HTML5 Pages #309

sjehuda commented May 19, 2022

kurtmckee commented May 19, 2022

sjehuda commented May 19, 2022

kurtmckee commented May 19, 2022

sjehuda commented May 19, 2022 via email

sjehuda commented Jul 14, 2022

kurtmckee commented Jul 14, 2022

sjehuda commented Jul 14, 2022 •

edited

Loading

[Feature Request] Parsing HTML5 Pages #309

[Feature Request] Parsing HTML5 Pages #309

Comments

sjehuda commented May 19, 2022

kurtmckee commented May 19, 2022

sjehuda commented May 19, 2022

kurtmckee commented May 19, 2022

sjehuda commented May 19, 2022 via email

sjehuda commented Jul 14, 2022

kurtmckee commented Jul 14, 2022

sjehuda commented Jul 14, 2022 • edited Loading

sjehuda commented Jul 14, 2022 •

edited

Loading