Script-based website to feed conversion #265

martinrotter · 2020-08-10T09:12:53Z

Add script-based generic URL-to-feed conversion, similar to what Liferea offers:

User will be able to insert path to script, including interpreter and all command line switches, in "Add/Edit feed" dialog.
There will be enum which will specify the input mode - Url (default), Script, and maybe File (like Liferea). This will be new "input" column for "Feeds" table.
Feed type specified in feed's properties will be what is expected as output of the script. Also, guessing the feed will work with this feature. In other words if user sets Script as input and even some post-processing script, the result will be used as guessing source of course.
User will be able to specify the script "interpreter", for example "bash.exe", or "cmd.exe", along with parameters, for example "bash.exe -C %1/myscript.bash 'twitter'". This will be stored in "url" column of "Feeds" table if switch is "Script".
User will be able to set "post-processing" filter path which will work the same way as Liferea. Path to post-processing filter will be saved as column in "Feeds" table. These filters feature must be compatible with Liferea.
There will be special "%scripts" placeholder for "script" attribute (if in Script mode) and for "post-processing" script. This placeholder will be replaced in runtime by path to "user data scripts" folder (probably juse <rss-guard-data-folder>/scripts which will allow users to place scripts to portable locations.

Liferea docs:

Example "script input": php tweeper.php https://twitter.com/NSACareers (downloads RSS XML file generated from twitter page)

The text was updated successfully, but these errors were encountered:

pcause · 2020-09-17T10:36:07Z

martin, do you know about rss-bridge? it has plugins to convert many sites without feeds to feeds. you can find it here on github

martinrotter · 2020-09-17T14:29:06Z

will look, but it will probably be out-of-scope for RSS Guard, because I plan to add very simple-to-use regular expression-based approach which will be very universal

pcause · 2020-09-18T19:49:05Z

wasn't suggesting you did, jusy thought you might want to know about this.

when you be sure to only fetch if modified since last fetch? I'm assuming you always do this but for fetching regular pages I suspect users only want to get a new RSS item if the page is changed.

martinrotter · 2020-09-19T18:06:51Z

@pcause That is actually very good observation. Fetching date/time of the actual message will be very tricky. Thing is that 90% of feeds does not actually "update" messages with newer versions.

See this. In RSS Guard, two messages on DB level are considered "same" if they meet the requirements stated in the link. Therefore the "real" date/time of message is not that important and many feeds even do not have it, so RSS Guard is already written to take that situation in account and I believe that many users do not realise this.

pcause · 2020-09-19T19:41:16Z

Sorry if I wasn't clear, I thought with an http request you could specify only to get context when the page at the URL had changed. Since these are web sites, I thouhgt you could do this. "Modified since" would be the last time you fetched.

pcause · 2020-10-26T17:37:37Z

on tis one i wonder if you can use readability to extract just the main content or have an option to use it. that way we get the content and not the page and ads. the implementations i know of are in javascript so not sure if there are any in c++ but maybe there are or you can use the browser control you have to do the work

martinrotter · 2020-10-26T18:00:52Z

Yes, "Modified since" HTTP header, exists, but its functionality on many sites is just missing, but sure, it is worth of investigating.

What is "readablity", I don't know it, can you give some website?

As for this ticket, it is at this point really unsure how exactly will its implementation/use-case work, I think about it from time to time. Elaborate and well-written regular expression with named groups could do amazing job, it really just depends how skilled the author of regexp is.

neoavalon · 2020-11-01T04:08:21Z

Would be a good feature to have.

Just want to point out that liferea supports a very nice and simple generalization (relative to just regexp's) of this capability by allowing a user to supply a path to a conversion post-filter (a script to call w/ parameters) when creating a new feed. The filter/script runs each time the URL for the feed is retrieved and is fed the retrieved URL content via stdin. The filter/script is expected to output the converted content (in XML/RSS form) via stdout back to liferia. So it's up to the user what they want to do in the script/filter (run regular expressions, apply xpath, invoke python code to do more fancy things, etc.).

This means not having to add new menus to enter regexp's, etc. You'd need to add a new label and input box to "Add new feed" for specifying the command to run (w/ flags). This could be enabled with a checkbox. In this case the selected "Type" of feed could be the expected format of the output generated by the users script. This would makes things a bit less user friendly to more layman users (specifying regexp's) but gives much more (I think) flexibility in general. Just an idea.

martinrotter · 2020-11-01T06:27:59Z

Shit that is actually nice idea. Even simpler for me to implement. Will add instead of just regexps.

martinrotter · 2020-11-01T13:32:18Z

I will update first message to reflect your ideas.

martinrotter · 2021-02-02T12:41:32Z

Working on this.

martinrotter · 2021-02-03T09:41:30Z

OK, I made significant progress and the feature is basically done. I made quite some testing and it works with all well-known interpreters, including Bash, Powershell and php.

Feature is even support by "Fetch feed metadata" feature and thus is able to semi-automatically scrape sites like Twitter etc.

martinrotter self-assigned this Aug 10, 2020

martinrotter added Component-Core Priority-Low I not personally interested in this ticket, perhaps others might prepare PR. Type-Enhancement This is request for brand new feature. labels Aug 10, 2020

martinrotter changed the title ~~Regular expression based website to feed conversion~~ Script-based website to feed conversion Nov 1, 2020

martinrotter mentioned this issue Nov 4, 2020

Unify behavior of "Edit feed" for feeds in synced accounts #287

Closed

10 tasks

martinrotter removed the Priority-Low I not personally interested in this ticket, perhaps others might prepare PR. label Feb 2, 2021

martinrotter added this to the 3.9.0 milestone Feb 2, 2021

martinrotter pushed a commit that referenced this issue Feb 2, 2021

Add sql for #265.

7bef56b

martinrotter pushed a commit that referenced this issue Feb 2, 2021

Working on #265.

45304b9

martinrotter pushed a commit that referenced this issue Feb 2, 2021

Working on #265.

e86882c

martinrotter pushed a commit that referenced this issue Feb 2, 2021

Working on #265.

f93d0f2

martinrotter pushed a commit that referenced this issue Feb 2, 2021

Working on #265.

2c5b014

martinrotter closed this as completed in a94c016 Feb 3, 2021

martinrotter added the Status-Fixed Ticket is resolved. label Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script-based website to feed conversion #265

Script-based website to feed conversion #265

martinrotter commented Aug 10, 2020 •

edited

Loading

pcause commented Sep 17, 2020

martinrotter commented Sep 17, 2020

pcause commented Sep 18, 2020

martinrotter commented Sep 19, 2020

pcause commented Sep 19, 2020

pcause commented Oct 26, 2020

martinrotter commented Oct 26, 2020

neoavalon commented Nov 1, 2020 •

edited

Loading

martinrotter commented Nov 1, 2020

martinrotter commented Nov 1, 2020

martinrotter commented Feb 2, 2021

martinrotter commented Feb 3, 2021

Script-based website to feed conversion #265

Script-based website to feed conversion #265

Comments

martinrotter commented Aug 10, 2020 • edited Loading

pcause commented Sep 17, 2020

martinrotter commented Sep 17, 2020

pcause commented Sep 18, 2020

martinrotter commented Sep 19, 2020

pcause commented Sep 19, 2020

pcause commented Oct 26, 2020

martinrotter commented Oct 26, 2020

neoavalon commented Nov 1, 2020 • edited Loading

martinrotter commented Nov 1, 2020

martinrotter commented Nov 1, 2020

martinrotter commented Feb 2, 2021

martinrotter commented Feb 3, 2021

martinrotter commented Aug 10, 2020 •

edited

Loading

neoavalon commented Nov 1, 2020 •

edited

Loading