Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script-based website to feed conversion #265

Closed
6 tasks done
martinrotter opened this issue Aug 10, 2020 · 12 comments
Closed
6 tasks done

Script-based website to feed conversion #265

martinrotter opened this issue Aug 10, 2020 · 12 comments
Assignees
Labels
Component-Core Status-Fixed Ticket is resolved. Type-Enhancement This is request for brand new feature.
Milestone

Comments

@martinrotter
Copy link
Owner

martinrotter commented Aug 10, 2020

Add script-based generic URL-to-feed conversion, similar to what Liferea offers:

  • User will be able to insert path to script, including interpreter and all command line switches, in "Add/Edit feed" dialog.
  • There will be enum which will specify the input mode - Url (default), Script, and maybe File (like Liferea). This will be new "input" column for "Feeds" table.
  • Feed type specified in feed's properties will be what is expected as output of the script. Also, guessing the feed will work with this feature. In other words if user sets Script as input and even some post-processing script, the result will be used as guessing source of course.
  • User will be able to specify the script "interpreter", for example "bash.exe", or "cmd.exe", along with parameters, for example "bash.exe -C %1/myscript.bash 'twitter'". This will be stored in "url" column of "Feeds" table if switch is "Script".
  • User will be able to set "post-processing" filter path which will work the same way as Liferea. Path to post-processing filter will be saved as column in "Feeds" table. These filters feature must be compatible with Liferea.
  • There will be special "%scripts" placeholder for "script" attribute (if in Script mode) and for "post-processing" script. This placeholder will be replaced in runtime by path to "user data scripts" folder (probably juse <rss-guard-data-folder>/scripts which will allow users to place scripts to portable locations.

Liferea docs:
image

Example "script input": php tweeper.php https://twitter.com/NSACareers (downloads RSS XML file generated from twitter page)

@martinrotter martinrotter self-assigned this Aug 10, 2020
@martinrotter martinrotter added Component-Core Priority-Low I not personally interested in this ticket, perhaps others might prepare PR. Type-Enhancement This is request for brand new feature. labels Aug 10, 2020
@pcause
Copy link

pcause commented Sep 17, 2020

martin, do you know about rss-bridge? it has plugins to convert many sites without feeds to feeds. you can find it here on github

@martinrotter
Copy link
Owner Author

will look, but it will probably be out-of-scope for RSS Guard, because I plan to add very simple-to-use regular expression-based approach which will be very universal

@pcause
Copy link

pcause commented Sep 18, 2020

wasn't suggesting you did, jusy thought you might want to know about this.

when you be sure to only fetch if modified since last fetch? I'm assuming you always do this but for fetching regular pages I suspect users only want to get a new RSS item if the page is changed.

@martinrotter
Copy link
Owner Author

@pcause That is actually very good observation. Fetching date/time of the actual message will be very tricky. Thing is that 90% of feeds does not actually "update" messages with newer versions.

See this. In RSS Guard, two messages on DB level are considered "same" if they meet the requirements stated in the link. Therefore the "real" date/time of message is not that important and many feeds even do not have it, so RSS Guard is already written to take that situation in account and I believe that many users do not realise this.

@pcause
Copy link

pcause commented Sep 19, 2020

Sorry if I wasn't clear, I thought with an http request you could specify only to get context when the page at the URL had changed. Since these are web sites, I thouhgt you could do this. "Modified since" would be the last time you fetched.

@pcause
Copy link

pcause commented Oct 26, 2020

on tis one i wonder if you can use readability to extract just the main content or have an option to use it. that way we get the content and not the page and ads. the implementations i know of are in javascript so not sure if there are any in c++ but maybe there are or you can use the browser control you have to do the work

@martinrotter
Copy link
Owner Author

Yes, "Modified since" HTTP header, exists, but its functionality on many sites is just missing, but sure, it is worth of investigating.

What is "readablity", I don't know it, can you give some website?

As for this ticket, it is at this point really unsure how exactly will its implementation/use-case work, I think about it from time to time. Elaborate and well-written regular expression with named groups could do amazing job, it really just depends how skilled the author of regexp is.

@neoavalon
Copy link

neoavalon commented Nov 1, 2020

Would be a good feature to have.

Just want to point out that liferea supports a very nice and simple generalization (relative to just regexp's) of this capability by allowing a user to supply a path to a conversion post-filter (a script to call w/ parameters) when creating a new feed. The filter/script runs each time the URL for the feed is retrieved and is fed the retrieved URL content via stdin. The filter/script is expected to output the converted content (in XML/RSS form) via stdout back to liferia. So it's up to the user what they want to do in the script/filter (run regular expressions, apply xpath, invoke python code to do more fancy things, etc.).

This means not having to add new menus to enter regexp's, etc. You'd need to add a new label and input box to "Add new feed" for specifying the command to run (w/ flags). This could be enabled with a checkbox. In this case the selected "Type" of feed could be the expected format of the output generated by the users script. This would makes things a bit less user friendly to more layman users (specifying regexp's) but gives much more (I think) flexibility in general. Just an idea.

@martinrotter
Copy link
Owner Author

Shit that is actually nice idea. Even simpler for me to implement. Will add instead of just regexps.

@martinrotter martinrotter changed the title Regular expression based website to feed conversion Script-based website to feed conversion Nov 1, 2020
@martinrotter
Copy link
Owner Author

I will update first message to reflect your ideas.

@martinrotter martinrotter removed the Priority-Low I not personally interested in this ticket, perhaps others might prepare PR. label Feb 2, 2021
@martinrotter martinrotter added this to the 3.9.0 milestone Feb 2, 2021
martinrotter pushed a commit that referenced this issue Feb 2, 2021
@martinrotter
Copy link
Owner Author

Working on this.

martinrotter pushed a commit that referenced this issue Feb 2, 2021
martinrotter pushed a commit that referenced this issue Feb 2, 2021
martinrotter pushed a commit that referenced this issue Feb 2, 2021
martinrotter pushed a commit that referenced this issue Feb 2, 2021
@martinrotter
Copy link
Owner Author

OK, I made significant progress and the feature is basically done. I made quite some testing and it works with all well-known interpreters, including Bash, Powershell and php.

image

Feature is even support by "Fetch feed metadata" feature and thus is able to semi-automatically scrape sites like Twitter etc.

image

@martinrotter martinrotter added the Status-Fixed Ticket is resolved. label Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Core Status-Fixed Ticket is resolved. Type-Enhancement This is request for brand new feature.
Projects
None yet
Development

No branches or pull requests

3 participants