-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script-based website to feed conversion #265
Comments
martin, do you know about rss-bridge? it has plugins to convert many sites without feeds to feeds. you can find it here on github |
will look, but it will probably be out-of-scope for RSS Guard, because I plan to add very simple-to-use regular expression-based approach which will be very universal |
wasn't suggesting you did, jusy thought you might want to know about this. when you be sure to only fetch if modified since last fetch? I'm assuming you always do this but for fetching regular pages I suspect users only want to get a new RSS item if the page is changed. |
@pcause That is actually very good observation. Fetching date/time of the actual message will be very tricky. Thing is that 90% of feeds does not actually "update" messages with newer versions. See this. In RSS Guard, two messages on DB level are considered "same" if they meet the requirements stated in the link. Therefore the "real" date/time of message is not that important and many feeds even do not have it, so RSS Guard is already written to take that situation in account and I believe that many users do not realise this. |
Sorry if I wasn't clear, I thought with an http request you could specify only to get context when the page at the URL had changed. Since these are web sites, I thouhgt you could do this. "Modified since" would be the last time you fetched. |
on tis one i wonder if you can use readability to extract just the main content or have an option to use it. that way we get the content and not the page and ads. the implementations i know of are in javascript so not sure if there are any in c++ but maybe there are or you can use the browser control you have to do the work |
Yes, "Modified since" HTTP header, exists, but its functionality on many sites is just missing, but sure, it is worth of investigating. What is "readablity", I don't know it, can you give some website? As for this ticket, it is at this point really unsure how exactly will its implementation/use-case work, I think about it from time to time. Elaborate and well-written regular expression with named groups could do amazing job, it really just depends how skilled the author of regexp is. |
Would be a good feature to have. Just want to point out that liferea supports a very nice and simple generalization (relative to just regexp's) of this capability by allowing a user to supply a path to a conversion post-filter (a script to call w/ parameters) when creating a new feed. The filter/script runs each time the URL for the feed is retrieved and is fed the retrieved URL content via stdin. The filter/script is expected to output the converted content (in XML/RSS form) via stdout back to liferia. So it's up to the user what they want to do in the script/filter (run regular expressions, apply xpath, invoke python code to do more fancy things, etc.). This means not having to add new menus to enter regexp's, etc. You'd need to add a new label and input box to "Add new feed" for specifying the command to run (w/ flags). This could be enabled with a checkbox. In this case the selected "Type" of feed could be the expected format of the output generated by the users script. This would makes things a bit less user friendly to more layman users (specifying regexp's) but gives much more (I think) flexibility in general. Just an idea. |
Shit that is actually nice idea. Even simpler for me to implement. Will add instead of just regexps. |
I will update first message to reflect your ideas. |
Working on this. |
OK, I made significant progress and the feature is basically done. I made quite some testing and it works with all well-known interpreters, including Bash, Powershell and php. Feature is even support by "Fetch feed metadata" feature and thus is able to semi-automatically scrape sites like Twitter etc. |
Add script-based generic URL-to-feed conversion, similar to what Liferea offers:
<rss-guard-data-folder>/scripts
which will allow users to place scripts to portable locations.Liferea docs:
![image](https://user-images.githubusercontent.com/1255302/106560311-1d061580-6527-11eb-9347-c58fc8df501c.png)
Example "script input":
php tweeper.php https://twitter.com/NSACareers
(downloads RSS XML file generated from twitter page)The text was updated successfully, but these errors were encountered: