RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

marycanady · 2015-07-06T22:58:49Z

RSSAgent is adding "amp;" characters at the end of ampersands in URLs it is outputting, how do I prevent or remove? This happens whether or not I use the "clean:true" option. I know Google Alert URLs are long and complex, but I don't know any alternative. Thanks in advance!

Original:
https://www.google.com/url?rct=j&sa=t&url=http://www.vocalrepublic.com/superbug-threat-prompts-west-to-revisit-soviet-era-virus-therapy/15605/&ct=ga&cd=CAIyGjhiMjMxNTkwMzBiNmY1OGU6Y29tOmVuOlVT&usg=AFQjCNFRgagE8CRhJY-oQqfNzxfRAclTeg

RSS Agent Output:
https://www.google.com/url?rct=j&sa=t&url=http://www.vocalrepublic.com/superbug-threat-prompts-west-to-revisit-soviet-era-virus-therapy/15605/&ct=ga&cd=CAIyGjhiMjMxNTkwMzBiNmY1OGU6Y29tOmVuOlVT&usg=AFQjCNFRgagE8CRhJY-oQqfNzxfRAclTeg

cantino · 2015-07-07T02:29:17Z

Hey @marycanady, welcome! Could you show us an example of the Agent configuration that you're using?

marycanady · 2015-07-07T16:06:39Z

Thanks so much! Here's the code for RSS Agent. I tried it with and without "clean": "true" and "disable_url_encoding": "false", and got same results.

{
"expected_update_period_in_days": "1",
"clean": "true",
"url": [
"https://www.google.com/alerts/feeds/03898956544915458721/17285894713843133113",
"https://www.google.com/alerts/feeds/03898956544915458721/15913629364926333310",
"https://www.google.com/alerts/feeds/03898956544915458721/12406979761446534656",
"https://www.google.com/alerts/feeds/03898956544915458721/12718887008787068431",
"https://www.google.com/alerts/feeds/03898956544915458721/14261209551583029212"
],
"disable_url_encoding": "false"
}

virtadpt · 2015-07-07T19:23:29Z

Does this happen with non-Google RSS feeds?

cantino · 2015-07-08T05:12:28Z

Hey @marycanady, I'll try setting this up locally and see if I can reproduce. I'm pretty swamped for the next two days, though, so it may be later in the week or early next week. If someone else wants to try and debug / fix, go for it! :)

knu · 2015-07-08T06:39:21Z

Apparently FeedNormalizer is doing something wrong.

knu · 2015-07-08T07:12:44Z

When parsing one of these feeds, FeedNormalizer somehow falls back from the rss parser to the simple-rss parser, and the SimpleRSS generates the wrong URLs.

require 'simple-rss'
require 'open-uri'

rss = SimpleRSS.parse open('https://www.google.com/alerts/feeds/03898956544915458721/17285894713843133113')

puts rss.items.first.link
#=> https://www.google.com/url?rct=j&amp;sa=t&amp;url=http://www.dakotafinancialnews.com/intrexon-corp-lifted-to-buy-at-zacks-xon/243174/&amp;ct=ga&amp;cd=CAIyGmUxMDlkOTA3MDE0YzQ2ZjY6Y29tOmVuOlVT&amp;usg=AFQjCNH61DNiDuZW20aTxuFzFfNKrlWpVg

So, one problem is RSS::Parser failing to parse the feed and another is SimpleRSS not handling XML attributes properly.

knu · 2015-07-08T07:20:11Z

This is the current status, and I'm going out for a meal...

marycanady · 2015-07-11T19:27:40Z

Thanks! Any updates? It will likely be a significant problem for huginn as so many people use Google Alerts.

cantino · 2015-07-11T23:25:43Z

I will try to dig into it this week.

knu · 2015-07-12T04:09:34Z

The first problem was due to feed-normalizer not expecting an Atom feed from RSS::Parser.parse(): https://github.com/aasmith/feed-normalizer/blob/master/lib/parsers/rss.rb#L37 (RSS::RDF has channel() while RSS::Atom::Feed does not)

The second "problem" turned out to be a 'feature" of simple-rss, not really dealing with XML at all: https://github.com/cardmagic/simple-rss#usage

knu · 2015-07-12T04:15:28Z

So, it would be the best to enhance feed-normalizer to support Atom, but in the meantime we'd have to monkey-patch it.

cantino · 2015-07-13T22:20:35Z

Given aasmith/feed-normalizer#4, I doubt feed-normalizer is being maintained. @knu, does updating it interest you? If not, I'll take a crack at it.

knu · 2015-07-14T04:00:19Z

We could probably use RSS::Parser directly, calling feed-normalizer's HTML cleaner as necessary.

cantino · 2015-07-14T04:05:08Z

That might be simpler, if it gives us more control, as I imagine Huginn's RSS parsing needs will only increase with time.

marycanady · 2015-07-14T15:44:14Z

Thanks guys, I know you're still working on it, but I'd be happy to contribute somewhere for your time, let me know where!

knu · 2015-07-15T03:13:05Z

It's not ideal, but I guess adding a Liquid filter unescape wouldn't hurt and mitigate the issue by allowing you to create an EventFormattingAgent to fix the messed up escaped URLs.

cantino · 2015-07-15T21:21:27Z

@marycanady, would an unescape filter solve this?

Proposed in #889.

marycanady · 2015-07-16T17:36:31Z

Hi--thanks! I'm a n00b both on Huginn and Git, has this been created, and if so how do I use it? I'm also using Huginn on Heroku, and from what I remember, they don't have the latest snapshot of Huginn (and I'm not yet savvy enough to add new code) so I don't know if I can use it yet. Will look into it soon and let you know. Have you been able to test?

cantino · 2015-07-16T22:06:22Z

@marycanady, it has not been merged into Huginn's master branch yet. When it is, you'll be able to update your local checkout of Huginn, then push the update to Heroku. This would be something like git fetch origin, then git merge origin/master, then git push heroku master.

marycanady · 2015-07-26T14:31:48Z

Thanks--sorry I have very limited time to work on this right now. Can you please tell me when the change has been merged with the master? Can you also please point me to some instructions on how to use the filter?

cantino · 2015-07-26T21:54:36Z

@marycanady, @knu's filter has been merged into master. It's a normal Liquid filter, as is documented in our wiki. I think @knu's intention was for you to make an EventFormattingAgent that receives your events, runs the unescape filter on the URL parts, and re-emits them.

FeedNormalizer is no longer maintained, and its Atom support has flaws in that it throws away what RSS::Parser returns and falls back to using SimpleRSS which is not capable of handling XML entities, resulting in getting ususable URLs such as ones including `&`. A breaking change is removal of the `clean` option which needs to be reimplemented if it is important, while I personally think you should use EventFormattingAgent to tweak feed contents if you need to. This should address #889 and #955, among others.

marycanady · 2015-08-08T15:51:46Z

Thanks, I'm finally having time for this now. I had created my Huginn build from a buildpack and I didn't get the admin access set up (had used windows, now trying the VMWare/Linux option, still working on it). If I can't get that to work, will a new instance created using the buildpack have the updates I need to implement the filter?

Edited

cantino · 2015-08-08T19:37:36Z

Yes, if you're pulling from the current git repo, it should be up to date.

marycanady · 2015-08-09T14:33:40Z

Thanks, it's working, the Google Alerts URLs are valid. Here's the code for the Formatting Agent, you can close this item. I would suggest adding that no arguments are needed for the unescape filter in the wiki, as I was unclear on this fact.

{
"instructions": {
"title": "Blogs: {{title}}",
"url": "{{url | unescape }}"
},
"matchers": [],
"mode": "merge"
}

cantino · 2015-08-11T02:02:02Z

Glad it's working! Would you mind updating the wiki?

cantino · 2015-08-23T21:47:24Z

@marycanady, did my JavaScript Agent work? If so, would you mind updating the wiki with your successes, and maybe blogging about it? :)

mcanady · 2015-08-24T16:45:03Z

Hi--forgive me, but where would I have gotten the JavaScript Agent?

cantino · 2015-08-24T18:23:23Z

I emailed you a week or so ago. Just replied to it again.

FeedNormalizer is no longer maintained, and its Atom support has flaws in that it throws away what RSS::Parser returns and falls back to using SimpleRSS which is not capable of handling XML entities, resulting in getting ususable URLs such as ones including `&`. A new key `links` is added, which lists all `link` elements. A breaking change is removal of the `clean` option which needs to be reimplemented later if it is important. This should address #889 and #955, among others.

knu · 2018-11-23T13:13:19Z

I recently checked out feeds generated by Google Alerts and found out they are Atom documents with <title type="html"/> elements. So, RssAgent should be made aware of that and convert marked-up title contents to plain texts as necessary.

I think that would be the job of the underlying feed parser which currently is Feedjira, so I'm looking to making a PR for it. We've already been patching Feejira quite heavily, but this would be relatively easy to factor out and feed back.

knu · 2019-07-10T12:32:42Z

Finally, here's the work in progress!
feedjira/feedjira#423

knu · 2019-07-10T13:07:44Z

And this: feedjira/feedjira#424

mcanady · 2019-07-10T13:52:40Z

Thanks! I reread the thread and can't remember how I solved the problem...but the work is appreciated!

knu added a commit that referenced this issue Jul 16, 2015

Add a new filter unescape

e5860dd

Proposed in #889.

knu mentioned this issue Jul 16, 2015

Add a new filter unescape #919

Merged

knu mentioned this issue Aug 4, 2015

RssAgent: Use RSS::Parser directly #957

Closed

cantino closed this as completed Aug 23, 2015

cantino mentioned this issue Jun 28, 2016

RssAgent: Migrate from FeedNormalizer to Feedjira #1564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

marycanady commented Jul 6, 2015

cantino commented Jul 7, 2015

marycanady commented Jul 7, 2015

virtadpt commented Jul 7, 2015

cantino commented Jul 8, 2015

knu commented Jul 8, 2015

knu commented Jul 8, 2015

knu commented Jul 8, 2015

marycanady commented Jul 11, 2015

cantino commented Jul 11, 2015

knu commented Jul 12, 2015

knu commented Jul 12, 2015

cantino commented Jul 13, 2015

knu commented Jul 14, 2015

cantino commented Jul 14, 2015

marycanady commented Jul 14, 2015

knu commented Jul 15, 2015

cantino commented Jul 15, 2015

marycanady commented Jul 16, 2015

cantino commented Jul 16, 2015

marycanady commented Jul 26, 2015

cantino commented Jul 26, 2015

marycanady commented Aug 8, 2015

cantino commented Aug 8, 2015

marycanady commented Aug 9, 2015

cantino commented Aug 11, 2015

cantino commented Aug 23, 2015

mcanady commented Aug 24, 2015

cantino commented Aug 24, 2015

knu commented Nov 23, 2018

knu commented Jul 10, 2019

knu commented Jul 10, 2019

mcanady commented Jul 10, 2019

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

Comments

marycanady commented Jul 6, 2015

cantino commented Jul 7, 2015

marycanady commented Jul 7, 2015

virtadpt commented Jul 7, 2015

cantino commented Jul 8, 2015

knu commented Jul 8, 2015

knu commented Jul 8, 2015

knu commented Jul 8, 2015

marycanady commented Jul 11, 2015

cantino commented Jul 11, 2015

knu commented Jul 12, 2015

knu commented Jul 12, 2015

cantino commented Jul 13, 2015

knu commented Jul 14, 2015

cantino commented Jul 14, 2015

marycanady commented Jul 14, 2015

knu commented Jul 15, 2015

cantino commented Jul 15, 2015

marycanady commented Jul 16, 2015

cantino commented Jul 16, 2015

marycanady commented Jul 26, 2015

cantino commented Jul 26, 2015

marycanady commented Aug 8, 2015

cantino commented Aug 8, 2015

marycanady commented Aug 9, 2015

cantino commented Aug 11, 2015

cantino commented Aug 23, 2015

mcanady commented Aug 24, 2015

cantino commented Aug 24, 2015

knu commented Nov 23, 2018

knu commented Jul 10, 2019

knu commented Jul 10, 2019

mcanady commented Jul 10, 2019