Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

Closed
marycanady opened this issue Jul 6, 2015 · 32 comments
Closed

RSS Agent Adding Extra Characters to Google Alerts RSS URLs #889

marycanady opened this issue Jul 6, 2015 · 32 comments

Comments

@marycanady
Copy link

RSSAgent is adding "amp;" characters at the end of ampersands in URLs it is outputting, how do I prevent or remove? This happens whether or not I use the "clean:true" option. I know Google Alert URLs are long and complex, but I don't know any alternative. Thanks in advance!

Original:
https://www.google.com/url?rct=j&sa=t&url=http://www.vocalrepublic.com/superbug-threat-prompts-west-to-revisit-soviet-era-virus-therapy/15605/&ct=ga&cd=CAIyGjhiMjMxNTkwMzBiNmY1OGU6Y29tOmVuOlVT&usg=AFQjCNFRgagE8CRhJY-oQqfNzxfRAclTeg

RSS Agent Output:
https://www.google.com/url?rct=j&sa=t&url=http://www.vocalrepublic.com/superbug-threat-prompts-west-to-revisit-soviet-era-virus-therapy/15605/&ct=ga&cd=CAIyGjhiMjMxNTkwMzBiNmY1OGU6Y29tOmVuOlVT&usg=AFQjCNFRgagE8CRhJY-oQqfNzxfRAclTeg

@cantino
Copy link
Member

cantino commented Jul 7, 2015

Hey @marycanady, welcome! Could you show us an example of the Agent configuration that you're using?

@marycanady
Copy link
Author

Thanks so much! Here's the code for RSS Agent. I tried it with and without "clean": "true" and "disable_url_encoding": "false", and got same results.

{
"expected_update_period_in_days": "1",
"clean": "true",
"url": [
"https://www.google.com/alerts/feeds/03898956544915458721/17285894713843133113",
"https://www.google.com/alerts/feeds/03898956544915458721/15913629364926333310",
"https://www.google.com/alerts/feeds/03898956544915458721/12406979761446534656",
"https://www.google.com/alerts/feeds/03898956544915458721/12718887008787068431",
"https://www.google.com/alerts/feeds/03898956544915458721/14261209551583029212"
],
"disable_url_encoding": "false"
}

@virtadpt
Copy link
Collaborator

virtadpt commented Jul 7, 2015

Does this happen with non-Google RSS feeds?

@cantino
Copy link
Member

cantino commented Jul 8, 2015

Hey @marycanady, I'll try setting this up locally and see if I can reproduce. I'm pretty swamped for the next two days, though, so it may be later in the week or early next week. If someone else wants to try and debug / fix, go for it! :)

@knu
Copy link
Member

knu commented Jul 8, 2015

Apparently FeedNormalizer is doing something wrong.

@knu
Copy link
Member

knu commented Jul 8, 2015

When parsing one of these feeds, FeedNormalizer somehow falls back from the rss parser to the simple-rss parser, and the SimpleRSS generates the wrong URLs.

require 'simple-rss'
require 'open-uri'

rss = SimpleRSS.parse open('https://www.google.com/alerts/feeds/03898956544915458721/17285894713843133113')

puts rss.items.first.link
#=> https://www.google.com/url?rct=j&sa=t&url=http://www.dakotafinancialnews.com/intrexon-corp-lifted-to-buy-at-zacks-xon/243174/&ct=ga&cd=CAIyGmUxMDlkOTA3MDE0YzQ2ZjY6Y29tOmVuOlVT&usg=AFQjCNH61DNiDuZW20aTxuFzFfNKrlWpVg

So, one problem is RSS::Parser failing to parse the feed and another is SimpleRSS not handling XML attributes properly.

@knu
Copy link
Member

knu commented Jul 8, 2015

This is the current status, and I'm going out for a meal...

@marycanady
Copy link
Author

Thanks! Any updates? It will likely be a significant problem for huginn as so many people use Google Alerts.

@cantino
Copy link
Member

cantino commented Jul 11, 2015

I will try to dig into it this week.

@knu
Copy link
Member

knu commented Jul 12, 2015

The first problem was due to feed-normalizer not expecting an Atom feed from RSS::Parser.parse(): https://github.com/aasmith/feed-normalizer/blob/master/lib/parsers/rss.rb#L37 (RSS::RDF has channel() while RSS::Atom::Feed does not)

The second "problem" turned out to be a 'feature" of simple-rss, not really dealing with XML at all: https://github.com/cardmagic/simple-rss#usage

@knu
Copy link
Member

knu commented Jul 12, 2015

So, it would be the best to enhance feed-normalizer to support Atom, but in the meantime we'd have to monkey-patch it.

@cantino
Copy link
Member

cantino commented Jul 13, 2015

Given aasmith/feed-normalizer#4, I doubt feed-normalizer is being maintained. @knu, does updating it interest you? If not, I'll take a crack at it.

@knu
Copy link
Member

knu commented Jul 14, 2015

We could probably use RSS::Parser directly, calling feed-normalizer's HTML cleaner as necessary.

@cantino
Copy link
Member

cantino commented Jul 14, 2015

That might be simpler, if it gives us more control, as I imagine Huginn's RSS parsing needs will only increase with time.

@marycanady
Copy link
Author

Thanks guys, I know you're still working on it, but I'd be happy to contribute somewhere for your time, let me know where!

@knu
Copy link
Member

knu commented Jul 15, 2015

It's not ideal, but I guess adding a Liquid filter unescape wouldn't hurt and mitigate the issue by allowing you to create an EventFormattingAgent to fix the messed up escaped URLs.

@cantino
Copy link
Member

cantino commented Jul 15, 2015

@marycanady, would an unescape filter solve this?

knu added a commit that referenced this issue Jul 16, 2015
@marycanady
Copy link
Author

Hi--thanks! I'm a n00b both on Huginn and Git, has this been created, and if so how do I use it? I'm also using Huginn on Heroku, and from what I remember, they don't have the latest snapshot of Huginn (and I'm not yet savvy enough to add new code) so I don't know if I can use it yet. Will look into it soon and let you know. Have you been able to test?

@cantino
Copy link
Member

cantino commented Jul 16, 2015

@marycanady, it has not been merged into Huginn's master branch yet. When it is, you'll be able to update your local checkout of Huginn, then push the update to Heroku. This would be something like git fetch origin, then git merge origin/master, then git push heroku master.

@marycanady
Copy link
Author

Thanks--sorry I have very limited time to work on this right now. Can you please tell me when the change has been merged with the master? Can you also please point me to some instructions on how to use the filter?

@cantino
Copy link
Member

cantino commented Jul 26, 2015

@marycanady, @knu's filter has been merged into master. It's a normal Liquid filter, as is documented in our wiki. I think @knu's intention was for you to make an EventFormattingAgent that receives your events, runs the unescape filter on the URL parts, and re-emits them.

knu added a commit that referenced this issue Aug 4, 2015
FeedNormalizer is no longer maintained, and its Atom support has flaws
in that it throws away what RSS::Parser returns and falls back to using
SimpleRSS which is not capable of handling XML entities, resulting in
getting ususable URLs such as ones including `&`.

A breaking change is removal of the `clean` option which needs to be
reimplemented if it is important, while I personally think you should
use EventFormattingAgent to tweak feed contents if you need to.

This should address #889 and #955, among others.
@marycanady
Copy link
Author

Thanks, I'm finally having time for this now. I had created my Huginn build from a buildpack and I didn't get the admin access set up (had used windows, now trying the VMWare/Linux option, still working on it). If I can't get that to work, will a new instance created using the buildpack have the updates I need to implement the filter?

Edited

@cantino
Copy link
Member

cantino commented Aug 8, 2015

Yes, if you're pulling from the current git repo, it should be up to date.

@marycanady
Copy link
Author

Thanks, it's working, the Google Alerts URLs are valid. Here's the code for the Formatting Agent, you can close this item. I would suggest adding that no arguments are needed for the unescape filter in the wiki, as I was unclear on this fact.

{
"instructions": {
"title": "Blogs: {{title}}",
"url": "{{url | unescape }}"
},
"matchers": [],
"mode": "merge"
}

@cantino
Copy link
Member

cantino commented Aug 11, 2015

Glad it's working! Would you mind updating the wiki?

@cantino
Copy link
Member

cantino commented Aug 23, 2015

@marycanady, did my JavaScript Agent work? If so, would you mind updating the wiki with your successes, and maybe blogging about it? :)

@cantino cantino closed this as completed Aug 23, 2015
@mcanady
Copy link

mcanady commented Aug 24, 2015

Hi--forgive me, but where would I have gotten the JavaScript Agent?

@cantino
Copy link
Member

cantino commented Aug 24, 2015

I emailed you a week or so ago. Just replied to it again.

knu added a commit that referenced this issue Jun 28, 2016
FeedNormalizer is no longer maintained, and its Atom support has flaws
in that it throws away what RSS::Parser returns and falls back to using
SimpleRSS which is not capable of handling XML entities, resulting in
getting ususable URLs such as ones including `&`.

A new key `links` is added, which lists all `link` elements.

A breaking change is removal of the `clean` option which needs to be
reimplemented later if it is important.

This should address #889 and #955, among others.
@knu
Copy link
Member

knu commented Nov 23, 2018

I recently checked out feeds generated by Google Alerts and found out they are Atom documents with <title type="html"/> elements. So, RssAgent should be made aware of that and convert marked-up title contents to plain texts as necessary.

I think that would be the job of the underlying feed parser which currently is Feedjira, so I'm looking to making a PR for it. We've already been patching Feejira quite heavily, but this would be relatively easy to factor out and feed back.

@knu
Copy link
Member

knu commented Jul 10, 2019

Finally, here's the work in progress!
feedjira/feedjira#423

@knu
Copy link
Member

knu commented Jul 10, 2019

And this: feedjira/feedjira#424

@mcanady
Copy link

mcanady commented Jul 10, 2019

Thanks! I reread the thread and can't remember how I solved the problem...but the work is appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants