Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSSAgent not extracting the primary entry URL #1276

Closed
Lenbok opened this issue Feb 6, 2016 · 14 comments
Closed

RSSAgent not extracting the primary entry URL #1276

Lenbok opened this issue Feb 6, 2016 · 14 comments

Comments

@Lenbok
Copy link

Lenbok commented Feb 6, 2016

I have an RSSAgent collecting data from http://hydraraptor.blogspot.com/feeds/posts/default (although the same happens with other blogspot feeds) to perform some aggregation and filtering. Unfortunately, the emitted events do not contain the primary URL for each entry, instead choosing one that does not direct to the article itself (and can't be clicked on to view in a browser). Looking at the RSS feed itself, each article has a series of links such as:

<link rel='replies' type='application/atom+xml' href='http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default' title='Post Comments'/>
<link rel='replies' type='text/html' href='http://hydraraptor.blogspot.com/2016/02/cool-maps.html#comment-form' title='3 Comments'/>
<link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4339813531032979196/posts/default/4971617975719595764'/>
<link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4339813531032979196/posts/default/4971617975719595764'/>
<link rel='alternate' type='text/html' href='http://hydraraptor.blogspot.com/2016/02/cool-maps.html' title='Cool maps'/>

The corresponding emitted event has the following attributes;

[...]
"url": "http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default",
  "urls": [
    "http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default"
  ],
[...]

So, it appears that the first link is selected by the RSSAgent.

Is it possible to instead select the rel='alternate' link instead, or otherwise exert some control over which links are selected?

@cantino
Copy link
Member

cantino commented Feb 6, 2016

Hey @Lenbok. I think that'd be a code change to the RSSAgent, if you want to give it a go. I'm not sure how hard it'd be. I'll add "Help Wanted" to this issue in case someone has suggestions.

You can also parse RSS with a WebsiteAgent, and that would give you more control over exactly what you extract. There are some examples on the wiki.

@Lenbok
Copy link
Author

Lenbok commented Feb 6, 2016

I know exactly zero ruby, and (after some issue scouting) it looks like RSSAgent may be getting fixed/enhanced/superseded anyway by #957 -- is it possible that this issue would already be fixed by #957?

@cantino
Copy link
Member

cantino commented Feb 7, 2016

We'd have to ask @knu, I'm not sure. I don't think that PR will fix all of the issues that reference it at this point ;)

@xdirty
Copy link

xdirty commented Feb 8, 2016

Cant you use the website agent to extract anydata even from an xml feed?

@cantino
Copy link
Member

cantino commented Feb 8, 2016

Yes you could definitely try the website agent too.

@Lenbok
Copy link
Author

Lenbok commented Feb 8, 2016

Someone who is familiar with xml parsing / xpath / etc would probably have no trouble getting the job done via the website agent. I was hoping that the RSSAgent could be made to Just Work for those that aren't :-).

@xdirty
Copy link

xdirty commented Feb 10, 2016

Lenbok, if you play around with the website agent and look at Cantinos example you should be able to figure out what keywords you need to enter to extract the right data. Heres what i used for one of my feeds that wouldnt extract and didnt need to change any of cantinos code. I think your situation is a little more complicatet, but you dont need coding skills, you just need to be able to recognize patterns.

{
"url": "Enter rss feed address here",
"mode": "on_change",
"type": "xml",
"expected_update_period_in_days": "10",
"extract": {
"title": {
"css": "item title",
"value": ".//text()"
},
"url": {
"css": "item link",
"value": ".//text()"
}
}
}

Alternatively you can use http://www.rssmix.com/, u need more than one feed but it will combine them and the output will be more consistent with what xml output should be which means the rssagent will be able to extract the data.

@knu
Copy link
Member

knu commented Feb 10, 2016

Let me take a look into this later.

@Jngai
Copy link
Contributor

Jngai commented Mar 7, 2016

@knu do you have an update for this issue?

@Jngai
Copy link
Contributor

Jngai commented Mar 8, 2016

@knu if you don't mind, I plan to try it. I was looking at the rss agent and think it is manageable.

@Jngai
Copy link
Contributor

Jngai commented Mar 9, 2016

@Lenbok I don't think there is a way to select your own urls. I was looking into the agent. Turns out its pretty reliant on feed-normalizer and the url has already been selected. On the upside did you notice that the alternative links that you were looking for are actually in the content section? Hope this helps.

{
  "id": "tag:blogger.com,1999:blog-4339813531032979196.post-8884703291196141290",
  "date_published": "2016-01-29 17:20:00 +0000",
  "last_updated": "2016-02-18 11:21:40 +0000",
  "url": "http://hydraraptor.blogspot.com/feeds/8884703291196141290/comments/default",
  "urls": [
    "http://hydraraptor.blogspot.com/feeds/8884703291196141290/comments/default"
  ],
  "description": "",
  "content": "A few months ago <a href=\"https://plus.google.com/u/0/113737869224798118032/auto\" target=\"_blank\">+Neil Darlow<\/a> mentioned that he had replaced his <a href=\"http://hydraraptor.blogspot.co.uk/2011/12/mendel90.html\" target=\"_blank\">Mendel90<\/a> fan with a quiet version and it seemed to give better cooling results. This made me curious because quieter fans of the same dimensions generally spin slower and produce less airflow. So I purchased

@Jngai
Copy link
Contributor

Jngai commented Apr 4, 2016

@Lenbok did my suggestion help?

@Lenbok
Copy link
Author

Lenbok commented Apr 4, 2016

@Jngai Hmmmm, in the (raw xml of the) entries I have looked at, the links in the content section do not correspond to those links that appear after the content section (in particular I want the "alternate" link, as that is the one I should click on in order to visit the article itself in a browser). #957 discusses an alternative implementation that doesn't depend on feed-normalizer, so I was hoping that would provide a solution.

@Lenbok
Copy link
Author

Lenbok commented Dec 6, 2017

Closing as this has been addressed by #1564

@Lenbok Lenbok closed this as completed Dec 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants