RSSAgent not extracting the primary entry URL #1276

Lenbok · 2016-02-06T22:25:01Z

I have an RSSAgent collecting data from http://hydraraptor.blogspot.com/feeds/posts/default (although the same happens with other blogspot feeds) to perform some aggregation and filtering. Unfortunately, the emitted events do not contain the primary URL for each entry, instead choosing one that does not direct to the article itself (and can't be clicked on to view in a browser). Looking at the RSS feed itself, each article has a series of links such as:

<link rel='replies' type='application/atom+xml' href='http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default' title='Post Comments'/>
<link rel='replies' type='text/html' href='http://hydraraptor.blogspot.com/2016/02/cool-maps.html#comment-form' title='3 Comments'/>
<link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4339813531032979196/posts/default/4971617975719595764'/>
<link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4339813531032979196/posts/default/4971617975719595764'/>
<link rel='alternate' type='text/html' href='http://hydraraptor.blogspot.com/2016/02/cool-maps.html' title='Cool maps'/>

The corresponding emitted event has the following attributes;

[...]
"url": "http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default",
  "urls": [
    "http://hydraraptor.blogspot.com/feeds/4971617975719595764/comments/default"
  ],
[...]

So, it appears that the first link is selected by the RSSAgent.

Is it possible to instead select the rel='alternate' link instead, or otherwise exert some control over which links are selected?

The text was updated successfully, but these errors were encountered:

cantino · 2016-02-06T22:58:29Z

Hey @Lenbok. I think that'd be a code change to the RSSAgent, if you want to give it a go. I'm not sure how hard it'd be. I'll add "Help Wanted" to this issue in case someone has suggestions.

You can also parse RSS with a WebsiteAgent, and that would give you more control over exactly what you extract. There are some examples on the wiki.

Lenbok · 2016-02-06T23:57:49Z

I know exactly zero ruby, and (after some issue scouting) it looks like RSSAgent may be getting fixed/enhanced/superseded anyway by #957 -- is it possible that this issue would already be fixed by #957?

cantino · 2016-02-07T01:47:25Z

We'd have to ask @knu, I'm not sure. I don't think that PR will fix all of the issues that reference it at this point ;)

xdirty · 2016-02-08T15:12:36Z

Cant you use the website agent to extract anydata even from an xml feed?

cantino · 2016-02-08T16:10:51Z

Yes you could definitely try the website agent too.

Lenbok · 2016-02-08T19:59:30Z

Someone who is familiar with xml parsing / xpath / etc would probably have no trouble getting the job done via the website agent. I was hoping that the RSSAgent could be made to Just Work for those that aren't :-).

xdirty · 2016-02-10T04:57:29Z

Lenbok, if you play around with the website agent and look at Cantinos example you should be able to figure out what keywords you need to enter to extract the right data. Heres what i used for one of my feeds that wouldnt extract and didnt need to change any of cantinos code. I think your situation is a little more complicatet, but you dont need coding skills, you just need to be able to recognize patterns.

{
"url": "Enter rss feed address here",
"mode": "on_change",
"type": "xml",
"expected_update_period_in_days": "10",
"extract": {
"title": {
"css": "item title",
"value": ".//text()"
},
"url": {
"css": "item link",
"value": ".//text()"
}
}
}

Alternatively you can use http://www.rssmix.com/, u need more than one feed but it will combine them and the output will be more consistent with what xml output should be which means the rssagent will be able to extract the data.

knu · 2016-02-10T04:59:29Z

Let me take a look into this later.

Jngai · 2016-03-07T18:52:48Z

@knu do you have an update for this issue?

Jngai · 2016-03-08T22:13:11Z

@knu if you don't mind, I plan to try it. I was looking at the rss agent and think it is manageable.

Jngai · 2016-03-09T18:02:34Z

@Lenbok I don't think there is a way to select your own urls. I was looking into the agent. Turns out its pretty reliant on feed-normalizer and the url has already been selected. On the upside did you notice that the alternative links that you were looking for are actually in the content section? Hope this helps.

{
  "id": "tag:blogger.com,1999:blog-4339813531032979196.post-8884703291196141290",
  "date_published": "2016-01-29 17:20:00 +0000",
  "last_updated": "2016-02-18 11:21:40 +0000",
  "url": "http://hydraraptor.blogspot.com/feeds/8884703291196141290/comments/default",
  "urls": [
    "http://hydraraptor.blogspot.com/feeds/8884703291196141290/comments/default"
  ],
  "description": "",
  "content": "A few months ago <a href=\"https://plus.google.com/u/0/113737869224798118032/auto\" target=\"_blank\">+Neil Darlow<\/a> mentioned that he had replaced his <a href=\"http://hydraraptor.blogspot.co.uk/2011/12/mendel90.html\" target=\"_blank\">Mendel90<\/a> fan with a quiet version and it seemed to give better cooling results. This made me curious because quieter fans of the same dimensions generally spin slower and produce less airflow. So I purchased

Jngai · 2016-04-04T19:09:32Z

@Lenbok did my suggestion help?

Lenbok · 2016-04-04T21:02:36Z

@Jngai Hmmmm, in the (raw xml of the) entries I have looked at, the links in the content section do not correspond to those links that appear after the content section (in particular I want the "alternate" link, as that is the one I should click on in order to visit the article itself in a browser). #957 discusses an alternative implementation that doesn't depend on feed-normalizer, so I was hoping that would provide a solution.

Lenbok · 2017-12-06T18:03:06Z

Closing as this has been addressed by #1564

cantino added the help wanted label Feb 6, 2016

Lenbok closed this as completed Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSSAgent not extracting the primary entry URL #1276

RSSAgent not extracting the primary entry URL #1276

Lenbok commented Feb 6, 2016

cantino commented Feb 6, 2016

Lenbok commented Feb 6, 2016

cantino commented Feb 7, 2016

xdirty commented Feb 8, 2016

cantino commented Feb 8, 2016

Lenbok commented Feb 8, 2016

xdirty commented Feb 10, 2016

knu commented Feb 10, 2016

Jngai commented Mar 7, 2016

Jngai commented Mar 8, 2016

Jngai commented Mar 9, 2016

Jngai commented Apr 4, 2016

Lenbok commented Apr 4, 2016

Lenbok commented Dec 6, 2017