-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSSAgent not extracting the primary entry URL #1276
Comments
Hey @Lenbok. I think that'd be a code change to the RSSAgent, if you want to give it a go. I'm not sure how hard it'd be. I'll add "Help Wanted" to this issue in case someone has suggestions. You can also parse RSS with a WebsiteAgent, and that would give you more control over exactly what you extract. There are some examples on the wiki. |
We'd have to ask @knu, I'm not sure. I don't think that PR will fix all of the issues that reference it at this point ;) |
Cant you use the website agent to extract anydata even from an xml feed? |
Yes you could definitely try the website agent too. |
Someone who is familiar with xml parsing / xpath / etc would probably have no trouble getting the job done via the website agent. I was hoping that the RSSAgent could be made to Just Work for those that aren't :-). |
Lenbok, if you play around with the website agent and look at Cantinos example you should be able to figure out what keywords you need to enter to extract the right data. Heres what i used for one of my feeds that wouldnt extract and didnt need to change any of cantinos code. I think your situation is a little more complicatet, but you dont need coding skills, you just need to be able to recognize patterns.
Alternatively you can use http://www.rssmix.com/, u need more than one feed but it will combine them and the output will be more consistent with what xml output should be which means the rssagent will be able to extract the data. |
Let me take a look into this later. |
@knu do you have an update for this issue? |
@knu if you don't mind, I plan to try it. I was looking at the rss agent and think it is manageable. |
@Lenbok I don't think there is a way to select your own urls. I was looking into the agent. Turns out its pretty reliant on feed-normalizer and the url has already been selected. On the upside did you notice that the alternative links that you were looking for are actually in the content section? Hope this helps.
|
@Lenbok did my suggestion help? |
@Jngai Hmmmm, in the (raw xml of the) entries I have looked at, the links in the content section do not correspond to those links that appear after the content section (in particular I want the "alternate" link, as that is the one I should click on in order to visit the article itself in a browser). #957 discusses an alternative implementation that doesn't depend on feed-normalizer, so I was hoping that would provide a solution. |
Closing as this has been addressed by #1564 |
I have an RSSAgent collecting data from http://hydraraptor.blogspot.com/feeds/posts/default (although the same happens with other blogspot feeds) to perform some aggregation and filtering. Unfortunately, the emitted events do not contain the primary URL for each entry, instead choosing one that does not direct to the article itself (and can't be clicked on to view in a browser). Looking at the RSS feed itself, each article has a series of links such as:
The corresponding emitted event has the following attributes;
So, it appears that the first link is selected by the RSSAgent.
Is it possible to instead select the
rel='alternate'
link instead, or otherwise exert some control over which links are selected?The text was updated successfully, but these errors were encountered: