New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad Uri #938
Comments
WebsiteAgent automatically parses the An ugly workaround is to rename the key from |
No, uri_escape shouldn't be applied automatically as it might end up double escaping. I was thinking maybe a flag to set the escape option but then that didn't make much sense either because this seems like a one-off case specifically for extracting url. The ugly workaround doesn't quite work in this case... The problem is that once the WebsiteAgent comes across the malformed url, it stops extracting. It will only output everything before it encountered the bad url (and nothing after). If the html is as follows: <ul>
<li><a href="http://google.com">google</a></li>
<li><a href="https://www.google.ca/search?q=some query">broken</a></li>
<li><a href="https://www.google.ca/search?q=some%20query">escaped</a></li>
</ul> Extract should try and process as many as possible, not stop after it finds a bad url. Ignoring the second one and just logging it would suffice I think. |
Did you actually try the workaround? The "rename the key from url to something different" part is essential because WebsiteAgent tries to parse a value as URI only if the key is "url". WebsiteAgent wouldn't stop after a URI error unless it found a bad URL in the "url" key. |
Sorry, I hadn't. |
OK, it's my turn to seek for a real fix. I still think it's an idea to first try parsing a value as URL and if it fails then fall back to escape unescaped characters in it and retry parsing. What do you think? |
I spoke too soon. While the first part works, the second part fails... Unless I misinterpretted something? WebsiteAgent {
"expected_update_period_in_days": "2",
"url": "https://dl.dropboxusercontent.com/u/28950293/test.html",
"type": "html",
"mode": "on_change",
"extract": {
"item_url": {
"css": "a",
"value": "@href"
},
"title": {
"css": "a",
"value": "normalize-space(.)"
}
}
} EventFormattingAgent {
"instructions": {
"url": "{{item_url | uri_escape | uri_expand(base_uri) }}"
},
"mode": "merge"
} |
Ah, of course. |
Running under Ruby 2.2.2, this issue goes away, and the WebsiteAgent runs as expected: [
{
"url": "http://google.com",
"title": "google"
},
{
"url": "https://www.google.ca/search?q=some%20query",
"title": "broken"
},
{
"url": "https://www.google.ca/search?q=some%20query",
"title": "escaped"
}
] |
I've got a different problem now: unicode characters in url For a url like this in a site: http://ko.wikipedia.org/wiki/위키백과:대문 |
Do you know which Gem is producing that error? |
It seems the problem is this line. Comment it out and the agent in the first post works... Otherwise it gives this error:
|
Oh wow, that looks like the error is deep in the Ruby standard library. I wonder if escaping the unicode in some way would help? Alternatively, we could possibly call |
Well, really, the sites should be encoding the urls properly... For my
|
You could definitely add a rescue similar to #943 |
I also have the same error - but with a different setup. I have tried both the RSS and the Website Agent, and the same result (error) - but different failures: I have upgraded Ruby to 2.2.2 and also updated Huginn to the latest version from Git Master, but no luck. The RSS agent is as follows: { The error I get is: E, [2015-08-17T15:41:07.077118 #20] ERROR -- : Failed to fetch [ARRAY OF URLS AS ABOVE] with message 'bad URI(is not URI?): [ "http://www.teekay.com/rss/pressrelease.aspx", "http://www.swireshipping.com/index.php': For the Website agent, I get the following: { E, [2015-08-17T15:43:14.839367 #17] ERROR -- : Ignoring a non-HTTP url: "["ESCAPED URL LIST FROM ABOVE"]" |
It looks like you have the Array inside of a string. It should be |
Hi there - was there a suggested resolution on this? I updated and started again with a fresh Huginn install in case I'd configured something incorrectly but no dice - the error I'm getting is with non-ASCII characters: |
I'm not sure where this is at. |
@haroonis what URL is causing that error? |
Thanks for the suggestion @irfancharania Irfan - I'm using a workaround for now. I've named the URL differently so it's not processed as a URL. Not ideal as I then have to append it to the domain name (it's a relative URL) however I'm not set up to recompile the code at the moment (I use Windows mainly!). |
Does @irfancharania's #958 PR fix these issues? |
I found it a bit tricky to install huginn in the first place so am not sure how I'd go about testing it! It looks to have failed some checks according to that link? |
I believe @knu was going to take a look at that PR again. |
This fixes #938, and the specs are from #958. (Thanks @irfancharania!)
Just pushed #1125! |
This fixes #938, and the specs are from #958. (Thanks @irfancharania!)
Awesome! |
This fixes huginn#938, and the specs are from huginn#958. (Thanks @irfancharania!)
The browser seems to be more forgiving with malformed links than ruby.
I've got a site that I'm trying to extract urls from and because the url is not properly formed, ruby just borks with a
Error when fetching url: bad URI(is not URI?)
and stops parsing the rest of the page.Example:
Is there any way to use something like
uri_escape
available for liquid during the extraction?At the very least, I think it should ignore the bad uri, and continue trying to parse the rest of the links on the page.
The text was updated successfully, but these errors were encountered: