Bad Uri #938

irfancharania · 2015-07-27T02:19:14Z

The browser seems to be more forgiving with malformed links than ruby.

I've got a site that I'm trying to extract urls from and because the url is not properly formed, ruby just borks with a Error when fetching url: bad URI(is not URI?) and stops parsing the rest of the page.

Example:

{
  "expected_update_period_in_days": "2",
  "url": "https://dl.dropboxusercontent.com/u/28950293/test.html",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "url": {
      "css": "a",
      "value": "@href"
    },
    "title": {
      "css": "a",
      "value": "normalize-space(.)"
    }
  }
}

Is there any way to use something like uri_escape available for liquid during the extraction?

At the very least, I think it should ignore the bad uri, and continue trying to parse the rest of the links on the page.

The text was updated successfully, but these errors were encountered:

knu · 2015-07-27T04:14:14Z

WebsiteAgent automatically parses the url value in a created event as URL, resolving it as a relative URL from the current page, and parsing is done strictly there. I think we could make it fall back to the original value, while I don't feel like applying uri_escape automatically.

An ugly workaround is to rename the key from url to something different and then make an EventFormattingAgent to transform the value through something like "url": "{{ key | uri_escape | uri_expand(base_uri) }}".

irfancharania · 2015-07-27T14:49:02Z

No, uri_escape shouldn't be applied automatically as it might end up double escaping.

I was thinking maybe a flag to set the escape option but then that didn't make much sense either because this seems like a one-off case specifically for extracting url.

The ugly workaround doesn't quite work in this case... The problem is that once the WebsiteAgent comes across the malformed url, it stops extracting. It will only output everything before it encountered the bad url (and nothing after).

If the html is as follows:

    <ul>
        <li><a href="http://google.com">google</a></li>
        <li><a href="https://www.google.ca/search?q=some query">broken</a></li>
        <li><a href="https://www.google.ca/search?q=some%20query">escaped</a></li>
    </ul>

Extract should try and process as many as possible, not stop after it finds a bad url. Ignoring the second one and just logging it would suffice I think.

knu · 2015-07-27T16:30:29Z

Did you actually try the workaround? The "rename the key from url to something different" part is essential because WebsiteAgent tries to parse a value as URI only if the key is "url". WebsiteAgent wouldn't stop after a URI error unless it found a bad URL in the "url" key.

irfancharania · 2015-07-27T16:34:50Z

Sorry, I hadn't.
I just did and it does work.

knu · 2015-07-27T16:46:17Z

OK, it's my turn to seek for a real fix. I still think it's an idea to first try parsing a value as URL and if it fails then fall back to escape unescaped characters in it and retry parsing. What do you think?

irfancharania · 2015-07-27T17:03:24Z

I spoke too soon. While the first part works, the second part fails...
It encodes the whole url instead of just the query string.

Unless I misinterpretted something?
Here are my agents:

WebsiteAgent

{
  "expected_update_period_in_days": "2",
  "url": "https://dl.dropboxusercontent.com/u/28950293/test.html",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "item_url": {
      "css": "a",
      "value": "@href"
    },
    "title": {
      "css": "a",
      "value": "normalize-space(.)"
    }
  }
}

EventFormattingAgent

{
  "instructions": {
    "url": "{{item_url | uri_escape | uri_expand(base_uri) }}"
  },
  "mode": "merge"
}

knu · 2015-07-27T17:11:01Z

Ah, of course. uri_escape would escape the colon and slash as well. What's needed here is something that does what Ruby's URI.escape (which had been deprecated as "misfeature") would do.

irfancharania · 2015-07-28T16:03:04Z

Running under Ruby 2.2.2, this issue goes away, and the WebsiteAgent runs as expected:

[
  {
    "url": "http://google.com",
    "title": "google"
  },
  {
    "url": "https://www.google.ca/search?q=some%20query",
    "title": "broken"
  },
  {
    "url": "https://www.google.ca/search?q=some%20query",
    "title": "escaped"
  }
]

irfancharania · 2015-07-28T20:46:42Z

I've got a different problem now: unicode characters in url

For a url like this in a site: http://ko.wikipedia.org/wiki/위키백과:대문
I get an error stating URI must be ascii only

cantino · 2015-07-29T06:42:40Z

Do you know which Gem is producing that error?

irfancharania · 2015-07-29T16:14:22Z

It seems the problem is this line.

Comment it out and the agent in the first post works...

Otherwise it gives this error:

I, [2015-07-29T16:11:12.466857 #20]  INFO -- : Fetching https://dl.dropboxusercontent.com/u/28950293/test.html
I, [2015-07-29T16:11:12.681116 #20]  INFO -- : Extracting html at a: ["http://google.com", "https://www.google.ca/search?q=some query", "https://www.google.ca/search?q=some%20query", "http://ko.wikipedia.org/wiki/위키백과:대문", "https://www.google.ca/search?q=위키백과:대문"]
I, [2015-07-29T16:11:12.684229 #20]  INFO -- : Extracting html at a: ["google", "broken", "escaped", "unicode url", "unicode param"]
I, [2015-07-29T16:11:12.685909 #20]  INFO -- : Storing new parsed result for 'AAA - Test Fetch': {"url"=>"http://google.com", "title"=>"google"}
I, [2015-07-29T16:11:12.716302 #20]  INFO -- : Storing new parsed result for 'AAA - Test Fetch': {"url"=>"https://www.google.ca/search?q=some%20query", "title"=>"broken"}
I, [2015-07-29T16:11:12.717489 #20]  INFO -- : Storing new parsed result for 'AAA - Test Fetch': {"url"=>"https://www.google.ca/search?q=some%20query", "title"=>"escaped"}
E, [2015-07-29T16:11:12.718580 #20] ERROR -- : Error when fetching url: URI must be ascii only "http://ko.wikipedia.org/wiki/\u{c704}\u{d0a4}\u{bc31}\u{acfc}:\u{b300}\u{bb38}"
/app/vendor/ruby-2.2.2/lib/ruby/2.2.0/uri/generic.rb:1100:in `rescue in merge'
/app/vendor/ruby-2.2.2/lib/ruby/2.2.0/uri/generic.rb:1097:in `merge'
/app/app/models/agents/website_agent.rb:320:in `block (3 levels) in check_url'
/app/app/models/agents/website_agent.rb:317:in `each'
/app/app/models/agents/website_agent.rb:317:in `block (2 levels) in check_url'
/app/app/models/agents/website_agent.rb:315:in `times'
/app/app/models/agents/website_agent.rb:315:in `block in check_url'
/app/vendor/bundle/ruby/2.2.0/gems/liquid-3.0.6/lib/liquid/context.rb:132:in `stack'
/app/app/models/agents/website_agent.rb:279:in `check_url'
/app/app/models/agents/website_agent.rb:266:in `block in check_urls'
/app/app/models/agents/website_agent.rb:265:in `each'
/app/app/models/agents/website_agent.rb:265:in `check_urls'
/app/app/models/agents/website_agent.rb:259:in `check'
/app/app/concerns/dry_runnable.rb:17:in `dry_run!'
/app/app/controllers/agents_controller.rb:53:in `dry_run'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/implicit_render.rb:4:in `send_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/abstract_controller/base.rb:198:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/rendering.rb:10:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/abstract_controller/callbacks.rb:20:in `block in process_action'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:117:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:117:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:555:in `block (2 levels) in compile'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:505:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:505:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:92:in `_run_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:776:in `_run_process_action_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:81:in `run_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/abstract_controller/callbacks.rb:19:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/rescue.rb:29:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/notifications.rb:164:in `block in instrument'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/notifications.rb:164:in `instrument'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/instrumentation.rb:30:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/params_wrapper.rb:250:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/railties/controller_runtime.rb:18:in `process_action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/abstract_controller/base.rb:137:in `process'
/app/vendor/bundle/ruby/2.2.0/gems/actionview-4.2.2/lib/action_view/rendering.rb:30:in `process'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal.rb:196:in `dispatch'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal/rack_delegation.rb:13:in `dispatch'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_controller/metal.rb:237:in `block in action'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/routing/route_set.rb:74:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/routing/route_set.rb:74:in `dispatch'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/routing/route_set.rb:43:in `serve'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/journey/router.rb:43:in `block in serve'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/journey/router.rb:30:in `each'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/journey/router.rb:30:in `serve'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/routing/route_set.rb:819:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/warden-1.2.3/lib/warden/manager.rb:35:in `block in call'
/app/vendor/bundle/ruby/2.2.0/gems/warden-1.2.3/lib/warden/manager.rb:34:in `catch'
/app/vendor/bundle/ruby/2.2.0/gems/warden-1.2.3/lib/warden/manager.rb:34:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/etag.rb:24:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/conditionalget.rb:38:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/head.rb:13:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/params_parser.rb:27:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/flash.rb:260:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/session/abstract/id.rb:225:in `context'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/session/abstract/id.rb:220:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/cookies.rb:560:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/query_cache.rb:36:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activerecord-4.2.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:649:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:88:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:88:in `_run_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:776:in `_run_call_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/callbacks.rb:81:in `run_callbacks'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/callbacks.rb:27:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/remote_ip.rb:78:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/rack/logger.rb:38:in `call_app'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/rack/logger.rb:20:in `block in call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/tagged_logging.rb:68:in `block in tagged'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/tagged_logging.rb:26:in `tagged'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/tagged_logging.rb:68:in `tagged'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/rack/logger.rb:20:in `call'
/app/config/initializers/silence_worker_status_logger.rb:5:in `call_with_silence_worker_status'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/request_id.rb:21:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/methodoverride.rb:22:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/runtime.rb:18:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/activesupport-4.2.2/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/static.rb:113:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/rack-1.6.4/lib/rack/sendfile.rb:113:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/actionpack-4.2.2/lib/action_dispatch/middleware/ssl.rb:24:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/engine.rb:518:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/application.rb:164:in `call'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/railtie.rb:194:in `public_send'
/app/vendor/bundle/ruby/2.2.0/gems/railties-4.2.2/lib/rails/railtie.rb:194:in `method_missing'
/app/vendor/bundle/ruby/2.2.0/gems/unicorn-4.8.3/lib/unicorn/http_server.rb:576:in `process_client'
/app/vendor/bundle/ruby/2.2.0/gems/unicorn-4.8.3/lib/unicorn/http_server.rb:670:in `worker_loop'
/app/vendor/bundle/ruby/2.2.0/gems/unicorn-4.8.3/lib/unicorn/http_server.rb:525:in `spawn_missing_workers'
/app/vendor/bundle/ruby/2.2.0/gems/unicorn-4.8.3/lib/unicorn/http_server.rb:140:in `start'
/app/vendor/bundle/ruby/2.2.0/gems/unicorn-4.8.3/bin/unicorn:126:in `<top (required)>'
/app/vendor/bundle/ruby/2.2.0/bin/unicorn:23:in `load'
/app/vendor/bundle/ruby/2.2.0/bin/unicorn:23:in `<main>'

cantino · 2015-07-30T02:17:42Z

Oh wow, that looks like the error is deep in the Ruby standard library. I wonder if escaping the unicode in some way would help? Alternatively, we could possibly call to_s on it and then concatenate the strings ourselves instead of using the URI library.

irfancharania · 2015-07-30T02:31:23Z

Well, really, the sites should be encoding the urls properly... For my
particular use case, if huginn just skips the bad ones and carries on
trying to extract the rest, it's good enough.
On Jul 29, 2015 7:18 PM, "Andrew Cantino" notifications@github.com wrote:

Oh wow, that looks like the error is deep in the Ruby standard library. I
wonder if escaping the unicode in some way would help? Alternatively, we
could possibly call to_s on it and then concatenate the strings ourselves
instead of using the URI library.

—
Reply to this email directly or view it on GitHub
#938 (comment).

cantino · 2015-07-30T05:53:28Z

You could definitely add a rescue similar to #943

hughwi · 2015-08-17T15:44:51Z

I also have the same error - but with a different setup. I have tried both the RSS and the Website Agent, and the same result (error) - but different failures:

I have upgraded Ruby to 2.2.2 and also updated Huginn to the latest version from Git Master, but no luck.

The RSS agent is as follows:

{
"expected_update_period_in_days": "1",
"clean": "true",
"url": "[ "http://www.teekay.com/rss/pressrelease.aspx", "http://www.swireshipping.com/index.php?option=com_content&view=category&id=2&Itemid=27&format=feed&type=rss", "http://www.pacificbasin.com/en/global/rss_xml.php?cat=announcement", "http://www.ap-group.co.uk/feed/", "http://www.imo.org/Pages/PressBriefingsRSS.aspx", "http://www.incoships.com.au/rss_news.cfm", "http://www.iss-shipping.com/XML/issnewsfeed.xml", "http://www.gepowerconversion.com/feed/inspire", "http://www.nyk.com/english/release/top_news.xml", "http://www.shi.samsung.co.kr/Eng/Rss/pr.aspx", "http://feeds.feedburner.com/EcoMarinePowerNews", "http://www.brittany-ferries.co.uk/press-office?articleaction=rss", "http://www.gtt.fr/feed/", "http://panasonic.net/news/rss/topics.xml", "http://www.rotork.com/en/media/rss", "http://www.imageline.co.uk/news-rss.php?news_id=16", "http://www.khi.co.jp/english/news_atom/news.xml", "http://www.raytheon-anschuetz.com/?type=9817", "http://worldmaritimenews.com/feed", "http://maritime-connector.com/news-rss/", "http://www.seadiscovery.com/Services/MarineTechnologyNewsXML.aspx", "http://www.joc.com/rss/13/all/rss.xml", "http://www.rolls-royce.com/press/rss.jsp", "http://www.wartsila.com/ss/Satellite?childpagename=WCom%2FUtilities%2FRSS&p=1267106760296&packedargs=locale%3Den%26newstitle%3DAll%2Bnews%2Bfrom%2BW%25C3%25A4rtsil%25C3%25A4%26rsscategory%3Dall%26site%3DWartsila&pagename=WCom%2FCommon%2FThirdPartyWrapper", "http://www.mandieselturbo.com/web/rss.aspx", "http://www.imo.org/MediaCentre/PressBriefings/_layouts/listfeed.aspx?List=%7b1FC6FB5E-156F-4CF2-A9F2-CCF052B87DAC%7d", "http://www.elabor8.co.uk/feed/", "http://j-l-a.com/press-releases?format=feed&type=atom", "http://www.businessweek.com/feeds/most-popular.rss", "http://www.indiatimes.com/feeds/feeds.xml", "http://www.wartsila.com/ss/Satellite?childpagename=WCom%2FUtilities%2FRSS&p=1267106760296&packedargs=locale%3Den%26newstitle%3DMarine%2BSolutions%2BNews%26rsscategory%3DMarine%2BSolutions%26site%3DWartsila&pagename=WCom%2FCommon%2FThirdPartyWrapper", "http://www.corporate.man.eu/en/technicalpages/RSS-Feed.html", "http://www.daimler.com/rss/5-1-494337.xml", "http://www.ship-technology.com/feeds/news-feed.xml", "http://www.maritimedenmark.dk/rss.aspx", "http://www.channelnewsasia.com/rss/latest_cna_asiapac_rss.xml", "http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml", "http://rssfeeds.shell.com/shell_media_releases", "http://www.tognum.com/press/press-releases/?type=100", "http://maritimeglobalnews.com/rss", "http://www.maerskpress.com/rss/Latest-News", "http://www.scania.com/_inc/rss.aspx", "http://www.chevron.com/news/press/articles.rss", "http://www.westfalia-separator.com/home/backend.php", "http://blogactiv.eu/feed/rss/", "http://www.littelfuse.com/rss.xml", "http://www.getransportation.com/index.php?option=com_ijoomla_rss&act=xml&cat=15&feedtype=RSS2.0", "http://www.micanti.com/feed/", "http://finance.thestandard.com.hk/feed/breaking.xml", "http://www.businesswire.com/portal/site/home/news/industry/?vnsId=31121", "http://tentea.ec.europa.eu/scripts/rss.php?channel=112470000004394", "http://www.kline.co.jp/en/news/news.xml", "http://www.agv.gr/?feed=rss2", "http://ir.horizonlines.com/corporate.rss?c=188937&Rule=Cat=news~subcat=ALL", "http://www.vanoord.com/news/rss.xml", "http://www.inmarsat.com/rss/index.htm?type=pressrelease", "http://www.tognum.com/index.php?id=1698&type=100&tt_news[category]=20", "http://www.zim.com/newsrss.xml", "https://www.dynamar.com/news/rss", "http://www.minervamarine.com/rss.xml" ]"
}

The error I get is:

E, [2015-08-17T15:41:07.077118 #20] ERROR -- : Failed to fetch [ARRAY OF URLS AS ABOVE] with message 'bad URI(is not URI?): [ "http://www.teekay.com/rss/pressrelease.aspx", "http://www.swireshipping.com/index.php':

For the Website agent, I get the following:

{
"expected_update_period_in_days": "1",
"url": "["http://www.teekay.com/rss/pressrelease.aspx","http://www.swireshipping.com/index.php?option=com_content&view=category&id=2&Itemid=27&format=feed&type=rss","http://www.pacificbasin.com/en/global/rss_xml.php?cat=announcement","http://www.ap-group.co.uk/feed/","http://www.imo.org/Pages/PressBriefingsRSS.aspx","http://www.incoships.com.au/rss_news.cfm","http://www.iss-shipping.com/XML/issnewsfeed.xml","http://www.gepowerconversion.com/feed/inspire","http://www.nyk.com/english/release/top_news.xml","http://www.shi.samsung.co.kr/Eng/Rss/pr.aspx","http://feeds.feedburner.com/EcoMarinePowerNews","http://www.brittany-ferries.co.uk/press-office?articleaction=rss","http://www.gtt.fr/feed/","http://panasonic.net/news/rss/topics.xml","http://www.rotork.com/en/media/rss","http://www.imageline.co.uk/news-rss.php?news_id=16","http://www.khi.co.jp/english/news_atom/news.xml","http://www.raytheon-anschuetz.com/?type=9817","http://worldmaritimenews.com/feed","http://maritime-connector.com/news-rss/","http://www.seadiscovery.com/Services/MarineTechnologyNewsXML.aspx","http://www.joc.com/rss/13/all/rss.xml","http://www.rolls-royce.com/press/rss.jsp","http://www.wartsila.com/ss/Satellite?childpagename=WCom%2FUtilities%2FRSS&p=1267106760296&packedargs=locale%3Den%26newstitle%3DAll%2Bnews%2Bfrom%2BW%25C3%25A4rtsil%25C3%25A4%26rsscategory%3Dall%26site%3DWartsila&pagename=WCom%2FCommon%2FThirdPartyWrapper","http://www.mandieselturbo.com/web/rss.aspx","http://www.imo.org/MediaCentre/PressBriefings/_layouts/listfeed.aspx?List=%7b1FC6FB5E-156F-4CF2-A9F2-CCF052B87DAC%7d","http://www.elabor8.co.uk/feed/","http://j-l-a.com/press-releases?format=feed&type=atom","http://www.businessweek.com/feeds/most-popular.rss","http://www.indiatimes.com/feeds/feeds.xml","http://www.wartsila.com/ss/Satellite?childpagename=WCom%2FUtilities%2FRSS&p=1267106760296&packedargs=locale%3Den%26newstitle%3DMarine%2BSolutions%2BNews%26rsscategory%3DMarine%2BSolutions%26site%3DWartsila&pagename=WCom%2FCommon%2FThirdPartyWrapper","http://www.corporate.man.eu/en/technicalpages/RSS-Feed.html","http://www.daimler.com/rss/5-1-494337.xml","http://www.ship-technology.com/feeds/news-feed.xml","http://www.maritimedenmark.dk/rss.aspx","http://www.channelnewsasia.com/rss/latest_cna_asiapac_rss.xml","http://www.channelnewsasia.com/rss/latest_cna_biz_rss.xml","http://rssfeeds.shell.com/shell_media_releases","http://www.tognum.com/press/press-releases/?type=100","http://maritimeglobalnews.com/rss","http://www.maerskpress.com/rss/Latest-News","http://www.scania.com/_inc/rss.aspx","http://www.chevron.com/news/press/articles.rss","http://www.westfalia-separator.com/home/backend.php","http://blogactiv.eu/feed/rss/","http://www.littelfuse.com/rss.xml","http://www.getransportation.com/index.php?option=com_ijoomla_rss&act=xml&cat=15&feedtype=RSS2.0","http://www.micanti.com/feed/","http://finance.thestandard.com.hk/feed/breaking.xml","http://www.businesswire.com/portal/site/home/news/industry/?vnsId=31121","http://tentea.ec.europa.eu/scripts/rss.php?channel=112470000004394","http://www.kline.co.jp/en/news/news.xml","http://www.agv.gr/?feed=rss2","http://ir.horizonlines.com/corporate.rss?c=188937&Rule=Cat=news~subcat=ALL","http://www.vanoord.com/news/rss.xml","http://www.inmarsat.com/rss/index.htm?type=pressrelease","http://www.tognum.com/index.php?id=1698&type=100&tt_news[category]=20","http://www.zim.com/newsrss.xml","http://www.dynamar.com/news/rss","http://www.minervamarine.com/rss.xml"]",
"type": "xml",
"mode": "on_change",
"extract": {
"title": {
"css": "item title",
"value": ".//text()"
},
"url": {
"css": "item link",
"value": ".//text()"
},
"description": {
"css": "item description",
"value": ".//text()"
},
"pubDate": {
"css": "item pubDate",
"value": ".//text()"
}
}
}

E, [2015-08-17T15:43:14.839367 #17] ERROR -- : Ignoring a non-HTTP url: "["ESCAPED URL LIST FROM ABOVE"]"

cantino · 2015-08-17T18:35:15Z

It looks like you have the Array inside of a string. It should be [ ... ], not "[ ... ]".

haroonis · 2015-10-08T20:19:43Z

Hi there - was there a suggested resolution on this? I updated and started again with a fresh Huginn install in case I'd configured something incorrectly but no dice - the error I'm getting is with non-ASCII characters:
"ERROR -- : Error when fetching url: URI must be ascii only"

irfancharania · 2015-10-08T20:47:32Z

I'm not sure where this is at.
I've got it working on my Huginn instance running Ruby 2.2.2 and using this

cantino · 2015-10-10T04:18:03Z

@haroonis what URL is causing that error?

haroonis · 2015-10-13T15:45:05Z

Thanks for the suggestion @irfancharania Irfan - I'm using a workaround for now. I've named the URL differently so it's not processed as a URL. Not ideal as I then have to append it to the domain name (it's a relative URL) however I'm not set up to recompile the code at the moment (I use Windows mainly!).
@cantino it includes a £ sign which works in the browser but isn't a compliant URL.

cantino · 2015-10-14T04:26:35Z

Does @irfancharania's #958 PR fix these issues?

haroonis · 2015-10-29T22:21:16Z

I found it a bit tricky to install huginn in the first place so am not sure how I'd go about testing it! It looks to have failed some checks according to that link?

cantino · 2015-10-31T04:14:25Z

I believe @knu was going to take a look at that PR again.

@irfancharania

This fixes #938, and the specs are from #958. (Thanks @irfancharania!)

knu · 2015-11-02T14:13:20Z

Just pushed #1125!

@irfancharania

This fixes #938, and the specs are from #958. (Thanks @irfancharania!)

cantino · 2015-11-14T21:32:30Z

Awesome!

@irfancharania

This fixes huginn#938, and the specs are from huginn#958. (Thanks @irfancharania!)

irfancharania mentioned this issue Aug 4, 2015

Escape invalid urls #958

Closed

knu added a commit that referenced this issue Nov 2, 2015

Introduce Utils.normalize_uri and use it in WebsiteAgent

cea37ab

This fixes #938, and the specs are from #958. (Thanks @irfancharania!)

knu mentioned this issue Nov 2, 2015

Introduce Utils.normalize_uri and use it in WebsiteAgent #1125

Merged

knu added a commit that referenced this issue Nov 14, 2015

Introduce Utils.normalize_uri and use it in WebsiteAgent

cdfdc7f

This fixes #938, and the specs are from #958. (Thanks @irfancharania!)

knu closed this as completed in #1125 Nov 14, 2015

TildeWill pushed a commit to omniscopeio/huginn that referenced this issue Nov 28, 2015

Introduce Utils.normalize_uri and use it in WebsiteAgent

192651a

This fixes huginn#938, and the specs are from huginn#958. (Thanks @irfancharania!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad Uri #938

Bad Uri #938

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 28, 2015

irfancharania commented Jul 28, 2015

cantino commented Jul 29, 2015

irfancharania commented Jul 29, 2015

cantino commented Jul 30, 2015

irfancharania commented Jul 30, 2015

cantino commented Jul 30, 2015

hughwi commented Aug 17, 2015

cantino commented Aug 17, 2015

haroonis commented Oct 8, 2015

irfancharania commented Oct 8, 2015

cantino commented Oct 10, 2015

haroonis commented Oct 13, 2015

cantino commented Oct 14, 2015

haroonis commented Oct 29, 2015

cantino commented Oct 31, 2015

knu commented Nov 2, 2015

cantino commented Nov 14, 2015

Bad Uri #938

Bad Uri #938

Comments

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 27, 2015

knu commented Jul 27, 2015

irfancharania commented Jul 28, 2015

irfancharania commented Jul 28, 2015

cantino commented Jul 29, 2015

irfancharania commented Jul 29, 2015

cantino commented Jul 30, 2015

irfancharania commented Jul 30, 2015

cantino commented Jul 30, 2015

hughwi commented Aug 17, 2015

cantino commented Aug 17, 2015

haroonis commented Oct 8, 2015

irfancharania commented Oct 8, 2015

cantino commented Oct 10, 2015

haroonis commented Oct 13, 2015

cantino commented Oct 14, 2015

haroonis commented Oct 29, 2015

cantino commented Oct 31, 2015

knu commented Nov 2, 2015

cantino commented Nov 14, 2015