Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request stalls making a request to a URL that streams content #188

Closed
lepek opened this issue Aug 25, 2016 · 7 comments
Closed

Request stalls making a request to a URL that streams content #188

lepek opened this issue Aug 25, 2016 · 7 comments

Comments

@lepek
Copy link

lepek commented Aug 25, 2016

Example:

MetaInspector.new('http://listen.radionomy.com/abc-lounge')

I have searched in MetaInspector, Faraday and Net::HTTP a way to limit the amount if time a request can be open/reading data but I couldn't find a way. Am I missing something here?
One approach would be to make a HEAD request and decide by the content-type if the URL is HTML or not. But that would mean making two request to the URL and I don't like it.
Another way is to wrap the previous call with the Timeout module, but it is now that the Thread.raise that the Timeout module can arise some issues. Anyway, I am not sure if it is entirely wrong, I think we could use it in this case.

@jaimeiniesta
Copy link
Owner

Thanks @lepek, that's an interesting problem. In this case the URL corresponds to a streaming audio file, so MetaInspector has nothing to do with it as it's not HTML, but I understand that you've stumbled into that when crawling a site?

Anyway, our only timeouts are the ones explained here but none of them can save us from being stalled at a streaming URL.

Both :connection_timeout and :read_timeout are options we pass to Faraday, but it looks like Faraday doesn't support streaming. I've seen there are many issues on Faraday trying to cover this, but there doesn't seem to be anything finished yet.

I'll ask the Faraday crew if there is a solution to this.

If not, in order to protect us from this, I would wrap the request in a Timeout, and then catch the error raised and re-raise a MetaInspector::TimeoutError.

@jaimeiniesta
Copy link
Owner

I've found that with Faraday, I can get an empty response from that URL, and it doesn't get stuck:

2.0.0-p648 :001 > conn = Faraday.new(url: 'http://listen.radionomy.com/abc-lounge')
 => #<Faraday::Connection:0x007fe5fa8cac98 @parallel_manager=nil, @headers={"User-Agent"=>"Faraday v0.9.2"}, @params={}, @options=#<Faraday::RequestOptions (empty)>, @ssl=#<Faraday::SSLOptions (empty)>, @default_parallel_manager=nil, @builder=#<Faraday::RackBuilder:0x007fe5fa8ca9f0 @handlers=[Faraday::Request::UrlEncoded, Faraday::Adapter::NetHttp]>, @url_prefix=#<URI::HTTP:0x007fe5f9bcbbf0 URL:http://listen.radionomy.com/abc-lounge>, @proxy=nil>
2.0.0-p648 :002 > response = conn.get
 => #<Faraday::Response:0x007fe5f9bb9ce8 @on_complete_callbacks=[], @env=#<Faraday::Env @method=:get @body="" @url=#<URI::HTTP:0x007fe5f9bcbbf0 URL:http://listen.radionomy.com/abc-lounge> @request=#<Faraday::RequestOptions (empty)> @request_headers={"User-Agent"=>"Faraday v0.9.2"} @ssl=#<Faraday::SSLOptions (empty)> @response=#<Faraday::Response:0x007fe5f9bb9ce8 ...> @response_headers={"cache-control"=>"private", "content-type"=>"application/octet-stream", "location"=>"http://streaming.radionomy.com/ABC-Lounge", "server"=>"Microsoft-IIS/8.5", "x-aspnetmvc-version"=>"5.2", "x-aspnet-version"=>"4.0.30319", "x-powered-by"=>"ASP.NET", "date"=>"Sat, 27 Aug 2016 22:28:26 GMT", "connection"=>"close", "content-length"=>"0"} @status=302>>
2.0.0-p648 :003 > response.body
 => ""

But, if I use it from MetaInspector, it gets stuck. Unless we disable redirections, like this:

2.0.0-p648 :004 > page = MetaInspector.new('http://listen.radionomy.com/abc-lounge', allow_redirections: false)
 => #<MetaInspector::Document:0x007fe5f9b83800 @connection_timeout=20, @read_timeout=20, @retries=3, @allow_redirections=false, @allow_non_html_content=false, @document="http://streaming.radionomy.com/ABC-Lounge", @download_images=true, @headers={"User-Agent"=>"MetaInspector/5.2.3 (+https://github.com/jaimeiniesta/metainspector)", "Accept-Encoding"=>"identity"}, @normalize_url=true, @faraday_options=nil, @faraday_http_cache=nil, @url=#<MetaInspector::URL:0x007fe5f9b82158 @normalize=true, @url="http://listen.radionomy.com/abc-lounge">, @request=#<MetaInspector::Request:0x007fe5fbe67240 @url=#<MetaInspector::URL:0x007fe5f9b82158 @normalize=true, @url="http://listen.radionomy.com/abc-lounge">, @allow_redirections=false, @connection_timeout=20, @read_timeout=20, @retries=3, @headers={"User-Agent"=>"MetaInspector/5.2.3 (+https://github.com/jaimeiniesta/metainspector)", "Accept-Encoding"=>"identity"}, @faraday_options={:url=>"http://listen.radionomy.com/abc-lounge"}, @faraday_http_cache=nil, @response=#<Faraday::Response:0x007fe5fa07f598 @on_complete_callbacks=[], @env=#<Faraday::Env @method=:get @body="http://streaming.radionomy.com/ABC-Lounge" @url=#<URI::HTTP:0x007fe5fbe65620 URL:http://listen.radionomy.com/abc-lounge> @request=#<Faraday::RequestOptions timeout=20, open_timeout=20> @request_headers={"User-Agent"=>"MetaInspector/5.2.3 (+https://github.com/jaimeiniesta/metainspector)", "Accept-Encoding"=>"identity"} @ssl=#<Faraday::SSLOptions (empty)> @response=#<Faraday::Response:0x007fe5fa07f598 ...> @response_headers={"cache-control"=>"private", "content-type"=>"text/html; charset=utf-8", "location"=>"http://streaming.radionomy.com/ABC-Lounge", "server"=>"Microsoft-IIS/8.5", "x-aspnetmvc-version"=>"5.2", "x-aspnet-version"=>"4.0.30319", "x-powered-by"=>"ASP.NET", "date"=>"Sat, 27 Aug 2016 22:30:27 GMT", "connection"=>"close", "content-length"=>"41"} @status=302>>>, @parser=#<MetaInspector::Parser:0x007fe5fbe246c0 @document=#<MetaInspector::Document:0x007fe5f9b83800 ...>, @head_links_parser=#<MetaInspector::Parsers::HeadLinksParser:0x007fe5fbe24698 @main_parser=#<MetaInspector::Parser:0x007fe5fbe246c0 ...>>, @meta_tag_parser=#<MetaInspector::Parsers::MetaTagsParser:0x007fe5fbe24670 @main_parser=#<MetaInspector::Parser:0x007fe5fbe246c0 ...>>, @links_parser=#<MetaInspector::Parsers::LinksParser:0x007fe5fbe24648 @main_parser=#<MetaInspector::Parser:0x007fe5fbe246c0 ...>>, @download_images=true, @images_parser=#<MetaInspector::Parsers::ImagesParser:0x007fe5fbe245f8 @download_images=true, @main_parser=#<MetaInspector::Parser:0x007fe5fbe246c0 ...>>, @texts_parser=#<MetaInspector::Parsers::TextsParser:0x007fe5fbe245d0 @main_parser=#<MetaInspector::Parser:0x007fe5fbe246c0 ...>>, @parsed=#<Nokogiri::HTML::Document:0x3ff2fdf0f548 name="document" children=[#<Nokogiri::XML::DTD:0x3ff2fdf0b858 name="html">, #<Nokogiri::XML::Element:0x3ff2fdf0b5b0 name="html" children=[#<Nokogiri::XML::Element:0x3ff2fdf0a930 name="body" children=[#<Nokogiri::XML::Element:0x3ff2fdf0a700 name="p" children=[#<Nokogiri::XML::Text:0x3ff2fdf0a520 "http://streaming.radionomy.com/ABC-Lounge">]>]>]>]>>>
2.0.0-p648 :005 > page.to_s
 => "http://streaming.radionomy.com/ABC-Lounge"

Of course, we don't want to disable redirections globally, so this is not a solution, just a hint at what can be happening here. I'll keep you posted.

@jaimeiniesta
Copy link
Owner

Example of Faraday getting stuck:

2.0.0-p648 :007 > conn = Faraday.new(url: 'http://listen.radionomy.com/abc-lounge') do |faraday|
2.0.0-p648 :008 >     faraday.use FaradayMiddleware::FollowRedirects, limit: 10
2.0.0-p648 :009?>     faraday.use :cookie_jar
2.0.0-p648 :010?>     faraday.adapter :net_http
2.0.0-p648 :011?>   end
 => #<Faraday::Connection:0x007fef94776ee8 @parallel_manager=nil, @headers={"User-Agent"=>"Faraday v0.9.2"}, @params={}, @options=#<Faraday::RequestOptions (empty)>, @ssl=#<Faraday::SSLOptions (empty)>, @default_parallel_manager=nil, @builder=#<Faraday::RackBuilder:0x007fef94776b28 @handlers=[FaradayMiddleware::FollowRedirects, Faraday::CookieJar, Faraday::Adapter::NetHttp]>, @url_prefix=#<URI::HTTP:0x007fef94776880 URL:http://listen.radionomy.com/abc-lounge>, @proxy=nil>
2.0.0-p648 :012 > response = conn.get
^CIRB::Abort: abort then interrupt!

Strangely, it doesn't get stuck if we don't use the faraday.adapter :net_http line - but this is the default adapter, shouldn't be needed.

@jaimeiniesta
Copy link
Owner

Opened an issue upstream - lostisland/faraday#602

If we don't come with a solution for that, I'll put the request inside a Timeout as said before.

@lepek
Copy link
Author

lepek commented Aug 28, 2016

Although we know it is not HMTL, while crawling a site it could be found and crawled it. We know it is not HTML, now we have to make MetaInspector->Faraday also know about that 😄

The redirect it is just because that URL is actually redirecting to the streaming URL which is longer, so we can ignore the redirections, that is not a problem:

2.1.4 :067 > conn = Faraday.new(url: 'http://streaming.radionomy.com/ABC-Lounge?lang=es-ES%2ces%3bq%3d0.8%2cen%3bq%3d0.6%2cpt%3bq%3d0.4%2cgl%3bq%3d0.2') do |f|
2.1.4 :068 >     f.adapter :net_http
2.1.4 :069?>   end
 => #<Faraday::Connection:0x007fbf3cfa84c8 @parallel_manager=nil, @headers={"User-Agent"=>"Faraday v0.9.2"}, @params={"lang"=>"es-ES,es;q=0.8,en;q=0.6,pt;q=0.4,gl;q=0.2"}, @options=#<Faraday::RequestOptions (empty)>, @ssl=#<Faraday::SSLOptions (empty)>, @default_parallel_manager=nil, @builder=#<Faraday::RackBuilder:0x007fbf3cfa8090 @handlers=[Faraday::Adapter::NetHttp]>, @url_prefix=#<URI::HTTP:0x007fbf3cfb3918 URL:http://streaming.radionomy.com/ABC-Lounge>, @proxy=nil>
2.1.4 :070 > response = conn.get
^CIRB::Abort: abort then interrupt!

This is an idea using Net::HTTP:

2.1.4 :097 > url = URI.parse('http://streaming.radionomy.com/ABC-Lounge?lang=es-ES%2ces%3bq%3d0.8%2cen%3bq%3d0.6%2cpt%3bq%3d0.4%2cgl%3bq%3d0.2')
 => #<URI::HTTP:0x007fbf3bf31f40 URL:http://streaming.radionomy.com/ABC-Lounge?lang=es-ES%2ces%3bq%3d0.8%2cen%3bq%3d0.6%2cpt%3bq%3d0.4%2cgl%3bq%3d0.2>
2.1.4 :098 > body=""
 => ""
2.1.4 :099 > res = Net::HTTP.start(url.host, url.port) { |http|
2.1.4 :100 >    http.request_get(url.path) {|response|
2.1.4 :101 >       break unless response['content-type'] =~ /html/i
2.1.4 :102?>       response.read_body {|b|
2.1.4 :103 >          body<<b
2.1.4 :104?>       }
2.1.4 :105?>   }
2.1.4 :106?>}
 => nil

But Faraday doesn't use Net::HTTP like that and I should have to do a new adapter I guess.

@jaimeiniesta
Copy link
Owner

Alright, so then it's got nothing to do with redirections. This is the simplest case where Faraday gets stuck:

conn = Faraday.new(url: 'http://streaming.radionomy.com/ABC-Lounge?lang=es-ES%2ces%3bq%3d0.8%2cen%3bq%3d0.6%2cpt%3bq%3d0.4%2cgl%3bq%3d0.2')

conn.get

I think that as Faraday doesn't support streaming, we're going to have to add our own global timeout around it.

@jaimeiniesta
Copy link
Owner

This has been fixed and released in 5.3.0.

As Faraday doesn't handle streaming by now, I've added a timeout on top of it, set to the sum of connection timeout and read timeout, plus one second.

By default, this means a timeout of 21 seconds. You can set a different timeout with the :connection_timeout and :read_timeout options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants