Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout on provision #177

Open
jfalameda opened this issue Jan 25, 2019 · 8 comments
Open

Timeout on provision #177

jfalameda opened this issue Jan 25, 2019 · 8 comments

Comments

@jfalameda
Copy link

I am deploying a very large application. I get a timeout on provision, any tips?

OS: Mac OXS. Service fabric on docker, latest version.

`Command arguments: ['application', 'provision', '--application-type-build-path', 'DHCloud', '--timeout', '1000000', '--debug']
Event: Cli.PreExecute []
Event: CommandParser.OnGlobalArgumentsCreate [<function CLILogging.on_global_arguments at 0x1014632f0>, <function OutputProducer.on_global_arguments at 0x1014c96a8>, <function CLIQuery.on_global_arguments at 0x1014f8950>]
Event: CommandInvoker.OnPreCommandTableCreate []
Event: CommandLoader.OnLoadArguments []
Event: CommandInvoker.OnPostCommandTableCreate []
Event: CommandInvoker.OnCommandTableLoaded []
Event: CommandInvoker.OnPreParseArgs []
Event: CommandInvoker.OnPostParseArgs [<function OutputProducer.handle_output_argument at 0x1014c9730>, <function CLIQuery.handle_query_parameter at 0x1014f89d8>]
msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90
msrest.service_client : Accept header absent and forced to application/json
msrest.universal_http : Configuring redirects: allow=True, max=30
msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None
msrest.universal_http : Configuring proxies: ''
msrest.universal_http : Evaluate proxies against ENV settings: True
urllib3.connectionpool : Starting new HTTP connection (1): localhost
msrest.exceptions : Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 333, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 389, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send
**kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/knack/cli.py", line 206, in invoke
cmd_result = self.invocation.execute(args)
File "/usr/local/lib/python3.7/site-packages/knack/invocation.py", line 188, in execute
cmd_result = parsed_args.func(params)
File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 105, in call
return self.handler(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 212, in _command_handler
result = op(client, **command_args) if client else op(**command_args)
File "/usr/local/lib/python3.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type
request, header_parameters, body_content_sorted)
File "/usr/local/lib/python3.7/site-packages/msrest/service_client.py", line 336, in send
pipeline_response = self.config.pipeline.run(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 197, in run
return first_node.send(pipeline_request, **kwargs) # type: ignore
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send
response = self.next.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 72, in send
return self.next.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 137, in send
return self.next.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send
response = self.next.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 193, in send
self.driver.send(request.http_request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 328, in send
return super(RequestsHTTPSender, self).send(request, **requests_kwargs)
File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 140, in send
raise_with_traceback(ClientRequestError, msg, err)
File "/usr/local/lib/python3.7/site-packages/msrest/exceptions.py", line 51, in raise_with_traceback
raise error.with_traceback(exc_traceback)
File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send
**kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send
raise ReadTimeout(e, request=request)
msrest.exceptions.ClientRequestError: Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90
Performing cluster version check
msrest.universal_http : Configuring redirects: allow=True, max=30
msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None
msrest.universal_http : Configuring proxies: ''
msrest.universal_http : Evaluate proxies against ENV settings: True
urllib3.connectionpool : Starting new HTTP connection (1): localhost
urllib3.connectionpool : http://localhost:19080 "GET /$/GetClusterVersion?api-version=6.4&timeout=60 HTTP/1.1" 200 23`

@Christina-Kang
Copy link
Contributor

Hi @jfalameda, sorry to hear that you're running into this issue. To help us debug, can you share the output from running this command with the --debug option? Could you also share the cluster traces and the approximate time at which you ran the provision command? The traces would be in /etc/servicefabric/FabricDataRoot. Thanks!

@jfalameda
Copy link
Author

jfalameda commented Feb 6, 2019

Hi @Christina-Kang,

It seems that the provision is taking few minutes and the timeout for the call to provision is 100.

You can download the tracing from here:

https://drive.google.com/file/d/17cAPVzd6v3v7PULhgxyN2_h4SSGRIMmL/view?usp=sharing

Kind Regards,
/ José.

@Christina-Kang
Copy link
Contributor

Hi @jfalameda,

Thanks for your patience! It looks like the actual timeout duration is not the issue here, but rather, an issue with permissions, which holds up the operation until it eventually times out.

Issue is with certain folders not being ACLed correctly. From your traces, it looks like the source is this:
/home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store and the destination is this: /home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/ImageBuilderProxy/

There are also paths under different "nodes", represented in this path by "N0010" instead of the "N0030" in the previous path: /home/ClusterDeployer/ClusterData/Data/N0010/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store. The same is for the destination path.

Can you check all these paths (N0010, N0020, N0030) to make sure that users have the correct permissions, and if not, chmod everything under those paths to give user permissions to rwx?

You should be able to retry upload after this is done. You should not need to, but if the above doesn't work, try restarting fabric and seeing if that plus the permission changes allows provision to continue. Please let me know if this mitigates the issue for you.

This issue happens rarely, so if you were to have a cluster in azure, or on a different dev machine, this issue most likely won't reoccur.

Hope this helps!
Christina

@jfalameda
Copy link
Author

jfalameda commented Feb 7, 2019

Hi @Christina-Kang ,

I can see the application provisioning now. I still get the timeout error as the application takes a lot of time to provision.

It is weird that I need to fix permission, as I am using the docker based service fabric.

Thanks.

Kind Regards,
/ José.

@Christina-Kang
Copy link
Contributor

Hi @jfalameda,

Is it timing out with the same error / stack as posted above still? Where the timeout is 100 for read timeouts, rather than the passed in time out.

Thanks!

@jfalameda
Copy link
Author

jfalameda commented Feb 14, 2019

Hi @Christina-Kang ,

Yes, the application times out the same way. I can see on SF that the application is still provisioning. The application disappears when the provisioning is done without any error, therefore I cannot publish it.

PS: After checking the checksums it shows "unprovisioning." then it disappears from the applications list.

Kind Regards,
/ José.

@jfalameda
Copy link
Author

jfalameda commented Feb 22, 2019

Hi @Christina-Kang

Unfortunately 7.1.0 didn't resolve this issue. The timeout is not about uploading the application, but to provision it. The application takes few minutes to provision, while sfctl timeouts before it has completed. By the time it is 100% provisioned, the application unprovision itself automatically.

This is the error I get on the client, I can see the applicaiton is still provisioning on the service fabric explorer.

Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
  File "/home/jose/.local/lib/python2.7/site-packages/knack/cli.py", line 206, in invoke
    cmd_result = self.invocation.execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/entry.py", line 81, in execute
    return super(SFInvoker, self).execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/invocation.py", line 188, in execute
    cmd_result = parsed_args.func(params)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 105, in __call__
    return self.handler(*args, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 212, in _command_handler
    result = op(client, **command_args) if client else op(**command_args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type
    request, header_parameters, body_content_sorted)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/service_client.py", line 336, in send
    pipeline_response = self.config.pipeline.run(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 197, in run
    return first_node.send(pipeline_request, **kwargs)  # type: ignore
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 150, in send
    response = self.next.send(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/requests.py", line 72, in send
    return self.next.send(request, **kwargs)

Kind Regards,
/ Jose.

@Christina-Kang
Copy link
Contributor

Christina-Kang commented Feb 22, 2019

@jfalameda While the timeout may occur on the client side (this is a bug on our part, that the timeout can occur before the cluster times out), the cluster should continue its work with provisioning. There should be no unprovision happening, even on error, from the cluster side. If you are running the upload / provision / create process as part of a script, this may be causing the unprovisioning to happen. If this is not the case, are you available for a call with us so we can debug this faster? You can contact me at bikang@microsoft.com

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants