You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to debug a situation where I am running into sporadic registration timeouts on the engine. In this Gist, I have included relevant config files and sample output ( https://gist.github.com/jakirkham/b0452178331db511dd0d ). All other config files were simply the result of running ipython profile create --parallel --profile=sge.
To provide more information, this is on a CentOS 6.6 VM on a single machine; as such, there is no need to worry about accessibility between the jobs. The queue has 7 jobs in it in this case and has been configured to limit the number of running jobs based on the number of accessible cores. However, I have run into the same problem with less than 7 running jobs, as well. All of the jobs are able to start and run successfully. I don't believe this to be a resource issue as I have ran heavy duty machine learning algorithms in the VM without error repeatably.
As it is sporadic, I am wondering if timing differences between engines communicating to the controller could be causing the problem. For example, all the engines slam the controller at the same time leaving the controller unable to respond and this happens up to the timeout limit. Unfortunately, I have had trouble finding more information about parameters that could institute delays between engine queries or similar to test this hypothesis.
Any pointers would be appreciated?
The text was updated successfully, but these errors were encountered:
Sorry for not responding in a reasonable amount of time!
I'm going to close this as old and stale, but my hunch based on the logs is that there is a stale ipcontroller-engine.json file not cleaned up from a previous run (note the engine logs loading the connection file before the controller starts, though clock drift can be at play). These files should usually get cleaned up, but if stale files are leftover, an old file could get loaded before the new, valid connection file is put in its place.
Moved from here ( ipython/ipython#8569 ).
I am trying to debug a situation where I am running into sporadic registration timeouts on the engine. In this Gist, I have included relevant config files and sample output ( https://gist.github.com/jakirkham/b0452178331db511dd0d ). All other config files were simply the result of running
ipython profile create --parallel --profile=sge
.To provide more information, this is on a CentOS 6.6 VM on a single machine; as such, there is no need to worry about accessibility between the jobs. The queue has 7 jobs in it in this case and has been configured to limit the number of running jobs based on the number of accessible cores. However, I have run into the same problem with less than 7 running jobs, as well. All of the jobs are able to start and run successfully. I don't believe this to be a resource issue as I have ran heavy duty machine learning algorithms in the VM without error repeatably.
As it is sporadic, I am wondering if timing differences between engines communicating to the controller could be causing the problem. For example, all the engines slam the controller at the same time leaving the controller unable to respond and this happens up to the timeout limit. Unfortunately, I have had trouble finding more information about parameters that could institute delays between engine queries or similar to test this hypothesis.
Any pointers would be appreciated?
The text was updated successfully, but these errors were encountered: