Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudSim - limit 1 running simulation per team #644

Open
m3d opened this issue Oct 4, 2020 · 11 comments
Open

CloudSim - limit 1 running simulation per team #644

m3d opened this issue Oct 4, 2020 · 11 comments
Labels
enhancement New feature or request

Comments

@m3d
Copy link
Contributor

m3d commented Oct 4, 2020

Hello,
few days ago @angelacmaio announced on https://community.subtchallenge.com/t/practice-for-cave-circuit-virtual-competition/1449 that the number of submissions will be limited from 10 to 3:

Teams can continue submitting solutions against the practice scenarios. To ensure cloud machine availability for all teams, each team can submit a maximum of 3 simultaneous practice runs on Cloudsim. Once the limit is reached, teams will not be able to submit additional runs until at least one of the 3 runs has finished. During peak usage, submissions may also display a “queued” status until machines become available.

At the moment we have two simple (small set of robots) simulations pending for more than 24 hours, so I would suggest to limit the number of concurrent runs per team to only one that it still provides some feedback in "reasonable time". You can vote 👍 or 👎

thanks
Martin/Robotika Team

@wolfgangschwab
Copy link

We agree, that it makes sense to limit the allowed parallel runs to 1, if this helps to avoid the Pending status.

We have an submitted solution run in status Pending now since 16 hours and we have no other simulation running.

@knoedler
Copy link
Contributor

knoedler commented Oct 4, 2020

I am in a similar situation. One pending 36 hours, one pending 12 hours and none currently running. I don't know that the issue is specifically the simultaneous runs, but more that the system seems to get lots of its resources in a state that nobody can use. It recovers after it is reset, but with more people using the system it gets back to the bad state much faster. Limiting it to one simultaneous run might help keep the resources from getting in the bad state for longer periods.

@malcolmst
Copy link

Agree, I am also seeing pending for a very long time. Whatever the issue is, the ability to get results ASAP is more important to me than the number of concurrent simulations.

On a slightly related note though, as the maximum number of concurrent simulations is decreased, it would be really helpful to be able to cancel a running job :).

@peci1
Copy link
Contributor

peci1 commented Oct 4, 2020

Or, an even better solution if budget allows - buy more compute on AWS. That would allow teams to have 3 simultaneous simulations and reasonable finish times. I think the fact that the whole setup is cloud-based should make it simple to add more compute...

@malcolmst
Copy link

Good point! I’ve been running my own simulator for a while in AWS based on the cloudsim containers (wish I could share it with other teams, but don’t know how I could provide enough privacy atm, and it also still has its own set of issues :)) and it’s somewhere around $1/hr/robot. I expect that’s about the same for the subt simulator. Not crazy expensive, but does add up depending on the available budget.

@zwn
Copy link

zwn commented Oct 5, 2020

I was thinking about reviving #354... Some simulation seems to be running but we cannot connect to it to see what is going on. The message I get is "Connection failed. Please contact an administrator.". The exact error is

GET | wss://cloudsim-ws.ignitionrobotics.org/simulations/84ffbecd-eaf5-44e1-8769-d2cd7a77c2f2-r-1
503 Service Unavailable

Started at 2020-10-04T17:59:05.599Z.

If I may summarize what I have learned in this thread: all teams see Pending times in tens of hours and none of them have anything running - except us, where something seems to be running but we cannot connect to it and the simulations that have finished recently have ended abruptly probably due to a crash (see #631 (comment)).

Why does it break each time before the circuit deadline? It was kind of expected for tunnel, somewhat expected for urban but definitely not expected for cave since that is the last try before the finals. Is the load so much higher in these times? Is nobody running anything during the year except us? I am confused.

@angelacmaio
Copy link
Collaborator

When Cloudsim is unable to procure enough instances for a submission (1 per robot + 1 for the simulator), submissions will display the Pending status. The limit of 3 simultaneous runs per team was put in place to spread available capacity across the 17 teams.

AWS has many users, and external spikes in usage can result in fewer total available instances for Cloudsim (likely what occurred over the weekend). We are in contact with AWS about instance availability and will only reduce the limit of simultaneous runs per team further if it is necessary based on the usual expected machine availability. We do not wish to further restrict practice runs around-the-clock based on infrequent dips in availability.

All Pending runs were able to spin up after more instances were freed up later in the weekend and have now terminated, so please check the status of your submissions on the SubT Virtual Portal.

@malcolmst
Copy link

FWIW, while I don’t know specifically what limits AWS has per user, I was able to spin up a simulation with 6 instances (5 robots + simulator) in my private simulator, shortly after hitting the pending state in the subt simulator using a similar number of robots. That was with g3.4xlarge instances in the us-east-1 region. Those pending simulations did eventually run, but it wasn’t until many hours later.

@zwn
Copy link

zwn commented Oct 5, 2020

All Pending runs were able to spin up after more instances were freed up later in the weekend and have now terminated

That is not correct. The instance 84ffbecd-eaf5-44e1-8769-d2cd7a77c2f2-r-1 still shows as Running, while it has crashed a long time ago.

AWS has many users, and external spikes in usage can result in fewer total available instances for Cloudsim

This only ever happens when a circuit deadline is approaching (it has not happened since urban) so our guess is that it is related to Subt activities and not to external AWS usage - hence the suggestion to further limit the number of concurrent simulations per team. Can you publish stats about Subt usage of AWS? It would be interesting to see the comparison of now vs month ago. It would be also interesting to know the size of the AWS pool - how many machines are there?

@malcolmst
Copy link

malcolmst commented Oct 6, 2020

This only ever happens when a circuit deadline is approaching (it has not happened since urban)

Agree, this does seem to be really consistent when the simulator usage is higher before a deadline. Also, as mentioned in my previous post, I saw no evidence of an AWS g3 instance shortage over the weekend.

Taking a look at the cloudsim web code, there is a pool of threads responsible for starting up new simulations:

https://gitlab.com/ignitionrobotics/web/cloudsim/-/blob/master/simulations/sim_service.go

        The Simulations Service is in charge of launching and terminating Gazebo simulations. And,
	in case of an error, it is responsible of rolling back the failed operation.
	To do this and handle some concurrency without exhausting the host, it has
	one worker-thread-pool for each main activity (launch, terminate, error handling).
	The `launch` and `terminate` pools can launch 10 concurrent workers (eg. the launcher can
	launch 10 simulations in parallel). The error handler pool only has one worker.

Is it possible the threads in the launch pool, or one of the other pools, all got in a bad state (crashed or hung), preventing new simulations from starting successfully? Or maybe another internal limit got hit due to the crashed simulations from #631?

[Edit] There is an internal limit of running EC2 instances defined here:
https://gitlab.com/ignitionrobotics/web/cloudsim/-/blob/master/simulations/ec2_machines.go
(See AvailableEC2Machines).

@m3d
Copy link
Contributor Author

m3d commented Oct 6, 2020

p.s. I probably misunderstood the limit - up to now I expected that you can put several simulations in the queue, but only given limit is processed in parallel, but at the moment you cannot add new simulation to the queue (error 5506 - Simultaneous simulations limit reached.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants