Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is cloudsim totally broken? #354

Closed
osrf-migration opened this issue Mar 12, 2020 · 10 comments
Closed

Is cloudsim totally broken? #354

osrf-migration opened this issue Mar 12, 2020 · 10 comments
Labels
bug Something isn't working major

Comments

@osrf-migration
Copy link

Original report (archived issue) by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


Yesterday we wanted to test the new logging that has been enabled on the bridge container to maybe get a closer to solving #261. We started 10 simulations with 1 robot each. Some of them (after 12 hours) are still in Pending state. None of them is currently showing any logs (neither real-time, nor for download). At least one of them is considered finished because I was able to start yet another simulation. At least one of them was running yesterday as I could see the real time log saying that the battery is running for 27 minutes already.

The solution images work locally without problems.

There is no information available to us that we could act upon. From here it looks just as if the cloudsim is broken 😟 . The logs have to be available especially when something goes wrong and not be the first victim of any glitch. So, is the problem on our side? What is the problem?

@osrf-migration
Copy link
Author

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


  • changed title from "Is cloudsim totaly broken?" to "Is cloudsim totally broken?"

@osrf-migration
Copy link
Author

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


So, 22 hours later, nothing useful got out of the 11 simulations (a day and a half from starting them). Some are still pending, some are showing some errors, but not a single log can be downloaded.

I found that strange since each simulation needs only 2 ec2 instances. If they would be run sequentially on these 2 instances, after 11 hours, everything should be over. So, the lack of ec2 should not be the problem here. Still no idea what the problem is.

@osrf-migration
Copy link
Author

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


I started one test run 3 days ago. It shows:

DeletingPods

Error: ServerRestart

And the one I started yesterday is still in Pensing :-(

@osrf-migration
Copy link
Author

Original comment by Sarah Kitchen (Bitbucket: snkitche).


I tried starting two runs last night to check this and am having the same issue. Status switches between Pending and Launching Nodes. AWS status reporting normal: https://status.aws.amazon.com/

If subt is monitoring status of the instances being used, and if resources are impacted/constrained by usage unrelated to subt in a way that could be checked by the status monitoring features of aws, it would be nice to have some summary of the status info displayed at the subt portal.

@osrf-migration
Copy link
Author

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


I’ve been seeing the same thing this week. My last successful run was Monday at 6pm. They are running for a really long time, eventually failing with a server restart, and no logs are available.

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  • changed state from "new" to "resolved"

A new ign-transport release was the root cause of this problem. The cloudsim server needed to be rebuilt, which was done last Friday (March 13th). Blocked runs were restarted, and weekend runs completed.

Please re-open if the problem persists.

@osrf-migration
Copy link
Author

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


What was wrong with the ign-transport and what was the fix?

Can the logs be available for download when such a thing happens again?

We still have 3 runs that have not finished. They all seem to be stuck at “LaunchingNodes Error: ServerRestart”. Is that expected?

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


This was the PR: https://bitbucket.org/ignitionrobotics/ign-transport/pull-requests/436 to fix this issue https://bitbucket.org/ignitionrobotics/ign-transport/issues/114.

I have to map username to team, which I think is Robotika and therefore the simulations you are referring to are: 923fd3b0-23ef-48da-9716-502570d68486, 79d36ef4-ae92-4a75-8e6c-a64c4cfe1c9a, and 67012ec5-4878-4252-adf9-8c2957116c46. I'll look at these.

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Those three simulation instances were affected while launching, so there are no logs. I've restarted them.

@osrf-migration
Copy link
Author

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Just took a quick look at that PR 436, and while I’m not very familiar with that codebase, I was curious - if those publish/subscribe messages are now UDP, is packet loss still being accounted for?

BTW, my last couple runs have also either failed with a server restart or gotten stuck before starting. I did have a couple successful runs over the weekend, but they took much longer to complete than they did a week ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant