-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch automation of a ARIAC trials + user system #6
Conversation
…s down the user's container
@@ -20,20 +20,6 @@ usage() | |||
exit 1 | |||
} | |||
|
|||
# Wait until the /ariac/competition_state returns "data: done". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caguero I modified the script from https://bitbucket.org/osrf/ariac/pull-requests/58 to use the ARIAC_EXIT_ON_COMPLETION
var instead of checking for 'done'. This should bring the simulation down when the trial is complete, time's up, or the user calls /ariac/end_competition
run_trial.bash
Outdated
# Make the log playable | ||
#TODO: figure out why the ownership of the state log is root and resolve. | ||
echo "Changing ownership of gazebo state log" | ||
sudo chown $USER ${HOST_LOG_DIR}/gazebo/state.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j-rivero, I noticed that the gazebo state log has root
ownership. The log is generated in the container, and then copied into /ariac/logs
on the container in this script, which is mounted to logs/{team_name}/{trial_name}
on the host. Do you have any idea what might cause this?
This workaround is necessary to make the log playable on the host. If we can avoid it, the script won't need sudo
I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we running gazebo as root in the container? AFAIK there is no other change in permissions of the state log inside gazebo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gazebo gets launched from a command passed through this line in the entrypoint. I see now that it looks to be run as root user:
USER root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the the entrypoint to be run as the ariac user and now it's all good! f011bac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, got lucky on my machine with userid. Added 2c3a870 for portability (thanks @dirk-thomas)
@caguero: I added the example code back into the repository. I added some TODOs in aa6899f. I think we spoke about most of them in person. The most important one is aa6899f#diff-92694b972d13716f9b231735929abe9bR6 (the example code has to sleep for 10s at the moment because the competitor container is started first). We still don't support kinetic at the moment, but we don't know how many users are using kinetic to know how big of a priority it should be. I added 8ed5087 which should disable the nvidia support but I can't test it because the airport internet won't let me pull docker images. Please revert it if is doesn't work! |
@j-rivero would you be able to give this a look if you get a chance please? There are some known to-do's that I've noted but you might be able to spot some other issues since you have more experience with Docker than @caguero and I. I forgot to document that if you want to stop the competition manually as opposed to waiting for it to time out (10mins by default!) you can run |
I had to remove the version number while installing |
@@ -2,8 +2,22 @@ | |||
|
|||
DOCKER_ARGS="" | |||
# Uncoment this line to rebuild without cache | |||
# TODO: expose this as an argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we want to expose as an argument here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't show up well in the diffs, sorry: the no-cache option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the no-cache
option in my upcoming kinetic_version
branch, let's handle it from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we'll just have to do it for the competitor script as well as the server script
run_all_trials.bash
Outdated
|
||
TEAM_NAME=$1 | ||
|
||
#TODO: get list of trials automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the list of trails or how we can get it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From ls comp_configs
minus the extensions. each of the config files in the comp_configs directory is used for a different trial (once).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it was pretty obvious sorry for the question. Let's try a patch to fix all the three comments/problems we have in this file:
diff --git a/run_all_trials.bash b/run_all_trials.bash
index 3f0822e..7fd7b7c 100755
--- a/run_all_trials.bash
+++ b/run_all_trials.bash
@@ -1,10 +1,24 @@
#!/usr/bin/env bash
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+
TEAM_NAME=$1
+COMP_CONFIGS_DIR=${DIR}/comp_configs/
+
+get_list_of_trials()
+{
+ yaml_files=$(ls ${COMP_CONFIGS_DIR}/*.yaml)
+
+ for f in $(ls ${COMP_CONFIGS_DIR}/*.yaml); do
+ f=${f##*/}
+ f=${f//.yaml}
+ all_names="${all_names} ${f}"
+ done
+
+ echo $all_names
+}
-#TODO: get list of trials automatically
-declare -a TRIAL_NAMES=("example_trial1" "example_trial2")
-for TRIAL_NAME in ${TRIAL_NAMES[@]}; do
+for TRIAL_NAME in $(get_list_of_trials); do
echo "Running trial: ${TRIAL_NAME}"
#TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY.
./run_trial.bash ${TEAM_NAME} ${TRIAL_NAME}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'll try this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in aa62c7e thanks
run_all_trials.bash
Outdated
TEAM_NAME=$1 | ||
|
||
#TODO: get list of trials automatically | ||
declare -a TRIAL_NAMES=("example_trial1" "example_trial2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't need a bash array here, should be enough using:
TRIAL_NAMES="example_trial1 example_trial2"
for TRIAL_NAME in ${TRIAL_NAMES}; do
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for your advice here (and elsewhere!) -- indeed that looks like a simpler way to achieve it
run_all_trials.bash
Outdated
declare -a TRIAL_NAMES=("example_trial1" "example_trial2") | ||
for TRIAL_NAME in ${TRIAL_NAMES[@]}; do | ||
echo "Running trial: ${TRIAL_NAME}" | ||
#TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the run_trial.bash
script gets called from here, inside the run_trial.bash
script the TRIAL_NAME
variable sometimes has the correct value, but sometimes is empty. (whereas if you call the script from the terminal, the variable is always correct). I didn't have heaps of time to investigate it thoroughly yet - maybe your suggestion to not use bash arrays is enough to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try to the patch I've posted above and see if it is enough or we need to debug deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the end quoting the parameters fixed it: 2b1a20b
|
||
# Start the competitors container and let it run in the background. | ||
#TODO: parameterize the container name | ||
./ariac-competitor/run_competitor_container.bash & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This background call to run_competitor_container script is going to be hard to check and response on errors. For example I forget to generate the team container and run the script. It keeps going even if this call fails totally which ends up in other errors more difficult to debug. We can workaround with some checks before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the explanation of why this could be an issue, I wasn't aware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good suggestion to improve the situation right now, but let's keep this in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have encountered a different problem because of this: if the server doesn't start correctly (next line), the script exits, and the competitor container is left running in the background. I added a script for users to kill the containers in 019b15c which will at least let people handle the situation
|
||
# Start the competitors container and let it run in the background. | ||
#TODO: parameterize the container name | ||
./ariac-competitor/run_competitor_container.bash & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make more sense to launch first the ariac server and the competitor code after it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the script is where I am most looking for feedback. It definitely would be preferable to have the competitor code start after the ariac server, so that we don't need to do waiting like this in the competitor code.
The reason the competitor code is launched first is for the flow of the script. I launch the competitor code in the background first, and it may run indefinitely. The server, however, will only run for the duration of the trial (limited by a timeout): I launch it in the foreground, it is blocking, and when it ends, we know the trial is finished, so we kill the competitor container.
I am "utilising" the blocking nature of the server here, but it is not ideal, it was just the easiest way I knew how to do it! If you could help me launch the competitor code after the server (and then kill it once the trial has ended), I would like to learn from how you do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your approach makes sense although I think we are abusing a little bit of the nature of the system instead of creating a proper sync mechanism. A possible solution could be to implement a service in the ariac server that triggers the beginning of the trial. This service could be called from the user code (as soon as the team consider the code is ready) or by a timeout by our system (let's say give X minutes to startup).
It is hard to automatically try to know when are all the pieces ready using ROS1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit complicated: the team's do have to call a service etc, but I believe this issue has more to do with the ROS master not being available at the time. If the team's container starts before the server (which is the ROS master), when the ARIAC simulation starts it gives the following error:
Loaded ariac/arm
Started ['ariac/arm'] successfully
Traceback (most recent call last):
File "/opt/ros/indigo/lib/controller_manager/controller_manager", line 63, in <module>
controller_manager_interface.load_controller(c)
File "/opt/ros/indigo/lib/python2.7/dist-packages/controller_manager/controller_manager_interface.py", line 66, in load_controller
resp = s.call(LoadControllerRequest(name))
File "/opt/ros/indigo/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 525, in call
raise ServiceException("transport error completing service call: %s"%(str(e)))
rospy.service.ServiceException: transport error completing service call: unable to receive data from sender, check sender's logs for details
I'll see if moving the ROS master around helps...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I tried moving it to the competitor container and to a third container but it still gives that error when the server connects to the master. I left the server container as the ROS master and moved the waiting to a workaround script in 64b10ed
Most of my comments are doubts or minor things. I would like to submit some changes and the implementation of the nvidia-docker detection. Feel free to merge and I will submit them against master or I can create the PR against this branch, whatever you prefer. Important things still to do (in my list):
|
update:
these two should not be hard to implement.
see #2 (comment) |
Avoids the following error when starting gazebo on my machine: libGL error: failed to load driver: swrast X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 154 (GLX) Minor opcode of failed request: 3 (X_GLXCreateContext) Value in failed request: 0x0 Serial number of failed request: 30 Current serial number in output stream: 31
Thanks @j-rivero for your help with this PR and #2. I've managed to address most issues now, the main one left is #6 (comment): I'll make a follow-up PR.
For now we have disabled the use of GPU completely in this PR so that we can get it out the door, since users can still replay the logs post-hoc. It's just considered a "nice-to-have": we're happy to drop support for it if it's complicated to implement. No need for intel support.
Awesome! testing now |
Otherwise copying ARIAC logs to the mounted directory is not guaranteed to have correct permissions
Addresses #3
This adds the run_trial.bash script that can be used to bringup the server and competitor containers and run a given trial config file. It also shuts the two containers down automatically, provided that the competitors system gets the competition into a 'done' state (by completing the trial, making a call to the
/ariac/end_competition
service (giving up), or time running out).There's also the start of a script to call
run_trial.bash
with the different trial names but something's not right with it at the moment.