Batch automation of a ARIAC trials + user system #6

dhood · 2017-05-17T21:48:20Z

Addresses #3

This adds the run_trial.bash script that can be used to bringup the server and competitor containers and run a given trial config file. It also shuts the two containers down automatically, provided that the competitors system gets the competition into a 'done' state (by completing the trial, making a call to the /ariac/end_competition service (giving up), or time running out).

There's also the start of a script to call run_trial.bash with the different trial names but something's not right with it at the moment.

…s down the user's container

dhood · 2017-05-17T21:49:57Z

ariac-server/ariac-server/run_ariac_task.sh

@@ -20,20 +20,6 @@ usage()
  exit 1
 }

-# Wait until the /ariac/competition_state returns "data: done".


@caguero I modified the script from https://bitbucket.org/osrf/ariac/pull-requests/58 to use the ARIAC_EXIT_ON_COMPLETION var instead of checking for 'done'. This should bring the simulation down when the trial is complete, time's up, or the user calls /ariac/end_competition

dhood · 2017-05-18T02:31:47Z

run_trial.bash

+# Make the log playable
+#TODO: figure out why the ownership of the state log is root and resolve.
+echo "Changing ownership of gazebo state log"
+sudo chown $USER ${HOST_LOG_DIR}/gazebo/state.log


@j-rivero, I noticed that the gazebo state log has root ownership. The log is generated in the container, and then copied into /ariac/logs on the container in this script, which is mounted to logs/{team_name}/{trial_name} on the host. Do you have any idea what might cause this?

This workaround is necessary to make the log playable on the host. If we can avoid it, the script won't need sudo I believe.

are we running gazebo as root in the container? AFAIK there is no other change in permissions of the state log inside gazebo.

Gazebo gets launched from a command passed through this line in the entrypoint. I see now that it looks to be run as root user:

ariac-docker/ariac-server/ariac-server/Dockerfile

Line 2 in aa6899f

USER root

. Maybe changing that will fix it (still seems strange to me that it wouldn't have the permissions of the mounted volume owner, but I haven't worked with mounting much before).

I changed the the entrypoint to be run as the ariac user and now it's all good! f011bac

oops, got lucky on my machine with userid. Added 2c3a870 for portability (thanks @dirk-thomas)

dhood · 2017-05-18T02:40:29Z

@caguero: I added the example code back into the repository. I added some TODOs in aa6899f. I think we spoke about most of them in person.

The most important one is aa6899f#diff-92694b972d13716f9b231735929abe9bR6 (the example code has to sleep for 10s at the moment because the competitor container is started first). We still don't support kinetic at the moment, but we don't know how many users are using kinetic to know how big of a priority it should be.

I added 8ed5087 which should disable the nvidia support but I can't test it because the airport internet won't let me pull docker images. Please revert it if is doesn't work!

dhood · 2017-05-18T04:04:56Z

@j-rivero would you be able to give this a look if you get a chance please? There are some known to-do's that I've noted but you might be able to spot some other issues since you have more experience with Docker than @caguero and I.

I forgot to document that if you want to stop the competition manually as opposed to waiting for it to time out (10mins by default!) you can run docker exec -it ariac-competitior-system /bin/bash to get into the user's container, and then source the ROS setup file and call the /ariac/end_competition service. It should then kill the simulation and bring down the competition container as a result.

… building gazebo Docker image.

caguero · 2017-05-18T23:24:54Z

I had to remove the version number while installing ros-indigo-ros-core and ros-indigo-ros-base packages (building a Docker image). This is probably due to a new ROS sync. I also added a note about installing Docker.

j-rivero · 2017-05-19T20:00:36Z

ariac-competitor/build_competitor_image.bash

@@ -2,8 +2,22 @@

 DOCKER_ARGS=""
 # Uncoment this line to rebuild without cache
+# TODO: expose this as an argument


what do we want to expose as an argument here?

it doesn't show up well in the diffs, sorry: the no-cache option

I'm using the no-cache option in my upcoming kinetic_version branch, let's handle it from there.

ok, we'll just have to do it for the competitor script as well as the server script

j-rivero · 2017-05-19T20:11:28Z

run_all_trials.bash

+
+TEAM_NAME=$1
+
+#TODO: get list of trials automatically


where is the list of trails or how we can get it?

From ls comp_configs minus the extensions. each of the config files in the comp_configs directory is used for a different trial (once).

Oh, it was pretty obvious sorry for the question. Let's try a patch to fix all the three comments/problems we have in this file:

diff --git a/run_all_trials.bash b/run_all_trials.bash index 3f0822e..7fd7b7c 100755 --- a/run_all_trials.bash +++ b/run_all_trials.bash @@ -1,10 +1,24 @@ #!/usr/bin/env bash +DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + TEAM_NAME=$1 +COMP_CONFIGS_DIR=${DIR}/comp_configs/ + +get_list_of_trials() +{ + yaml_files=$(ls ${COMP_CONFIGS_DIR}/*.yaml) + + for f in $(ls ${COMP_CONFIGS_DIR}/*.yaml); do + f=${f##*/} + f=${f//.yaml} + all_names="${all_names} ${f}" + done + + echo $all_names +} -#TODO: get list of trials automatically -declare -a TRIAL_NAMES=("example_trial1" "example_trial2") -for TRIAL_NAME in ${TRIAL_NAMES[@]}; do +for TRIAL_NAME in $(get_list_of_trials); do echo "Running trial: ${TRIAL_NAME}" #TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY. ./run_trial.bash ${TEAM_NAME} ${TRIAL_NAME}

thanks, I'll try this!

added in aa62c7e thanks

j-rivero · 2017-05-19T20:14:09Z

run_all_trials.bash

+TEAM_NAME=$1
+
+#TODO: get list of trials automatically
+declare -a TRIAL_NAMES=("example_trial1" "example_trial2")


we probably don't need a bash array here, should be enough using:

TRIAL_NAMES="example_trial1 example_trial2" for TRIAL_NAME in ${TRIAL_NAMES}; do ...

thank you for your advice here (and elsewhere!) -- indeed that looks like a simpler way to achieve it

j-rivero · 2017-05-19T20:14:27Z

run_all_trials.bash

+declare -a TRIAL_NAMES=("example_trial1" "example_trial2")
+for TRIAL_NAME in ${TRIAL_NAMES[@]}; do
+  echo "Running trial: ${TRIAL_NAME}"
+  #TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY.


What is the problem here?

when the run_trial.bash script gets called from here, inside the run_trial.bash script the TRIAL_NAME variable sometimes has the correct value, but sometimes is empty. (whereas if you call the script from the terminal, the variable is always correct). I didn't have heaps of time to investigate it thoroughly yet - maybe your suggestion to not use bash arrays is enough to fix it.

let's try to the patch I've posted above and see if it is enough or we need to debug deeper.

in the end quoting the parameters fixed it: 2b1a20b

j-rivero · 2017-05-19T20:49:06Z

run_trial.bash

+
+# Start the competitors container and let it run in the background.
+#TODO: parameterize the container name
+./ariac-competitor/run_competitor_container.bash &


This background call to run_competitor_container script is going to be hard to check and response on errors. For example I forget to generate the team container and run the script. It keeps going even if this call fails totally which ends up in other errors more difficult to debug. We can workaround with some checks before.

thank you for the explanation of why this could be an issue, I wasn't aware

I don't have a good suggestion to improve the situation right now, but let's keep this in mind.

I have encountered a different problem because of this: if the server doesn't start correctly (next line), the script exits, and the competitor container is left running in the background. I added a script for users to kill the containers in 019b15c which will at least let people handle the situation

j-rivero · 2017-05-19T20:49:29Z

run_trial.bash

+
+# Start the competitors container and let it run in the background.
+#TODO: parameterize the container name
+./ariac-competitor/run_competitor_container.bash &


Will make more sense to launch first the ariac server and the competitor code after it?

This part of the script is where I am most looking for feedback. It definitely would be preferable to have the competitor code start after the ariac server, so that we don't need to do waiting like this in the competitor code.

The reason the competitor code is launched first is for the flow of the script. I launch the competitor code in the background first, and it may run indefinitely. The server, however, will only run for the duration of the trial (limited by a timeout): I launch it in the foreground, it is blocking, and when it ends, we know the trial is finished, so we kill the competitor container.

I am "utilising" the blocking nature of the server here, but it is not ideal, it was just the easiest way I knew how to do it! If you could help me launch the competitor code after the server (and then kill it once the trial has ended), I would like to learn from how you do it.

Your approach makes sense although I think we are abusing a little bit of the nature of the system instead of creating a proper sync mechanism. A possible solution could be to implement a service in the ariac server that triggers the beginning of the trial. This service could be called from the user code (as soon as the team consider the code is ready) or by a timeout by our system (let's say give X minutes to startup).

It is hard to automatically try to know when are all the pieces ready using ROS1.

Yeah, it's a bit complicated: the team's do have to call a service etc, but I believe this issue has more to do with the ROS master not being available at the time. If the team's container starts before the server (which is the ROS master), when the ARIAC simulation starts it gives the following error:

Loaded ariac/arm Started ['ariac/arm'] successfully Traceback (most recent call last): File "/opt/ros/indigo/lib/controller_manager/controller_manager", line 63, in <module> controller_manager_interface.load_controller(c) File "/opt/ros/indigo/lib/python2.7/dist-packages/controller_manager/controller_manager_interface.py", line 66, in load_controller resp = s.call(LoadControllerRequest(name)) File "/opt/ros/indigo/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 525, in call raise ServiceException("transport error completing service call: %s"%(str(e))) rospy.service.ServiceException: transport error completing service call: unable to receive data from sender, check sender's logs for details

I'll see if moving the ROS master around helps...

For the record, I tried moving it to the competitor container and to a third container but it still gives that error when the server connects to the master. I left the server container as the ROS master and moved the waiting to a workaround script in 64b10ed

j-rivero · 2017-05-19T21:03:26Z

Most of my comments are doubts or minor things. I would like to submit some changes and the implementation of the nvidia-docker detection. Feel free to merge and I will submit them against master or I can create the PR against this branch, whatever you prefer.

Important things still to do (in my list):

Implement the nvidia-docker detection and use of nvidia-docker mechanism
Implement the mechanism for the hardware acceleration in docker using Intel graphics cards
Implement the option of having kinetic

j-rivero · 2017-05-23T23:32:37Z

update:

Implement the nvidia-docker detection and use of nvidia-docker mechanism
Implement the mechanism for the hardware acceleration in docker using Intel graphics cards

these two should not be hard to implement.

Implement the option of having kinetic

see #2 (comment)

Avoids the following error when starting gazebo on my machine: libGL error: failed to load driver: swrast X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 154 (GLX) Minor opcode of failed request: 3 (X_GLXCreateContext) Value in failed request: 0x0 Serial number of failed request: 30 Current serial number in output stream: 31

dhood · 2017-05-25T20:54:15Z

Thanks @j-rivero for your help with this PR and #2. I've managed to address most issues now, the main one left is #6 (comment): I'll make a follow-up PR.

Implement the nvidia-docker detection and use of nvidia-docker mechanism
Implement the mechanism for the hardware acceleration in docker using Intel graphics cards

For now we have disabled the use of GPU completely in this PR so that we can get it out the door, since users can still replay the logs post-hoc. It's just considered a "nice-to-have": we're happy to drop support for it if it's complicated to implement. No need for intel support.

Implement the option of having kinetic
see #2 (comment)

Awesome! testing now

Otherwise copying ARIAC logs to the mounted directory is not guaranteed to have correct permissions

dhood added 10 commits May 17, 2017 10:16

WIP batch mode

bc378e1

Add script from caguero

6a62899

Install killall

a83dbec

Mount config and log dirs

9705b06

Use ARIAC_EXIT_ON_COMPLETION instead of wait_until_competition_ends

30954eb

Run competitor container too

399f3aa

Re-order container bringup so that when the competition dies it bring…

daa77d2

…s down the user's container

Parameterise scripts and give example in readme

235fdbf

Add timeout to example trials so server is sure to die

66c7744

Add script to run all trials (doesn't pass args correctly ATM)

a8b2212

dhood self-assigned this May 17, 2017

dhood commented May 17, 2017

View reviewed changes

dhood added 3 commits May 17, 2017 15:21

Add call to ariac_network.bash

0ca5b39

Don't fetch example submission from gist

f6c0806

Add TODOs

aa6899f

dhood commented May 18, 2017

View reviewed changes

[untested] Disable nvidia support

8ed5087

caguero and others added 4 commits May 18, 2017 16:08

Add instructions for installing Docker.

495b92a

Remove version from ros-indigo-ros-core and ros-indigo-ros-base while…

110beee

… building gazebo Docker image.

Update README

6ed7380

Update README.

1d787ab

j-rivero reviewed May 19, 2017

View reviewed changes

j-rivero mentioned this pull request May 22, 2017

Support kinetic in addition to indigo #2

Closed

dhood added 2 commits May 25, 2017 09:54

Get the list of available trials automatically

aa62c7e

Exit on errors

eee0270

dhood added 7 commits May 25, 2017 10:16

Quote arguments

2b1a20b

Document run_all_trials

93c0dbd

Add script for killing ARIAC containers

019b15c

Print docker run args

ee861a2

Capturing of stdout and stderr can be done by a higher-level script

8e8e127

Don't execute ariac entrypoint as root

f011bac

dhood added 3 commits May 25, 2017 13:59

Link to non-root docker instructions

e3f46f3

Ensure that files written to mounted directories have the same userid

2c3a870

Otherwise copying ARIAC logs to the mounted directory is not guaranteed to have correct permissions

Show docker build parameters

dd90aa0

dhood merged commit 1233f73 into master May 26, 2017

dhood deleted the batch_runs branch May 26, 2017 00:00

dhood mentioned this pull request Jun 2, 2017

Support use without nvidia-docker #5

Closed

stevegray954 mentioned this pull request Mar 7, 2018

Can't open display. Also - Gazebo 7.7 or 8.3 for qualifier? #18

Closed

Batch automation of a ARIAC trials + user system #6

Batch automation of a ARIAC trials + user system #6

Conversation

dhood commented May 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhood commented May 18, 2017

dhood commented May 18, 2017 • edited by caguero Loading

caguero commented May 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-rivero May 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-rivero commented May 19, 2017

j-rivero commented May 23, 2017 • edited Loading

dhood commented May 25, 2017

dhood commented May 17, 2017 •

edited

Loading

dhood commented May 18, 2017 •

edited by caguero

Loading

caguero commented May 18, 2017 •

edited

Loading

j-rivero May 19, 2017 •

edited

Loading

j-rivero commented May 23, 2017 •

edited

Loading