Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch automation of a ARIAC trials + user system #6

Merged
merged 30 commits into from
May 26, 2017
Merged

Conversation

dhood
Copy link
Contributor

@dhood dhood commented May 17, 2017

Addresses #3

This adds the run_trial.bash script that can be used to bringup the server and competitor containers and run a given trial config file. It also shuts the two containers down automatically, provided that the competitors system gets the competition into a 'done' state (by completing the trial, making a call to the /ariac/end_competition service (giving up), or time running out).

There's also the start of a script to call run_trial.bash with the different trial names but something's not right with it at the moment.

@dhood dhood self-assigned this May 17, 2017
@@ -20,20 +20,6 @@ usage()
exit 1
}

# Wait until the /ariac/competition_state returns "data: done".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caguero I modified the script from https://bitbucket.org/osrf/ariac/pull-requests/58 to use the ARIAC_EXIT_ON_COMPLETION var instead of checking for 'done'. This should bring the simulation down when the trial is complete, time's up, or the user calls /ariac/end_competition

run_trial.bash Outdated
# Make the log playable
#TODO: figure out why the ownership of the state log is root and resolve.
echo "Changing ownership of gazebo state log"
sudo chown $USER ${HOST_LOG_DIR}/gazebo/state.log
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-rivero, I noticed that the gazebo state log has root ownership. The log is generated in the container, and then copied into /ariac/logs on the container in this script, which is mounted to logs/{team_name}/{trial_name} on the host. Do you have any idea what might cause this?

This workaround is necessary to make the log playable on the host. If we can avoid it, the script won't need sudo I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we running gazebo as root in the container? AFAIK there is no other change in permissions of the state log inside gazebo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gazebo gets launched from a command passed through this line in the entrypoint. I see now that it looks to be run as root user:

. Maybe changing that will fix it (still seems strange to me that it wouldn't have the permissions of the mounted volume owner, but I haven't worked with mounting much before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the the entrypoint to be run as the ariac user and now it's all good! f011bac

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, got lucky on my machine with userid. Added 2c3a870 for portability (thanks @dirk-thomas)

@dhood
Copy link
Contributor Author

dhood commented May 18, 2017

@caguero: I added the example code back into the repository. I added some TODOs in aa6899f. I think we spoke about most of them in person.

The most important one is aa6899f#diff-92694b972d13716f9b231735929abe9bR6 (the example code has to sleep for 10s at the moment because the competitor container is started first). We still don't support kinetic at the moment, but we don't know how many users are using kinetic to know how big of a priority it should be.

I added 8ed5087 which should disable the nvidia support but I can't test it because the airport internet won't let me pull docker images. Please revert it if is doesn't work!

@dhood
Copy link
Contributor Author

dhood commented May 18, 2017

@j-rivero would you be able to give this a look if you get a chance please? There are some known to-do's that I've noted but you might be able to spot some other issues since you have more experience with Docker than @caguero and I.

I forgot to document that if you want to stop the competition manually as opposed to waiting for it to time out (10mins by default!) you can run docker exec -it ariac-competitior-system /bin/bash to get into​ the user's container, and then source the ROS setup file and call the /ariac/end_competition service. It should then kill the simulation and bring down the competition container as a result.

@caguero
Copy link
Contributor

caguero commented May 18, 2017

I had to remove the version number while installing ros-indigo-ros-core and ros-indigo-ros-base packages (building a Docker image). This is probably due to a new ROS sync. I also added a note about installing Docker.

@@ -2,8 +2,22 @@

DOCKER_ARGS=""
# Uncoment this line to rebuild without cache
# TODO: expose this as an argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we want to expose as an argument here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't show up well in the diffs, sorry: the no-cache option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the no-cache option in my upcoming kinetic_version branch, let's handle it from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we'll just have to do it for the competitor script as well as the server script


TEAM_NAME=$1

#TODO: get list of trials automatically
Copy link
Contributor

@j-rivero j-rivero May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the list of trails or how we can get it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ls comp_configs minus the extensions. each of the config files in the comp_configs directory is used for a different trial (once).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it was pretty obvious sorry for the question. Let's try a patch to fix all the three comments/problems we have in this file:

diff --git a/run_all_trials.bash b/run_all_trials.bash
index 3f0822e..7fd7b7c 100755
--- a/run_all_trials.bash
+++ b/run_all_trials.bash
@@ -1,10 +1,24 @@
 #!/usr/bin/env bash
 
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+
 TEAM_NAME=$1
+COMP_CONFIGS_DIR=${DIR}/comp_configs/
+
+get_list_of_trials()
+{
+  yaml_files=$(ls ${COMP_CONFIGS_DIR}/*.yaml)
+
+  for f in $(ls ${COMP_CONFIGS_DIR}/*.yaml); do
+    f=${f##*/}
+    f=${f//.yaml}
+    all_names="${all_names} ${f}"
+  done
+
+  echo $all_names
+}
 
-#TODO: get list of trials automatically
-declare -a TRIAL_NAMES=("example_trial1" "example_trial2")
-for TRIAL_NAME in ${TRIAL_NAMES[@]}; do
+for TRIAL_NAME in $(get_list_of_trials); do
   echo "Running trial: ${TRIAL_NAME}"
   #TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY.
   ./run_trial.bash ${TEAM_NAME} ${TRIAL_NAME}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll try this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in aa62c7e thanks

TEAM_NAME=$1

#TODO: get list of trials automatically
declare -a TRIAL_NAMES=("example_trial1" "example_trial2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need a bash array here, should be enough using:

TRIAL_NAMES="example_trial1 example_trial2"
 for TRIAL_NAME in ${TRIAL_NAMES}; do
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your advice here (and elsewhere!) -- indeed that looks like a simpler way to achieve it

declare -a TRIAL_NAMES=("example_trial1" "example_trial2")
for TRIAL_NAME in ${TRIAL_NAMES[@]}; do
echo "Running trial: ${TRIAL_NAME}"
#TODO: GET THIS TO RUN THE INDIVIDUAL TRIALS CORRECTLY.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the run_trial.bash script gets called from here, inside the run_trial.bash script the TRIAL_NAME variable sometimes has the correct value, but sometimes is empty. (whereas if you call the script from the terminal, the variable is always correct). I didn't have heaps of time to investigate it thoroughly yet - maybe your suggestion to not use bash arrays is enough to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to the patch I've posted above and see if it is enough or we need to debug deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the end quoting the parameters fixed it: 2b1a20b


# Start the competitors container and let it run in the background.
#TODO: parameterize the container name
./ariac-competitor/run_competitor_container.bash &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This background call to run_competitor_container script is going to be hard to check and response on errors. For example I forget to generate the team container and run the script. It keeps going even if this call fails totally which ends up in other errors more difficult to debug. We can workaround with some checks before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the explanation of why this could be an issue, I wasn't aware

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good suggestion to improve the situation right now, but let's keep this in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have encountered a different problem because of this: if the server doesn't start correctly (next line), the script exits, and the competitor container is left running in the background. I added a script for users to kill the containers in 019b15c which will at least let people handle the situation


# Start the competitors container and let it run in the background.
#TODO: parameterize the container name
./ariac-competitor/run_competitor_container.bash &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make more sense to launch first the ariac server and the competitor code after it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the script is where I am most looking for feedback. It definitely would be preferable to have the competitor code start after the ariac server, so that we don't need to do waiting like this in the competitor code.

The reason the competitor code is launched first is for the flow of the script. I launch the competitor code in the background first, and it may run indefinitely. The server, however, will only run for the duration of the trial (limited by a timeout): I launch it in the foreground, it is blocking, and when it ends, we know the trial is finished, so we kill the competitor container.

I am "utilising" the blocking nature of the server here, but it is not ideal, it was just the easiest way I knew how to do it! If you could help me launch the competitor code after the server (and then kill it once the trial has ended), I would like to learn from how you do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach makes sense although I think we are abusing a little bit of the nature of the system instead of creating a proper sync mechanism. A possible solution could be to implement a service in the ariac server that triggers the beginning of the trial. This service could be called from the user code (as soon as the team consider the code is ready) or by a timeout by our system (let's say give X minutes to startup).

It is hard to automatically try to know when are all the pieces ready using ROS1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit complicated: the team's do have to call a service etc, but I believe this issue has more to do with the ROS master not being available at the time. If the team's container starts before the server (which is the ROS master), when the ARIAC simulation starts it gives the following error:

Loaded ariac/arm
Started ['ariac/arm'] successfully
Traceback (most recent call last):
  File "/opt/ros/indigo/lib/controller_manager/controller_manager", line 63, in <module>
    controller_manager_interface.load_controller(c)
  File "/opt/ros/indigo/lib/python2.7/dist-packages/controller_manager/controller_manager_interface.py", line 66, in load_controller
    resp = s.call(LoadControllerRequest(name))
  File "/opt/ros/indigo/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 525, in call
    raise ServiceException("transport error completing service call: %s"%(str(e)))
rospy.service.ServiceException: transport error completing service call: unable to receive data from sender, check sender's logs for details

I'll see if moving the ROS master around helps...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I tried moving it to the competitor container and to a third container but it still gives that error when the server connects to the master. I left the server container as the ROS master and moved the waiting to a workaround script in 64b10ed

@j-rivero
Copy link
Contributor

Most of my comments are doubts or minor things. I would like to submit some changes and the implementation of the nvidia-docker detection. Feel free to merge and I will submit them against master or I can create the PR against this branch, whatever you prefer.

Important things still to do (in my list):

  • Implement the nvidia-docker detection and use of nvidia-docker mechanism
  • Implement the mechanism for the hardware acceleration in docker using Intel graphics cards
  • Implement the option of having kinetic

@j-rivero
Copy link
Contributor

j-rivero commented May 23, 2017

update:

Implement the nvidia-docker detection and use of nvidia-docker mechanism
Implement the mechanism for the hardware acceleration in docker using Intel graphics cards

these two should not be hard to implement.

Implement the option of having kinetic

see #2 (comment)

dhood added 7 commits May 25, 2017 10:16
Avoids the following error when starting gazebo on my machine:

libGL error: failed to load driver: swrast
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  30
  Current serial number in output stream:  31
@dhood
Copy link
Contributor Author

dhood commented May 25, 2017

Thanks @j-rivero for your help with this PR and #2. I've managed to address most issues now, the main one left is #6 (comment): I'll make a follow-up PR.

Implement the nvidia-docker detection and use of nvidia-docker mechanism
Implement the mechanism for the hardware acceleration in docker using Intel graphics cards

For now we have disabled the use of GPU completely in this PR so that we can get it out the door, since users can still replay the logs post-hoc. It's just considered a "nice-to-have": we're happy to drop support for it if it's complicated to implement. No need for intel support.

Implement the option of having kinetic
see #2 (comment)

Awesome! testing now

dhood added 3 commits May 25, 2017 13:59
Otherwise copying ARIAC logs to the mounted directory is not guaranteed to have correct permissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants