Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New BUILD_ROOT not being respected #182

Closed
pbauman opened this issue Dec 12, 2016 · 22 comments
Closed

New BUILD_ROOT not being respected #182

pbauman opened this issue Dec 12, 2016 · 22 comments

Comments

@pbauman
Copy link

pbauman commented Dec 12, 2016

So, the first time I started my client, I had a typo in the BUILD_ROOT variable. I fixed it, killed the current running client, and restarted a new client with the correct value of BUILD_ROOT exported. The first time the client runs my job, it's fine. When I invalidate it to try again (I'm currently debugging my setup, so I'm just invalidating a job on a PR), it's reverting back to the old, mistyped value of BUILD_ROOT.

Any suggestions? Any other information I can provide? Thanks!

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

I should mention, this behavior is repeatable. I.e. I can kill the running client and start a new one. First time it runs the job, it has the correct BUILD_ROOT. The next time, it uses the "old" one. Clearly it's being cached somewhere, but not sure where.

@brianmoose
Copy link
Contributor

I take it you are just running client.py?
I don't think it caches BUILD_ROOT anywhere...where are you setting it? On the command line or in .bashrc? Let me know what your command line is and I can try to reproduce it.

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

I'm just setting it in the environment: export BUILD_ROOT=<path>

The command I use for the client is:

./client.py --url https://femputer.eng.buffalo.edu:8000 --build-key <build_key> --configs linux-gnu --name pbauman_client --insecure &

@brianmoose
Copy link
Contributor

Hmm, I can't seem to reproduce it.
Where are you seeing it using the wrong BUILD_ROOT?
If you are using the scripts from civet_example_recipes then it should display what it thinks the build root is in the header of each step. Is it wrong there as well? Is it right at the beginning (like when you clone/update your repository)? Or after, like when building and it is referencing bad library paths or objects files?

Any custom scripts in civet_recipes that could be the culprit?

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

I see it as an error in the first step of the recipe, in particular when it's calling init_script I believe.

https://femputer.eng.buffalo.edu:8000/job/3/

/tmp/tmpG7CBZ2: line 123: cd: /femptuer/pbauman/civet_build_testing_root: No such file or directory
ERROR: exiting with code 1

I grepped through the scripts and I don't see any of them setting BUILD_ROOT:

[01:06:58][pbauman@femputer:/femputer/pbauman/civet_recipes/scripts][master] $ grep BUILD_ROOT *.sh
cleanup_build_testing_dir.sh:#REQUIRED: BUILD_ROOT
cleanup_build_testing_dir.sh:SUBDIR=$BUILD_ROOT/$FEMPUTER_BUILD_DIRNAME
cleanup_build_testing_dir.sh:cd "$BUILD_ROOT"
functions.sh:# Bad things can happen if BUILD_ROOT is not set
functions.sh:if [ -z "$BUILD_ROOT" ]; then
functions.sh:  echo "You need to set BUILD_ROOT"
functions.sh:  local b="$BUILD_ROOT/"
functions.sh:  local cwd=${p/#$b/BUILD_ROOT/}
functions.sh:  printf "Build Root: $BUILD_ROOT\n"
functions.sh:  export REPO_DIR=$BUILD_ROOT/$APPLICATION_NAME
functions.sh:    cd "$BUILD_ROOT"
make_build_testing_dir.sh:#REQUIRED: BUILD_ROOT
make_build_testing_dir.sh:SUBDIR=$BUILD_ROOT/$FEMPUTER_BUILD_DIRNAME
run_cmd.sh:REPO_DIR=$BUILD_ROOT/$APPLICATION_NAME

I only have one recipe at the moment and it doesn't set BUILD_ROOT.

Could this be cached somehow on the GitHub side? (I'm sure that's a stupid suggestion.)

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

Annnnnnd now the server isn't responding. Will report back in a few mins.

@brianmoose
Copy link
Contributor

OK, I guess you probably want /femputer/pbauman/civet_build_testing_root.
I am not seeing anywhere where this could be cached. The client gets it scripts from the CIVET server and the client should only be accessing BUILD_ROOT from its local environment. I guess it could get overwritten locally but it would have to be in the last step of whatever recipe you are running. Do you set BUILD_ROOT in one of your recipe .cfg files?
Could you send the results of env from the terminal where you are launching the client?

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

OK, I guess you probably want /femputer/pbauman/civet_build_testing_root

Correct.

I am not seeing anywhere where this could be cached. The client gets it scripts from the CIVET server and the client should only be accessing BUILD_ROOT from its local environment.

Does the server maybe cache any of the variables anywhere that would be read by the client?

I guess it could get overwritten locally but it would have to be in the last step of whatever recipe you are running. Do you set BUILD_ROOT in one of your recipe .cfg files?

No. I only have one .cfg file in recipes. (https://github.com/ubaceslab/civet_recipes/tree/master/recipes)

I've tried the following.

  1. Killed the running client and shutdown that terminal.
  2. Killed the running server and shutdown that server.
  3. Started a new terminal and started a new server in that terminal.
  4. Started a new terminal and set valid value of BUILD_ROOT. Started a new client in that terminal.
  5. Invalidated job. It reran with correct BUILD_ROOT up to the point of failure (Configure step, still trying to figure out what's wrong with my shell script that's causing a -15 exit code, no output produced).
  6. Made changes to script, invalidated job.
  7. Now failing because can't cd to old BUILD_ROOT from the terminal that died in step 2 (that had been unset and reset to a valid value before that).
  8. Invalidated again. Now getting exit -15 on Bootstrap step (this is the first time this happened)
  9. Invalidated again. Now getting exit -15 on Build step...
  10. Invalidated again. Now failing because of bad BUILD_ROOT (the misspelled one that should've gone away...)
  11. Invalidated again. Now getting exit -15 on Bootstrap step.

There were no changes to recipe or scripts between steps 7, 8, 9, 10, and 11, just invalidating.

Could you send the results of env from the terminal where you are launching the client?

This is from the new terminal started in step 4.
env_new.txt

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

Yeah, I don't even. I'm getting random behavior with each invalidate. Any clues about what error code -15 could be?

The nuclear option is rebooting the server, wiping out the current civet install, and starting from scratch, but it would be nice to try and understand what's causing this behavior.

@brianmoose
Copy link
Contributor

Nothing seems wrong with your environment.
-15 is the result of being killed by a signal (15 or SIGTERM).
A quick messy debug option would be to sprinkle your scripts to have echo $BUILD_ROOT in various spots to try to track down when it actually changes.
I am going to clone your repo and go through your scripts real quick to see if anything pops out.

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

I put a print statement at the top of functions.sh and it looks like BUILD_ROOT is corrupted right from the beginning:

/femptuer/pbauman/civet_build_testing_root
Date: Mon Dec 12 16:16:56 EST 2016
Machine: femputer.eng.buffalo.edu
LSB Version:	:base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID:	CentOS
Description:	CentOS release 6.8 (Final)
Release:	6.8
Codename:	Final
Cannot find MOOSE Package version.
Build Root: /femptuer/pbauman/civet_build_testing_root
Trigger: Pull request
Step: Load GRINS Modules (0)

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

I'm using Lmod. Does Lmod do some weird caching business? Is each step in the recipe starting a new environment?

@pbauman
Copy link
Author

pbauman commented Dec 12, 2016

There must be something funky happening between Lmod and the shell. I'm printing out env at the top of functions.sh and it's (occasionally...) showing the bad BUILD_ROOT value.

@brianmoose
Copy link
Contributor

I have seen weird Lmod caching behavior with environment variables but it is usually solved by quitting the current terminal and starting again.

I notice that you are loading modules in the scripts. I don't think that will work like you think it should. Each step is run in its own process so it shouldn't carry over (which is why I find it weird that BUILD_ROOT is getting set by a child process).
If your recipes need a certain environment then the environment should be setup to how you want it in the terminal that is going to run the client. That way the environment is inherited from the parent process (the client).
If you need multiple different environments then you can use multiple build configs but then you might want to look at our inl_client.py which is setup to handle multiple different module environments.
Not the cause here probably, but I also noticed that you don't clear out grins-devel before you start everything. You will probably want to do that since you can't rely on the cleanup step always executing.

@pbauman
Copy link
Author

pbauman commented Dec 13, 2016

I have seen weird Lmod caching behavior with environment variables but it is usually solved by quitting the current terminal and starting again.

Yes, this was really perplexing.

I notice that you are loading modules in the scripts. I don't think that will work like you think it should. Each step is run in its own process so it shouldn't carry over (which is why I find it weird that BUILD_ROOT is getting set by a child process).
If your recipes need a certain environment then the environment should be setup to how you want it in the terminal that is going to run the client. That way the environment is inherited from the parent process (the client).

Thanks for the info. I'd started inferring this through my trial and error. :)

If you need multiple different environments then you can use multiple build configs but then you might want to look at our inl_client.py which is setup to handle multiple different module environments.
Not the cause here probably, but I also noticed that you don't clear out grins-devel before you start everything. You will probably want to do that since you can't rely on the cleanup step always executing.

I'd planned on controlling the environment with the scripts+module system - any pointers to what I should look at in the inl_client.py?

@pbauman
Copy link
Author

pbauman commented Dec 13, 2016

After rebooting the node and restarting the server and client, everything seems to be working now: I've invalidated twice, I'm not getting -15 exit messages or invalid BUILD_ROOT values. Just need to fix LD_LIBRARY_PATH in one of the modules (because PETSc changed stuff again).

I guess this is good, but it would be nice to understand really what happened. Is there any chance that screen would be interfering? I was running the CIVET server and client within a screen session.

@brianmoose
Copy link
Contributor

I don't see screen could have affected it but then I can't really see how Lmod screwed things up so badly. I bet it has something to do with Lmod. It seems to do some magic under the hood that I haven't gotten around to trying to figure out.

Are you going to be using multiple different module configurations? If not then you could use something like what is in civet/client/scripts/control.sh to setup the environment and run the client. We do that on a couple machines that only have one configuration.

Regardless, I will update the wiki on how to setup the inl_client.py

@pbauman
Copy link
Author

pbauman commented Dec 13, 2016

I don't see screen could have affected it but then I can't really see how Lmod screwed things up so badly. I bet it has something to do with Lmod. It seems to do some magic under the hood that I haven't gotten around to trying to figure out.

OK, hopefully I won't make anymore typos. :P Will report in if I ever figure it out (unlikely).

Are you going to be using multiple different module configurations?

Yes? I guess I'm not quite sure what you mean. I plan to several recipes that will build with different compiler options on the same PR, e.g. dbg vs. opt libMesh. I'll have different modules for the dbg build vs. the opt build. But I can easily just load the different modules in each of those recipes.

If not then you could use something like what is in civet/client/scripts/control.sh to setup the environment and run the client. We do that on a couple machines that only have one configuration.

OK, thanks, I'll have a look.

Regardless, I will update the wiki on how to setup the inl_client.py

Awesome, thanks!

@brianmoose
Copy link
Contributor

So you have different modules for debug vs opt? We don't do that much here, we just have different compiler targets or configuration targets. Like linux-gcc, linux-clang, linux-intel, etc. Each one of those loads a different set of modules. But we build libmesh every time in dbg/opt every time.
For your case you could have a linux-gcc-opt and linux-gcc-dbg and use the inl_client.py to poll for jobs on those two configurations.
However, loading modules in an individual step should work as well. You will just need to make sure the correct modules are loaded in each step (make sure to do a purge first!). Going this route you probably wouldn't need the inl_client.py.

@brianmoose
Copy link
Contributor

I updated the wiki for the INL client . Please let me know if there is something that isn't clear.

@pbauman
Copy link
Author

pbauman commented Dec 20, 2016

Sorry I was slow here (travel+end of semester). This is all clear for me now and my server and client are running smoothly so far. Thanks very much for the very quick and helpful comments and updates!

@pbauman pbauman closed this as completed Dec 20, 2016
@brianmoose
Copy link
Contributor

Great! Let us know if you run into any problems or need something added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants