Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution in login shell (or not) #201

Open
LourensVeen opened this issue Mar 22, 2023 · 3 comments
Open

Execution in login shell (or not) #201

LourensVeen opened this issue Mar 22, 2023 · 3 comments
Labels
instance-mgmt Topic: Starting and monitoring instances

Comments

@LourensVeen
Copy link
Contributor

Model components are currently executed in a login shell. This is nice, because it means the environment is the same as what you have on the command line, so there are fewer unexpected differences. On the other hand, there may be cases where different models require different things, and you want a clean environment to explicitly add modules and variables to. Also, the first case may create unexpected conflicts, because the shell scripts loaded for a login shell may fail in the presence of environment variables injected from the environment in which the manager was started by QCG-PJ.

QCG-PJ currently seems to copy various bits of the environment from that in which it is running to the jobs it runs, but as I recall it's not the same locally as on a cluster. It also always runs in a login shell, at least in a cluster but not when running locally. MUSCLE3 currently manually adds a bash -l -c to local runs to at least make it consistent.

Both of the above cases actually seem reasonable, so the solution is probably to add another key to the implementations section in the yMMSL file that specifies whether we want a login shell or a normal one, and/or a clean one or with passthrough from the host environment.

Thanks to @peter-t-fox for the report and discussion.

@LourensVeen
Copy link
Contributor Author

Some additional complications have come up. In some cases, bash -l causes the program to be run in the home directory, which is definitely not what we want. Also, QCG-PJ version 0.14.0 now uses exec -l for some reason.

bash has the following ways of reading startup files when it's running non-interactively:

  • bash -c <command> doesn't read any startup files
  • bash -l -c <command> reads /etc/profile, then the first it finds of ~/.bash_profile, ~/.bash_login and ~/.profile
  • bash -l --noprofile -c <command> doesn't read any startup files

Note that on my Ubuntu system, the default ~/.profile includes ~/.bashrc, so that will get read by bash -l as well then.

It's not clear to me whether there's a difference between bash -c and bash -l --noprofile -c.

@LourensVeen
Copy link
Contributor Author

Just encountered another issue: if we don't run in a login shell, then we'll inherit the environment from the parent, which is the manager or the node agent. The manager has in turn inherited the environment from the shell that started it, but not any functions defined in it. Subshells do "inherit" functions, because they're forked subprocesses of the parent shell, but when starting a Python interpreter those are lost because Python doesn't know anything about shell functions and there's no mechanism to pass them.

The problem is that the module command that we need to load modules is a shell function, and if we launch a non-login shell from Python, then it won't read the usual config files, therefore not source the environment modules or lmod configuration script, and it won't have the module function.

So we need a login shell, but the problem with that is that it can also contain other stuff that we don't want (giant active banners, commands to move to a different directory, anything really).

It seems that lmod defines a few environment variables that we may be able to use. Sourcing ${LMOD_PKG}/init/bash should define the required functions. But if we're doing lmod with Spack, then we then also need to source ${SPACK_ROOT}/share/spack/setup-env.sh after that to make the Spack-built modules available.

We could add those commands to the run script if we detect that LMOD_PKG and SPACK_ROOT are set, but what if we're using environment modules, and what if we're using EasyBuild or Nix with lmod? We'd have to figure out all those situations and add support for them one by one. And test them too...

Question: how does SLURM do this? It starts the job script, and inside the job script you can do module load just fine. But I write those with #!/bin/bash, not #!/bin/bash -l. So how does the module function get defined?

@LourensVeen
Copy link
Contributor Author

https://lists.schedmd.com/pipermail/slurm-users/2021-January/006675.html

According to the bash man page, login shells read /etc/profile if it exists, then the first one of ~/.bash_profile, ~/.bash_login, and ~/.profile that exists. A non-login shell reads /etc/bash.bashrc and then ~/.bashrc, if they exist.

Of course, we have no idea where the cluster administrators source the module environment script, so that doesn't help.

The link above says that the module function is an exported function, and will therefore be inherited by subshells. The bash man page says that "Functions may be exported so that subshells automatically have them defined with the -f option to the export builtin." The bash manpage doesn't mention this, but it seems to use BASH_FUNC_<name> environment variables to pass functions to subshells, and that should propagate through Slurm, but also through our Python stuff.

So then why was module undefined in my test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
instance-mgmt Topic: Starting and monitoring instances
Projects
None yet
Development

No branches or pull requests

1 participant