Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work with Environment Modules 4 on Gadi #209

Closed
penguian opened this issue Nov 18, 2019 · 20 comments · Fixed by #210 or #211
Closed

Work with Environment Modules 4 on Gadi #209

penguian opened this issue Nov 18, 2019 · 20 comments · Fixed by #210 or #211

Comments

@penguian
Copy link

penguian commented Nov 18, 2019

Today I tried running on Gadi. Building succeeded. Unfortunately, I was unable to run the 1deg_jra55_ryf experiment. The error in payu/envmod.py line 40 is

FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath’

This error is essentially because payu is written for Environment Modules version 3.2.6 on Rajin, and the version of Environment Modules on Gadi is 4.3.0. This version of Environment Modules is backwards incompatible with version 3.2.6. See https://modules.readthedocs.io/en/latest/diff_v3_v4.html
Therefore payu needs to be changed to be compatible with Environment Modules version 4, in particular the current configuration on Gadi. It may also be possible that the configuration of Environment Modules on Gadi could change.

See also pull request #128 and issue #200.

@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Nov 19, 2019

The proximate cause of this issue is that the module environment variables are not consistent with how they're configured on raijin.

On raijin the following environment variables are set:

MODULE_VERSION=3.2.6
MODULESHOME=/opt/Modules/3.2.6

on gadi the only equivalent is

MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl

Without a code change the above issue is solved by setting MODULE_VERSION:

export set MODULE_VERSION=v4.3.0

I think this should be the first fix, but a code fix to extract MODULE_VERSION from MODULES_CMD, or just use MODULES_CMD directly should be the next step.

@aidanheerdegen
Copy link
Collaborator

Once the above issue was solved another error occurred:

Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/bin/payu-run", line 10, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/experiment.py", line 505, in run
    'libmpi.so'
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.10/lib/python3.6/site-packages/payu/envmod.py", line 102, in lib_update
    mod_name, mod_version = fsops.splitpath(lib_path)[2:4]
ValueError: not enough values to unpack (expected 2, got 0)

This is due to ldd not resolving dynamic library paths for my executable:

(Pdb) p slibs
['\tlinux-vdso.so.1 (0x00007ffc36f3f000)', '\tlibnetcdff.so.5 => not found', '\tlibnetcdf.so.7 => not found', '\tlibmpi_usempif08.so.11 => not found', '\tlibmpi_usempi_ignore_tkr.so.6 => not found', '\tlibmpi_mpifh.so.12 => not found', '\tlibmpi.so.12 => not found', '\tlibifport.so.5 => not found', '\tlibifcore.so.5 => not found', '\tlibimf.so => not found', '\tlibsvml.so => not found', '\tlibm.so.6 => /lib64/libm.so.6 (0x0000153a2660c000)', '\tlibintlc.so.5 => not found', '\tlibpthread.so.0 => /lib64/libpthread.so.0 (0x0000153a263ec000)', '\tlibc.so.6 => /lib64/libc.so.6 (0x0000153a26028000)', '\tlibgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000153a25e10000)', '\tlibdl.so.2 => /lib64/libdl.so.2 (0x0000153a25c0c000)', '\t/lib64/ld-linux-x86-64.so.2 (0x0000153a2698e000)', '']

Presumably the solution is to recompile on gadi to ensure all the libraries can be found. Will try this now.

@marshallward
Copy link
Collaborator

Not sure if this is helpful or a distraction, but much of this code was based on an init script provided by the package:

/opt/Modules/3.2.6/init/python

It looks like the new script is here:

/opt/Modules/v4.3.0/init/python.py

If you can get that path sorted out, say $MODULESHOME it might help to make things more portable?

@penguian
Copy link
Author

penguian commented Nov 19, 2019

In config.yaml I already had

qsub_flags: -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

when I got the error reported above. I am running again to double check. My config.yaml is at

gadi:/scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml

@penguian
Copy link
Author

payu run now results in

[pcl900@gadi-login-03 1deg_jra55_ryf]$ more  1deg_jra55_ryf.e148947
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 116, in runscript
    run_args.lab_path)
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/laboratory.py", line 30, in __init__
    raise ValueError('Cannot determine model type.')
ValueError: Cannot determine model type.

@aidanheerdegen
Copy link
Collaborator

Hi @penguian,

That error suggests there is no

model: access-om2

line in your config.yaml file

https://github.com/COSIMA/1deg_jra55_ryf/blob/master/config.yaml#L12

I guess you know you're running payu from your ~/.local directory?

@penguian
Copy link
Author

I tried again, and now the result is

[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148960
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/experiment.py", line 441, in run
    envmod.setup()
  File "/home/900/pcl900/.local/lib/python3.6/site-packages/payu-1.0-py3.6.egg/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'

@aidanheerdegen
Copy link
Collaborator

I have managed to successfully run an MITgcm simulation, a small test configuration that does not have to be submitted to the queue, once I recompiled for gadi available libraries.

@penguian can you try altering your PATH to pick up the conda/analysis3 version of payu so we can rule out any issues with a differing codebase.

@penguian
Copy link
Author

The result is:

[pcl900@gadi-login-03 1deg_jra55_ryf]$ module use /g/data3/hh5/public/modules
[pcl900@gadi-login-03 1deg_jra55_ryf]$ module load conda/analysis3
[pcl900@gadi-login-03 1deg_jra55_ryf]$ which payu
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
Moving log 1deg_jra55_ryf.e148974
Moving log 1deg_jra55_ryf.o148974
Removing work path /scratch/fp0/pcl900/access-om2/work/1deg_jra55_ryf
Removing symlink /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/work
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148975.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148975
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
    envmod.setup()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'

Perhaps I have misconfigured?

@aidanheerdegen
Copy link
Collaborator

That code branch is only called if MODULE_VERSION is not defined

with open(module_initpath) as initpaths:

Can you set this and try again?

export set MODULE_VERSION=v4.3.0

@penguian
Copy link
Author

The result is:

[pcl900@gadi-login-03 1deg_jra55_ryf]$ export set MODULE_VERSION=v4.3.0
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu sweep
[pcl900@gadi-login-03 1deg_jra55_ryf]$ payu run
qsub -q normal -P fp0 -l walltime=14400 -l ncpus=288 -l mem=500GB -N 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib,PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin -v MODULE_VERSION=v4.3.0 -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027 -- /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/python /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run
148978.gadi-pbs
[pcl900@gadi-login-03 1deg_jra55_ryf]$ cat 1deg_jra55_ryf.e148978
/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/yamanifest/manifest.py:99: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.header, self.data = yaml.load_all(file)
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/bin/payu-run", line 12, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/experiment.py", line 443, in run
    envmod.setup()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/site-packages/payu/envmod.py", line 40, in setup
    with open(module_initpath) as initpaths:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/Modules/v4.3.0/init/.modulespath'

@penguian
Copy link
Author

In the setup() code, the check on line 38 is for 'MODULEPATH', not MODULE_VERSION.

@aidanheerdegen
Copy link
Collaborator

Yep, but MODULEPATH is set above that, and uses MODULE_VERSION

moduleshome = os.path.join(basepath, module_version)

@penguian
Copy link
Author

No, that is moduleshome. The environment variable MODULEPATH is not set until line 46, which is after the FileNotFoundError.

@aidanheerdegen
Copy link
Collaborator

You're right, sorry.

So you don't have MODULEPATH defined by the looks of it. I wonder why?

@penguian
Copy link
Author

It is because MODULEPATH is defined by the Environment Modules TCL code on Gadi rather than the way it is done on Raijin. See /opt/Modules/v4.3.0/init/python.py and /opt/Modules/v4.3.0/libexec/modulecmd.tcl.

@penguian
Copy link
Author

On Raijin, the script /etc/profile.d/nf_sh_modules is run on login, and this calls $modules_path/init/bash, setting up the environment variables. On Gadi, this script does not exist, but there is a script /etc/profile.d/modules.sh which should do something similar. I will check.

@penguian
Copy link
Author

Looging in to Gadi, I see:

Last login: Mon Nov 18 14:07:27 2019 from 150.203.248.245
[pcl900@gadi-login-01 ~]$ echo $MODULESHOME
/opt/Modules/v4.3.0
[pcl900@gadi-login-01 ~]$ echo $MODULEPATH
/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles
[pcl900@gadi-login-01 ~]$ echo $MODULE_VERSION

That is MODULEPATH and MODULESHOME are defined in the interactive login shell, but not MODULE_VERSION.

@penguian
Copy link
Author

I succeeded in starting the model by setting:

qsub_flags: -v MODULE_VERSION=v4.3.0,MODULEPATH=/g/data3/hh5/public/modules:/apps/Modules/restricted-modulefiles/z00:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulef
iles:/apps/Modules/modulefiles -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

in config.yaml. The following also works, as long as MODULE_VERSION and MODULEPATH are defined in the interactive shell:

qsub_flags: -V -lstorage=scratch/fp0+gdata/fp0 -lother=hyperthread -W umask=027

See /scratch/fp0/pcl900/access-om2-gadi/control/1deg_jra55_ryf/config.yaml

(FYI, but not relevant to payu: I also copied /g/data1/ua8/JRA55-do/RYF to /scratch/fp0/pcl900/ and adjusted atmosphere/forcing.json to match, as well as setting ncpus for ocean back to 216 in config.yaml. The model now fails with SIGSEGV in ice_transport_remap.f90.)

@aidanheerdegen
Copy link
Collaborator

This fix only worked for payu-run. With payu run it fails as payu only passes a limited environment to the PBS job, and so none of the module environment variables were being passed in.

Paul's work-around

qsub_flags: -V 

ensures the environment is fully populated, so it works in that case.

This needs a another fix to properly populate the PBS environment with module environment variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants