Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent state when experiment_uuid deleted #436

Closed
aidanheerdegen opened this issue Apr 9, 2024 · 3 comments
Closed

Inconsistent state when experiment_uuid deleted #436

aidanheerdegen opened this issue Apr 9, 2024 · 3 comments
Labels

Comments

@aidanheerdegen
Copy link
Collaborator

I managed to get a payu control directory into an inconsistent state by deleting the experiment_uuid from metadata.yaml and then doing payu setup.

payu version is 431-Recover-from-incomplete-checkout branch from PR #435

Details

I cloned a standard release config:

$ ~/.local/bin/payu clone -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs.git 1deg_jra55_ryf
Cloned repository from https://github.com/ACCESS-NRI/access-om2-configs.git to directory: /tmp/lala/1deg_jra55_ryf
Checked out branch: release-1deg_jra55_ryf       
laboratory path:  /scratch/tm70/aph502/access-om2                                                                                                                
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
Updated metadata. Experiment UUID: 357368bb-2c6f-4b33-a7d5-38a8a2d6ab69
Added archive symlink to /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb
To change directory to control directory run:
  cd 1deg_jra55_ryf
$ cd 1deg_jra55_ryf/

So it has a generated a new UUID

$ ~/.local/bin/payu branch
* Current Branch: release-1deg_jra55_ryf
    experiment_uuid: 357368bb-2c6f-4b33-a7d5-38a8a2d6ab69
$ ls -ld archive
lrwxrwxrwx 1 aph502 tm70 86 Apr  9 22:32 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb                       

At this stage it looks fine. I then removed the experiment_uuid from metadata.yaml and tried to run payu setup:

[aph502@gadi-login-02 1deg_jra55_ryf]$ payu setup
laboratory path:  /scratch/tm70/aph502/access-om2
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
payu: error: work path already exists: /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf.                                                                      
             payu sweep and then payu run

So it now thinks it is a legacy experiment and finds an existing work directory.

Force it to sweep and create a new work directory:

$ payu setup -f
laboratory path:  /scratch/tm70/aph502/access-om2
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
payu: work path already exists.
      Sweeping as --force option is True.
Removing work path /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up access-om2
Checking exe and input manifests
File no longer in input directory: work/ice/RESTART/i2o.nc removing from manifest                                                                                
File no longer in input directory: work/ice/RESTART/monthly_sstsss.nc removing from manifest                                                                     
File no longer in input directory: work/ice/RESTART/u_star.nc removing from manifest                                                                             
File no longer in input directory: work/ice/RESTART/kmt.nc removing from manifest                                                                                
File no longer in input directory: work/ice/RESTART/grid.nc removing from manifest                                                                               
File no longer in input directory: work/ice/RESTART/o2i.nc removing from manifest                                                                                
Creating restart manifest
Updating full hashes for 181 files in manifests/restart.yaml
Writing manifests/input.yaml
Writing manifests/restart.yaml

Now the archive is pointing at the same location from the beginning, but the work is pointing to the legacy location:

$ ls -l
total 48
-rw-r----- 1 aph502 tm70   861 Apr  9 22:32 accessom2.nml
lrwxrwxrwx 1 aph502 tm70    86 Apr  9 22:32 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-357368bb                    
drwxr-x--- 2 aph502 tm70    80 Apr  9 22:32 atmosphere
-rw-r----- 1 aph502 tm70  5173 Apr  9 22:32 config.yaml
drwxr-x--- 2 aph502 tm70    60 Apr  9 22:32 doc
drwxr-x--- 2 aph502 tm70   120 Apr  9 22:32 ice
-rw-r----- 1 aph502 tm70 18657 Apr  9 22:32 LICENSE
drwxr-x--- 2 aph502 tm70   100 Apr  9 22:32 manifests
-rw-r----- 1 aph502 tm70  2219 Apr  9 22:35 metadata.yaml
-rw-r----- 1 aph502 tm70  7904 Apr  9 22:32 namcouple
drwxr-x--- 2 aph502 tm70   120 Apr  9 22:32 ocean
-rw-r----- 1 aph502 tm70  1367 Apr  9 22:32 README.md
drwxr-x--- 3 aph502 tm70    60 Apr  9 22:32 testing
drwxr-x--- 2 aph502 tm70   100 Apr  9 22:32 tools
lrwxrwxrwx 1 aph502 tm70    51 Apr  9 22:35 work -> /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf

Now it might be answer is "don't do that", but it is quite a confusing situation to get into with a relatively small change.

@jo-basevi
Copy link
Collaborator

Do you know if ~/.local/bin/payu and the payu command were pointing to the same location? As its there a few lines that I would've expected to be logged out if it was the payu version: 431-Recover-from-incomplete-checkout.

E.g. in recent merged changes, if it finds an archive (legacy or not) it'll will always log out the archive path. e.g.

Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state-release-1deg_jra55_ryf-14667dab

or warning if there's no UUID in metadata - (below has UUID removed from metadata.yaml and a legacy archive exists with no corresponding metadata)

jb4202@gadi-login-03 ~/test_models/incomplete_checkout/test_inconsistent_state (release-1deg_jra55_ryf)$ payu setup
laboratory path:  /scratch/tm70/jb4202/access-om2
binary path:  /scratch/tm70/jb4202/access-om2/bin
input path:  /scratch/tm70/jb4202/access-om2/input
work path:  /scratch/tm70/jb4202/access-om2/work
archive path:  /scratch/tm70/jb4202/access-om2/archive
/home/189/jb4202/payu_fork/payu/metadata.py:125: MetadataWarning: No experiment uuid found in metadata. Generating a new uuid
  warnings.warn("No experiment uuid found in metadata. "
Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state
Updated metadata. Experiment UUID: a68e2ad3-1547-4b87-a95d-09163c51dcfd
payu: error: work path already exists: /scratch/tm70/jb4202/access-om2/work/test_inconsistent_state.
             payu sweep and then payu run

or subsequent setups $ payu setup

laboratory path:  /scratch/tm70/jb4202/access-om2
binary path:  /scratch/tm70/jb4202/access-om2/bin
input path:  /scratch/tm70/jb4202/access-om2/input
work path:  /scratch/tm70/jb4202/access-om2/work
archive path:  /scratch/tm70/jb4202/access-om2/archive
Found experiment archive: /scratch/tm70/jb4202/access-om2/archive/test_inconsistent_state
payu: error: work path already exists: /scratch/tm70/jb4202/access-om2/work/test_inconsistent_state.
             payu sweep and then payu run

@aidanheerdegen
Copy link
Collaborator Author

So sorry to have wasted your time. Yes you are correct, I was using inconsistent versions of payu when testing.

If I redo those commands but instead use ~/.local/bin/payu setup it works as you say, and I'd expect, and generates a new UUID and relinks archive and work to point at the new experiment directories

$ ~/.local/bin/payu setup
laboratory path:  /scratch/tm70/aph502/access-om2
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
/home/502/aph502/code/python/payu/payu/metadata.py:125: MetadataWarning: No experiment uuid found in metadata. Generating a new uuid
  warnings.warn("No experiment uuid found in metadata. "
Mismatch of UUIDs between metadata and an archive metadata found at: /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf/metadata.yaml
/home/502/aph502/code/python/payu/payu/metadata.py:180: MetadataWarning: No pre-existing archive found. Generating a new uuid
  warnings.warn(
Updated metadata. Experiment UUID: d05616b0-e65b-4d71-a3cc-d4672ccb5e17
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Making exe links
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up access-om2
Checking exe and input manifests
Creating restart manifest
Writing manifests/restart.yaml
$ ls -ld work archive
lrwxrwxrwx 1 aph502 tm70 86 Apr 10 16:05 archive -> /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-383acc4b
lrwxrwxrwx 1 aph502 tm70 83 Apr 10 16:05 work -> /scratch/tm70/aph502/access-om2/work/1deg_jra55_ryf-release-1deg_jra55_ryf-d05616b0

Thanks for looking into that and finding my mistake @jo-basevi. Glad it was a stuff up on my part and not a code issue.

@jo-basevi
Copy link
Collaborator

The archive symlink is still pointing towards an old archive. It will use the new archive and re-setup the symlink when archive runs as part of payu run or if payu checkout $SAME_BRANCH. Could add a line to payu setup to setup the archive symlink?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants