Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum errors keep popping up #190

Closed
navidcy opened this issue Jun 9, 2019 · 18 comments
Closed

Checksum errors keep popping up #190

navidcy opened this issue Jun 9, 2019 · 18 comments

Comments

@navidcy
Copy link
Contributor

navidcy commented Jun 9, 2019

I keep getting errors of this sort:

FATAL from PE     0: MOM_restart(restore_state): Checksum of input field DTBT 4027A842B7D0AE91 does not match value 1EFB741F9B08614    8 stored in INPUT/MOM.res.nc

See, e.g., job 9363751 in /home/552/nc3020/SOchanBcBtEddySat/layer2/layer2_tau5e-0_manyshortridges. (Possibly the logs are archived because I swept.)

@navidcy
Copy link
Contributor Author

navidcy commented Jun 9, 2019

Also /home/552/nc3020/SOchanBcBtEddySat/layer2/layer2_tau8e-0_manyshortridgesCorrectTopo/

Is it something I did? Seems like it's impossible to restart my experiments unless I play dirty and go in the restart .nc file and delete the checksum from DBDT variable...

@marshallward
Copy link
Collaborator

marshallward commented Jun 9, 2019

was the restart created on the same machine? could be a little/big endian issue. I think FMS just writes out the bytes via a write '(Z16)', my_field conversion.

Probably something simpler, but thats all that comes to mind at the moment. I agree that usually this only happens when one manually manipulates the fields in, say, another program.

@marshallward
Copy link
Collaborator

there is a flag that disables checksums btw, if you cant find it then I'll have a look a bit later. Obviously not recommended in general...

@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Jun 9, 2019 via email

@marshallward
Copy link
Collaborator

Seems more of at FMS issue but probably best to start with MOM6.

Guessing you have confirmed that this happens from a manual execution, and is not a payu problem?

@navidcy
Copy link
Contributor Author

navidcy commented Jun 9, 2019

@aidanheerdegen, sure I can do. I just want to make sure that this is not payu related.

[Either way possibly it's @marshallward who will be attending the mom6 issue anyway... ha :)]

@navidcy
Copy link
Contributor Author

navidcy commented Jun 9, 2019

@marshallward no I haven't done that because, actually, I don't know how to do a manual restart without payu! :) I'll try to do that and get back to you.

@marshallward
Copy link
Collaborator

Oh! Did you do a code update? I think the mpp_checksum default method has changed to the new Bob/Alistair method. (Or should I say Hallberg-Adcroft? Adcroft-Hallberg?)

If this is what's happening, then I do think the checksum will be different.

@navidcy
Copy link
Contributor Author

navidcy commented Jun 9, 2019

I have recompiled my executable 2 months ago (early Apr...)
I believe I took the MOM6 code from commit NOAA-GFDL/MOM6@d93b047 or the one before/after that.

Did the mpp_checksum method changed before that?

@marshallward
Copy link
Collaborator

I could have swore that I had seen the method in the latest FMS. But now that I'm looking, all I can see the old version, so I guess I must have been mistaken. Really sorry about that.

The only other idea I can see after looking at the code is that there is a sensitivity to the expected fill value. If, say, the CF fill value (1e20) got changed to somethig like the netCDF fill value (something like 9.96e34) then it could cause problems.

But I'm really reaching here. Probably best to just

  1. Confirm that you can reproduce the original restart with the bad checksum
  2. Confirm that the allegedly "bad" restart gives the checksum error

and to do both of these independent of Payu.

@navidcy
Copy link
Contributor Author

navidcy commented Jun 10, 2019

Or should I say Hallberg-Adcroft? Adcroft-Hallberg?

How about "Hallcroft method"?

@marshallward
Copy link
Collaborator

@adcroft gave the very wise suggestion of searching for older errors, which came up with this:

https://github.com/NOAA-GFDL/MOM6/issues/824

It seems this might be a bug with checksums on 1d arrays, such as a dynamic dtbt timestep. Did you change the number of nodes?

Anyway I will ask @MJHarrison-GFDL when he comes in, since he seems to have figured it out last time. (If it's still happening then we might want to reopen this in MOM6, but lets confirm first)

In the meantime, it seems ok for you to do RESTART_CHECKSUMS_REQUIRED = False in MOM_input

@navidcy
Copy link
Contributor Author

navidcy commented Jun 10, 2019

@marshallward, yes I probably changed number of nodes since I'm systematically changing from submitting to normal queue to normalsl or normalbw... Does this info helps identify the issue?

@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Jun 11, 2019

We tried this:

&fms_io_nml
    checksum_required=.false.
/

but it still complained about that field (no others). I couldn't find RESTART_CHECKSUMS_REQUIRED in the code. Is that a typo?

That is more of an FMS issue I guess.

@marshallward
Copy link
Collaborator

marshallward commented Jun 11, 2019

@aidanheerdegen This looks like a MOM6 config rather than an FMS config. MOM6 actually overrides save_restart for example. I suspect that the FMS one is doing nothing here.

Try adding this setting to MOM_input. (Check one of the parameter doc files to confirm the spelling)

@marshallward
Copy link
Collaborator

@navidcy yes it's useful info, and matches the other issue. I haven't noticed it myself but it looks like the original issue was never resolved and may stillb be present.

Hopefully will get some time soon to look into it.

@navidcy
Copy link
Contributor Author

navidcy commented Jun 11, 2019

@marshallward , adding RESTART_CHECKSUMS_REQUIRED = False inside MOM_input did the job.

@marshallward
Copy link
Collaborator

marshallward commented Jun 11, 2019

Sounds good, thanks for letting me know. I guess we can close this, but can you keep aside the experiment somewhere on raijin? I will grab it and try to reproduce the problem and hopefully find a fix for it this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants