Checksum errors keep popping up #190

navidcy · 2019-06-09T22:26:50Z

I keep getting errors of this sort:

FATAL from PE     0: MOM_restart(restore_state): Checksum of input field DTBT 4027A842B7D0AE91 does not match value 1EFB741F9B08614    8 stored in INPUT/MOM.res.nc

See, e.g., job 9363751 in /home/552/nc3020/SOchanBcBtEddySat/layer2/layer2_tau5e-0_manyshortridges. (Possibly the logs are archived because I swept.)

The text was updated successfully, but these errors were encountered:

navidcy · 2019-06-09T23:17:14Z

Also /home/552/nc3020/SOchanBcBtEddySat/layer2/layer2_tau8e-0_manyshortridgesCorrectTopo/

Is it something I did? Seems like it's impossible to restart my experiments unless I play dirty and go in the restart .nc file and delete the checksum from DBDT variable...

marshallward · 2019-06-09T23:21:39Z

was the restart created on the same machine? could be a little/big endian issue. I think FMS just writes out the bytes via a write '(Z16)', my_field conversion.

Probably something simpler, but thats all that comes to mind at the moment. I agree that usually this only happens when one manually manipulates the fields in, say, another program.

marshallward · 2019-06-09T23:22:21Z

there is a flag that disables checksums btw, if you cant find it then I'll have a look a bit later. Obviously not recommended in general...

aidanheerdegen · 2019-06-09T23:26:30Z

Navid had this problem before and we did the checksum thing bit it didn't work. It was only that one variable too. Navid, might be time to make a MOM6 issue.

marshallward · 2019-06-09T23:31:20Z

Seems more of at FMS issue but probably best to start with MOM6.

Guessing you have confirmed that this happens from a manual execution, and is not a payu problem?

navidcy · 2019-06-09T23:31:56Z

@aidanheerdegen, sure I can do. I just want to make sure that this is not payu related.

[Either way possibly it's @marshallward who will be attending the mom6 issue anyway... ha :)]

navidcy · 2019-06-09T23:32:44Z

@marshallward no I haven't done that because, actually, I don't know how to do a manual restart without payu! :) I'll try to do that and get back to you.

marshallward · 2019-06-09T23:38:38Z

Oh! Did you do a code update? I think the mpp_checksum default method has changed to the new Bob/Alistair method. (Or should I say Hallberg-Adcroft? Adcroft-Hallberg?)

If this is what's happening, then I do think the checksum will be different.

navidcy · 2019-06-09T23:43:38Z

I have recompiled my executable 2 months ago (early Apr...)
I believe I took the MOM6 code from commit NOAA-GFDL/MOM6@d93b047 or the one before/after that.

Did the mpp_checksum method changed before that?

marshallward · 2019-06-10T00:24:01Z

I could have swore that I had seen the method in the latest FMS. But now that I'm looking, all I can see the old version, so I guess I must have been mistaken. Really sorry about that.

The only other idea I can see after looking at the code is that there is a sensitivity to the expected fill value. If, say, the CF fill value (1e20) got changed to somethig like the netCDF fill value (something like 9.96e34) then it could cause problems.

But I'm really reaching here. Probably best to just

Confirm that you can reproduce the original restart with the bad checksum
Confirm that the allegedly "bad" restart gives the checksum error

and to do both of these independent of Payu.

navidcy · 2019-06-10T01:06:36Z

Or should I say Hallberg-Adcroft? Adcroft-Hallberg?

How about "Hallcroft method"?

marshallward · 2019-06-10T13:50:02Z

@adcroft gave the very wise suggestion of searching for older errors, which came up with this:

https://github.com/NOAA-GFDL/MOM6/issues/824

It seems this might be a bug with checksums on 1d arrays, such as a dynamic dtbt timestep. Did you change the number of nodes?

Anyway I will ask @MJHarrison-GFDL when he comes in, since he seems to have figured it out last time. (If it's still happening then we might want to reopen this in MOM6, but lets confirm first)

In the meantime, it seems ok for you to do RESTART_CHECKSUMS_REQUIRED = False in MOM_input

navidcy · 2019-06-10T23:44:50Z

@marshallward, yes I probably changed number of nodes since I'm systematically changing from submitting to normal queue to normalsl or normalbw... Does this info helps identify the issue?

aidanheerdegen · 2019-06-11T00:01:27Z

We tried this:

&fms_io_nml
    checksum_required=.false.
/

but it still complained about that field (no others). I couldn't find RESTART_CHECKSUMS_REQUIRED in the code. Is that a typo?

That is more of an FMS issue I guess.

marshallward · 2019-06-11T00:07:26Z

@aidanheerdegen This looks like a MOM6 config rather than an FMS config. MOM6 actually overrides save_restart for example. I suspect that the FMS one is doing nothing here.

Try adding this setting to MOM_input. (Check one of the parameter doc files to confirm the spelling)

marshallward · 2019-06-11T00:09:44Z

@navidcy yes it's useful info, and matches the other issue. I haven't noticed it myself but it looks like the original issue was never resolved and may stillb be present.

Hopefully will get some time soon to look into it.

navidcy · 2019-06-11T00:23:33Z

@marshallward , adding RESTART_CHECKSUMS_REQUIRED = False inside MOM_input did the job.

marshallward · 2019-06-11T13:38:07Z

Sounds good, thanks for letting me know. I guess we can close this, but can you keep aside the experiment somewhere on raijin? I will grab it and try to reproduce the problem and hopefully find a fix for it this time.

aidanheerdegen closed this as completed Jun 11, 2019

hakaseh mentioned this issue Aug 27, 2020

wombat in master COSIMA/access-om2#209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum errors keep popping up #190

Checksum errors keep popping up #190

navidcy commented Jun 9, 2019

navidcy commented Jun 9, 2019

marshallward commented Jun 9, 2019 •

edited

marshallward commented Jun 9, 2019

aidanheerdegen commented Jun 9, 2019 via email

marshallward commented Jun 9, 2019

navidcy commented Jun 9, 2019

navidcy commented Jun 9, 2019

marshallward commented Jun 9, 2019

navidcy commented Jun 9, 2019 •

edited

marshallward commented Jun 10, 2019

navidcy commented Jun 10, 2019 •

edited

marshallward commented Jun 10, 2019

navidcy commented Jun 10, 2019 •

edited

aidanheerdegen commented Jun 11, 2019 •

edited

marshallward commented Jun 11, 2019 •

edited

marshallward commented Jun 11, 2019

navidcy commented Jun 11, 2019

marshallward commented Jun 11, 2019 •

edited

Checksum errors keep popping up #190

Checksum errors keep popping up #190

Comments

navidcy commented Jun 9, 2019

navidcy commented Jun 9, 2019

marshallward commented Jun 9, 2019 • edited

marshallward commented Jun 9, 2019

aidanheerdegen commented Jun 9, 2019 via email

marshallward commented Jun 9, 2019

navidcy commented Jun 9, 2019

navidcy commented Jun 9, 2019

marshallward commented Jun 9, 2019

navidcy commented Jun 9, 2019 • edited

marshallward commented Jun 10, 2019

navidcy commented Jun 10, 2019 • edited

marshallward commented Jun 10, 2019

navidcy commented Jun 10, 2019 • edited

aidanheerdegen commented Jun 11, 2019 • edited

marshallward commented Jun 11, 2019 • edited

marshallward commented Jun 11, 2019

navidcy commented Jun 11, 2019

marshallward commented Jun 11, 2019 • edited

marshallward commented Jun 9, 2019 •

edited

navidcy commented Jun 9, 2019 •

edited

navidcy commented Jun 10, 2019 •

edited

navidcy commented Jun 10, 2019 •

edited

aidanheerdegen commented Jun 11, 2019 •

edited

marshallward commented Jun 11, 2019 •

edited

marshallward commented Jun 11, 2019 •

edited