Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM4 does not reproduce across a change in ice_layout , unless icebergs are off #13

Closed
nikizadehgfdl opened this issue Aug 11, 2015 · 7 comments

Comments

@nikizadehgfdl
Copy link
Contributor

This is a very old issue which was first seen in ESM2 years ago.

The CM4 coupled model (using SIS2 and its old icebergs module) does not produce the same answers when ice_layout is changed. When I turn off the icebergs the answers are bitwise identical across ice_layout change.

This is with repro mode and with make_exchange_reproduce=.true., but I think neither has an effect here.

I believe this issue persists if I swap SIS2 with SIS1 . No reason to go away with new icebergs module either.

Here's the two configs that do not reproduce (ALL restart files differ) unless I turn off the bergs.
They differ only in ice_layout 72,4 vs 96,3

 else if ( "$npes" == "2560" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "72,4";  set ice_io_layout   =  "1,4"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

else if ( "$npes" == "2561" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "96,3";  set ice_io_layout   =  "1,3"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

The experiments I tried are:

CM4_c96L32_am4g5r2_2000_sis2 which has the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar
DIFFER : ALL
    CROSSOVER   FAILED: CM4_c96L32_am4g5r2_2000_sis2

CM4_c96L32_am4g5r2_2000_sis2_nobergs which does not have the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar

    CROSSOVER   PASSED: CM4_c96L32_am4g5r2_2000_sis2_nobergs
@nikizadehgfdl
Copy link
Contributor Author

BTW the model does reproduce across a fv_layout change, atmos_threads change or ocean_layout change (with no mask_table).

@nikizadehgfdl
Copy link
Contributor Author

@underwoo wrote:
"Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs, the "Total Ice Mass|Salt|Heat" are all slightly different from the first print out. Points to something in the iceberg initialization. (The noberg runs do not show the difference in "Total Ice ..".)

Also, I don't see any iceberg restart files in the initCond file. Could you please run a test that uses iceberg restart files to see if that will reproduce across layout changes."

So, I did try that and the answers indeed reproduced across the same ice_layout change!

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar

the only difference was
      Comparing icebergs.res.nc...
DIFFER : VARIABLE : lon : POSITION : 0 : VALUES : -265.452 <> -267.218

All I did was to use one of the restart tars from a 10 days experiment (the 2560 one) as the initCond and repeat the runs.

Here are the stdouts:

/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2560pe.o5041290 

/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2561pe.o5041289

So, Seth, what do you make of this?

@jwdGFDL
Copy link

jwdGFDL commented Aug 13, 2015

If I recall correctly, the initialization algorithm picks out the
icebergs a given rank owns. One possibility is that there is a layout
dependent flaw that may attribute an iceberg to multiple ranks or to no
rank thus leaving it out of further simulation.

On 08/12/2015 07:14 PM, Niki Zadeh wrote:

@underwoo https://github.com/underwoo wrote:
"Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs,
the "Total Ice Mass|Salt|Heat" are all slightly different from the
first print out. Points to something in the iceberg initialization.
(The noberg runs do not show the difference in "Total Ice ..".)

Also, I don't see any iceberg restart files in the initCond file.
Could you please run a test that uses iceberg restart files to see if
that will reproduce across layout changes."

So, I did try that and the answers indeed reproduced across the same
ice_layout change!

|///
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar
\
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar
the only difference was Comparing icebergs.res.nc... DIFFER : VARIABLE
: lon : POSITION : 0 : VALUES : -265.452 <> -267.218 |

All I did was to use one of the restart tars from a 10 days experiment
(the 2560 one) as the initCond and repeat the runs.

So, Seth, what do you make of this?


Reply to this email directly or view it on GitHub
#13 (comment).

Jeff Durachta
Engineering Lead for Modeling Services
NOAA Geophysical Fluid Dynamics Lab
Forrestal Campus, Princeton University
201 Forrestal Road
Princeton, NJ 08540
Office: +1-609-987-5054

@adcroft
Copy link
Member

adcroft commented Aug 18, 2015

I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.

Looking at the logs @nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?

> grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum=           -1896008147 chksum2=           -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum=            -185424424 chksum2=            1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)=  0.354793438964402       0.354793438964402    0.354793438964402
> OCN(ATMOCNLND)=  0.354433472151885       0.354433472151885    0.354433472151885

which has nothing todo with icebergs.

@Zhi-Liang
Copy link
Contributor

Hi Niki,

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402
0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885
0.354433472151885

This printout is from xgrid.F90. This caculation is based on some random
number. So it can not reproduce between processor count.

Zhi

On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft <notifications@github.com

wrote:

I've been unable to make an ice-ocean configuration fail
reproducibility tests in which I seed every model cell with four bergs
moving in the cardinal directions.

Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl
provided it looks like there is a difference in the calving restart
checksum. How does this happen?

grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

which has nothing todo with icebergs.


Reply to this email directly or view it on GitHub
#13 (comment).

@underwoo
Copy link
Member

There is a namelist options 'make_calving_reproduce' in the ice_sis version
of ice_bergs. Niki, please check if this option is in the new icebergs,
and if it is set to .true. in your namelists.

Seth Underwood
Engility

Modeling Systems Group
GFDL/NOAA/DOC
201 Forrestal Road
Princeton, NJ 08540-6649

(609) 452-5847 Office
(304) 376-9002 Cell
(609) 987-5063 Fax
Seth.Underwood@noaa.gov

On Tue, Aug 18, 2015 at 10:09 AM, Zhi Liang notifications@github.com
wrote:

Hi Niki,

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402
0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885
0.354433472151885

This printout is from xgrid.F90. This caculation is based on some random
number. So it can not reproduce between processor count.

Zhi

On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft <
notifications@github.com

wrote:

I've been unable to make an ice-ocean configuration fail
reproducibility tests in which I seed every model cell with four bergs
moving in the cardinal directions.

Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl
provided it looks like there is a difference in the calving restart
checksum. How does this happen?

grep restart_calv
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds,
grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752
min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms=
1.634493874E+11 sd= 1.311745835E+11
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds,
grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min=
0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms=
1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

which has nothing todo with icebergs.


Reply to this email directly or view it on GitHub
<#13 (comment)
.


Reply to this email directly or view it on GitHub
#13 (comment).

@nikizadehgfdl
Copy link
Contributor Author

Thanks, that was the problem. The model reproduced across ice_layout change after I set the iceberg namelist make_calving_reproduce = .true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants