Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting MOM5 to latest FMS (ulm) #132

Merged
merged 14 commits into from
Aug 11, 2016
Merged

Conversation

marshallward
Copy link
Collaborator

This is a big one, I have replaced FMS with the latest version! Definitely talk with me before considering this pull request.

I will also start a mailing list post discussing this PR.

Commit message below:

This is a large patch which updates MOM5 to use the latest FMS library.

Changes to FMS are not summarised here, only changes to MOM and support
models (atmos, land, sis) are summarised below.

Changes are as follows:

  • Updating the FMS (src/shared) library to the latest FMS release.
  • write_version_number updated in each module to support inputs
  • version and tagname variable names are now used consistently
    across modules
  • The land_null model now sets its Current_domain to support changes
    to get_mosaic_xgrid (which now requires an explicit domain)
  • FMS_compile.csh buildscript replaces the fixed filepaths with an
    autogenerated list using the list_paths utility

The main purpose of this patch is to demonstrate that MOM5 still runs
under the newer FMS release with only minor changes, and to start
discussion of how to maintain MOM under future FMS releases.

@marshallward
Copy link
Collaborator Author

Forgot to add:

I have confirmed the following:

  • bitwise reproducibility of bowl1
  • bitwise reproducibility of the "canon" CM2.5-ish 0.25° MOM-SIS experiment
  • I can compile CM2M but I have not yet run it; there may be problems in the other submodels, particularly land_lad.

@nichannah
Copy link
Contributor

Hi Marshall, this looks good, I think we should merge it. Give me a day or two to run the full tests.

Can you update the commit message to include a reference to where the FMS code comes from, e.g. URL and commit hash.

Cheers

This is a large patch which updates MOM5 to use the latest FMS library.

Based on the following FMS release:

url: https://github.com/NOAA-GFDL/FMS.git
commit: 35e491c497d4bb9b77bd6d8b9557badd53e56e19

Modifications to the `load_xgrid` subroutine of `xgrid.F90` (already
present in the MOM5 source) have been integrated into this FMS release.

Changes to FMS are not summarised here, only changes to MOM and support
models (atmos, land, sis) are summarised below.

Changes are as follows:

- Updating the FMS (src/shared) library to the latest FMS release.

- `write_version_number` updated in each module to support inputs

- `version` and `tagname` variable names are now used consistently
  across modules

- The `land_null` model now sets its `Current_domain` to support changes
  to `get_mosaic_xgrid` (which now requires an explicit domain)

- FMS_compile.csh buildscript replaces the fixed filepaths with an
  autogenerated list using the `list_paths` utility

The main purpose of this patch is to demonstrate that MOM5 still runs
under the newer FMS release with only minor changes, and to start
discussion of how to maintain MOM under future FMS releases.
@marshallward
Copy link
Collaborator Author

No worries, I've updated the commit message to include a URL and commit hash.

The FMS import here is not identical to that FMS release, since it did not include the changes that we needed to make to xgrid.F90, so I had to manually re-insert that work.

The xgrid work is part of FMS, but it is still under a development branch.

@nichannah
Copy link
Contributor

Hi Marshall, I've done a bit of work to get all the tests fixed. I've also introduced a Jenkins job that can be used to do regression testing on pull requests. Can you please merge (or rebase onto) the latest master. I'll then set off the tests.

Thanks

@marshallward
Copy link
Collaborator Author

I think it's updated now.

@nichannah
Copy link
Contributor

Check out:
https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM5_pr_run/

Scroll to the bottom of the console output to see the test results. Finding the actual error output is still not straight-forward, try searching for 'Error'.

For reference, the jenkins job is doing the following:

cd ${WORKSPACE}/test &&
nosetests --with-xunit test_run_setup.py &&
qsub_run.py -P v45 -l ncpus=960,mem=1920Gb,walltime=8:00:00 --
"cd ${WORKSPACE}/test && nosetests --with-xunit test_run.py -s" &&
nosetests --with-xunit test_run_outputs.py

@marshallward
Copy link
Collaborator Author

So basically ICCM failed to build? OK I will look at it.

Variable `tag` was renamed to `tagname` to match other modules, and fix
a bug in the `write_version_number` call.

This fixes the ICCM builds, which use `atmos_bgrid`.
@marshallward
Copy link
Collaborator Author

ICCM should be fixed now.

I'll try to figure out this script to test the rest.

@marshallward
Copy link
Collaborator Author

I ran this command:

nosetests --with-xunit test_run_setup.py

and it passed the 16 tests.

The alltoall-based exchange grid calculation is restored in this
commit.  It is based on the development version of FMS, since it has not
yet been integrated into the Ulm release.  It is based on the current
Alltoall implementation in the repo, with some minor changes by Zhi
Liang.

The following deprecated FMS IO namelist settings have also been
restored:

* ``threading_write``
* ``fileset_write``
@StephenGriffies
Copy link
Contributor

Hi,

What is the status of these tests?

If there are tests that are no longer supported by GFDL, then we may choose
to drop those from the newer MOM5 release.

I believe the problematic test is the EBM. That makes sense, as the
atmospheric model used in the EBM test has for some years not been
supported at GFDL.

Any other tests no longer working?

Steve

On Sun, Apr 10, 2016 at 10:05 PM, Nic Hannah notifications@github.com
wrote:

retest this please


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#132 (comment)

Dr. Stephen M. Griffies
NOAA Geophysical Fluid Dynamics Lab
201 Forrestal Road
Princeton, NJ 08542
USA

@marshallward
Copy link
Collaborator Author

EBM is most likely not a problem. There was a minor error in the EBM grid_spec.nc file. One of the fields (edge coordinate in X, I think?) was missing a boundary value and ulm was raising an error because it was the wrong length (off by one). Most likely this vector was never used in MOM 5. We can fix this by cleaning up the EBM grid_spec.nc.

The harder problem was in the OBC of the atlantic1 test, discussed here:

https://mail.google.com/mail/u/0/#search/zhi/15427f4eb8897f9d

Quoting from the mailing list:

The experiment hangs in line 1729 of the src/mom5/ocean_core/ocean_obc_barotrop.F90, which is a call to init_external_field. Inside this function is a call to mpp_open, which contains a call to mpp_open_meta.

Inside mpp_open_meta (src/shared/mpp/include/mpp_io_read.inc), the older FMS version used to allow every read to independently read its own file. In the new FMS version, only rank 0 reads the file, then uses mpp_broadcast to scatter the data to other nodes. This is causing the model hangs.

I.e., mpp_io_read was changed. Previously, each rank independently read its own files. In the new FMS, only one rank reads the input files, then scatters the data to other ranks.

This fails in OBC because only ranks on each boundary will read boundary data. If it doesn't contain rank 0, then it hangs indefinitely.

There is some mixup between the global communicator and boundary subcommunicators, if they are even being used.

I understand the problem, but wasn't able to come up with a solution. I'll need some time to work it out.

@StephenGriffies
Copy link
Contributor

Hi,

Any update on these tests? I am pestering, as David Hutchinson (former
UNSW student) wishes to run CM2.1 with an updated radiation module
available in the Ulm release. I wish to give him a rough timeline.

Thanks,
Steve

On Wed, Jun 1, 2016 at 6:11 AM, Marshall Ward notifications@github.com
wrote:

EBM is most likely not a problem. There was a minor error in the EBM
grid_spec.nc file. One of the fields (edge coordinate in X, I think?) was
missing a boundary value and ulm was raising an error because it was the
wrong length (off by one). Most likely this vector was never used in MOM 5.
We can fix this by cleaning up the EBM grid_spec.nc.

The harder problem was in the OBC of the atlantic1 test, discussed here:

https://mail.google.com/mail/u/0/#search/zhi/15427f4eb8897f9d

Quoting from the mailing list:

The experiment hangs in line 1729 of the
src/mom5/ocean_core/ocean_obc_barotrop.F90, which is a call to
init_external_field. Inside this function is a call to mpp_open, which
contains a call to mpp_open_meta.

Inside mpp_open_meta (src/shared/mpp/include/mpp_io_read.inc), the older
FMS version used to allow every read to independently read its own file. In
the new FMS version, only rank 0 reads the file, then uses mpp_broadcast
to scatter the data to other nodes. This is causing the model hangs.

I.e., mpp_io_read was changed. Previously, each rank independently read
its own files. In the new FMS, only one rank reads the input files, then
scatters the data to other ranks.

This fails in OBC because only ranks on each boundary will read boundary
data. If it doesn't contain rank 0, then it hangs indefinitely.

There is some mixup between the global communicator and boundary
subcommunicators, if they are even being used.

I understand the problem, but wasn't able to come up with a solution. I'll
need some time to work it out.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#132 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACBam-5Urs-YJZTiHbdeTMw41ZP7c7pfks5qHVrEgaJpZM4GoTVm
.

Dr. Stephen M. Griffies
NOAA Geophysical Fluid Dynamics Lab
201 Forrestal Road
Princeton, NJ 08542
USA

@marshallward
Copy link
Collaborator Author

I'll try to do a "good" patch today (something which address the boundary communicator issue) but if I can't get anywhere then I'll just rip out the new IO code and put the old IO code in.

I could also try updating with the new verona FMS, althought that may just create new problems.

@marshallward
Copy link
Collaborator Author

marshallward commented Jul 19, 2016

Quick update. I think I have a solution that doesn't require any changes to the new FMS. If I swap to the boundary communicator using mpp_get/set_current_pelist before reading the input files, then it resolves the model hangs.

Unfortuantely only the barotropic part of the OBC initilization creates communicators, and is missing from the OBC tracer initialization so I'll need to copypasta some code into there.

I will push something once it appears to be working on my end (hopefully tomorrow).

@marshallward
Copy link
Collaborator Author

It looks like I have a working version now - well, no more hangs at least.

I will push it tomorrow morning and check the model output. If it looks OK then we can put it through the test suite.

The new FMS IO only allows one rank to open a file, and then
`MPI_BCAST`s the data to the other ranks.  This was causing model hangs
on the OBC code (via `init_external_field`), where only ranks along the
boundaries would open a file.

This was resolved by switching to the local boundary communicator during
the initialization IO for the barotropic and tracer initialization.

This required some copypasta from `ocean_obc_barotropic.F90` to
`ocean_obc.F90`, since only the barotropic part was properly
constructing the boundary communicators.  This should probably be
consolidated someday in the future, since there are now identical
duplicates of these communicators (along with the code duplication).
@nichannah
Copy link
Contributor

@marshallward
Copy link
Collaborator Author

There's one outstanding change that I don't know how to deal with. This WARNING was changed to an ERROR:

call mpp_error(WARNING, "horiz_interp_mod: You have overridden the default value of reproduce_siena " // &
"and set it to .true. in horiz_interp_nml. This is a temporary workaround to " // &
"allow for consistency in continuing experiments.  Please use the default " //&
"value (.false.) as this option will be removed in a future release. ")

i.e. the FMS horiz_interp no longer supports reproduce_siena.

However, reproduce_siena is still used in the ESM2M_pi-control_C2 experiment.

Should we disable reproduce_siena in ESM2M_pi-control_C2? Otherwise, we will have to change our FMS.

@StephenGriffies
Copy link
Contributor

Hi,

I recommend we move forward. If that means to disable reproduce_siena,
then so be it.

Steve

On Wed, Jul 20, 2016 at 4:09 AM, Marshall Ward notifications@github.com
wrote:

There's one outstanding change that I don't know how to deal with. This
WARNING was changed to an ERROR:

call mpp_error(WARNING, "horiz_interp_mod: You have overridden the default value of reproduce_siena " // &
"and set it to .true. in horiz_interp_nml. This is a temporary workaround to " // &
"allow for consistency in continuing experiments. Please use the default " //&
"value (.false.) as this option will be removed in a future release. ")

i.e. the FMS horiz_interp no longer supports reproduce_siena.

However, reproduce_siena is still used in the ESM2M_pi-control_C2
experiment.

Should we disable reproduce_siena in ESM2M_pi-control_C2? Otherwise, we
will have to change our FMS.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#132 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACBam0ATh2WOOw8PrRoDHud0GVIDLU3rks5qXdfRgaJpZM4GoTVm
.

Dr. Stephen M. Griffies
NOAA Geophysical Fluid Dynamics Lab
201 Forrestal Road
Princeton, NJ 08542
USA

@nichannah
Copy link
Contributor

Marshall, let me know when you have all the updated configs and input data. I'll remake the tarballs and rerun the tests.

@marshallward
Copy link
Collaborator Author

Sorry for the delay here... the ESM namelist exposed some flaws in my namelist parser (f90nml) which were a lot harder to fix than I had thought.

It looks like setting reproduce_siena to .false. in horiz_interp is all that's needed here, it runs fine on Raijin.

I will set that up in the directory for you to test, Nic. (Probably tomorrow, it's really late here!)

@marshallward
Copy link
Collaborator Author

I've now copied the modified input files to /short/public/mxw900/nic on Raijin.

@nichannah
Copy link
Contributor

Hi Marshall, apologies for the delay. Can you please fix the perms? Also after I've got the tarballs made and installed I'll need you to do a quick rebase (or merge) to get the latest master... I'll let you know when.

@marshallward
Copy link
Collaborator Author

Sorry, permissions should be OK now.

Re: patch, github is saying there's no conflicts, do we still need to rebase? I can do the merge if you like, tho it might not be necessary.

@nichannah
Copy link
Contributor

OK, cool. So can you merge with master and then apply the attached patch. Your pr should then use new data sources containing your changes.

data_sources.txt

(p.s. github wouldn't let me upload .diff/.patch so I've made it a txt)

@marshallward
Copy link
Collaborator Author

I've updated against the latest master branch.

I didn't change the data sources file, was I supposed to do that?

@nichannah
Copy link
Contributor

Thanks Marshall. Yes, you'll need to change over to the new data sources. If you're interested in the details: this supplies the input data URLs, since your input has changed it needs to get from a new URL. This is a kind of versioning for data.

@marshallward
Copy link
Collaborator Author

So are there new tarballs somewhere? I had thought you were going to make
these.

On 29 Jul 2016 5:59 pm, "Nic Hannah" notifications@github.com wrote:

Thanks Marshall. Yes, you'll need to change over to the new data sources.
If you're interested in the details: this supplies the input data URLs,
since your input has changed it needs to get from a new URL. This is a kind
of versioning for data.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#132 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAcN63ZFtyqZhx0oYOwZKHIDo3rz0_zGks5qabLhgaJpZM4GoTVm
.

@nichannah
Copy link
Contributor

Yep, I've made them. The data_sources.txt (diff) attached has their location.

@marshallward
Copy link
Collaborator Author

OK, I understand now.
I've updated the data_sources.txt file using your patch.

@nichannah
Copy link
Contributor

Hi Marshall, sorry, I messed up the data sources script. Can you please merge master again? Thanks

@marshallward
Copy link
Collaborator Author

I've merged the updated script.

@nichannah
Copy link
Contributor

nichannah commented Aug 2, 2016

Hi Marshall, we're getting there! I made another mistake, there were duplicates in the data_sources.txt so the old one was still being used. Can you please merge again!

@marshallward
Copy link
Collaborator Author

The merge went funny and I didn't know how to resolve the conflcts, so I just copied the file directly. Hopefuly I didn't break anything.

@nichannah
Copy link
Contributor

You need to modify the data_sources.txt that you now have to use these two datasources:

https://climate-cms.nci.org.au/repository/entry/get/Data+Repository/Other+Data+at+NCI/MOM+Test+Data/4dc08e007b6920911634f766da7d36c6--ESM2M_pi-control_C2.input.tar.gz

https://climate-cms.nci.org.au/repository/entry/get/Data+Repository/Other+Data+at+NCI/MOM+Test+Data/521e418ac4988e8a424903ed4d219339--mom4p1_ebm1.tar.gz

These should replace the sources you currently have for ESM2M_pi-control_C2.input.tar.gz and mom4p1_ebm1.tar.gz

The idea is that the data_sources.txt file is a map between the experiment input (tar.gz) and the URL for that file. Since you are using new inputs you need these new URLs.

@nichannah
Copy link
Contributor

Can you just manually fix those two lines.

@marshallward
Copy link
Collaborator Author

I think I did what you needed.

@marshallward
Copy link
Collaborator Author

BTW you should be able to fork this to your own account and make any other modifications.. I can keep doing them if it's easier, but it's possible that I could mess them up.

@nichannah
Copy link
Contributor

Yeah, I probably should have done that to save you the trouble of testing/including my buggy code!

Anyway I think it's the last one. Please merge master on more time! Thanks

@marshallward
Copy link
Collaborator Author

I think it's done...

Nic Hannah and others added 2 commits August 10, 2016 12:26
Somehow these important lines got removed in your last merge of master.
Update data_sources.txt
@nichannah
Copy link
Contributor

@nichannah nichannah merged commit 43986e2 into mom-ocean:master Aug 11, 2016
@aidanheerdegen aidanheerdegen mentioned this pull request Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants