Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specified UNIT in FLUSH is not connected with GNU #2

Closed
uturuncoglu opened this issue Feb 9, 2024 · 42 comments
Closed

Specified UNIT in FLUSH is not connected with GNU #2

uturuncoglu opened this issue Feb 9, 2024 · 42 comments
Assignees
Labels
bug Something isn't working

Comments

@uturuncoglu
Copy link
Collaborator

@hga007 As a part of debugging #1, I decided to run the same configuration in different machine (Hercules). The configuration crash with same way using Intel compiler. Then, I tried to compile the configuration with GNU to see what happens but the code crashed with following error,

23: At line 39 of file /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90 (unit = 60)
23: Fortran runtime error: Specified UNIT in FLUSH is not connected
23:
23: Error termination. Backtrace:
16: #0  0x15b0888 in my_flush_
16:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
20: #0  0x15b0888 in my_flush_
20:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
21: #0  0x15b0888 in my_flush_
21:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
21: #1  0x16060cf in wclock_on_
21:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
23: #0  0x15b0888 in my_flush_
23:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
23: #1  0x16060cf in wclock_on_
23:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
13: #0  0x15b0888 in my_flush_
13:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
13: #1  0x16060cf in wclock_on_
13:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
13: #2  0x13ea002 in __roms_kernel_mod_MOD_roms_initialize
13:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/roms_kernel.f90:131
14: #0  0x15b0888 in my_flush_
14:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
14: #1  0x16060cf in wclock_on_
14:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
14: #2  0x13ea002 in __roms_kernel_mod_MOD_roms_initialize
14:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/roms_kernel.f90:131
15: #0  0x15b0888 in my_flush_
15:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
15: #1  0x16060cf in wclock_on_
15:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
15: #2  0x13ea002 in __roms_kernel_mod_MOD_roms_initialize
15:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/roms_kernel.f90:131
16: #1  0x16060cf in wclock_on_
16:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
16: #2  0x13ea002 in __roms_kernel_mod_MOD_roms_initialize
16:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/roms_kernel.f90:131
17: #0  0x15b0888 in my_flush_
17:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
17: #1  0x16060cf in wclock_on_
17:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95
17: #2  0x13ea002 in __roms_kernel_mod_MOD_roms_initialize
17:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/roms_kernel.f90:131
18: #0  0x15b0888 in my_flush_
18:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/mp_routines.f90:39
18: #1  0x16060cf in wclock_on_
18:     at /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/compile_16_gnu/build_fv3_16_gnu/ROMS-interface/ROMS/f90/timers.f90:95

I checked it and it seems it is related stdout file. It writes the log.roms but then fails with this error. I just wonder if you tried to run this configuration in your end with gnu compiler?

BTW, my run directory on Hercules: /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_3847567/coastal_irene_atm2roms_gnu

@uturuncoglu uturuncoglu added the bug Something isn't working label Feb 9, 2024
@uturuncoglu uturuncoglu self-assigned this Feb 9, 2024
@uturuncoglu
Copy link
Collaborator Author

@hga007 This is just trying to compile RT with GNU compiler. There might be a bug in cmeps cap that we want to fix since it is not working with GNU at this point. I wonder if you test cmeps cap with GNU compiler in your end? Probably, we did not test it before. This is not urgent but we would like to compile the RT with both GNU and Intel without any issue.

@hga007
Copy link

hga007 commented Feb 9, 2024

@uturuncoglu: I use my JEDI stack-Spack (ifort) with UFS support on my computer. I don't have one for gfortran. So, all my runs are with ifort. I dislike gfortran very much because It doesn't work well with the TotalView debugger. They build the *.o files without full support, which makes it very difficult for them in the debugger.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Okay that explains why it is working. It would be nice to test with GNU to catch these kind of issues. Anyway, I'll look at it and update you.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I traced this little bit more and put print statements and here is the output,

12: ==> Entering ROMS_SetServices, PET0
12: <== Exiting  ROMS_SetServices, PET0
12: ==> Entering ROMS_SetInitializeP1, PET0
12: ==> Entering ROMS_Create, PET0
12: <== Exiting  ROMS_Create, PET0
12: ==> Entering ROMS_SetInitializeP2, PET0
12:  stdout =           60

The code enters ROMS_SetInitializeP2 and calls ROMS_initialization. The stdout value is 60 even if it is set to 6 in the ROMS/Modules/mod_iounits.F. Do you have any idea why? I am not sure but maybe some memory corruption is playing role in here but not sure. Definitely there is a bug in code that we could not catch with Intel. It might be nice to run ROMS native coupled configuration (outside of UFS) using GNU compiler to see what happens.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I put couple of print statement to ROMS_initialization call in ROMS/ROMS/Drivers/nl_roms.h and it seems that stdout is changing after calling inp_par routine. I'll put more print statement to it to find out exact point that changes value of the stdout from 6 to 60.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Okay. I found it. Following statement is changing the value. https://github.com/myroms/roms/blob/b7a47408ba224a4703d9122c33995eee3c5ed06c/ROMS/Utility/inp_par.F#L94
So, stdout is starting to be 6 and then changed to 60 with this routine and this leads to crash in the flush() call. Anyway, there is a bug in the code I think. Let me know what you think?

@uturuncoglu
Copy link
Collaborator Author

@hga007 Maybe we need to disable ROMS_STDOUT for UFS. Not sure.

@uturuncoglu
Copy link
Collaborator Author

@hga007 BTW, I think that changing stdout in the inp_par call is not logical. This is set in the ROMS/ROMS/Modules/mod_iounits.F and I think if ROMS want to use different unit number for stdout then this needs to be placed to the mod_iounits.F.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Okay. I could go further without ROMS_STDOUT but now it is failing with following similar error.

13: At line 3170 of file /work2/noaa/nems/tufuk/COASTAL/ufs-coastal/tests/build_fv3_coastal/ROMS-interface/ROMS/f90/esmf_roms.f90 (unit = 77)
13: Fortran runtime error: Specified UNIT in FLUSH is not connected

So, probably there is a bug in ROMS side about handling these unit numbers. I checked the unit number 77 and it appears in the code like following,

../ROMS-interface/ROMS/Master/mod_esmf_esm.F:      integer :: cplout  = 77             ! coupling driver
../ROMS-interface/ROMS/Master/mod_esmf_esm.F:      integer :: dataout = 77             ! data component
../ROMS-interface/ROMS/Master/cmeps_roms.h:      integer :: cplout  = 77         ! coupling driver

So, it seems that we have also issue with cplout. So, maybe it is not designed to work without ROMS_STDOUT and this crash is just a side affect of it. Anyway, I could dig more and try to fix it but since you have more experience in model side, maybe we need to think about the handling these unit numbers in coupled application. At this point, I am switching to other issues in SCHSIM since we know the issue with GNU now.

@hga007
Copy link

hga007 commented Mar 26, 2024

@uturuncoglu: Sorry, this week I am swamped. Andy Moore is visiting to work on very difficult ROMS-JEDI codes, and we have four days to make progress on various issues.

The issue here is that we split the standard output for every coupled component in the native coupling scheme. Otherwise, tracking each component's very verbose standard output will take a lot of work. Thus, I open arbitrary Fortran units for the components I control and leave the stdout=6 to the atmosphere component. Is the UFS using those Fortran units?

This logic always worked for me. Is it a compiler bug? The compiler may dislike initializing and rewriting variables in the module. Notice that I am not using parameter statements for those units. For years, we have done that with Ifort and gfortran. Since ifort will be deprecated this year, we are moving to ifx and icx.

I started coding a ROMS-to-ROMS interpolator using the ESMF/NUOPC interpolation utility with RegridStore. I think I sent you an email about it, but I may have an additional question about the RouteHandle for 3D fields where the interpolation is level-by-level.

@uturuncoglu
Copy link
Collaborator Author

@hga007 No worries. It is not urgent but needs to be resolved since ROMS coupling is not working with GNU compiler. I am not sure about component in UFS side that use those units but I think implementation needs to be prune from it. In ESMF there is a specific call to find the empty unit number. I think rather than using separate files like log.roms and log.coupler etc., the NUOPC cap needs to direct all the log to PET files using ESMF calls. This will eliminate these kind of issues that we could have in the future. As I see from the implementation, the log units are changed outside of the cap. It might be nice to move those to the cap since it is related with the coupling. I think this is not a compiler bug and it seems that a logical issue in the management of this files and also only GNU catches it. I think ROMS mode internally should use stdout (6) for all the log and then cap needs to use ESMF call to write the log to PET files. This will split the log with the model and the coupling interface.

Sorry, I missed you mail and forgot to replay. I'll directly reply to your mail regarding to use ESMF for the pre/post processing.

@uturuncoglu
Copy link
Collaborator Author

Here is the update from Hernan:

Hi Ufuk,
 
I made several corrections and a few changes in ROMS to address the issues we were having running ROMS in the UFS.
 
They are loaded into feature/stdinp in [github.com/myroms](http://github.com/myroms).  Can you give it a try to see what you get?
 
Thank you, H

So, I'll try that branch and let him know about the results.

@hga007
Copy link

hga007 commented Apr 17, 2024

@ufuk: I updated to the latest UFS-coastal and I got the following compiling error with gfortran:

[ 92%] Building Fortran object CMEPS-interface/CMakeFiles/cmeps.dir/CMEPS/mediator/med_utils_mod.F90.o
cd /home/arango/ROMS/Projects/IRENE/Coupling/roms_data_cmeps/Build_ufs/CMEPS-interface && /bin/gfortran -DCPRGNU -I/home/arango/ROMS/Projects/IRENE/Coupling/roms_data_cmeps/Build_ufs/CMEPS-interface/mod -I/opt/spack-stack/gcc/11.4.1/esmf-8.6.0-7bqs55m/include -I/opt/spack-stack/gcc/11.4.1/netcdf-c-4.9.2-isuwkgl/include -I/opt/spack-stack/gcc/11.4.1/netcdf-fortran-4.6.1-7udf3eb/include -I/opt/spack-stack/gcc/11.4.1/parallelio-2.6.2-murvht3/include -ggdb -fbacktrace -cpp -fcray-pointer -ffree-line-length-none -fno-range-check -fallow-argument-mismatch -fallow-invalid-boz -fdefault-real-8 -fdefault-double-8 -g -fbacktrace -ffree-line-length-none -fallow-argument-mismatch -fallow-invalid-boz -O2 -Jmod -fopenmp -c /home/arango/ocean/repository/git/ufs-coastal/CMEPS-interface/CMEPS/mediator/med_utils_mod.F90 -o CMakeFiles/cmeps.dir/CMEPS/mediator/med_utils_mod.F90.o
/home/arango/ocean/repository/git/ufs-coastal/CMEPS-interface/CMEPS/mediator/med_utils_mod.F90:37:9:

   37 |     use mpi , only : MPI_ERROR_STRING, MPI_MAX_ERROR_STRING, MPI_SUCCESS
      |         1
Fatal Error: Cannot open module file ‘mpi.mod’ for reading at (1): No such file or directory
compilation terminated.
make[2]: *** [CMEPS-interface/CMakeFiles/cmeps.dir/build.make:556: CMEPS-interface/CMakeFiles/cmeps.dir/CMEPS/mediator/med_utils_mod.F90.o] Error 1
make[2]: Leaving directory '/storage/arango/ROMS/Projects/IRENE/Coupling/roms_data_cmeps/Build_ufs'

If we replace /bin/gfortran with mpif90 in that command, it compiles because the wrapper knows where to find mpi.mod. Is there an error somewhere in the CMake files?

@uturuncoglu
Copy link
Collaborator Author

@hga007 Thanks for testing. Let me checkin my side. I am syncing ufs coastal with ufs weather model at this point. Once that is done. I'll work on this. I'll keep you updated.

@uturuncoglu
Copy link
Collaborator Author

@hga007 BTW, I think you are testing in your local cluster. Right? I am wondering if I could reproduce it on any other supported platform like Orion or Hercules.

@hga007
Copy link

hga007 commented Apr 19, 2024

@uturuncoglu:

  • I am testing on my Linux box and am getting mixed results. My old version of ufs-coastal (Oct 2023) compiles and runs well to completion with ifort. However, I get an ESMF error with gfortran:
20240418 174247.898 ERROR            PET09 Failure  - Driver and ROMS start times do not match: please check the config files

This is bizarre. Why does the compiler that we use affect the ESMF clock configuration? It doesn't make any sense!

  • If I use the latest version of us-coastal, it blows up with ifort, and I get the same error with gfortran. I have fixed all the issues you were having with gfortran in ROMS. I wouldn't say I like that compiler. I cannot use the latest version of OpenMPI because the mpirun rapper doesn't allow Totalview flag -tv.
  • This happens when I run the DATA-ROMS coupling system for IRENE with CDEPS and CMEPS.
  • Your newest version of ufs-coastal cannot find mpi.mod with the Spak-Stack version that I have for gfortran. It seems that you make changes to the CMake files, which overwrite the FC and CC environmental variables. We have recommended that you avoid using environmental variables in the past to specify the compiler to use because of unexpected behaviors.
  • It is using /bin/gfortran instead of the mpif90 wrapper to compile (see below). This needs to be more robust. The Oct 2023 version works because it gets the MPI module from -I/opt/spack-stack/gcc/11.4.1/openmpi-5.0.1-svju6pb/lib. In the new version, this path is not included when compiling. Thus, it failed, and I had to do it manually four times. It is messy and undesirable. We have yet to find the culprit. There are too many configuration scripts.
cd /home/arango/ROMS/Projects/IRENE/Coupling/roms_data_cmeps/Build_ufs/CMEPS-interface && /bin/gfortran -DCPRGNU -I/home/arango/ROMS/Projects/IRENE/Coupling/roms_data_cmeps/Build_ufs/CMEPS-interface/mod -I/opt/spack-stack/gcc/11.4.1/fms-2023.04-x7miv6p/include_r8 -I/opt/spack-stack/gcc/11.4.1/netcdf-fortran-4.6.1-7udf3eb/include -I/opt/spack-stack/gcc/11.4.1/netcdf-c-4.9.2-isuwkgl/include -I/opt/spack-stack/gcc/11.4.1/openmpi-5.0.1-svju6pb/include -I/opt/spack-stack/gcc/11.4.1/openmpi-5.0.1-svju6pb/lib -I/opt/spack-stack/gcc/11.4.1/esmf-8.6.0-7bqs55m/include -I/opt/spack-stack/gcc/11.4.1/parallelio-2.6.2-murvht3/include -g -fbacktrace -ffree-line-length-none -fallow-argument-mismatch -fallow-invalid-boz -O2 -Jmod -fopenmp -c /home/arango/ocean/repository/git/ufs-coastal_old/CMEPS-interface/CMEPS/mediator/med_utils_mod.F90 -o CMakeFiles/cmeps.dir/CMEPS/mediator/med_utils_mod.F90.o
/opt/spack-stack/gcc/11.4.1/cmake-3.23.1-nksdcyi/bin/cmake -E cmake_copy_f90_mod CMEPS-interface/mod/glc_elevclass_mod.mod CMEPS-interface/CMakeFiles/cmeps.dir/glc_elevclass_mod.mod.stamp GNU

I don't have more time to look into this, and I am running out of ideas.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Sorry. I could not get back to this. I am trying to sync ufs-coastal and fix couple of issues at this point. I don't this it is issue related with compiler or ESMF. There might be bug in ROMS side that is appearing under GNU. Once I finalize the work, I'll look at ROMS as a next thing. I could also test this in already supported system, so we could see what happens on Hercules. I also noticed that even if files are fine and checked with NCAR's cprnc tool, the baseline is still failing when it is checked with nc compare tool. So, I also need to look at that one.

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented Apr 20, 2024

@hga007 I test your branch on MSU's Hercules using GNU compiler. It compiles without any issue in my case. So, the error that you get with mpi use statement could be specific in your environment.

I am not sure what is wrong at this point but the run crashes with the following error,

13: Fatal error in PMPI_Bcast:
13: Invalid communicator, error stack:
13: PMPI_Bcast(1645): MPI_Bcast(buf=0x7ffd8bb0ddb0, count=1, MPI_INTEGER, root=0, comm=0x0) failed
13: PMPI_Bcast(1546): Invalid communicator
13: 

and the last log that I am seeing in out file is 22: ==> Entering ROMS_SetInitializeP1, PET0. There is also no any error in any PET files. log.roms is also empty. I wonder if you change anything in the configuration files? Do I need to update input files? Otherwise, I have no clue and i am not getting error related to the clock as you.

I also tried to run with Intel compiler to be sure it is working with Intel. This is also failed with same error.

At least, following hash (currently used by ufs-coastal) is fine with Intel.

commit f309f5ab41de03e5b3ff5f28221b478a252b25f6 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Hernan G. Arango <arango@marine.rutgers.edu>
Date:   Tue Jan 30 12:50:11 2024 -0500

So, let me know if I need any change to reproduce your error. Maybe I could fix it if I could reproduce the issue.

@hga007
Copy link

hga007 commented May 1, 2024

@uturuncoglu: The coupling with gfortran compiles and runs with the old (Oct 2023) and new versions of ufs-coastal. My issues were solved by cleaning the YAML parser while compiling aggressively with debugging -g flags. The time clock mismatch was solved by initializing a logical variable that is false in ifort and true in gfortran. Uninitialized logical switches in gfortran have peculiar behavior. Since our Spack-Stack uses openMPI 5, we can't debug gfortran in TotalView. It was so annoying, and I had to put many PRINT statements in place to solve the issues I was having.

The old version of the UFS runs successfully to the end with both gfortran and ifort for CDEPS and CMEPS configurations. However, ROMS blows up after 300 timesteps with the new UFS code for all cases with both compilers. Something has changed in the configuration. I am still using the Oct 2023 configuration to run successfully. I have not figured out what changed since I didn't run your regression scripts. If you regenerate those scripts (datm_in, datm.streams, model_configure, and ufs.configure) I can try again with the new UFS code. We need better descriptions of those configuration scripts for some of us who are not interested in running the regression test but are interested in applications outside of NOAA computers. The build_ufs.csh or build_ufs.sh we created is much better and works well for us.

I have yet to load all those changes to the github.com/myroms/roms develop branch. They are still in feature/stdinp if you want to try it.

Since October 2023, the UFS has undergone many changes, and I need help tracking what is relevant to the ROMS applications.

@uturuncoglu
Copy link
Collaborator Author

@hga007 okay. Thanks. Let me try with the branch again to see what happens. I wonder if I need to put any extra cpp flag in the UFS level cmake build layer?

@uturuncoglu
Copy link
Collaborator Author

@hga007 I am still getting following error with your branch,

14: Fatal error in PMPI_Bcast:
14: Invalid communicator, error stack:
14: PMPI_Bcast(1645): MPI_Bcast(buf=0x7ffcb6e1c580, count=1, MPI_INTEGER, root=0, comm=0x0) failed
14: PMPI_Bcast(1546): Invalid communicator
14:

Anyway, if you would like to try with newer version of the ufs-coastal, we could meet and discuss about it but at this point GNU is not working for ROMS under ufs-coastal.

@hga007
Copy link

hga007 commented May 3, 2024

@uturuncoglu, Would you happen to have more information about that error? Maybe you'll need to compile full debugging flags to get more information. I don't get that error at all with gfortran.

Yes, we can chat to see if we can figure out how to resolve this problem.

@uturuncoglu
Copy link
Collaborator Author

@hga007 There is no any other error message. Next week Wednesday afternoon around 1PM MT is fine for me. If it is okay, I could setup a call.

@hga007
Copy link

hga007 commented May 3, 2024

@uturuncoglu: Yes, it works for me. Thank you!

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented May 3, 2024

@hga007 Okay. I have just sent the invitation.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Okay. I am back to this issue. I have just tested ROMS head of develop and it is working without any issue with Intel. I'll tested your branch https://github.com/myroms/roms/tree/feature/stdinp with Intel first to see what happens. If it works, I'll test with GNU.

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented Jul 5, 2024

@hga007 I am looking differences in this branch (feature/stdinp) and it seems that it has lots of changes that are not related with the log issue. Is it possible to have light weight branch that has only fix for the issue that we are seeing under GNU. BTW, we could isolate the issue if we have under UFS Coastal. Let me know what you think? If you point me the files that you fixed for the issue, I could try to create another branch based on develop and test it in my side.

@hga007
Copy link

hga007 commented Jul 5, 2024

@uturuncoglu: The changes are due to GNU to fix the issue with the standard input unit that works with gfortran now. It was more involved than I thought.

@uturuncoglu
Copy link
Collaborator Author

@hga007 It would be nice to split developments to meaningful chunks to find the issue. Maybe the problem that you see is not related with the fix that we need for this issue. So, rather than having a brach that modifies 111 files. Let's do it in small chunks.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Okay. If all those changes are required for GNU fix then let me try one more time under UFS Coastal to see what happens.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I think I don't need to add any specific switch to compile for this fix but let me know if I need. I am building model with -DAPP=CSTLR -DMY_CPP_FLAGS=BULK_FLUXES and rest of the options are coming from the CMake interface in UFS Coastal level.

@uturuncoglu
Copy link
Collaborator Author

@hga007 Still same, your fix branch is giving following error and dies,

12: Abort(597253) on node 12 (rank 12 in comm 0): Fatal error in internal_Bcast: Invalid communicator, error stack:
12: internal_Bcast(1097): MPI_Bcast(buffer=0x7fff086d9bec, count=1, MPI_INTEGER, 0, comm=0x0) failed
12: internal_Bcast(1034): Invalid communicator

The last thing that I am seeing is that it is entering 12: ==> Entering ROMS_SetInitializeP1, PET0. So, probably there is still some issue in there. BTW, this is just Intel not GNU and head of develop is fine.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I am compiling with -DDEBUG=ON in UFS side to see I could get more information and trace.

@hga007
Copy link

hga007 commented Jul 5, 2024

@uturuncoglu: I cannot do that now because it will break a lot of stuff. Most of the changes are due to the intrinsic FLUSH routine related to how the standard input is managed in ROMS, which is associated with the fix for gfortran. It cannot be split. Unless we forget about gfortran and force ROMS not to support it, which is radical, we cannot do it for the UFS. This issue is associated with the coupling strategy that does not use the native framework.

I convinced myself that the issues I had were due to multiple changes in CMEPS that you cannot revert. It concerns the interpolation next to the land/sea mask. It works for me for the much older UFS version that I saved. I spent days in the debugger and compared the input with the ones you provided, but I had yet to be successful.

It is a peculiar problem. I can tell you that my changes work for me, but not for you. There may be something that our computers ignore, but you get issues with your computers. I can try again to see if that is still the case. Like I said, it has been a while since I've tried.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I think I got the trace by runnit with DDT. Here is the one,

#26 ufs () at /work2/noaa/nems/tufuk/COASTAL/ufs-weather-model_om/driver/UFS.F90:389 (at 0x42c474)
#25 esmf_gridcompmod_mp_esmf_gridcompinitialize_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1419 (at 0xdc8e31)
#24 esmf_compmod_mp_esmf_compexecute_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252 (at 0x9ef030)
#23 c_esmc_ftablecallentrypointvm_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981 (at 0xadceea)
#22 ESMCI::VM::enter(ESMCI::VMPlan*, void*, void*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216 (at 0x12e4b08)
#21 ESMCI::VMK::enter(ESMCI::VMKPlan*, void*, void*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501 (at 0x114fcfa)
#20 ESMCI_FTableCallEntryPointVMHop () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824 (at 0xadfa0f)
#19 ESMCI::FTable::callVFuncPtr(char const*, ESMCI::VM*, int*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 (at 0xadba94)
#18 nuopc_driver_mp_initializegeneric_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:484 (at 0x9c330b)
#17 nuopc_driver_mp_initializeipdv02p1_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:1316 (at 0x9b9494)
#16 nuopc_driver_mp_loopmodelcompss_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:2889 (at 0x9911f0)
#15 esmf_gridcompmod_mp_esmf_gridcompinitialize_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1419 (at 0xdc8e31)
#14 esmf_compmod_mp_esmf_compexecute_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252 (at 0x9ef030)
#13 c_esmc_ftablecallentrypointvm_ () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981 (at 0xadceea)
#12 ESMCI::VM::enter(ESMCI::VMPlan*, void*, void*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216 (at 0x12e4b08)
#11 ESMCI::VMK::enter(ESMCI::VMKPlan*, void*, void*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247 (at 0x114ff0a)
#10 ESMCI_FTableCallEntryPointVMHop () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824 (at 0xadfa0f)
#9 ESMCI::FTable::callVFuncPtr(char const*, ESMCI::VM*, int*) () at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 (at 0xadba94)
#8 cmeps_roms_mod::roms_setinitializep1 (model=(...), importstate=(...), exportstate=(...), clock=(...), rc=0) at /work2/noaa/nems/tufuk/COASTAL/ufs-weather-model_om/tests/build_fv3_coastal/ROMS-interface/ROMS/f90/esmf_roms.f90:1592 (at 0x220108e)
#7 stdout_mod::stdout_unit (mymaster=.TRUE.) at /work2/noaa/nems/tufuk/COASTAL/ufs-weather-model_om/tests/build_fv3_coastal/ROMS-interface/ROMS/f90/stdout_mod.f90:106 (at 0x3c713c2)
#6 distribute_mod::mp_bcasti_0d (ng=1, model=1, a=0, inpcomm=<error reading variable: Cannot access memory at address 0x0>) at /work2/noaa/nems/tufuk/COASTAL/ufs-weather-model_om/tests/build_fv3_coastal/ROMS-interface/ROMS/f90/distribute.f90:615 (at 0x3701ff3)
#5 pmpi_bcast_ (v1=0x14e1b88668e0, v2=0x0, v3=0xc091d05, v4=0x7ffe38fff818, v5=0x0, ierr=0x7ffe38fff600) at /build/impi/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:269 (at 0x14e1b8e703e4)
#4 PMPI_Bcast (buffer=0x14e1b88668e0, count=0, datatype=201923845, root=956299288, comm=0) at /build/impi/_buildspace/release/../../src/binding/intel/c/c_binding.c:1112 (at 0x14e1b75b37f6)
#3 internal_Bcast () at /build/impi/_buildspace/release/../../src/binding/intel/c/c_binding.c:1101 (at 0x14e1b75b37f6)
#2 MPIR_Err_return_comm (comm_ptr=0x14e1b88668e0, fcname=0x0, errcode=201923845) at /build/impi/_buildspace/release/../../src/mpi/errhan/errutil.c:301 (at 0x14e1b7786331)
#1 MPIR_Handle_fatal_error () at /build/impi/_buildspace/release/../../src/mpi/errhan/errutil.c:486 (at 0x14e1b7786331)
#0 MPID_Abort (comm=0x14e1b88668e0, mpi_errno=0, exit_code=201923845, error_msg=0x7ffe38fff818 \"Fatal error in internal_Bcast: Invalid communicator, error stack:\\ninternal_Bcast(1097): MPI_Bcast(buffer=0x7ffe39001ca8, count=1, MPI_INTEGER, 0, comm=0x0) failed\\ninternal_Bcast(1034): Invalid communi\"...) at /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_globals.c:98 (at 0x14e1b7634810)

It seems that it is failing on stdout_mod::stdout_unit (mymaster=.TRUE.) at /work2/noaa/nems/tufuk/COASTAL/ufs-weather-model_om/tests/build_fv3_coastal/ROMS-interface/ROMS/f90/stdout_mod.f90:106

@uturuncoglu
Copy link
Collaborator Author

@hga007 There is a CALL mp_bcasti (1, 1, io_err) call in that line which I don't know what is doing. At least we have some lead in here.

@hga007
Copy link

hga007 commented Jul 6, 2024

@uturuncoglu: Okay, I will look at it in the TotalView debugger.

@hga007
Copy link

hga007 commented Jul 9, 2024 via email

@uturuncoglu
Copy link
Collaborator Author

-- adding last mail from Hernan,
Hi Ufuk,

I made stdinp_mod.F and stdout_mod.F more robust. I removed the mp_bcast that you were having trouble with. Please give it a try with both ifort and gfortran. I think that it will work for you now.

Good luck, H

@uturuncoglu
Copy link
Collaborator Author

@hga007 Good news. Intel is passing with your branch (feature/stdinp). I'll setup a gnu test to see what happens.

@uturuncoglu
Copy link
Collaborator Author

@hga007 I have just checked and GNU compiler is also fine now under UFS. I also created baseline for GNU on Hercules and it is reproducing itself. So, I think feature/stdinp is fine to merge with head of develop now in ROMS side. Please let me know when you merge it. Then, I'll update ROMS submodule under UFS Coastal. So, probably I'll close this issue in that time. Again, thanks for your help. It was a headache for us for a long time.

@uturuncoglu
Copy link
Collaborator Author

@janahaddad @pvelissariou1 @saeed-moghimi-noaa @hga007 I have just sync the ROMS fork under UFS Coastal and pointing the head of develop that has GNU fix. I also added extra ROMS GNU test to rt_coastal.conf and create baseline on Hercules (we might need to create baseline also on other platforms but it is not urgent, I am assuming Hercules is our base platform). Both GNU and Intel RTs are running without any problem. So, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants