Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failure of SCHISM runs for hurricane ike 2008 (with 24 hr leadtime) #45

Closed
FariborzDaneshvar-NOAA opened this issue Feb 29, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@FariborzDaneshvar-NOAA
Copy link
Collaborator

As part of running workflow for 6 leadtimes of 25 storms recommended by the NHC team, random SCHISM runs for Ike 2008 (with 24 hr leadtimes) failed with segmentation fault. Here is one example error message for one of 13 failed runs (run#5) out of 31 ensemble:

...
+ srun /work2/noaa/nos-surge/smani/bin//pschism_HERCULES_PAHM_BLD_STANDALONE_TVD-VL 4
[hercules-06-31:36479:0:36479] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0252e10)
==== backtrace (tid:  36479) ====
 0 0x0000000000054d90 __GI___sigaction()  :0
 1 0x00000000009422eb libmetis__CreateCoarseGraphNoMask()  ???:0
 2 0x000000000093f8ec libmetis__Match_SHEM()  ???:0
 3 0x000000000093f21e libmetis__CoarsenGraph()  ???:0
 4 0x0000000000935b0f libmetis__MlevelRecursiveBisection()  ???:0
 5 0x0000000000935df4 libmetis__MlevelRecursiveBisection()  ???:0
 6 0x00000000009358ad METIS_PartGraphRecursive()  ???:0
 7 0x000000000091f518 libmetis__MlevelKWayPartitioning()  ???:0
 8 0x000000000091f084 METIS_PartGraphKway()  ???:0
 9 0x000000000083932d libparmetis__InitPartition()  ???:0
10 0x00000000008089ce libparmetis__Global_Partition()  ???:0
11 0x00000000008087f7 libparmetis__Global_Partition()  ???:0
12 0x00000000008087f7 libparmetis__Global_Partition()  ???:0
13 0x00000000008087f7 libparmetis__Global_Partition()  ???:0
14 0x00000000008087f7 libparmetis__Global_Partition()  ???:0
15 0x00000000008087f7 libparmetis__Global_Partition()  ???:0
16 0x0000000000805e47 ParMETIS_V3_PartGeomKway()  ???:0
17 0x000000000080527b parmetis_v3_partgeomkway_()  ???:0
18 0x00000000006afb03 partition_hgrid_.V()  grid_subs.F90:0
19 0x0000000000512965 schism_init_.V()  schism_init.F90:0
20 0x0000000000411b5f MAIN__()  ???:0
21 0x000000000041199d main()  ???:0
22 0x000000000003feb0 __libc_start_call_main()  ???:0
23 0x000000000003ff60 __libc_start_main_alias_2()  :0
24 0x00000000004118b5 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libc.so.6          000014C253191D90  Unknown               Unknown  Unknown
pschism_HERCULES_  00000000009422EB  Unknown               Unknown  Unknown
pschism_HERCULES_  000000000093F8EC  Unknown               Unknown  Unknown
...
  • Run directory on Hercules: /work2/noaa/nos-surge/shared/nhc_hurricanes/ike_2008_fdc63f00-2b10-4848-92af-7d64335e249a
  • Example slurm-*.out file for a failed run slurm-542899.txt
@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

Workflow is running for other leadtimes of ike_2008 without any issue!

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

I also ran it several times (in two different days) and every time random number of runs failed:

run directory No. of failed runs
ike_2008_fdc63f00-2b10-4848-92af-7d64335e249a 13 (5,6,7,12,13,19,21,22,25,27,28,29,30)
ike_2008_bc56cd29-7a6a-494a-babe-d82ea0636b8e spinup
ike_2008_7d722a89-0f3e-43fe-b2aa-da1f8c571456 spinup
ike_2008_3b071f55-2544-4bde-8992-dddd4a1a83f2 spinup
ike_2008_759ad86f-b237-4fef-9437-1cdb3f458b13 25 (1,2,3,4,5,6,8,9,12,13,14,15,16,17,18,19,20,21,23,25,26,27,28,29,30)
ike_2008_c440364a-b6a5-42c9-942b-b20796bfd951 25 (2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,24,25,26,28,30)

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

I also ran one of failed SCHISM runs separately, but it failed with the same error!

@SorooshMani-NOAA
Copy link
Collaborator

@pvelissariou1 have you seen this error before? It seems to be related to parmetis during initialization. Does it point to an issue on the compute node libraries or is it some issue with mesh in your opinion?

@pvelissariou1
Copy link

I checked your input files in /work2/noaa/nos-surge/shared/nhc_hurricanes/ike_2008_bc56cd29-7a6a-494a-babe-d82ea0636b8e/setup/ensemble.dir/spinup and all seem consistent except of course the perturber hurricane tracks. I don't think that these random crashes are related with the SCHISM code itself. Maybe the issue is with the way the simulation is being run within singularity. I don't know how your singularity environment is configured, e.g., memory, network interface, mpi channel, etc.
In your slurm job submission script you may add:

ulimit -l unlimited
ulimit -s unlimited (I think this one is already set in the script)

export I_MPI_DEBUG=10 (for debugging purposes. can be deleted afterwards)
export KMP_STACKSIZE=20480000000  (2 GB)
export KMP_AFFINITY=verbose,granularity=core,compact,1,0 (comment/uncomment to check its usefulness)

@FariborzDaneshvar-NOAA Can you try the above to see if the issue is resolved?

In singularity run the command: ulimit -a first to see how the limits are set (watch for the -l, -s options in ulimit output)

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

@pvelissariou1 Thanks for your input. I updated schism.sbatch file and added the following lines (as you suggested) before srun $BIN_DIR/pschism_HERCULES_PAHM_BLD_STANDALONE_TVD-VL 4 command

ulimit -l unlimited
ulimit -s unlimited
export I_MPI_DEBUG=10
export KMP_STACKSIZE=20480000000
export KMP_AFFINITY=verbose,granularity=core,compact,1,0  

But then got this error message!

Shell debugging restarted
+ unset __lmod_sh_dbg 
+ return 0              
+ export MV2_ENABLE_AFFINITY=0 
+ MV2_ENABLE_AFFINITY=0     
+ ulimit -l unlimited    
+ ulimit -s unlimited     
+ export I_MPI_DEBUG=10 
+ I_MPI_DEBUG=10   
+ export KMP_STACKSIZE=20480000000  
+ KMP_STACKSIZE=20480000000
+ export KMP_AFFINITY=verbose,granularity=core,compact,1,0
+ KMP_AFFINITY=verbose,granularity=core,compact,1,0 
+ srun /work2/noaa/nos-surge/smani/bin//pschism_HERCULES_PAHM_BLD_STANDALONE_TVD-VL 4 
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /opt/slurm/lib/libpmi2.so 
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /opt/slurm/lib/libpmi2.so 
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /opt/slurm/lib/libpmi2.so 
...

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

per ulimit -a:
max locked memory (kbytes, -l) unlimited
stack size (kbytes, -s) 16384

@pvelissariou1
Copy link

@fariborz, @SorooshMani-NOAA Then the calls are directed to the pmi channel. Let me check on that. Keep the ulimit -s unlimited and ulimit -l unlimited in the script. Can you share the full log file with me?

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

@pvelissariou1 Here is the slurm-*.out file:
slurm-568745.txt

Thanks

@pvelissariou1
Copy link

@FariborzDaneshvar-NOAA Fariborz can you replace srun by srun --mpi=pmi2 and resubmit to see what happens? Leave all other definitions in the script as they are. Can you increase: I_MPI_DEBUG=100. Last we need to recompile SCHISM with the option: a) cmake -DDEBUG=ON, b) make -DUSE_DEBUG to actually get the lines/files where the problem is located.

@pvelissariou1
Copy link

@FariborzDaneshvar-NOAA Could you please share your slurm submission script?

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

@FariborzDaneshvar-NOAA Could you please share your slurm submission script?

Here it is: schism.txt

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

FariborzDaneshvar-NOAA commented Mar 7, 2024

I used schism.sbatch from @SorooshMani-NOAA directory and it solved the problem! ✅
Thanks @pvelissariou1 and @SorooshMani-NOAA for your help and looking into this!

@pvelissariou1
Copy link

@FariborzDaneshvar-NOAA Could you please share the new slurm batch file? Thanks

@FariborzDaneshvar-NOAA
Copy link
Collaborator Author

@pvelissariou1 here it is:
schism.txt

@FariborzDaneshvar-NOAA FariborzDaneshvar-NOAA self-assigned this Mar 7, 2024
@FariborzDaneshvar-NOAA FariborzDaneshvar-NOAA added the bug Something isn't working label Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants