-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random failure of SCHISM runs for hurricane ike 2008 (with 24 hr leadtime) #45
Comments
Workflow is running for other leadtimes of ike_2008 without any issue! |
I also ran it several times (in two different days) and every time random number of runs failed:
|
I also ran one of failed SCHISM runs separately, but it failed with the same error! |
@pvelissariou1 have you seen this error before? It seems to be related to parmetis during initialization. Does it point to an issue on the compute node libraries or is it some issue with mesh in your opinion? |
I checked your input files in /work2/noaa/nos-surge/shared/nhc_hurricanes/ike_2008_bc56cd29-7a6a-494a-babe-d82ea0636b8e/setup/ensemble.dir/spinup and all seem consistent except of course the perturber hurricane tracks. I don't think that these random crashes are related with the SCHISM code itself. Maybe the issue is with the way the simulation is being run within singularity. I don't know how your singularity environment is configured, e.g., memory, network interface, mpi channel, etc.
@FariborzDaneshvar-NOAA Can you try the above to see if the issue is resolved? In singularity run the command: |
@pvelissariou1 Thanks for your input. I updated
But then got this error message!
|
per |
@fariborz, @SorooshMani-NOAA Then the calls are directed to the pmi channel. Let me check on that. Keep the |
@pvelissariou1 Here is the slurm-*.out file: Thanks |
@FariborzDaneshvar-NOAA Fariborz can you replace |
@FariborzDaneshvar-NOAA Could you please share your slurm submission script? |
Here it is: schism.txt |
I used |
@FariborzDaneshvar-NOAA Could you please share the new slurm batch file? Thanks |
@pvelissariou1 here it is: |
As part of running workflow for 6 leadtimes of 25 storms recommended by the NHC team, random SCHISM runs for Ike 2008 (with 24 hr leadtimes) failed with
segmentation fault
. Here is one example error message for one of 13 failed runs (run#5) out of 31 ensemble:/work2/noaa/nos-surge/shared/nhc_hurricanes/ike_2008_fdc63f00-2b10-4848-92af-7d64335e249a
The text was updated successfully, but these errors were encountered: