bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

hkershaw-brown · 2024-08-06T19:11:35Z

🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.

Describe the bug

List the steps someone needs to take to reproduce the bug.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work
Following https://github.com/NCAR/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh
reported by Ben Gunn: (thanks @Benjamin-Gunn !)
https://github.com/Benjamin-Gunn/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh

qceff_table_filename = 'one_below_qceff_table.csv'

&filter_nml
inf_flavor = 5, 5,

&model_nml
model_size = 120,
forcing = 8.0,
delta_t = 0.05,
mean_velocity = 0.0,
pert_velocity_multiplier = 5.0,
diffusion_coef = 0.0,
e_folding = 0.25,
sink_rate = 0.1,
source_rate = 100.0,
point_tracer_source_rate = 5.0,
positive_tracer = .false.,
bound_above_is_one = .true.,
time_step_days = 0,
time_step_seconds = 3600,
/

What was the expected outcome?
not expected fix_bound_violations = .true. to be required so often.
What actually happened?
Failures for "Ensemble member greater than upper bound first check" at various pe counts.

You can set:

&probit_transform_nml
fix_bound_violations = .true.
/

however, you still get different answers across mpi counts.

#!/bin/bash

module load nco

rm -f one_var_temp.nc
ncrcat -d location,1,1 filter_output.nc one_var_temp.nc
ncks -V -C -v state_variable_mean one_var_temp.nc | tail -3 | head -1 >> test_output
rm -f  one_var_temp.nc

varying pe count:
7.95979093017264 ;
8.02126025256388 ;
8.55748257662756 ;

varying pe count with -fp-model-precise
8.62082489125036 ;
8.62082489125036 ;
8.62082489125036 ;

not sure how different is ok with the varying pe count.
Note: I cannot reproduce the bounds violations with -fp-model-precise

Todo @HKershaw intel/2024.0.2, ifx, vs gfortran

Error Message

3 mpi tasks: (also happens with 8,7 (without post_inf), 40(without post_inf))

 PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 0] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 99) - process 0

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 1] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 1

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 2] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 2

Here is the code:

DART/assimilation_code/modules/assimilation/bnrh_distribution_mod.f90

Lines 292 to 300 in 75cf8dc

    
           elseif(x > sort_ens(ens_size)) then 
        
              ! In the right tail 
        
              ! Do an error check to make sure ensemble member isn't outside bounds, may be redundant 
        
              if(bounded_above .and. x > upper_bound) then 
        
                 write(errstring, *) 'Ensemble member greater than upper bound first check(see code)', x, upper_bound 
        
                 call error_handler(E_ERR, 'bnrh_cdf_initialized', errstring, source) 
        
                 ! This error can occur due to roundoff in increment generation from bounded BNRHF 
        
                 ! See discussion in function fix_bounds 
        
              endif

Which model(s) are you working with?

lorenz_96_tracer advaction.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work

Version of DART

v11.5.1

Have you modified the DART code?

No

Build information

Please describe:

Derecho
ifort (IFORT) 2021.10.0 20230609

The text was updated successfully, but these errors were encountered:

hkershaw-brown · 2024-08-06T19:26:41Z

no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise

8.83596691025763 ;
8.26235748376639 ;
8.41808494868261 ;

no bounds fails with ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.

7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;

jlaucar · 2024-08-06T19:38:08Z

Helen, I have a strong sense of deja-vu about this. Have we possibly identified things before where fp-precise was required for various intel versions? Is fix_bound_violations needed to get the cases with fp-precise to run successfully? Do the cases that do not duplicate across PE count duplicate when the same PE count is run repeatedly? Jeff

…

On Tue, Aug 6, 2024 at 1:27 PM Helen Kershaw ***@***.***> wrote: no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise 8.83596691025763 ; 8.26235748376639 ; 8.41808494868261 ; ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts. 7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ; — Reply to this email directly, view it on GitHub <#709 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDHUISJS57B5XLBROQ7YKLZQEPQVAVCNFSM6AAAAABMC6DF4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE4TENJYGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hkershaw-brown · 2024-08-06T19:49:32Z

yup this is a reoccurrence of what I was seeing on my old laptop with ifort. It would be cooler if I'd recorded the version on my now dead laptop.
I'm trying to see if a can try an older intel version on Derecho.

fix_bounds_violations does not seem to be needed with fp-model precise (haven't got it to fail (yet))
The cases that do not duplicate across PE counts do duplicate with the same PE count

…ntered and gives a 'Failed to converge for quantile' when quantile==0 see issue #709 Q. Is this a floating point comparison error? (plus short circuting the if statement)

hkershaw-brown · 2024-10-31T13:59:49Z

Note on B Gaubert's cam-chem(?) runs. These were done with fix_bound_violations = .true. rather than fix_bound_violations = .false. as originally thought.

So clamping, rather than probit enforcing the bounds. ( sd == 0 so you never transform into (or back out of) probit space.)

/glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run is using the reanalysis runs #749

hkershaw-brown · 2024-10-31T14:04:25Z

Note I have not separated out varying results across pe counts (QCEFF vs no QCEFF vs what would be expected).

hkershaw-brown added the QCEFF quantile conserving filters label Sep 13, 2024

hkershaw-brown mentioned this issue Sep 13, 2024

Documentation: clamping is applied only when variables are written #734

Open

hkershaw-brown mentioned this issue Oct 3, 2024

bug: inflation creating out-of-bounds values qceff #749

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

hkershaw-brown commented Aug 6, 2024

hkershaw-brown commented Aug 6, 2024 •

edited

Loading

jlaucar commented Aug 6, 2024 via email

hkershaw-brown commented Aug 6, 2024

hkershaw-brown commented Oct 31, 2024

hkershaw-brown commented Oct 31, 2024

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

Comments

hkershaw-brown commented Aug 6, 2024

Describe the bug

Error Message

Which model(s) are you working with?

Version of DART

Have you modified the DART code?

Build information

hkershaw-brown commented Aug 6, 2024 • edited Loading

jlaucar commented Aug 6, 2024 via email

hkershaw-brown commented Aug 6, 2024

hkershaw-brown commented Oct 31, 2024

hkershaw-brown commented Oct 31, 2024

hkershaw-brown commented Aug 6, 2024 •

edited

Loading