Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

Open
hkershaw-brown opened this issue Aug 6, 2024 · 5 comments
Labels
QCEFF quantile conserving filters

Comments

@hkershaw-brown
Copy link
Member

🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.

Describe the bug

  1. List the steps someone needs to take to reproduce the bug.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work
Following https://github.com/NCAR/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh
reported by Ben Gunn: (thanks @Benjamin-Gunn !)
https://github.com/Benjamin-Gunn/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh

qceff_table_filename = 'one_below_qceff_table.csv'

&filter_nml
inf_flavor = 5, 5,

&model_nml
model_size = 120,
forcing = 8.0,
delta_t = 0.05,
mean_velocity = 0.0,
pert_velocity_multiplier = 5.0,
diffusion_coef = 0.0,
e_folding = 0.25,
sink_rate = 0.1,
source_rate = 100.0,
point_tracer_source_rate = 5.0,
positive_tracer = .false.,
bound_above_is_one = .true.,
time_step_days = 0,
time_step_seconds = 3600,
/

  1. What was the expected outcome?
    not expected fix_bound_violations = .true. to be required so often.

  2. What actually happened?
    Failures for "Ensemble member greater than upper bound first check" at various pe counts.

You can set:

&probit_transform_nml
fix_bound_violations = .true.
/

however, you still get different answers across mpi counts.

#!/bin/bash

module load nco

rm -f one_var_temp.nc
ncrcat -d location,1,1 filter_output.nc one_var_temp.nc
ncks -V -C -v state_variable_mean one_var_temp.nc | tail -3 | head -1 >> test_output
rm -f  one_var_temp.nc

varying pe count:
7.95979093017264 ;
8.02126025256388 ;
8.55748257662756 ;

varying pe count with -fp-model-precise
8.62082489125036 ;
8.62082489125036 ;
8.62082489125036 ;

not sure how different is ok with the varying pe count.
Note: I cannot reproduce the bounds violations with -fp-model-precise

Todo @HKershaw intel/2024.0.2, ifx, vs gfortran

Error Message

3 mpi tasks: (also happens with 8,7 (without post_inf), 40(without post_inf))

 PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 0] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 99) - process 0

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 1] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 1

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000
 
MPICH ERROR [Rank 2] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 2

Here is the code:

elseif(x > sort_ens(ens_size)) then
! In the right tail
! Do an error check to make sure ensemble member isn't outside bounds, may be redundant
if(bounded_above .and. x > upper_bound) then
write(errstring, *) 'Ensemble member greater than upper bound first check(see code)', x, upper_bound
call error_handler(E_ERR, 'bnrh_cdf_initialized', errstring, source)
! This error can occur due to roundoff in increment generation from bounded BNRHF
! See discussion in function fix_bounds
endif

Which model(s) are you working with?

lorenz_96_tracer advaction.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work

Version of DART

v11.5.1

Have you modified the DART code?

No

Build information

Please describe:

  1. Derecho
  2. ifort (IFORT) 2021.10.0 20230609
@hkershaw-brown
Copy link
Member Author

hkershaw-brown commented Aug 6, 2024

no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise

8.83596691025763 ;
8.26235748376639 ;
8.41808494868261 ;

no bounds fails with ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.

7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;

@jlaucar
Copy link
Contributor

jlaucar commented Aug 6, 2024 via email

@hkershaw-brown
Copy link
Member Author

yup this is a reoccurrence of what I was seeing on my old laptop with ifort. It would be cooler if I'd recorded the version on my now dead laptop.
I'm trying to see if a can try an older intel version on Derecho.

fix_bounds_violations does not seem to be needed with fp-model precise (haven't got it to fail (yet))
The cases that do not duplicate across PE counts do duplicate with the same PE count

@hkershaw-brown hkershaw-brown added the QCEFF quantile conserving filters label Sep 13, 2024
hkershaw-brown added a commit that referenced this issue Sep 13, 2024
…ntered

and gives a 'Failed to converge for quantile' when quantile==0
see issue #709

Q. Is this a floating point comparison error? (plus short circuting the if statement)
@hkershaw-brown
Copy link
Member Author

Note on B Gaubert's cam-chem(?) runs. These were done with fix_bound_violations = .true. rather than fix_bound_violations = .false. as originally thought.

So clamping, rather than probit enforcing the bounds. ( sd == 0 so you never transform into (or back out of) probit space.)

/glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run is using the reanalysis runs #749

@hkershaw-brown
Copy link
Member Author

Note I have not separated out varying results across pe counts (QCEFF vs no QCEFF vs what would be expected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QCEFF quantile conserving filters
Projects
None yet
Development

No branches or pull requests

2 participants