Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability of some numerical results (condor vs CI) #200

Closed
mlincett opened this issue Jun 9, 2023 · 19 comments · Fixed by #227
Closed

Instability of some numerical results (condor vs CI) #200

mlincett opened this issue Jun 9, 2023 · 19 comments · Fixed by #227
Labels
CI / Testing About CI and/or testing prod concern a problem with running at scale question Further information is requested

Comments

@mlincett
Copy link
Collaborator

mlincett commented Jun 9, 2023

As we have been discussing for a while, it seems sometimes the numerical results of millipede are not stable across runs on different platforms, in spite of containerisation.

While I am not sure there is an "issue" to solve, I would like to track here some observations about this behaviour. Updates to come.

@mlincett
Copy link
Collaborator Author

mlincett commented Jun 9, 2023

After #198 the test of MillipedeWilks fails. A single pixel that fails the energy loss reco shows a difference in likelihood of 17% between condor and CI.

@ric-evans
Copy link
Member

this also seems relevant: #60

@ric-evans ric-evans added question Further information is requested CI / Testing About CI and/or testing prod concern a problem with running at scale labels Jun 9, 2023
@tianluyuan
Copy link
Contributor

Could you clarify where the failure occurred? Naively I would have thought when run on GitHub CI the platform should be near identical, so tests should fail randomly. Though results from condor could still be different.

@ric-evans
Copy link
Member

Could you clarify where the failure occurred? Naively I would have thought when run on GitHub CI the platform should be near identical, so tests should fail randomly. Though results from condor could still be different.

We're seeing it here #204 with millipede wilks

@ric-evans
Copy link
Member

Following up, the issue in #204 is not related to this after all

@tianluyuan
Copy link
Contributor

The spline tables for all these cases are obtained using wget over http. One possibility is that in those cases the integrity of the file is lost, leading to the different results. Perhaps we can add checksumming to ensure that is not happening.

@ric-evans
Copy link
Member

ric-evans commented Sep 21, 2023

The spline tables for all these cases are obtained using wget over http. One possibility is that in those cases the integrity of the file is lost, leading to the different results. Perhaps we can add checksumming to ensure that is not happening.

This is a good idea if we continue to rely on remote storage. What do you think in regards to #166?

@tianluyuan
Copy link
Contributor

I think the containers can be dataless and cvmfs-less if we really wanted, but in that case we should ensure the file transfer mechanism is robust. I don't think I have run into such issues on sub-2, but that could be because it's using the spline tables on cvmfs rather than over http.

@mlincett
Copy link
Collaborator Author

I believe issues in downloads should result in truncated files, rather than corrupted data, but a checksum is definitely a good idea.

@dsschult
Copy link
Member

I've seen corrupted data without a truncation when transferring files. It's a low probability to get a bit flip that passes the TCP checksum, but it does happen when moving around enough bytes. Note that this does not happen with CVMFS, since that uses checksums internally.

The easiest thing to do is to have a checksum file next to the file you want to download, and if it's available then perform the checksum test.

@tianluyuan
Copy link
Contributor

The numerical issues in the tests are a quite confounding. They seem to oscillate between two results for millipede_wilks. E.g. here and here.

Basically for what appears to be the same seed particle the millipede unfolded particle can be different. Checking the OS indicates the CI runners are on the same OS (ubuntu_amd64). It doesn't rule out data corruption over wget, but it appears not completely random either.

@tianluyuan tianluyuan linked a pull request Oct 14, 2023 that will close this issue
@tianluyuan
Copy link
Contributor

It looks like numpy picks up avx512 on select processors

tyuan@cobalt06:~$ lscpu|head   
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPU family:                      6
Model:                           45
tyuan@cobalt06:~$ python3 -c 'import numpy; numpy.show_config()'|tail
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX
    not found = F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

vs

Singularity> lscpu|head                                    
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Vendor ID:           GenuineIntel
Model name:          Intel Xeon Processor (Cascadelake)
CPU family:          6
Model:               85
Singularity> python3 -c 'import numpy; numpy.show_config()'|tail
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL,AVX512_ICL

could explain why splinempe and wilks are the ones to fail.

@tianluyuan
Copy link
Contributor

More testing indicates it's not something in numpy/python/seeding but in minimization/fitting with millipede (and possibly splinempe?). See discussion on slack for comparisons

@tianluyuan
Copy link
Contributor

Possibly relevant numpy/numpy#23523

@dsschult
Copy link
Member

That reminds me that you can do this to force different optimizations with OpenBLAS:
https://github.com/OpenMathLib/OpenBLAS/wiki/Faq#choose_target_dynamic

For avx2:

export OPENBLAS_CORETYPE=Haswell

For avx:

export OPENBLAS_CORETYPE=Sandybridge

@tianluyuan
Copy link
Contributor

That did it for avx2. This matches what I'm seeing on NPX with has_avx2

Singularity> lscpu |head && python3 mwe.py 2>/dev/null && OPENBLAS_CORETYPE=Haswell python3 mwe.py 2>/dev/null
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Vendor ID:           GenuineIntel
Model name:          Intel Xeon Processor (Cascadelake)
CPU family:          6
Model:               85
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107588.9842 Edm =       10725.01483 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

Setting to Sandybridge does not recover what I get on cobalt6 though

Singularity> OPENBLAS_CORETYPE=Sandybridge python3 mwe.py 2>/dev/null
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.2472 Edm =       10724.75928 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

@tianluyuan
Copy link
Contributor

Testing on AMD chip with avx512 indicates that it's default is equivalent to Haswell, which is avx2

tyuan@n-165:scan$ lscpu|head && python3 mwe.py 2>/dev/null && OPENBLAS_CORETYPE=Haswell python3 mwe.py 2>/dev/null                                                                    
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 9334 32-Core Processor
CPU family:                      25
Model:                           17
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

@tianluyuan
Copy link
Contributor

Testing a variety of OpenBlas flags on Model name: Intel Xeon Processor (Cascadelake) results in the following, none of which matches cobalt06

  Core2                                                                                                                                                                        [46/271]
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Banias                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Penryn                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Dunnington                                                                                                                                                                           
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Opteron_SSE3                                                                                                                                                                         
bash: line 1:  1864 Illegal instruction     OPENBLAS_CORETYPE=Opteron_SSE3 python3 mwe.py 2> /dev/null                                                                                 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Katmai                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Coppermine                                                                                                                                                                           
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Northwood                                  
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Prescott                                   
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Atom
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1233 Edm =       10724.83424 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Nehalem
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.0994 Edm =       10724.80926 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Barcelona
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Nano
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Bobcat
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Sandybridge
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.2472 Edm =       10724.75928 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Bulldozer
bash: line 1:  2310 Illegal instruction     OPENBLAS_CORETYPE=Bulldozer python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Piledriver
bash: line 1:  2345 Illegal instruction     OPENBLAS_CORETYPE=Piledriver python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Steamroller
bash: line 1:  2380 Illegal instruction     OPENBLAS_CORETYPE=Steamroller python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Excavator
bash: line 1:  2414 Illegal instruction     OPENBLAS_CORETYPE=Excavator python3 mwe.py 2> /dev/null

Cobalt06

tyuan@cobalt06:scan$ lscpu |head && python3 mwe.py 2>/dev/null
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPU family:                      6
Model:                           45
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1115 Edm =        10724.9067 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI / Testing About CI and/or testing prod concern a problem with running at scale question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants