Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-deterministic NaN values in scipy.integrate.solve_ivp when compiling function with numba #8931

Closed
2 tasks done
abulenok opened this issue Apr 28, 2023 · 11 comments
Closed
2 tasks done
Labels
needtriage stale Marker label for stale issues.

Comments

@abulenok
Copy link

abulenok commented Apr 28, 2023

Reporting a bug

  • I have tried using the latest released version of Numba (most recent is
    visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
  • I have included a self contained code sample to reproduce the problem.
    i.e. it's possible to run as 'python bug.py'.

What we know so far:

  • the behaviour is non-deterministic (usually enough to run 100 times to reproduce)
  • occurs both when using @jit and @njit functions, but does not happen, when NUMBA_DISABLE_JIT=1
  • occurs only when using “BDF” method in scipy.integrate.solve_ivp
  • seems to occur only on Linux systems (checked only on Debian-based), never observed on Windows or macOS
  • to experience error, it’s enough to compile a function with numba, no need to use it in the code
  • setting the fastmath option to True does not change the behaviour
  • the problem has been around since at least two-three years (hundreds of occurrences logged on GitHub actions CI)
    cc @slayoo @piotrbartman
import scipy.integrate
import numba
import numpy as np
import warnings

@numba.njit("int32(int32)")
def jitted_fun(x):
    return 1

def f(_, y):
  return np.zeros_like(y)

y0 = np.ones(2)

with warnings.catch_warnings(record=True):
    warnings.simplefilter("error")
    for _ in range(100):
        scipy.integrate.solve_ivp(fun=f, t_span=(0, 1), y0=y0, method="BDF")

The encountered error:

Traceback (most recent call last):
  File "/app/numba_scipy_bug.py", line 18, in <module>
    scipy.integrate.solve_ivp(fun=f, t_span=(0, 1), y0=y0, method="BDF")
  File "/usr/local/lib/python3.10/dist-packages/scipy/integrate/_ivp/ivp.py", line 591, in solve_ivp
    message = solver.step()
  File "/usr/local/lib/python3.10/dist-packages/scipy/integrate/_ivp/base.py", line 181, in step
    success, message = self._step_impl()
  File "/usr/local/lib/python3.10/dist-packages/scipy/integrate/_ivp/bdf.py", line 407, in _step_impl
    D[order + 2] = d - D[order + 1]
RuntimeWarning: invalid value encountered in subtract

example Dockerfile to reproduce the issue:

FROM ubuntu:22.04
WORKDIR /app
COPY numba_scipy_err.py ./numba_scipy_err.py
RUN apt-get update && apt-get -y install python3.9 && apt-get -y install python3-pip
RUN pip install numba numpy scipy

CMD ["/bin/bash", "-c", "for run in {1..100}; do python3 numba_scipy_err.py; done"]
@abulenok abulenok changed the title non-deterministic NaN values in scipy.integrate.solve_ivp when compiling function with numba non-deterministic NaN values in scipy.integrate.solve_ivp when compiling function with numba Apr 28, 2023
@esc esc added the needtriage label Apr 28, 2023
@slayoo
Copy link

slayoo commented May 1, 2023

CC: @seberg, @mdhaber, @drhagen, would you have any hints how to investigate it on the SciPy front (error happens in scipy/integrate/_ivp/bdf.py)? Thanks

@mdhaber
Copy link

mdhaber commented May 1, 2023

I would put a breakpoint in the code at /usr/local/lib/python3.10/dist-packages/scipy/integrate/_ivp/bdf.py", line 407 and run in debug mode. (Really, I'd put the breakpoint in an if np.any(np.isnan(D) branch so that execution would only stop when the NaN is actually present.) Then I'd look at what values are NaN and keep adding breakpoints earlier in the code execution to see where the NaN is introduced.

@esc
Copy link
Member

esc commented May 2, 2023

I can reproduce this using docker on my M1 mac and an aarch64 linux image. Although sometimes I need to wait for a few hours until it manifests.

@seberg
Copy link
Contributor

seberg commented May 2, 2023

Few hours? You can use feenableexcept to trap in gdb instead of getting thta warning later for inspection (I guess only really if there are not random ones created elsewhere). Might be worth to try valgrind and/or a sanitizer, maybe uninitialized memory is being used (at which point that might contain the NaN and no warning is given).

Not sure if either is worth it here though.

@esc
Copy link
Member

esc commented May 2, 2023

Few hours?

Yeah, I put it in a while true loop and waited for it to appear.

@seberg
Copy link
Contributor

seberg commented May 2, 2023

Just going to post it in case anything makes sense to you guys. Valgrind reports this for me which is likely irrelevant (the call does not fail):

==28430== Conditional jump or move depends on uninitialised value(s)
==28430==    at 0x39A9A8B1: llvm::ConstantExpr::getGetElementPtr(llvm::Type*, llvm::Constant*, llvm::ArrayRef<llvm::Value*>, bool, llvm::Optional<unsigned int>, llvm::Type*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E7E25: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E8540: (anonymous namespace)::ConstantFoldInstOperandsImpl(llvm::Value const*, unsigned int, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E68F4: (anonymous namespace)::ConstantFoldConstantImpl(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::SmallDenseMap<llvm::Constant*, llvm::Constant*, 4u, llvm::DenseMapInfo<llvm::Constant*>, llvm::detail::DenseMapPair<llvm::Constant*, llvm::Constant*> >&) [clone .part.0] (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E6B96: llvm::ConstantFoldConstant(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E6FAE: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E8540: (anonymous namespace)::ConstantFoldInstOperandsImpl(llvm::Value const*, unsigned int, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E68F4: (anonymous namespace)::ConstantFoldConstantImpl(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::SmallDenseMap<llvm::Constant*, llvm::Constant*, 4u, llvm::DenseMapInfo<llvm::Constant*>, llvm::detail::DenseMapPair<llvm::Constant*, llvm::Constant*> >&) [clone .part.0] (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x396E6B96: llvm::ConstantFoldConstant(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x3939D87C: combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::ProfileSummaryInfo*, unsigned int, llvm::LoopInfo*) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x3939FDE7: llvm::InstructionCombiningPass::runOnFunction(llvm::Function&) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430==    by 0x39B7230A: llvm::FPPassManager::runOnFunction(llvm::Function&) (in /home/sebastian/.local/lib/python3.9/site-packages/llvmlite/binding/libllvmlite.so)
==28430== 

EDIT: I had forgotten --track-origins=yes, but even with it, it doesn't actually report more. And, I am not sure what that means...

@stuartarchibald
Copy link
Contributor

@seberg I've often seen that same thing from llvm::ConstantExpr::getGetElementPtr() reported by Valgrind, it definitely can occur, but likewise I also am not sure it's related to this issue. I suspect trying to trap the inf/nan in gdb is perhaps the least effort thing to try next.

I also wonder if conda packages produce the same issue or if it only occurs with the pip packages?

@slayoo
Copy link

slayoo commented May 5, 2023

Here's a Github Actions setup depicting the problem: https://github.com/slayoo/numba_issue_8931/actions/runs/4891585656

Observations:

@seberg
Copy link
Contributor

seberg commented May 5, 2023

I never noticed this part:

to experience error, it’s enough to compile a function with numba, no need to use it in the code

One thing the SciPy code probably does do here and that may be non-trivial is calling into blas/lapack which might do a bit more non-trivial things? In that case, maybe OMP_NUM_THREADS=1 makes the issue go away?

But, that again, is just a random guess. For all we know, numba has nothing to do with it, but just happens to trigger a lingering bug on the other side of the world, I guess.

@slayoo
Copy link

slayoo commented May 5, 2023

@seberg, thanks.

Trying out all the CI runs with OMP_NUM_THREADS set to either 1 or 2 (https://github.com/slayoo/numba_issue_8931/blob/9adceece7d6e4fb6a530ca2fd2c9b1adf022dc4c/.github/workflows/all.yml#L44) shows that it does not alter the bahaviour - still failing on Linux whenever JIT is enabled (but used only in unrelated code):
https://github.com/slayoo/numba_issue_8931/actions/runs/4892152627

(for the record, the failure has been once recorded also on Windows, so far never on macOS)

@github-actions
Copy link

github-actions bot commented Jun 5, 2023

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needtriage stale Marker label for stale issues.
Projects
None yet
Development

No branches or pull requests

6 participants