Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help. Process finished with exit code 132 #241

Open
Inqus opened this issue Mar 31, 2020 · 19 comments
Open

Help. Process finished with exit code 132 #241

Inqus opened this issue Mar 31, 2020 · 19 comments

Comments

@Inqus
Copy link

Inqus commented Mar 31, 2020

I get an error when trying to run a function from java. What have I done wrong?

Process finished with exit code 132 (interrupted by signal 4: SIGILL)
try (Interpreter interp = new SharedInterpreter()) {
       interp.runScript("src/main/python/forecast.py");
       var pred = interp.getValue("predicted");
       System.out.println(pred);

 } catch (JepException e) {
    e.printStackTrace();
}

forecast.py:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import pmdarima as pm

# # Import data
data = pd.read_csv('src/main/python/week.csv', parse_dates=['date'], index_col='date')

# Create Training and Test
train = data.value[:194]
test = data.value[194:]

# here i got error 'Process finished with exit code 
# 132 (interrupted by signal 4: SIGILL)' 
model = pm.ARIMA(order=(1, 0, 1), seasonal_order=(0, 1, 1, 52))
#####

model.fit(train)

# # Forecast
n_periods = 15
predicted, confint = model.predict(n_periods=n_periods, return_conf_int=True)

index_of_fc = pd.date_range(data.index[194], periods=n_periods, freq='W-MON')

# make series for plotting purpose
fitted_series = pd.Series(predicted, index=index_of_fc)
lower_series = pd.Series(confint[:, 0], index=index_of_fc)
upper_series = pd.Series(confint[:, 1], index=index_of_fc)

Environment:

  • OS Platform, Distribution, and Version: Mac OS 10.15.3
  • Python Distribution and Version: 3.7.0
  • Java Distribution and Version: 11.0.2
  • Jep Version: 3.9.0
  • Python packages used (e.g. numpy, pandas, tensorflow): numpy, pmdarima
@Inqus Inqus changed the title Help Help. Process finished with exit code 132 Mar 31, 2020
@bsteffensmeier
Copy link
Member

I suspect something in your environment is not correct tor the pmdarima is not working in an embedded python. Can you try altering your function to do something that does not use pmdarima to verify that your environment is working. If that works then try to identify which specific line of your function is causing the crash and it may help narrow down what is wrong.

@Inqus
Copy link
Author

Inqus commented Apr 1, 2020

I suspect something in your environment is not correct tor the pmdarima is not working in an embedded python. Can you try altering your function to do something that does not use pmdarima to verify that your environment is working. If that works then try to identify which specific line of your function is causing the crash and it may help narrow down what is wrong.

I changed the python code, which is closer to what I need, I marked the place where I found the error; it is strange that it appears in the place where the model was advertised

@plapa
Copy link

plapa commented Nov 18, 2020

Hi,
Could you try using pm.ARIMA(order=(1, 0, 1), seasonal_order=(0, 1, 1, 52), enforce_stationary=False)?

@HemersonTacon
Copy link

Hello,
I'm facing the same problem. The minimum piece of code that I have so far that reproduces the issue is the following:

This on python runs without problems:

from statsmodels.tsa.statespace.sarimax import SARIMAX
import numpy as np
def test_sigill():
    history_vec = np.zeros(60)
    model = SARIMAX(history_vec, order=(1, 0, 1), seasonal_order=(2, 0, 3, 7), 
                    concentrate_scale=False, use_exact_diffuse=False)
    fitted_model = model.fit()

But this equivalent code running on Java through Jep fails with the same message (Process finished with exit code 132 (interrupted by signal 4: SIGILL)):

    @Test
    public void testSigIll() {
        try (Interpreter jep = new SharedInterpreter()) {
            jep.eval("from statsmodels.tsa.statespace.sarimax import SARIMAX");
            jep.eval("import numpy as np");
            jep.eval("def test_sigill():\n" +
                     "    history_vec = np.zeros(60)\n" +
                     "    model = SARIMAX(history_vec, order=(1, 0, 1), seasonal_order=(2, 0, 3, 7), concentrate_scale=False, use_exact_diffuse=False)\n" +
                     "    fitted_model = model.fit()\n");
            jep.invoke("test_sigill");
        } catch (final JepException e) {
            throw new RuntimeException(e);
        }

Note that I running this method as a test in my project.

The workaround that @plapa mentioned works for this example (i.e. change the line with model = SARIMAX(history_vec, order=(1, 0, 1), seasonal_order=(2, 0, 3, 7), concentrate_scale=False, use_exact_diffuse=False) to be model = SARIMAX(history_vec, order=(1, 0, 1), seasonal_order=(2, 0, 3, 7), concentrate_scale=False, use_exact_diffuse=False, enforce_stationarity=False)).

Environment:

OS Platform, Distribution, and Version: macOS Catalina, Version: 10.15.7
Python Distribution and Version: 3.8.3
Java Distribution and Version: 11.0.5
Jep Version: 3.9.0
Python packages used (e.g. numpy, pandas, tensorflow): numpy, statsmodels.tsa.statespace.sarimax

@joaopmatias
Copy link

joaopmatias commented Nov 25, 2020

I think I was able to narrow down the issue to a couple of functions in scipy. Are there other known issues with using scipy in Jep?

I experimented using examples for both packages (pmdarima and statsmodels) and got errors in parts of the code where the functions zgetrf or dgetrf from the module scipy.linalg.cython_lapack were called (not always, but sometimes).

I was able to replicate the issue using a file cc.pyxwith the content

cimport numpy as np
cimport cython
import numpy as np

np.import_array()

cimport scipy.linalg.cython_lapack as lapack

cdef FORTRAN = 1

cdef int zmeh(int n) except *:
    cdef np.npy_intp dim[2]
    cdef np.complex128_t [::1,:] capI
    cdef int [::1,:] ipiv
    cdef:
        int info

    # Initialize arrays
    dim[0] = n; dim[1] = n;
    capI = np.PyArray_ZEROS(2, dim, np.NPY_COMPLEX128, FORTRAN)
    ipiv = np.PyArray_ZEROS(2, dim, np.NPY_INT32, FORTRAN)

    # create matrix with two diagonals
    for i in range(n-1):
        capI[i,i] = capI[i,i] + 1
        capI[i,i+1] = capI[i,i+1] + 1
    capI[n-1,n-1] = capI[n-1,n-1]

    lapack.zgetrf(&n, &n, &capI[0,0], &n, &ipiv[0,0], &info)
    print(np.asarray(capI))
    print(np.asarray(ipiv))
    print(info)
    return 0;

def meh(n):
    zmeh(n)

I created the .so file for the import using python setup.py build_ext --inplace and a file setup.py with the content

from Cython.Build import cythonize
from setuptools import Extension, setup
import numpy as np

extension = Extension("cc", ["cc.pyx"],
                      include_dirs=[np.get_include()], depends=[],
                      libraries=[], library_dirs=[])

setup(ext_modules=cythonize(extension, compiler_directives={'language_level': "3"}))

EDIT: I forgot to say that the error is produced after importing the module cc and executing >>> meh(10) in the jep console.
On the other hand, >>> meh(9) does not give an error and gives the same output as the python console.

@bsteffensmeier
Copy link
Member

  1. The only known issues with scipy involve shared modules and multithreaded applications. I really don't think that is what is wrong here but anyone testing this issue should use SharedInterpreter(not SubInterpreter) and test it in a single threaded application just to make sure that isn't causing problems.
  2. Jep has had issues with signal handlers, since you are crashing due to a signal that makes me suspicious. Generally java is the main application so it should be catching signals and the embedded python should not. I'm not sure if the SIGILL is intended to be caught though, perhaps the code is trying to use an optimized instruction that isn't always available and should catch SIGILL and proceed with slower instructions. I don't know if that is actually something scipy would do though, I'm not sure how those types of things work on that low level.
  3. There are alot of little details that impact library loading. It would be helpful to compare the loaded libraries in both the broken java application and the working python application and make sure any scipy related libraries are the same. I'm thinking maybe some libraries might be distributed with multiple versions for optional instruction set extensions and Java is picking the wrong ones and executing illegal instructions. Again I'm not sure if that is how these things are even handled but you might see something else in the loaded libraries that is different.
  4. Is Java specifically doing something to limit some instructions within the process? I really don't think so because then other JNI applications would also break. It might be worthwhile to try to build a standalone application that embeds python and executes the sample code. I've done this before to test things and it only takes a dozen lines of C to test. If it works in an embedded environments that tells us Java is doing something to interfere with the way it should work, but if it fails that tells us that the python executable is setting up something extra that allows it to work.

@joaopmatias
Copy link

Thank you for the reply @bsteffensmeier !

Yes, we have been using SharedInterpter instances in our Java code.

I found about signal libraries after your comment. They provide some additional control over the signals, but it is not straightforward, as you would know. If you have additional tips related to these please let us know. :)
However, something that strikes me as weird is the fact that a standalone python script should crash with SIGILL and that's not occurring. Could Java be the one raising the signal and then python catches it?

I created a small example executing these python commands through the python c-api and jep in the folder linked below.
https://github.com/joaopmatias/jep/tree/replicate_sigill/temp_sandbox

Yet again, the issue only occurs while running it through jep. I also don't see how jep and standalone python could use different versions of dependencies since they essentially use the same executable (at least the script through the C-API does).

My only idea now is digging in scipy to see the exact point where it fails and then probably it will direct to lapack or blas implementations, but that seems a bit tedious. I was using print statements to debug so far, are there better ways to debug jep pointed out somewhere?

Thank you! :)

@bsteffensmeier
Copy link
Member

With signals the problem we have is that the some libraries attempt to install signal handlers from python and python will not install the signal handlers since it detects it is not on the main thread. The only way I can see that causing your problem is if something in python/scipy is supposed to catch that SIGILL but Java is getting it instead and crashing. Again, not sure if that is really plausible, I'm just trying to brainstorm and compare against problems we have seen before.

Thank you for putting together your small example. I was able to run it on my ubuntu system. I do segfault when executing run.py from jep. It does not segfault from python or from c_meh, so it sounds like the same problem even though I don't see SIGILL.

I was able to make the segfault dissappear by building scipy from source(pip3 install scipy --no-binary :all:). I recommend trying that and seeing if it helps on your system. It would still be awesome if we could fix whatever is different about jep but that would be a big hint.

Another difference between c_meh and jep(besides the entire JVM running) is the threading. You could try adding threading to c_meh and see if that is able to repeat the problem. I recently put together an isolated example of the python related threading in jep here. So if you modify that code to call py_meh you should be able to determine if it is threading related.

Along the lines of threading, one of the problems we have come across in the past is that python offers some C-API calls that are incompatible with multiple interpreters. This is why Jep has come up with the restriction that only one interpreter can be used on a thread and this has solved all known issues related to this API, but it is possible you have discovered some new caveat that we need to work around.

@joaopmatias
Copy link

joaopmatias commented Nov 27, 2020

I built scipy from source and had some success with it!

However, I still see the error occurring in other situations. I'll share when I have updates. :)

@joaopmatias
Copy link

joaopmatias commented Dec 2, 2020

Hi @bsteffensmeier !

This is just a short update. I updated the example in
https://github.com/joaopmatias/jep/tree/replicate_sigill/temp_sandbox

Yet again, the C code runs normally and it crashes with Jep (this time, I think the crash happens more consistently, although the solve function from scipy that is used is less specific than the old example).

I integrated your example with threading and the code still works. Please check if that's what you meant. I don't fully understand your example because it seems to always be using a single thread and when I uncomment any of the commented lines, it immediately breaks.

Thank you!

EDIT: the updated example does not replicate the issue when I build scipy from source.

@bsteffensmeier
Copy link
Member

In the multithreaded c example the commented out portion shows things that used to work in previous versions but do not work in the latest python. The code you have now which runs on_thread_3 is simulating what we do in jep 3.9.1 and should create one extra thread which is a good simplification of how SubInterpreter runs. The only change you need to make is to run your py_meh code in on_thread_3 instead of in main. In Jep the "main" python thread doesn't actually run anything.

My theory with the recompile is that SIGILL indicates an illegal instruction and the most straightforward explanation for that is that the code was compiled on another machine with extra instructions that our CPUs don't support. Compiling locally should ensure the compiler only uses instructions which we actually have. But this doesn't really fit well with what you see where it fails in jep and not python. I am not aware of anything in jep or the JVM which could alter which instructions are used. My best recommendation would be to start digging into the scipy code to try to narrow down where it is crashing.

@joaopmatias
Copy link

joaopmatias commented Dec 4, 2020

It seems that running code in a subthread explains these failures! I updated the code example in the fork using your suggestions and the results of C and Jep are more consistent (though not completely...).
https://github.com/joaopmatias/jep/tree/replicate_sigill/temp_sandbox

Looking into the results of scipy tests (import scipy; scipy.test()) it seems that segfaults are different depending on whether scipy was compiled locally and/or openblas is installed through conda (conda install libopenblas libgfortran nomkl), so probably there isn't a single root to the issue. I selected out a few additional examples that I found interesting and have different results depending on the environment. They can be commented/uncommented in run.py.

I also had a couple of questions related to this. Excuse me if they are answered in the docs. Since the C code was working fine when it was executed in the main thread, is there an alternative to SharedInterpreter() that would execute in the main thread too?
I also noticed that the output of threading.current_thread() through python or the C code was always _MainThread(MainThread, ...) independently of executing it in the main or subthread, whereas the output through Jep was always _DummyThread(Dummy-1, ...). Is this behavior expected?

Thanks!

@bsteffensmeier
Copy link
Member

I also had a couple of questions related to this. Excuse me if they are answered in the docs. Since the C code was working fine when it was executed in the main thread, is there an alternative to SharedInterpreter() that would execute in the main thread too?

There is currently no simple way for developers to access the main thread in Jep. Right now the only thing the main thread is doing after initialization is handling shared module imports. You could try to use that mechanism to run your code on the main thread. If you create a SubInterpreter with "run" as a shared module then "import run" in the SubInterpreter then the actual import logic will be executed in the main interpreter. I don't think that would be a useful solution but if you want to try it out and make sure it is stable on the main thread that is one way to try it. Another option would be to just import run from the jep c code. You could import run in pyembed_startup at the end before we release the thread.

I also noticed that the output of threading.current_thread() through python or the C code was always _MainThread(MainThread, ...) independently of executing it in the main or subthread, whereas the output through Jep was always _DummyThread(Dummy-1, ...). Is this behavior expected?

I have never noticed this before but according to the threading documentation any threads that are not created by python use a dummy thread. Since Jep threads are created by java this seems like the correct behaviour to me. I'm surprised the subthread in c would not be a dummy thread, that may be a difference between sub-interpreter and shared-interpreter. The current code you have will create a SubInterpreter. Do you know if a SubInterpreter thread from Jep will be a Dummy or Main? If you want to try to test something closer to a shared interpreter from C then the code below could be used instead of on_thread_3. I have not tested it yet but it is simply the necessary parts from the jep source. I would expect that to be a dummy thread.

void *on_thread_shared(void *vargp){
    PyThreadState* state = PyThreadState_New(main_state->interp);
    PyEval_AcquireThread(state);
    PyRun_SimpleString("print('In a shared interpreter')\n");
    PyRun_SimpleString("import run");
    PyThreadState_Clear(state);
    PyEval_ReleaseThread(state);
    PyThreadState_Delete(state);
    return NULL;
}

@joaopmatias
Copy link

joaopmatias commented Dec 14, 2020

@bsteffensmeier I updated the examples in my fork
https://github.com/joaopmatias/jep/tree/replicate_sigill/temp_sandbox
and I still can't pinpoint the issue. Part of the problem is the fact that building numpy and scipy from source, and make the libraries tests pass, is not straightforward. I probably need to add some flags to the build command. If anyone else has experience with this, please let me know. :)

By the way, just to be clear below, when I say "thread" or "main thread", I mean the outputs of the functions threading.current_thread() and threading.main_thread() from python.

If you create a SubInterpreter with "run" as a shared module then "import run" in the SubInterpreter then the actual import logic will be executed in the main interpreter.

The output of threading.main_thread() in the SubInterpreter matches the output threading.current_thread() in a
SharedInterpreter in the same java program. Curiously, the output of threading.main_thread() in the SharedInterpreter shows a thread different from the previous one.

Another option would be to just import run from the jep c code. You could import run in pyembed_startup at the end before we release the thread.

After changing pyembed_startup, import run worked without problems in my few experiments. The thread executing this code was the main thread of SharedInterpreter, which makes sense. By the way, is there a way of doing it without changing the source code? I tried

try {
    MainInterpreter.setSharedModulesArgv("run");
} catch (JepException e) {
    System.out.println(e);
}
try (Interpreter interp = new SharedInterpreter()) {
    interp.eval("2+2");
} catch (JepException e) {
    System.out.println(e);
}

and hadn't had success.

I have never noticed this before but according to the threading documentation any threads that are not created by python use a dummy thread. Since Jep threads are created by java this seems like the correct behavior to me. I'm surprised the subthread in c would not be a dummy thread, that may be a difference between sub-interpreter and shared-interpreter.

I used your snippet for void *on_thread_shared(void *vargp) and as a result, it used a thread labelled with DummyThread.

Do you know if a SubInterpreter thread from Jep will be a Dummy or Main?

The thread of the SubInterpreter was labelled as MainThread, but as I pointed above, I guess it runs on a "third level", being it pyembed_startup > SharedInterpreter > SubInterpreter.

Finally, there is a possible workaround for the issue pointed in this page by wrapping the python functions in a threading.Thread as follows

inception = threading.Thread(target=run, args=(q, skip_examples, throw_exception))
inception.start()
inception.join()

It works with all the examples I gathered, but it gives little insight into why they break in threads created in C or Java. The best explanation is that C or Java may be catching some signal generated by python code that is usually ignored as you mentioned.

@bsteffensmeier
Copy link
Member

After changing pyembed_startup, import run worked without problems in my few experiments. The thread executing this code was the main thread of SharedInterpreter, which makes sense. By the way, is there a way of doing it without changing the source code? I tried

try {
    MainInterpreter.setSharedModulesArgv("run");
} catch (JepException e) {
    System.out.println(e);
}
try (Interpreter interp = new SharedInterpreter()) {
    interp.eval("2+2");
} catch (JepException e) {
    System.out.println(e);
}

and hadn't had success.

setSharedModulesArgv() only controls what is seen by sys.argv in python, it does not actually run anything or control shared modules. If you want the run module to be shared you need to use JepConfig.addSharedModules() and then use that JepConfig to create a SubInterpreter. There is no way to import run on the main thread using SharedInterpreter.

Finally, there is a possible workaround by wrapping the python functions in a threading.Thread as follows

inception = threading.Thread(target=run, args=(q, skip_examples, throw_exception))
inception.start()
inception.join()

It works with all the example I gathered, but it gives little insight into why they break in threads created in C or Java. The best explanation is that C or Java may be catching some signal generated by python code that is usually ignored as you mentioned.

I think this is really interesting. The C code in Jep, and inon_thread_shared is theoretically functionally equivalent to threading.Thread.start(). It should be possible to step through the C code behind threading.Thread.start() and update c_run.c to match and if you can copy it perfectly then it should start working and you could start comparing the two to see what is actually making the difference. I started looking through the code and I noticed right away that we only call PyThreadState_New() on the newly created thread but cpython splits that and calls _PyThreadState_Prealloc() on the old thread and then PyThreadState_init() on the new thread. It doesn't look like that is actually a problem because both sets of calls will still end up calling new_threadstate() followed by PyThreadState_init() so I expect them to generate the same result but somewhere in the details there might live the problem that is affecting you.

Another approach to looking at this might be to compare the PyThreadState objects between a thread created with threading and a thread created from the c-api. The details of that struct are considered internal and I have never looked closely at them but so much of the thread initialization code is just setting up that struct that I suspect something in there may be your problem and a simple comparison may give some idea what is different.

@joaopmatias
Copy link

joaopmatias commented Dec 15, 2020

If you want the run module to be shared you need to use JepConfig.addSharedModules() and then use that JepConfig to create a SubInterpreter.

This procedure shows that the code is executed in the same thread as pyembed_startup.

The hint about tracking the matching C code looks promising. I wonder if the imports and/or "shared states"(?) will work well though. We'll see.

In the meantime, I found the first line in https://github.com/python/cpython/blob/master/Lib/threading.py neat
"""Thread module emulating a subset of Java's threading model."""
it makes it look like the circle is complete :)

EDIT: the following link seems to describe the thread module in a more accessible way
https://github.com/zpoint/CPython-Internals/blob/master/Interpreter/thread/thread.md

@joaopmatias
Copy link

joaopmatias commented Jan 3, 2021

Success 🎉! The tldr of the fix I found was to increase the stack size of the threads or programs.

In the meantime, I did not figure out what is the default stack size when python is called (in the command line) or when the main thread of a C program is running (since the python C-API worked well in that scenario). Please, pitch in if anyone knows. :)
Additionally, pitch in if anyone has a suggestion of an easy to use profiler(s) that could have detected this earlier. :)
I also updated the examples in https://github.com/joaopmatias/jep/tree/replicate_sigill/temp_sandbox again.

I hope that this actually sorts out the issue and does not simply hide it because there was a lot of trial and error and I kind of found this simple fix by chance. I wonder what actions could follow up: a heads up in the documentation could be added (in the numpy, scipy section) or, if possible, a warning could be issued before the call stack increases beyond its limit.

The longer story for how I found this is that I spent a lot of time checking if the computations could be affected by running Py_Initialize() and the other python examples in different threads. This was not necessarily the case since calling Py_Initialize() in a new thread and the remaining python code in the main thread of the C program worked well.
Then, I fidgeted a bit with the threads where Jep executes Py_Initialize() and noticed that the only case where the python examples worked well was when they were executed in the thread MainInterpreter.thread, which is different from the Java main thread. So, the following code snippet could work as an additional workaround.

Thread thread = new Thread() {
    @Override
    public void run() {
        try (Interpreter interp = new SharedInterpreter()) {
            System.out.println("\n   SharedInterpreter");
            interp.eval("import run");
            interp.eval("run.run()");
        } catch (JepException e) {
            System.out.println(e);
        }
    }
};
thread.start();
thread.join();

I also spent some time looking into the CPython source code, in particular the Thread class and PyThread_start_new_thread() as suggested in a previous comment. Eventually, I got an example that worked well and used PyThread_start_new_thread() directly. A few experiments later I was able to make the original examples work with few changes (that altered the stack size).

Finally, I increased the stack size of the Java program using Jep to 4 MB and the examples also worked well.

@bsteffensmeier
Copy link
Member

Success tada! The tldr of the fix I found was to increase the stack size of the threads or programs.

It is awesome that you were able to track that all the way down, Thank you so much for reporting back what you found. I am not sure how to identify who would need to adjust the stack size but it is something simple we can recommend other people try when they run into problems.

To sum up the simplest solution for anyone running across this thread in the future, You are just adding -Xss4m as an argument to the java application using jep.

@joaopmatias
Copy link

Thank you very much for always replying promptly, @bsteffensmeier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants