Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apparent deadlock in Python bindings for go-jsonnet #484

Open
githubaccount888 opened this issue Dec 1, 2020 · 10 comments
Open

apparent deadlock in Python bindings for go-jsonnet #484

githubaccount888 opened this issue Dec 1, 2020 · 10 comments

Comments

@githubaccount888
Copy link

First off - apologies for lack of any "simple reproduction" of this insidious bug - it only occurs under a specific set of (rather complex) circumstances. I will do my best to describe them and what I've observed - perhaps that will be enough to inspire a possible cause/solution.

Issue: I observe what appears to be a deadlock when calling _gojsonnet.evaluate_file() under a certain set of conditions. There is 0% CPU usage and the process appears to be hanging on IO somewhere in the call to evaluate_file().

My observations so far

  • This bug is deterministic. When the "conditions" are met it seems to happen 100% of the time.

  • It only affects gojsonnet bindings (v0.16 and v0.17 ... possibly earlier too, but these are the only ones I've been able to easily install). In my testing it does not affect C++ jsonnet bindings.

  • I've only been able to trigger it when calling via PythonOperator on Apache Airflow (using LocalExecutor). In my setup, Airflow is running as a Docker container (hopefully not relevant but worth mentioning). I briefly dug into the PythonOperator code and it appears to launch a new Python subprocess in case that's important. I have not been able to trigger it by calling evaluate_file() manually from the Python REPL (verified both locally and inside my Docker container environment).

  • It seems to only happen when the generated json is rather large (~5MB in my case). If I shrink the size of my generated json to a few KB the issue goes away. I don't know exactly where this cutoff is, but the deterministic nature of the bug strongly suggests there is a hard cutoff somewhere.

Please let me know what you think or if you have any questions - I don't have enough knowledge of how Python's C API works to really debug this further on my own. There are unfortunately lots of moving parts here, but I filed as an issue in go-jsonnet given that the C++ bindings do not exhibit this issue.

@sbarzowski
Copy link
Collaborator

Thanks a lot for reporting this. The folks at https://github.com/deepmind/kapitan reported similar problems and it's the primary reason why they're still on cpp-jsonnet. Having these additional details is really helpful.

We fixed one issue that had a small potential to fix this in 0.17.0, but apparently that wasn't it.

If you could get some info about what the hanged program is doing exactly that would be great. For example you can try to strace it or connect to it with gdb.

You can also try the newest version. We had one change that has a nonzero chance of affecting this: #478.

Please let us know if you can think of any other relevant details. If you're at any point able to create a self-contained repro (even if it's big and complicated) that would of course be super helpful.

@githubaccount888
Copy link
Author

Attached gdb and printed backtrace - looks like a call to Go's malloc triggers the deadlock?

(gdb) bt
#0  runtime.futex () at /usr/local/airflow/go/src/runtime/sys_linux_amd64.s:536
#1  0x00007f8af767ed66 in runtime.futexsleep (addr=0x7f8af7d10598 <runtime.work+216>, val=0, ns=-1)
    at /usr/local/airflow/go/src/runtime/os_linux.go:44
#2  0x00007f8af765f508 in runtime.notetsleep_internal (n=0x7f8af7d10598 <runtime.work+216>, ns=-1, 
    ~r2=<optimized out>) at /usr/local/airflow/go/src/runtime/lock_futex.go:174
#3  0x00007f8af765f710 in runtime.notetsleepg (n=0x7f8af7d10598 <runtime.work+216>, ns=-1, 
    ~r2=<optimized out>) at /usr/local/airflow/go/src/runtime/lock_futex.go:228
#4  0x00007f8af76702c2 in runtime.gcBgMarkStartWorkers () at /usr/local/airflow/go/src/runtime/mgc.go:1786
#5  0x00007f8af766ef4d in runtime.gcStart (trigger=...) at /usr/local/airflow/go/src/runtime/mgc.go:1265
#6  0x00007f8af7660a24 in runtime.mallocgc (size=204800, typ=0x7f8af7a6f580, needzero=true, 
    ~r3=<optimized out>) at /usr/local/airflow/go/src/runtime/malloc.go:1115
#7  0x00007f8af76950f4 in runtime.growslice (et=0x7f8af7a6f580, old=..., cap=1205, ~r3=...)
    at /usr/local/airflow/go/src/runtime/slice.go:181
#8  0x00007f8af7731724 in github.com/google/go-jsonnet/internal/parser.(*lexer).emitFullToken (
    l=0xc0000e3720, kind=4, data=..., stringBlockIndent=..., stringBlockTermIndent=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/internal/parser/lexer.go:362
#9  github.com/google/go-jsonnet/internal/parser.(*lexer).emitToken (l=0xc0000e3720, kind=4)
    at /tmp/pip-install-j2voo3v2/gojsonnet/internal/parser/lexer.go:374
#10 0x00007f8af7735e1f in github.com/google/go-jsonnet/internal/parser.Lex (diagnosticFilename=..., 
    importedFilename=..., input=..., ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/internal/parser/lexer.go:856
#11 0x00007f8af7746473 in github.com/google/go-jsonnet/internal/parser.SnippetToRawAST (
    diagnosticFilename=..., importedFilename=..., snippet=..., ~r3=..., ~r4=..., ~r5=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/internal/parser/parser.go:1388
#12 0x00007f8af774cd85 in github.com/google/go-jsonnet/internal/program.SnippetToAST (diagnosticFilename=..., 
    importedFilename=..., snippet=..., ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/internal/program/program.go:10
#13 0x00007f8af775ff07 in github.com/google/go-jsonnet.(*importCache).importAST (cache=0xc0000aa1e0, 
    importedFrom=..., importedPath=..., ~r2=..., ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/imports.go:128
#14 0x00007f8af7760586 in github.com/google/go-jsonnet.(*importCache).importCode (cache=0xc0000aa1e0, 
    importedFrom=..., importedPath=..., i=0xc0000bc120, ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/imports.go:162
#15 0x00007f8af7765bb9 in github.com/google/go-jsonnet.(*interpreter).evaluate (i=0xc0000bc120, a=..., tc=1, 
    ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:482
#16 0x00007f8af7769c10 in github.com/google/go-jsonnet.(*interpreter).EvalInCleanEnv (i=0xc0000bc120, 
    env=0xc0000a9aa0, ast=..., trimmable=false, ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:946
#17 0x00007f8af776ce87 in github.com/google/go-jsonnet.(*cachedThunk).getValue (t=0xc0000ce800, 
    i=0xc0000bc120, ~r1=..., ~r2=...) at /tmp/pip-install-j2voo3v2/gojsonnet/thunks.go:68
#18 0x00007f8af77637d2 in github.com/google/go-jsonnet.(*interpreter).evaluate (i=0xc0000bc120, a=..., tc=1, 
    ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:532
#19 0x00007f8af7769c10 in github.com/google/go-jsonnet.(*interpreter).EvalInCleanEnv (i=0xc0000bc120, 
    env=0xc0001ee160, ast=..., trimmable=false, ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:946
#20 0x00007f8af776ce87 in github.com/google/go-jsonnet.(*cachedThunk).getValue (t=0xc0001f0100, 
    i=0xc0000bc120, ~r1=..., ~r2=...) at /tmp/pip-install-j2voo3v2/gojsonnet/thunks.go:68
#21 0x00007f8af775de5a in github.com/google/go-jsonnet.(*binaryBuiltin).evalCall (b=0xc000078330, args=..., 
    i=0xc0000bc120, ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/builtins.go:1409
#22 0x00007f8af776f757 in github.com/google/go-jsonnet.(*valueFunction).call (f=0xc0000af760, i=0xc0000bc120, 
    args=..., ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/value.go:349
#23 0x00007f8af7769df4 in github.com/google/go-jsonnet.(*interpreter).evaluateTailCall (i=0xc0000bc120, 
#24 0x00007f8af776593a in github.com/google/go-jsonnet.(*interpreter).evaluate (i=0xc0000bc120, a=..., tc=1, 
    ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:587
#25 0x00007f8af7769c10 in github.com/google/go-jsonnet.(*interpreter).EvalInCleanEnv (i=0xc0000bc120, 
    env=0xc0001c9ce0, ast=..., trimmable=false, ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:946
#26 0x00007f8af776ce87 in github.com/google/go-jsonnet.(*cachedThunk).getValue (t=0xc0000cf7c0, 
    i=0xc0000bc120, ~r1=..., ~r2=...) at /tmp/pip-install-j2voo3v2/gojsonnet/thunks.go:68
#27 0x00007f8af77637d2 in github.com/google/go-jsonnet.(*interpreter).evaluate (i=0xc0000bc120, a=..., tc=1, 
    ~r2=..., ~r3=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:532
#28 0x00007f8af7769c10 in github.com/google/go-jsonnet.(*interpreter).EvalInCleanEnv (i=0xc0000bc120, 
    env=0xc000517458, ast=..., trimmable=false, ~r3=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:946
#29 0x00007f8af776cfed in github.com/google/go-jsonnet.(*codeUnboundField).evaluate (f=0xc0001c7490, 
    i=0xc0000bc120, sb=..., origBindings=0xc0001e8660, fieldName=..., ~r4=..., ~r5=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/thunks.go:92
#30 0x00007f8af7771510 in github.com/google/go-jsonnet.objectIndex (i=0xc0000bc120, sb=..., fieldName=..., 
    ~r3=..., ~r4=...) at /tmp/pip-install-j2voo3v2/gojsonnet/value.go:723
#31 0x00007f8af77675df in github.com/google/go-jsonnet.(*valueObject).index (obj=0xc0001e83f0, 
    i=0xc0000bc120, field=...) at /tmp/pip-install-j2voo3v2/gojsonnet/value.go:496
#32 github.com/google/go-jsonnet.(*interpreter).manifestJSON (i=0xc0000bc120, v=..., ~r1=..., ~r2=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:701
#33 0x00007f8af7767617 in github.com/google/go-jsonnet.(*interpreter).manifestJSON (i=0xc0000bc120, v=..., 
    ~r1=..., ~r2=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:706
#34 0x00007f8af7767617 in github.com/google/go-jsonnet.(*interpreter).manifestJSON (i=0xc0000bc120, v=..., 
    ~r1=..., ~r2=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:706
#35 0x00007f8af77683a5 in github.com/google/go-jsonnet.(*interpreter).manifestAndSerializeJSON (
    i=0xc0000bc120, buf=0xc000517b80, v=..., multiline=true, indent=..., ~r4=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:823
#36 0x00007f8af776c832 in github.com/google/go-jsonnet.evaluate (node=..., ext=0xc0000aa000, 
    tla=0xc0000aa030, nativeFuncs=0xc0000aa060, maxStack=500, ic=0xc0000aa1e0, stringOutputMode=false, 
    ~r7=..., ~r8=...) at /tmp/pip-install-j2voo3v2/gojsonnet/interpreter.go:1270
#37 0x00007f8af77722c9 in github.com/google/go-jsonnet.(*VM).evaluateSnippet (vm=0xc0000ac000, 
    diagnosticFileName=..., filename=..., snippet=..., kind=0, output=..., err=...)
    at /tmp/pip-install-j2voo3v2/gojsonnet/vm.go:182
#38 0x00007f8af777240b in github.com/google/go-jsonnet.(*VM).EvaluateSnippet (vm=0xc0000ac000, filename=..., 
    snippet=..., json=..., formattedErr=...) at /tmp/pip-install-j2voo3v2/gojsonnet/vm.go:272
#39 0x00007f8af777cd7a in main.evaluateSnippet (vmRef=0x55f2da623cb0, filename=..., code=..., 
    e=0x7ffc12854804, ~r4=<optimized out>) at /tmp/pip-install-j2voo3v2/gojsonnet/c-bindings/c-bindings.go:148
#40 0x00007f8af777dd6a in main.jsonnet_evaluate_file (vmRef=0x55f2da623cb0, 
    filename=0x7f8b0cf024c0 "/usr/local/airflow/jsonnet/archive/dl_bars.jsonnet", e=0x7ffc12854804, 
    ~r3=<optimized out>) at /tmp/pip-install-j2voo3v2/gojsonnet/c-bindings/c-bindings.go:239
#41 0x00007f8af777aa64 in main._cgoexpwrap_247c82eca4e9_jsonnet_evaluate_file (p0=0x55f2da623cb0, 
    p1=0x7f8b0cf024c0 "/usr/local/airflow/jsonnet/archive/dl_bars.jsonnet", p2=0x7ffc12854804, r0=0x0)
    at _cgo_gotypes.go:285
#42 0x00007f8af76a9e1d in runtime.call32 () at /usr/local/airflow/go/src/runtime/asm_amd64.s:539
#43 0x00007f8af76580cb in runtime.cgocallbackg1 (ctxt=0) at /usr/local/airflow/go/src/runtime/cgocall.go:314
#44 0x00007f8af7657e77 in runtime.cgocallbackg (ctxt=0) at /usr/local/airflow/go/src/runtime/cgocall.go:191
#45 0x00007f8af76ab42a in runtime.cgocallback_gofunc () at /usr/local/airflow/go/src/runtime/asm_amd64.s:793
#46 0x00007f8af76abb81 in runtime.goexit () at /usr/local/airflow/go/src/runtime/asm_amd64.s:1357
#47 0x0000000000000000 in ?? ()

@githubaccount888
Copy link
Author

Also I did try building the go-jsonnet Python bindings from the latest Github commit - deadlock was still there.

@githubaccount888
Copy link
Author

githubaccount888 commented Dec 2, 2020

I'm attaching a "relatively simple" example that reproduces the issue for me. It's still a bit of an abomination (blame Airflow and its poor testing strategy) but at least it's a starting point. The README file should explain the details - you'll need both Docker and Docker-Compose to run this.
gojsonnet-bug.tar.gz

Follow-ups:

  • When Airflow executes the gojsonnet task you should see a process appear, with a CMD named something like "airflow task runner: test-dag.test1 generate_json 2020-02-24T09:00:00+00:00". You'll notice that process is in a sleeping/idle state. If you attach gdb you should see a similar backtrace to the one I posted above. If you switch from gojsonnet-->jsonnet and run the test again that process should consume 100% CPU until it completes.
  • Updated the example in place to remove unnecessary "test2".
  • Updated the example in place to remove the need for a postgres container (unnecessary complexity)
  • Added docker-compose-airflow-test-simple.yml that runs the same code in a more direct context, except this one does not hang.

There seems to be something different about the context when Airflow executes the task, which appears to not play nice with the way gojsonnet invokes the Go runtime.

@githubaccount888
Copy link
Author

githubaccount888 commented Dec 13, 2020

That kapitan thread inspired me to take another look - here's a vastly simplified (no dependencies) test case that exercises the bug.
gojsonnet-bug-simple.tar.gz

Python multiprocessing has 3 ways to start a new process: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

"spawn" and "forkserver" both work fine. "fork" causes the deadlock

(I'm on Linux btw)

@githubaccount888
Copy link
Author

Appears suspiciously similar to this: golang/go#15538

It sounds like it might just be a fundamental issue with the Go runtime, since it's multithreaded. Even the Python documentation linked above states "Note that safely forking a multithreaded process is problematic."

Unfortunately, Python uses "fork" as the default for Unix, so it's more likely people will run into this.

@githubaccount888
Copy link
Author

It's not a general solution, but I was able to fix the issue for me by just deferring all of my "import _gojsonnet" calls until after the fork() occurs (however this "fix" won't work if there is any nested fork() pattern in one's program).

There may not be any good general solution to this. In theory, if you could reliably create/destroy the Go runtime multiple times in the same program (and you were okay with this overhead) then you could wrap all _gjsonnet calls with create/destroy. I'm not an expert on Go's runtime but I really doubt that's possible.

@joulaud
Copy link

joulaud commented Dec 13, 2020

I agree with previous comments.
Whatever you will be able to do or not in the binding to work around this problem I think that immediately a prominent notice should be added to documentation recommending to not use fork method of Python multiprocessing package with go-jsonnet binding.

@sbarzowski
Copy link
Collaborator

Thanks a lot for investigating. This explanation sounds right, but it's very unfortunate.

Maybe we can mitigate the issue by playing with gcstoptheworld setting (disabling the concurrency completely). Either way we should warn about the issue with fork + go library.

In the longer term, we can think about solving this issue by spawning a separate process for Jsonnet evaluation in our wrapper and only the new process actually imports the go library. It has its own problems and limitations, though.

@netomi
Copy link
Contributor

netomi commented Apr 19, 2023

I experienced the very same issue when using the gojsonnet python bindings together with the multiprocessing Pool.map() function. In my case, whenever I used evaluate_file or evaluate_snippet in a forked process, there would be a deadlock unrelated to jsonnet:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 240, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 102, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 240, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 102, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 240, in _process_worker
    call_item = call_queue.get(block=True)
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 102, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()

I worked around the issue making sure that jsonnet is not called in a forked process. The pool is used to parallelize http requests (threading in python is unfortunately a deadend), if jsonnet needs to be used on the results, it has to be delegated once the results of the spawned processes have been collected which is acceptable.

Another workaround that worked was to set:

    import multiprocessing as mp
    mp.set_start_method('spawn')

or

    import multiprocessing as mp
    mp.set_start_method('forkserver')

at the start of the program. That avoided the deadlock, but was not acceptable for me, as the spawned processes do not share the same filedescriptors. As I develop a command line tool that outputs logging data, logging from spawned processes was not available anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants