Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroying TorService a second time causes crash #57

Closed
grote opened this issue Nov 25, 2021 · 42 comments
Closed

Destroying TorService a second time causes crash #57

grote opened this issue Nov 25, 2021 · 42 comments

Comments

@grote
Copy link
Contributor

grote commented Nov 25, 2021

While how to stop TorService is not documented, one usually stops a bound service simply by unbinding from it.

This works fine the first time. But if I need to start and bind to TorService a second time, I reproducibly get a hard crash in libc when unbinding again:

    --------- beginning of crash
A/libc: fdsan: attempted to close file descriptor 91, expected to be unowned, actually owned by unique_fd 0x7b7962c254
A/libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 17309 (tor), pid 17187 (onshare.android)
2021-11-25 15:47:57.180 17543-17543/? I/crash_dump64: obtaining output fd from tombstoned, type: kDebuggerdTombstoneProto
2021-11-25 15:47:57.181 711-711/? I/tombstoned: received crash request for pid 17527
2021-11-25 15:47:57.183 17543-17543/? I/crash_dump64: performing dump of process 17436 (target tid = 17527)
@grote
Copy link
Contributor Author

grote commented Nov 25, 2021

This may or may not be related, but it seems that you don't keep a reference to the FileDescriptor for the control socket and therefore you are never closing it. Not when exceptions occur and not when the service gets destroyed.

@eighthave
Copy link
Member

ah, that sounds like a good catch! I could never reliably reproduce it to troubleshoot it.

@eighthave
Copy link
Member

seems like keeping a reference, then closing it in onDestroy() could work

@eighthave
Copy link
Member

Here is also some related info and debugging:
https://gitlab.torproject.org/tpo/core/tor/-/issues/32729

@grote
Copy link
Contributor Author

grote commented Nov 25, 2021

Background info about fdsan:

fdsan is a file descriptor sanitizer added to Android in API level 29.
...
detects mishandling of file descriptor ownership, which tend to manifest as use-after-close and double-close.

@eighthave
Copy link
Member

I don't think ParcelFileDescriptor is in use at all here. That's for serializing things for Intents.

@eighthave
Copy link
Member

Seems like this might need to be handled in jtorctl, since it's TorControlConnection is what is using those streams.

eighthave added a commit to eighthave/tor-android that referenced this issue Nov 25, 2021
This should get run most of the time, but not always, since Android could
send `kill -9`.  But this might be better handled in jtorctl's
TorControlConnection, since it wraps those streams.

guardianproject#57
@eighthave
Copy link
Member

If that commit works, there should be a build available in the artfacts in this job, once it completes:
https://gitlab.com/eighthave/tor-android/-/pipelines/416708252

eighthave added a commit to eighthave/tor-android that referenced this issue Nov 26, 2021
This should get run most of the time, but not always, since Android could
send `kill -9`.  But this might be better handled in jtorctl's
TorControlConnection, since it wraps those streams.

guardianproject#57
eighthave added a commit to eighthave/tor-android that referenced this issue Nov 26, 2021
This should get run most of the time, but not always, since Android could
send `kill -9`.  But this might be better handled in jtorctl's
TorControlConnection, since it wraps those streams.

guardianproject#57
grote added a commit to grote/tor-android that referenced this issue Nov 26, 2021
E/fdsan: attempted to close file descriptor, expected to be unowned, actually owned
@grote
Copy link
Contributor Author

grote commented Nov 26, 2021

If that commit works, there should be a build available in the artfacts in this job, once it completes:
https://gitlab.com/eighthave/tor-android/-/pipelines/416708252

If you close the control connection streams there, won't this interfere with the ability to use the control connection?

I pushed an extra test here that allows me to reproduce the problem: grote@04d249d#diff-169d30aebc4b72de0419cb79771f8ced07f96a9d6353f17fd2bd86f5d3fe1f8cR1
It does nothing more then binding and unbinding the service twice.

Again, I urge you to read the fdsan docs: https://android.googlesource.com/platform/bionic/+/master/docs/fdsan.md
This has nothing to do with ParcelFileDescriptor, but is what is causing the crash.

@grote
Copy link
Contributor Author

grote commented Nov 26, 2021

@eighthave
Copy link
Member

I won't have time to dig into this for a bit. If you want to try the naive closing, it built here:
https://gitlab.com/eighthave/tor-android/-/pipelines/416958111

@eighthave
Copy link
Member

That test case looks very promising!

@grote
Copy link
Contributor Author

grote commented Nov 26, 2021

Full backtrace follows below.

Looks like a double close:

libc.so (android_fdsan_close_with_tag+752)
libc.so (close+16)
libtor.so (tor_close_socket_simple+28)
libtor.so (tor_close_socket+32)

A:       #00 pc 0000000000053bb0  /apex/com.android.runtime/lib64/bionic/libc.so (fdsan_error(char const*, ...)+556) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #01 pc 00000000000538bc  /apex/com.android.runtime/lib64/bionic/libc.so (android_fdsan_close_with_tag+752) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #02 pc 0000000000054028  /apex/com.android.runtime/lib64/bionic/libc.so (close+16) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #03 pc 000000000017fa60  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_close_socket_simple+28)
A:       #04 pc 000000000017fae8  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_close_socket+32)
A:       #05 pc 00000000001f2728  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so
A:       #06 pc 00000000001f9da4  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (connection_free_all+140)
A:       #07 pc 0000000000148dc0  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_free_all+100)
A:       #08 pc 0000000000148d30  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_cleanup+184)
A:       #09 pc 0000000000107968  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (tor_run_main+1192)
A:       #10 pc 0000000000105a40  /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
A:       #11 pc 0000000000222244  /apex/com.android.art/lib64/libart.so (art_quick_generic_jni_trampoline+148) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #12 pc 0000000000218964  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #13 pc 0000000000284208  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+188) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #14 pc 00000000003e31cc  /apex/com.android.art/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+400) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #15 pc 00000000003de040  /apex/com.android.art/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+804) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #16 pc 0000000000755724  /apex/com.android.art/lib64/libart.so (MterpInvokeDirect+1036) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #17 pc 0000000000203914  /apex/com.android.art/lib64/libart.so (mterp_op_invoke_direct+20) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #18 pc 0000000000275eec  [anon:dalvik-classes11.dex extracted in memory from /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/base.apk!classes11.dex]
A:       #19 pc 00000000003d5e70  /apex/com.android.art/lib64/libart.so (art::interpreter::Execute(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame&, art::JValue, bool, bool)+304) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #20 pc 000000000073adbc  /apex/com.android.art/lib64/libart.so (artQuickToInterpreterBridge+776) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #21 pc 0000000000222378  /apex/com.android.art/lib64/libart.so (art_quick_to_interpreter_bridge+88) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #22 pc 0000000000211608  /apex/com.android.art/lib64/libart.so (nterp_helper+152) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #23 pc 0000000000275c9e  [anon:dalvik-classes11.dex extracted in memory from /data/app/~~8_F6TBRrq0vF8qgXFuDBlQ==/org.onionshare.android-s6-aT-cVZg3nV1p28B_OOw==/base.apk!classes11.dex]
A:       #24 pc 0000000000218964  /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #25 pc 0000000000284208  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+188) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #26 pc 000000000061f4d4  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithJValues<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, jvalue const*)+460) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #27 pc 000000000066e0b8  /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1184) (BuildId: 556e634687c8250966f28a53a5a17b2b)
A:       #28 pc 00000000000b1910  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+264) (BuildId: ba489d4985c0cf173209da67405662f9)
A:       #29 pc 00000000000513f0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: ba489d4985c0cf173209da67405662f9)
E: Tombstone written to: tombstone_23

@n8fr8
Copy link
Member

n8fr8 commented Nov 30, 2021

Wonder why this isn't happening in Orbot..

@eighthave
Copy link
Member

@grote can you tell what is calling close each time?

@asn-d6 @ahf this seems to crash the same way as https://gitlab.torproject.org/tpo/core/tor/-/issues/32729 and @grote has put together a repro test case. The stack trace looks different though, so I'm not sure it is 100% related.

@grote
Copy link
Contributor Author

grote commented Nov 30, 2021

Wonder why this isn't happening in Orbot..

I wonder as well. Maybe it isn't really stopping the service? I noticed that unbinding isn't enough to stop it when running a foreground service.

can you tell what is calling close each time?

No, I just got

libc.so (android_fdsan_close_with_tag+752)
libc.so (close+16)
libtor.so (tor_close_socket_simple+28)
libtor.so (tor_close_socket+32)

as posted in #57 (comment) . So maybe Tor is closing it and then something in Android is trying as well to close it?

@grote has put together a repro test case.

Has anyone been able to reproduce the issue on an API 29+ device? Either with an app or that instrumentation test? Knowing if others can reproduce it, could be the first step to solving it.

@akwizgran
Copy link
Contributor

Has anyone been able to reproduce the issue on an API 29+ device? Either with an app or that instrumentation test? Knowing if others can reproduce it, could be the first step to solving it.

I can reproduce the crash with @grote's test on the Redmi Note 7 (API 29).

--------- beginning of crash
12-01 13:01:57.342 20633 20772 F libc    : Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 20772 (tor), pid 20633 (roject.jni.test)
12-01 13:01:57.354 20773 20773 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
12-01 13:01:57.354 20773 20773 F DEBUG   : Build fingerprint: 'xiaomi/lavender/lavender:10/QKQ1.190910.002/V12.5.1.0.QFGMIXM:user/release-keys'
12-01 13:01:57.354 20773 20773 F DEBUG   : Revision: '0'
12-01 13:01:57.354 20773 20773 F DEBUG   : ABI: 'arm64'
12-01 13:01:57.355 20773 20773 F DEBUG   : Timestamp: 2021-12-01 13:01:57+0000
12-01 13:01:57.355 20773 20773 F DEBUG   : pid: 20633, tid: 20725, name: tor  >>> org.torproject.jni.test <<<
12-01 13:01:57.355 20773 20773 F DEBUG   : uid: 10149
12-01 13:01:57.355 20773 20773 F DEBUG   : signal 35 (<debuggerd signal>), code -1 (SI_QUEUE), fault addr --------
12-01 13:01:57.355 20773 20773 F DEBUG   : Abort message: 'attempted to close file descriptor 62, expected to be unowned, actually owned by unique_fd 0x7bae8b2894'
12-01 13:01:57.355 20773 20773 F DEBUG   :     x0  0000000000000000  x1  00000000000050f5  x2  0000000000000023  x3  0000007b51762bb0
12-01 13:01:57.355 20773 20773 F DEBUG   :     x4  3832623865616237  x5  3832623865616237  x6  3832623865616237  x7  3439383262386561
12-01 13:01:57.355 20773 20773 F DEBUG   :     x8  00000000000000f0  x9  fc4ae377d41e87a7  x10 0000000000000000  x11 0000007c3e6dfcfa
12-01 13:01:57.355 20773 20773 F DEBUG   :     x12 0000000000000062  x13 0000007b5176278f  x14 0000007b5176277a  x15 0000000000000000
12-01 13:01:57.355 20773 20773 F DEBUG   :     x16 0000007c3e7b88c0  x17 0000007c3e794bc0  x18 0000007b4e5f8000  x19 0000000000005099
12-01 13:01:57.356 20773 20773 F DEBUG   :     x20 00000000000050f5  x21 0000007b51764020  x22 0000007c41d0c8f4  x23 0000000000000001
12-01 13:01:57.356 20773 20773 F DEBUG   :     x24 0000007b517629a0  x25 0000007b51762920  x26 0000007b517628e0  x27 0000000000000001
12-01 13:01:57.356 20773 20773 F DEBUG   :     x28 0000007b51763040  x29 0000007b51762c90
12-01 13:01:57.356 20773 20773 F DEBUG   :     sp  0000007b51762860  lr  0000007c3e7497d8  pc  0000007c3e7497f8
12-01 13:01:57.422 20778 20778 I crash_dump64: obtaining output fd from tombstoned, type: kDebuggerdTombstone
12-01 13:01:57.424  1184  1184 I /system/bin/tombstoned: received crash request for pid 20772
12-01 13:01:57.424  1184  1184 I /system/bin/tombstoned: enqueueing crash request for pid 20772
12-01 13:01:57.598 20773 20773 F DEBUG   : 
12-01 13:01:57.598 20773 20773 F DEBUG   : backtrace:
12-01 13:01:57.598 20773 20773 F DEBUG   :       #00 pc 00000000000867f8  /apex/com.android.runtime/lib64/bionic/libc.so (fdsan_error(char const*, ...)+388) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #01 pc 0000000000086610  /apex/com.android.runtime/lib64/bionic/libc.so (android_fdsan_close_with_tag+728) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #02 pc 0000000000086d7c  /apex/com.android.runtime/lib64/bionic/libc.so (close+16) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #03 pc 000000000017fa60  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_close_socket_simple+28)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #04 pc 000000000017fae8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_close_socket+32)
12-01 13:01:57.598 20773 20773 F DEBUG   :       #05 pc 00000000001f2728  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so
12-01 13:01:57.598 20773 20773 F DEBUG   :       #06 pc 00000000001f9da4  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (connection_free_all+140)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #07 pc 0000000000148dc0  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_free_all+100)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #08 pc 0000000000148d30  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_cleanup+184)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #09 pc 0000000000107968  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_run_main+1192)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #10 pc 0000000000105a40  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #11 pc 0000000000140350  /apex/com.android.runtime/lib64/libart.so (art_quick_generic_jni_trampoline+144) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #12 pc 0000000000137334  /apex/com.android.runtime/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #13 pc 0000000000145fec  /apex/com.android.runtime/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+244) (BuildId: 64f7ca74f522
ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #14 pc 00000000002e3800  /apex/com.android.runtime/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, a
rt::JValue*)+384) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #15 pc 00000000002dea60  /apex/com.android.runtime/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, u
nsigned short, art::JValue*)+892) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #16 pc 00000000005a3ae4  /apex/com.android.runtime/lib64/libart.so (MterpInvokeDirect+424) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #17 pc 0000000000131914  /apex/com.android.runtime/lib64/libart.so (mterp_op_invoke_direct+20) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #18 pc 00000000000aab18  [anon:dalvik-classes.dex extracted in memory from /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/base.apk] (org.torproject.jni.TorService.access$1000)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #19 pc 00000000005a452c  /apex/com.android.runtime/lib64/libart.so (MterpInvokeStatic+1040) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #20 pc 0000000000131994  /apex/com.android.runtime/lib64/libart.so (mterp_op_invoke_static+20) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #21 pc 00000000000a9c34  [anon:dalvik-classes.dex extracted in memory from /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/base.apk] (org.torproject.jni.TorService$3.run+724)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #22 pc 00000000002b4b14  /apex/com.android.runtime/lib64/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEbb.llvm.10247671929900336896+240) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #23 pc 0000000000592d24  /apex/com.android.runtime/lib64/libart.so (artQuickToInterpreterBridge+1032) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #24 pc 0000000000140468  /apex/com.android.runtime/lib64/libart.so (art_quick_to_interpreter_bridge+88) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #25 pc 0000000000137334  /apex/com.android.runtime/lib64/libart.so (art_quick_invoke_stub+548) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.599 20773 20773 F DEBUG   :       #26 pc 0000000000145fec  /apex/com.android.runtime/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+244) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #27 pc 00000000004b171c  /apex/com.android.runtime/lib64/libart.so (art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*)+104) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #28 pc 00000000004b2830  /apex/com.android.runtime/lib64/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue const*)+416) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #29 pc 00000000004f31f8  /apex/com.android.runtime/lib64/libart.so (art::Thread::CreateCallback(void*)+1176) (BuildId: 64f7ca74f522ef2c933e507c8abf7ef4)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #30 pc 00000000000e6b50  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:57.600 20773 20773 F DEBUG   :       #31 pc 0000000000084c6c  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: b9254a9320cb794714fda3ae99886b88)

@akwizgran
Copy link
Contributor

I haven't finished debugging this but I'll add some notes about what I've found so far.

So far I haven't been able to reproduce the crash when running only @grote's test case, but I can usually get a crash when running that test case along with the existing tests in TorServiceTest.java. (I had to fix a couple of tests - one was using a v2 onion URL, the other was using a plaintext HTTP URL which the API 29 device refused to access.)

Only a few of the crashes were related to fdsan errors. Most were caused by SIGABRT with one of the following two backtraces, both of which I saw multiple times:

12-01 13:01:59.383 20778 20778 F DEBUG   :       #00 pc 00000000000831f0  /apex/com.android.runtime/lib64/bionic/libc.so (abort+160) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #01 pc 0000000000170864  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_raw_abort_+12)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #02 pc 0000000000179de8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_abort_+12)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #03 pc 000000000023c9ec  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (hs_circuitmap_init+132)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #04 pc 0000000000242fc8  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (hs_init+28)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #05 pc 0000000000106d48  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_init+128)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #06 pc 0000000000107564  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (tor_run_main+164)
12-01 13:01:59.383 20778 20778 F DEBUG   :       #07 pc 0000000000105a40  /data/app/org.torproject.jni.test-pTrnXTqOi3Mk7DH8UnOaQA==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #00 pc 00000000000831f0  /apex/com.android.runtime/lib64/bionic/libc.so (abort+160) (BuildId: b9254a9320cb794714fda3ae99886b88)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #01 pc 0000000000170864  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_raw_abort_+12)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #02 pc 0000000000179de8  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_abort_+12)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #03 pc 0000000000107448  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (pubsub_install+156)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #04 pc 0000000000107558  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (tor_run_main+152)
12-01 14:11:42.704 23799 23799 F DEBUG   :       #05 pc 0000000000105a40  /data/app/org.torproject.jni.test-d3CQfay99oHSZlslpR6oVg==/lib/arm64/libtor.so (Java_org_torproject_jni_TorService_runMain+36)

I started to wonder whether it was possible for one test's call to runMain() to overlap with another test's call to the same method, which might result in two threads accessing the same static state inside the Tor library. The fdsan errors would be a specific case of this involving a file descriptor, while the other two backtraces would involve some other static state.

I added some debug logging and found that when a crash occurred, the previous test case's call to runMain() often returned after the crash.

However, the problem wasn't exactly caused by overlapping calls to runMain(), because the crashing test case hadn't called runMain() at the time of the crash (or, possibly, it had done so but then it had crashed before the log message could be written to the log).

I added a lock to prevent concurrent calls to runMain(), but this didn't help. I made the lock static so that different TorService instances would share the same lock, and this did seem to make a difference. Since making the lock static I've only managed to capture one crash, which was also the only run where the log indicated that a TorService instance was waiting for the lock - indicating that different test cases were indeed potentially calling into the Tor library concurrently from different threads.

Since the crash still happens, I think the problem is not exactly concurrent calls to runMain(), but rather concurrent calls to a set of native methods that includes runMain(). I'm going to increase the amount of code guarded by the lock to see if the crashes can be eliminated, and if so, which concurrent native calls are involved.

@akwizgran
Copy link
Contributor

I think the crashes are probably caused by calling mainConfigurationFree() from onDestroy() while runMain() is still running in another thread (or may not even have been called yet... the thread that calls it may be doing the other native calls to set up the command line and control socket).

Moving the native setup calls and runMain() into a try block, with mainConfigurationFree() in the finally block, and also using a static lock to prevent concurrent entry into that block, seems to prevent the various crashes. However, fdsan now reliably complains about mainConfigurationFree() closing Tor's owning_controller_socket. This socket is the "other end" of the socket that's returned to TorService, and Tor closes its end when the main configuration is released. This produces the following complaint from fdsan:

12-01 16:41:35.061 20764 20873 I TorService: torControlFd is 62 before freeing main configuration
12-01 16:41:35.061 20764 20873 I TorService: Freeing main configuration
12-01 16:41:35.063 20764 20873 E fdsan   : attempted to close file descriptor 63, expected to be unowned, actually owned by unique_fd 0x7c40818d54

(In this case the FD for TorService's end of the control connection was 62, while the FD for Tor's end was 63.)

This happens before the static lock is released, so I don't think the issue is that another TorService instance is messing with the contents of the main configuration. And yet I can only reproduce this when running the whole test suite... so some kind of interaction between the different test cases does seem to be involved.

Perhaps when TorService closes the control socket, the FD on Tor's end is automatically closed and released as well. But Tor still has a record of the FD number in its main configuration struct. Later, some other caller asks for an FD and the number previously used by Tor gets reallocated. (Perhaps this part only happens if other test cases are running, which is why I can't reproduce the problem with a single test case?). Then later, when mainConfigurationFree() is called, Tor tries to close the FD that was already closed earlier, and fdsan notices the problem?

@eighthave
Copy link
Member

@akwizgran sounds like you're thinking that the crash isn't so much about the Control Socket but rather knowing when to call Tor's free/shutdown methods. That was my impression also, when we were going through the troubleshooting as discussed in https://gitlab.torproject.org/tpo/core/tor/-/issues/32729.

I also remember having the issue where these kinds of crashes happened only when running the whole test suite, where there were repeated start/stop cycles. The hard part is knowing exactly what is not cleaned up between runs, the docs never seem to fully outline it. Instead, there are statements like (keyword "most"):

Each test runs in its own Instrumentation instance. Therefore, if your tests share app state, most of that shared state is removed from your device's CPU or memory after each test.
[1]

There is also the poorly documented runner option:

   // The following argument makes the Android Test Orchestrator run its
   // "pm clear" command after each test invocation. This command ensures
   // that the app's state is completely cleared between tests.
   testInstrumentationRunnerArguments clearPackageData: 'true'

Also, FYI, the core JNI code used here came from @sysrqb, and I wrapped that with the UNIX domain socket. @sysrqb do you have any thoughts on a proper shutdown process when running tor in a thread?

@akwizgran
Copy link
Contributor

@akwizgran sounds like you're thinking that the crash isn't so much about the Control Socket but rather knowing when to call Tor's free/shutdown methods.

Right. Specifically, I think the crash is caused by calling mainConfigurationFree() when the thread that makes the other native calls (most importantly runMain()) is still running. The documentation for tor_main_configuration_free() in tor_api.h says:

Once you have passed a tor_main_configuration_t to tor_run_main(), you
must not free it until tor_run_main() has finished.

Moving mainConfigurationFree() to the same thread as the other native calls, and using a static lock to ensure that only one TorService instance per process is making native calls, prevents the crash. I'll clean these changes up and push them to a branch.

There's also some good news about the fdsan error. I can now reproduce this with only a single test case, which only binds and unbinds the service once. The error doesn't happen in every run, but in most runs fdsan warns about the file descriptor that's closed inside mainConfigurationFree(). If the test case finishes immediately then fdsan logs the initial warning but doesn't have time to log a backtrace or write a tombstone. Adding a one-second sleep to the end of the test case allows enough time for that to happen.

I think it was probably possible to reproduce the fdsan error before, but I was expecting to see a crash, so I ignored all the runs where the tests passed without crashing. That was a mistake. In the default configuration, fdsan doesn't terminate the process. It records a tombstone for the first fdsan violation, then disables further fdsan checks. So when I saw fdsan tombstones after a crash, they weren't the cause of the crash. Once I had prevented the crashes this became clear.

As far as I can tell, the fdsan error is unrelated to the crashes, and we might want to open a separate ticket for it. It's possible that it's a bug in Tor, where owning_controller_socket ought to be cleared if the control socket is closed before tor_main_configuration_free() is called, in order to avoid a double close.

I believe the reason we don't see the fdsan warning on every run is that fdsan only detects the double close if the file descriptor number is reallocated between the time when Tor's end of the control socket is closed and the time when tor_main_configuration_free() is called.

@grote
Copy link
Contributor Author

grote commented Dec 2, 2021

Great work! 🎉

In the default configuration, fdsan doesn't terminate the process. It records a tombstone for the first fdsan violation, then disables further fdsan checks.

Could it be that my API 31 device is configured differently? When using your branch it is still crashing when running the tests. With Android Studio, there's occasional crashes, but when running them with ./gradlew connectedAndroidTest, it seems to crash every time pointing at fdsan as the culprit.

The tombstone it wrote has quite some extensive debug information, including a list of file descriptors that are currently open. https://pastebin.com/kmmT83yD

@akwizgran
Copy link
Contributor

Could it be that my API 31 device is configured differently?

Ah, good thought!

https://developer.android.com/about/versions/11/behavior-changes-all#fdsan

"The default mode for fdsan is changing in Android 11. fdsan now aborts upon detecting an error; the previous behavior was to log a warning and continue."

@akwizgran
Copy link
Contributor

The tombstone it wrote has quite some extensive debug information, including a list of file descriptors that are currently open.

Nice!

12-02 16:40:53.105 12452 12491 E fdsan   : attempted to close file descriptor 70, expected to be unowned, actually owned by unique_fd 0x7c197bee44
    fd 69: socket:[1960483] (unowned)
    fd 70: anon_inode:[eventfd] (unowned)

Here's my theory: FD 69 and 70 were originally allocated to the control socket, with 69 being TorService's end and 70 being Tor's end. Tor closed its end of the socket during shutdown and FD 70 was reused by anon_inode:[eventfd] (whatever that means... anyway it's not socket like FD 69). When tor_main_configuration_free() tried to close its owning_controller_socket, fdsan noticed the double close.

@akwizgran
Copy link
Contributor

To add weight to this theory, you could log the value of torControlFd in TorService before calling mainConfigurationFree(). The FD for TorService's end of the socket should be one less than the FD that fdsan complains about, and hopefully the tombstone should still show its type as socket, whereas the FD that fdsan complains about will most likely be reused as some other type.

bitmold pushed a commit to bitmold/tor that referenced this issue Apr 7, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
bitmold pushed a commit to bitmold/tor that referenced this issue Apr 7, 2024
bitmold pushed a commit to bitmold/tor that referenced this issue Apr 7, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 9, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 9, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 10, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 10, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 13, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 13, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 16, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 16, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 29, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Apr 29, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue May 2, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue May 2, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
bitmold pushed a commit to bitmold/tor that referenced this issue May 3, 2024
bitmold pushed a commit to bitmold/tor that referenced this issue May 3, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
uniqx pushed a commit to uniqx/tor that referenced this issue May 7, 2024
uniqx pushed a commit to uniqx/tor that referenced this issue May 7, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 1, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 1, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 6, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 6, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 19, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Jun 19, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
uniqx pushed a commit to uniqx/tor that referenced this issue Jul 15, 2024
uniqx pushed a commit to uniqx/tor that referenced this issue Jul 15, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
syphyr pushed a commit to syphyr/tor that referenced this issue Aug 6, 2024
syphyr pushed a commit to syphyr/tor that referenced this issue Aug 6, 2024
This is here as an experiment in case it is useful in Java space.  It is
looking like there are better ways to handle the shutdown, so my guess is
that this will not be useful.

* guardianproject/tor-android#59
* guardianproject/tor-android#61 (comment)
* guardianproject/tor-android#57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants