Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For arm 32bit android build, the "__sync_synchronize" symbol is UNDEFINED in liblkl-hijack.so #348

Closed
mxi1 opened this issue May 13, 2017 · 21 comments

Comments

@mxi1
Copy link

mxi1 commented May 13, 2017

Have you ever tried the ARM 32bit build with arm-linux-androideabi-gcc (4.9.x, 5.x, 6.x, whatever)? I am trying to make hijacking work on Android shell.

After the CROSS_PATH, CROSS_COMPILE and SYS_ROOT environmental variables are set, and some endian-related macro redefinitions are removed, the build process looks nice and the liblkl-hijack.so file is generated. Then, I pushed the file to one root-ed Android device, and setup the LD_PRELOAD=/blah/blah/liblkl-hijack.so to run the ip addr show-like commands.

Here, the issues show up with symbols cannot located. Actually, there're more than one UNDEFINED symbols like dpdk and vde-related symbols. I removed all the dpdk and vde-related code blocks because they don't exist on my root-ed android device. Then, the confusing __sync_synchronize comes up.

I searched via Google, and knew that this symbol is gcc-internal, and should not appear in symbol table of the dynamic library. I tried to remove the --nodefaultlibs option, and this symbol is not UDNEFINED, but the segment fault error will show up when hijack is working.

Has anyone met the similar issue on any ARM 32bit platforms? Please give me your hands for this. Thank you!

@thehajime
Copy link
Member

I also faced atomic ops symbols issue which I tried to solve by #273. not upstreamed yet but it would be great if you can try the patch if it works or not.

@mxi1
Copy link
Author

mxi1 commented May 15, 2017

@thehajime Thanks for your help.
I added "-march=armv8-a" option to the CFLAGS and LDFLAGS options, and the undefined "__sync_synchronize" is gone.
However, there're still 4 undefined symbols for the build, all of which are related with atomic ops.

They are:

00000000         *UND*	00000000              __sync_fetch_and_or_4
00000000         *UND*	00000000              __sync_fetch_and_and_4
00000000         *UND*	00000000              __sync_fetch_and_add_4
00000000         *UND*	00000000              __sync_fetch_and_sub_4

From the pull request #273 , I can see some similar ops are defined for ARMEL arch. However, all of them are not with "_4" ending.

What should we can do about these symbols? Could you give some advices on it?
Thank you!

@mxi1
Copy link
Author

mxi1 commented May 15, 2017

@thehajime Here is one solution mentioned in one old (about 5 years ago) blog: How to solve __sync_add_and_fetch_4, but we need to rebuild the compiler, which is more complicated for me.

@thehajime
Copy link
Member

@mxi1 the basic idea of #273 is to give us a chance to provide a single interface to different atomic ops implementations. currently only two defines (ARMEL and other: I assumed x86_64) are included but you can add any ifdefs if it is appropriated.

@mxi1
Copy link
Author

mxi1 commented May 17, 2017

@thehajime By adding "-march=armv8-a" to both arch/lkl/Makeifle and tools/lkl/Makefile, these __sync_xx -related issues has gone.
However, when I am trying to hijack with LD_PRELOAD=, the segmentation fault shows up. The only message comes from dmesg, like:

115760.588470s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]iperf3[938]: unhandled level 2 translation fault (11) at 0xf7b4ce28, esr 0x92000006
[115760.588500s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]pgd = ffffffc08b0f6000
[115760.588531s][2017:05:16 20:36:26][pid:938,cpu0,iperf3][f7b4ce28] *pgd=000000008c95f003, *pmd=0000000000000000
[115760.588623s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]
[115760.588653s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]CPU: 0 PID: 938 Comm: iperf3 Tainted: G        W    3.10.90-gda247c4 #1
[115760.588653s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]task: ffffffc0944f1700 ti: ffffffc0d57f8000 task.ti: ffffffc0d57f8000

I can track to lkl_start_kernel(...), but cannot gather more information further. For the Unhandled level 2 translation fault (11), I can only find this mail thread and this.

BTW, for comparison, the lkl hijacking for aarch64 works fine for me.

@thehajime
Copy link
Member

@thehajime By adding "-march=armv8-a" to both arch/lkl/Makeifle and tools/lkl/Makefile, these __sync_xx -related issues has gone.
However, when I am trying to hijack with LD_PRELOAD=, the segmentation fault shows up. The only message comes from dmesg, like:

115760.588470s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]iperf3[938]: unhandled level 2 translation fault (11) at 0xf7b4ce28, esr 0x92000006
[115760.588500s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]pgd = ffffffc08b0f6000
[115760.588531s][2017:05:16 20:36:26][pid:938,cpu0,iperf3][f7b4ce28] *pgd=000000008c95f003, *pmd=0000000000000000
[115760.588623s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]
[115760.588653s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]CPU: 0 PID: 938 Comm: iperf3 Tainted: G W 3.10.90-gda247c4 #1
[115760.588653s][2017:05:16 20:36:26][pid:938,cpu0,iperf3]task: ffffffc0944f1700 ti: ffffffc0d57f8000 task.ti: ffffffc0d57f8000
I can track to lkl_start_kernel(...), but cannot gather more information further. For the Unhandled level 2 translation fault (11), I can only find this mail thread.

a stack trace with gdb (from core dump if you have) will be helpful to see what's wrong with.
another way to debug is setting LKL_HIJACK_DEBUG=1 environment variable to generate more messages.

does make test under tools/lkl directory successfully work ?
this will be clarify if hijack library behaves wrong, or lkl itself.

BTW, for comparison, the lkl hijacking for aarch64 works fine for me.

That's a great news.

@mxi1
Copy link
Author

mxi1 commented May 17, 2017

@thehajime

a stack trace with gdb (from core dump if you have) will be helpful to see what's wrong with.
I got one strace but it doesn't work well. I can try others if I can get another one.
another way to debug is setting LKL_HIJACK_DEBUG=1 environment variable to generate more messages.

Yes, I enabled LKL_HIJACK_DEBUG, but couldn't get helpful information.
Because it crashes in very-early stage in lkl_start_kernel, might be when memory operations, thread operations or atomic operations are called.

does make test under tools/lkl directory successfully work ?

No. boot can be generated but could not run under Android shell, which says only Position independent executables (PIE) are supported.
Compiling net-test.c fails because of incomplete type sturct icmphdr.

this will be clarify if hijack library behaves wrong, or lkl itself.

@mxi1
Copy link
Author

mxi1 commented May 19, 2017

@thehajime Here is the final parts of strace output.

Anything useful for debug? Thanks.

pipe2([4, 5], 0)                        = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
futex(0xf6a2af24, FUTEX_WAKE_PRIVATE, 2147483647) = 0
mmap2(NULL, 1044480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0xf6741000
madvise(0xf6741000, 1044480, MADV_MERGEABLE) = 0
mprotect(0xf6741000, 4096, PROT_NONE)   = 0
clone(child_stack=0xf683f928, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf683f938, tls=0xf683f978,
child_tidptr=0xf683f938) = 18931
mmap2(NULL, 1044480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0xf6642000
madvise(0xf6642000, 1044480, MADV_MERGEABLE) = 0
mprotect(0xf6642000, 4096, PROT_NONE)   = 0
clone(child_stack=0xf6740928, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf6740938, tls=0xf6740978,
child_tidptr=0xf6740938) = 18932
futex(0xf6853008, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_USER, si_pid=521, si_uid=0} ---
--- stopped by SIGSTOP ---
futex(0xf6853008, FUTEX_WAIT_PRIVATE, 0, NULL <ptrace(SYSCALL):No such process>
+++ killed by SIGSEGV +++
Segmentation fault

@thehajime
Copy link
Member

@mxi1 thanks for the trace, but I wasn't able to imagine how it works bad..

maybe I think we can fix the build issue you faced only Position independent executables (PIE) are supported, and then see how the make test works to see.

If PIE build is a constraint on Android binary, we can add the linker flag (-pie) on the lkl build.

@mxi1
Copy link
Author

mxi1 commented May 23, 2017

@thehajime Thanks for your advice.
Here is the Makefile changes. The executables and the dynamic libraries should be separated here, because -pie is for executables, -shared is for dynamic libraries.

diff --git a/tools/lkl/Makefile b/tools/lkl/Makefile
index fd6c698b3524..8bb7eaeb83c3 100644
--- a/tools/lkl/Makefile
+++ b/tools/lkl/Makefile
@@ -133,9 +133,12 @@ $(OUTPUT)lib/lkl.o:
 $(OUTPUT)liblkl.a: $(OUTPUT)lib/lkl-in.o $(OUTPUT)lib/lkl.o
        $(QUIET_AR)$(AR) -rc $@ $^
 
-$(OUTPUT)liblkl$(SOSUF) $(OUTPUT)liblkl-hijack$(SOSUF) $(OUTPUT)lklfuse$(EXESUF) $(OUTPUT)fs2tar$(EXESUF) $(OUTPUT)cptofs$(EXESUF) $(OUTPUT)tests/boot $(OUTPUT)tests/net-test:
+$(OUTPUT)liblkl$(SOSUF) $(OUTPUT)liblkl-hijack$(SOSUF) $(OUTPUT)lklfuse$(EXESUF) $(OUTPUT)fs2tar$(EXESUF) $(OUTPUT)cptofs$(EXESUF):
        $(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ $(LDLIBS)
 
+$(OUTPUT)tests/boot $(OUTPUT)tests/net-test:
+       $(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ $(LDLIBS) -pie
+
 $(OUTPUT)cpfromfs$(EXESUF): cptofs$(EXESUF)
        $(Q)if ! [ -e $@ ]; then ln -s $< $@; fi

Now, I can run the 32bit boot command (without any parameters) under Android shell, the output looks like the following:

mutex                passed [1]
semaphore            passed [1]
join                 passed [joined -151783120]
Segmentation fault 

From the source code, the segmentation fault comes from lkl_start_kernel.
Any suggestions for this? Thank you!

@mxi1
Copy link
Author

mxi1 commented Jun 1, 2017

@thehajime In order to find where the code crashes, I added several checkpoints, and found the location:
This line in arch/lkl/kernel/setup.c

init_sem = lkl_ops->sem_alloc(0);

it will call the sem_alloc function defined in tools/lkl/lib/posix-host.c.

In sem_alloc(), all the code works fine, until the last line return sem;. It seems the return value triggers the unhandled level 2 translation fault.

@thehajime
Copy link
Member

BTW, for comparison, the lkl hijacking for aarch64 works fine for me.

@mxi1 now I'm trying to test this on aarch64 but still struggling to build liblkl-hijack.so. which toolchain did you use when aarch64 hijack works fine ?

I'm right now using the tarball generated by make_standalone_toolchain.py --arch arm64, included in ndk version 15.0.4075724.

btw, please close this issue if you solved the original issue.

@mxi1
Copy link
Author

mxi1 commented Jul 1, 2017

@thehajime Okay, I think I can close this issue, and I am still preparing one patch for both arm32 and arm64 supports.

There's one bug in aarch64 android gcc toolchain 4.9, you need to use one higher-version toolchain, I remember version 5.4 should work, which I got from linaro git repo. Let me try to find the url for you from my home laptop

@mxi1 mxi1 closed this as completed Jul 1, 2017
@thehajime
Copy link
Member

@mxi1 Thanks a lot !

Though I still faced the same issue of linking (https://sourceware.org/bugzilla/show_bug.cgi?id=18270), I can successfully build liblkl-hijack.so with 6.3 version of toolchain (https://android-git.linaro.org/platform/prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-6.3-linaro.git/) and it's working fine on aarch64 emulator (of Android).

generic_arm64:/data/local/tmp $ LKL_HIJACK_DEBUG=1 LD_PRELOAD=./liblkl-hijack.so ip link
[    0.000000] Linux version 4.10.0+ (tazaki@zakzak-x260) (gcc version 6.3.1 20170109 (Linaro GCC Snapshot 6.3-2017.01) ) #30 Mon Jul 3 12:28:57 JST 2
017
[    0.000000] bootmem address range: 0x7689200000 - 0x768d200000
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16160
[    0.000000] Kernel command line:
[    0.000000] PID hash table entries: 256 (order: -1, 2048 bytes)
[    0.000000] Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.000000] Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Memory available: 64440k/0k RAM
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS:4096
[    0.000000] lkl: irqs initialized
[    0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000020] lkl: time and timers initialized (irq1)
[    0.000331] pid_max: default: 4096 minimum: 301
[    0.001149] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.001179] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.233752] console [lkl_console0] enabled
[    0.235704] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.240603] NET: Registered protocol family 16
[    0.268119] clocksource: Switched to clocksource lkl
[    0.278931] NET: Registered protocol family 2
[    0.290727] TCP established hash table entries: 512 (order: 0, 4096 bytes)
[    0.294984] TCP bind hash table entries: 512 (order: 0, 4096 bytes)
[    0.299250] TCP: Hash tables configured (established 512 bind 512)
[    0.302808] UDP hash table entries: 128 (order: 0, 4096 bytes)
[    0.306051] UDP-Lite hash table entries: 128 (order: 0, 4096 bytes)
[    0.321164] workingset: timestamp_bits=62 max_order=14 bucket_order=0
[    0.380599] SGI XFS with ACLs, security attributes, no debug enabled
[    0.427369] io scheduler noop registered
[    0.431148] io scheduler deadline registered
[    0.435052] io scheduler cfq registered (default)
[    0.538038] mousedev: PS/2 mouse device common for all mice
[    0.547445] NET: Registered protocol family 10
[    0.560128] Segment Routing with IPv6
[    0.566145] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    0.583736] Warning: unable to open an initial console.
[    0.590231] This architecture does not have kernel memory protection.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0

@mxi1
Copy link
Author

mxi1 commented Jul 3, 2017

@thehajime I didn't notice this issue. I am trying to make the networking function of this hijack library work for me. Is the issue very serious for running lkl on aarch64 platform?

@thehajime
Copy link
Member

@mxi1 we're trying to make the network working but not deeply tried yet.

the above issue of 4.9/5.4 toolchains is about ld (linker), we cannot build liblkl-hijack.so so we cannot avoid the issue.

@mxi1
Copy link
Author

mxi1 commented Jul 3, 2017

@thehajime Have you enabled 64BIT in the config file? I checked my Makefile, no special options for elf64-littleaarch64 support. What I do is just adding elf64-littelaarch64 along side with the existing elf32-littlearm, and removing the __android__ part from tools/lkl/lib/endian.h. There's one link problem, and I have mentioned in another issue for ARM support, but that seems to be another story.

@thehajime
Copy link
Member

@thehajime Have you enabled 64BIT in the config file?

yes

I checked my Makefile, no special options for elf64-littleaarch64 support. What I do is just adding elf64-littelaarch64 along side with the existing elf32-littlearm, and removing the android part from tools/lkl/lib/endian.h. There's one link problem, and I have mentioned in another issue for ARM support, but that seems to be another story.

my current, work-in-progress patches are available here.

libos-nuse@5c5bd5c

it's going to be more complicated later :)

@thehajime
Copy link
Member

btw, I'm testing 7.x android for the above aarch64 hijack.

@mxi1
Copy link
Author

mxi1 commented Jul 3, 2017

@thehajime the android version doesn't matter. the gcc cross-toolchain stays the 4.9 version, and it seems Google will use llvm to replace gcc. I have added two comments in your 5c5bd5c patch, you could try my way to see if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants