-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corrupt kernel image read in stage 2 #552
Comments
It appears as if this the culprit is related to the real mode sector read in stage2. Apply the following patch to the latest master and - whether or not the crash occurs - observe that this specific address changes, always to the same value, after a call to bios_read_sectors:
see how the location always changes to 0x4200:
That last one was the correct word from the kernel image being clobbered. This is what was leading the elf parsing off into the weeds. Using a watchpoint in gdb shows the address being modified - presumably because it's in esp!
0xd562 is off in the weeds, but after entering bios_read_sectors we are changing modes and stacks, and I'm not sure if gdb is keeping up with this (though if I change to arch i8086 it basically shows the same regs and backtrace). My hunch is that somehow, in the process of switching to real mode, the high 16 bits of esp (stage2 stack) are merged with REAL_SP (0x1ff0), leading to the corruption as the stack is used while in real mode. (Note that the half word at 0x7ffd1fe8 is also corrupted.) But I'm not totally clear on why this is if SS is cleared; I would think the high word of ESP would only apply in protected mode. If I add a 'xor esp, esp' before the jump to real mode, the corruption goes away. I'm not sure if this is the right fix though. I will PR for further discussion. |
Will,
I will take it from here
Max
…On Fri, 22 Feb 2019 at 14:00, Will Jhun ***@***.***> wrote:
It appears as if this the culprit is related to the real mode sector read
in stage2. Apply the following patch to the latest master and - whether or
not the crash occurs - observe that this specific address changes, always
to the same value, after a call to bios_read_sectors:
diff --git a/boot/stage2.c b/boot/stage2.c
index fa6e06e..168f5e3 100644
--- a/boot/stage2.c
+++ b/boot/stage2.c
@@ -42,7 +42,13 @@ static void read_sectors(char *dest, u64 start_sector, u64 nsectors)
{
while (nsectors > 0) {
int read_sectors = MIN(nsectors, SCRATCH_LEN >> SECTOR_OFFSET);
+ console("pre ");
+ print_u64(*(u64*)0x7ffd1fe0);
+ console("\n");
bios_read_sectors(start_sector, read_sectors);
+ console("post ");
+ print_u64(*(u64*)0x7ffd1fe0);
+ console("\n");
runtime_memcpy(dest, pointer_from_u64(SCRATCH_BASE), read_sectors << SECTOR_OFFSET);
dest += read_sectors << SECTOR_OFFSET;
start_sector += read_sectors;
see how the location always changes to 0x4200:
/share/src/nanovms/qemu/x86_64-softmmu/qemu-system-x86_64 -display none -serial stdio -m 2G -device isa-debug-exit -no-reboot -drive file=/share/src/nanovms/master00/output/image/image,format=raw,if=virtio -device virtio-net,netdev=n0 -ne\
tdev user,id=n0,hostfwd=tcp::8080-:8080,hostfwd=tcp::9090-:9090,hostfwd=udp::5309-:5309
pre 0000000000000000
post 0000000000004200
pre 0000000000004200
post 0000000000004200
pre 0000000000004200
post 0000000000004200
[...]
pre 0001001200002e98
post 0000000000004200
assigned: 10.0.2.15
That last one was the correct word from the kernel image being clobbered.
This is what was leading the elf parsing off into the weeds.
Using a watchpoint in gdb shows the address being modified - presumably
because it's in esp!
(gdb) c
Continuing.
Hardware watchpoint 2: *(0x7ffd1fe0)
Old value = 10890
New value = 16896
0x0000d562 in ?? ()
(gdb) i r
eax 0x4200 16896
ecx 0xfffffffe -2
edx 0x80 128
ebx 0xde1c 56860
esp 0x7ffd1fe0 0x7ffd1fe0
ebp 0x7ffdfac4 0x7ffdfac4
esi 0x8062 32866
edi 0x7ffd3000 2147299328
eip 0xd562 0xd562
eflags 0x46 [ PF ZF ]
cs 0xf000 61440
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) bt
#0 0x0000d562 in ?? ()
#1 0x00008954 in ?? ()
#2 0x0000c277 in ?? ()
#3 0x0000a320 in ?? ()
#4 0x0000c406 in ?? ()
#5 0x0000c56a in ?? ()
#6 0x000087cf in ?? ()
#7 0x00008852 in ?? ()
#8 0x0000bef4 in ?? ()
#9 0x0000d2c8 in ?? ()
#10 0x0000d2e7 in ?? ()
#11 0x00008ac1 in ?? ()
#12 0x0000d36b in ?? ()
#13 0x0000d3fe in ?? ()
#14 0x0000cdb4 in ?? ()
#15 0x00008c74 in ?? ()
#16 0x00008db8 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
0xd562 is off in the weeds, but after entering bios_read_sectors we are
changing modes and stacks, and I'm not sure if gdb is keeping up with this
(though if I change to arch i8086 it basically shows the same regs and
backtrace).
My hunch is that somehow, in the process of switching to real mode, the
high 16 bits of esp (stage2 stack) are merged with REAL_SP (0x1ff0),
leading to the corruption as the stack is used while in real mode. (Note
that the half word at 0x7ffd1fe8 is also corrupted.) But I'm not totally
clear on why this is if SS is cleared; I would think the high word of ESP
would only apply in protected mode.
If I add a 'xor esp, esp' before the jump to real mode, the corruption
goes away. I'm not sure if this is the right fix though. I will PR for
further discussion.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#552 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AC4fneJLQc9CilYMyxsMrh8QVogN24K1ks5vP4d6gaJpZM4bH_f4>
.
|
I found the root cause - stack is still 32-bit in real-mode because we don't re-load segment registers in prot16 mode with data16 descriptor. |
This is similar to the first crash found in #541, seen elsewhere as well (here I am just attempting to run webg - the bug is in very early init and doesn't appear to be related to the program). While parsing the kernel elf symbol table in stage3 (which is read in stage2 but temporarily mapped in stage3 to ingest the symbols), an invalid symbol name offset is followed past the end of the kernel image buffer. I verified that this invalid offset is in place - and doesn't match the file contents - in stage2 in kernel_read_complete().
This is kind of a heisenbug; if you change the kernel code by a line or two, it vanishes. It is apparently affected by the size and alignment of the elf sections. I can't give instructions to reproduce, and I just happen to have a tree that is reproducing it ... though not reliably depending on the build or added debug code.
Of course another issue here is that the elf parsing needs to be hardened. But I'd like to fix the root of the problem before putting protective measures in place.
The problem doesn't appear to be in read_sectors(), at first glance anyway, since the words affected are in the middle of a copy. I kind of doubt it's a tfs issue, since the file only has two extents, this is smack in the middle of one, and everything is sector aligned. Also, a dump of the fs image produces a kernel file that matches the original.
The text was updated successfully, but these errors were encountered: