Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service VM boot fails (Intel Atom E3900 Series) #7840

Closed
florian90re opened this issue Jun 28, 2022 · 22 comments
Closed

Service VM boot fails (Intel Atom E3900 Series) #7840

florian90re opened this issue Jun 28, 2022 · 22 comments
Labels
status: new The issue status: new for creation

Comments

@florian90re
Copy link

Describe the bug
Service VM boot fails. The system startup process ends in a black screen with "ACRN loading..." (output from the grub menu entry)

Platform
Duagon MC50M https://www.duagon.com/de/produkte/computing/box-pc/#selectedCategory=116%2C117%2C126%2C125%2C115
It's a box PC based on Intel Atom E3900 series. It is commonly used in railway industry.

Codebase
ACRN-HV Branch: release_2.7
ACRN SOS Branch: release_2.7
SOS Version: Ubunut 18.04.06

Scenario
shared

To Reproduce
Steps to reproduce the behavior:

  1. Set up a clean Ubuntu 18.04.06
  2. Follow the getting started guide for Version 2.7: https://projectacrn.github.io/2.7/getting-started/getting-started.html
  3. Start up and boot into ACRN

Expected behavior
SOS VM is booted properly

Additional context
The folder contains:

  • The output of the ACRN-Shell
  • board description file
  • scenario file
  • grub files

Duagon_shared_startup_fail.zip

@florian90re florian90re added the status: new The issue status: new for creation label Jun 28, 2022
@gvancuts
Copy link
Contributor

1. Set up a clean Ubuntu 18.04.06

Out of curiosity, is this Ubuntu 18.04 that you are using too on the build machine? I'm asking because last time I tried I couldn't build ACRN v2.7 on Ubuntu 18.04 ( I did not investigate further and switched to 20.04 instead)

@florian90re
Copy link
Author

1. Set up a clean Ubuntu 18.04.06

Out of curiosity, is this Ubuntu 18.04 that you are using too on the build machine? I'm asking because last time I tried I couldn't build ACRN v2.7 on Ubuntu 18.04 ( I did not investigate further and switched to 20.04 instead)

I've tested it with both versions 18.04 and 20.04. I got the same result. But I used the release_2.7 branch instead of the v2.7 branch. I have noticed that the ACRN build fails for some hardware on the v2.7 branch. Never had any issues on the release_2.7 branch.
The build seems to work correctly, I compared the output to the output I got for another hardware and they look pretty similar.

@fuzhongl
Copy link
Contributor

fuzhongl commented Jun 29, 2022

@florian90re
root=PARTUUID="dc85eb42-d879-499a-ae68-11fbc0e35568" is p2 of NVMe; and can boot with native. Right?
It is /dev/ttyS5 in your scenario.xml; any reason to set uart=bdf@0x500 in grub menu?
<SERIAL_CONSOLE>/dev/ttyS5</SERIAL_CONSOLE>
seri:/dev/ttyS5 type:mmio base:0x91626000 irq:6 bdf:"00:18.2"
Should be uart=bdf@0xc2 or remove this parameter since it is the default value.
You can refer following link for detail:
https://projectacrn.github.io/2.7/user-guides/hv-parameters.html

@florian90re
Copy link
Author

@fuzhongl
Yes root=PARTUUID="dc85eb42-d879-499a-ae68-11fbc0e35568" is p2 of NVMe and native boot works finde.

The serial port situation is a little confusing for this hardware:
bdf: "00:18.0", "00:18.2", "00:18.3" are the build-in signal processing controllers. But for some reason, those are not connected to the actual serial adapter on the box PC. The serial adapter that works is PCI based with bdf: 05:00.0. You can see the adapter also in the board description file.
I added the <SERIAL_CONSOLE>/dev/ttyS5</SERIAL_CONSOLE> manually since the ACRN build gives me an error if I leave the serial console information empty (<SERIAL_CONSOLE></SERIAL_CONSOLE>)

I also tested it with no UART information in the GRUB file with the same result except that I didn't get any output from the ACRN Shell.

@fuzhongl
Copy link
Contributor

@florian90re Thanks for clarification!
Could you help to boot native with the kernel of Service VM?
And also help to try boot ACRN with native kernel, instead of kernel of Service VM.
Thanks!

@florian90re
Copy link
Author

@fuzhongl
I natively booted the Service VM's kernel, and it works fine.

I also tried to boot ACRN with a native Kernel (5.4.0-84-generic). There was no output at all. I added the output of the ACRN shell.
ACRN_console_ouput_normal_kernel.txt

I assume this indicated that the problem comes from the HV rather than the Service VM kernel, right?

@fuzhongl
Copy link
Contributor

fuzhongl commented Jun 30, 2022

Thanks for your effort!
Yes, I think so. But is is strange about no output in HV console with native kernel.
It seems like the boot needs info from emmc:
[ 115.916673] mmc1: Timeout waiting for hardware interrupt.

Please boot native with ACRN ServiceVM kernel and share dmesg and lsblk log.

Also help to try if following patch works.

diff --git a/hypervisor/hw/pci.c b/hypervisor/hw/pci.c
index 30f0b487f..f43f79601 100644
--- a/hypervisor/hw/pci.c
+++ b/hypervisor/hw/pci.c
@@ -464,6 +464,11 @@ static void scan_pci_hierarchy(uint8_t bus, uint64_t buses_visited[BUSES_BITMAP_
                        continue;
                }

+               if ((pbdf.bits.b == 0x0) && ((pbdf.bits.d == 0x1b) || (pbdf.bits.d == 0x1c))) {
+                       //ignore MMC
+                       continue;
+               }
+
                for (dev = 0U; dev <= PCI_SLOTMAX; dev++) {
                        pbdf.bits.d = dev;
                        pbdf.bits.f = 0U;

Thanks!

@florian90re
Copy link
Author

@fuzhongl
Thanks for your help.
I attached the logs from dmesg and lsblk from the native boot of the ACRN Service VM Kernel.
dmesg_output.txt
lsblk_ouput.txt

I applied the patch but I got the same result as before.

I also tried to use the 3.0 Version of the hypervisor and kernel which gave me a slightly different output from the ACRN Shell. If you think this might be helpful I can also share it.

@fuzhongl
Copy link
Contributor

fuzhongl commented Jul 2, 2022

@florian90re Thanks for sharing log!
Do you re-generate board.xml and scenario.xml when you try v3.0? Please share the slightly different output from the ACRN Shell of v3.0.
Thanks!

@florian90re
Copy link
Author

@fuzhongl Sorry for the delay. I couldn't access the hardware over the weekend.

Yes, I re-generated the board and scenario file using v3.0.
ACRN_console_output_v3.0.txt

@fuzhongl
Copy link
Contributor

fuzhongl commented Jul 6, 2022

@florian90re Thanks for sharing ACRN_console_output of v3.0.

ACRN:\>vm_list

VM_ID VM_NAME                          VM_STATE
===== ================================ ========
  0   VM0                              Running
ACRN:\>vm_console 0
vuart console is not active
ACRN:\>

It seems like that Service VM console isn't enable in scenario.xml.
Could you help to double check if COM Port 1 is set for console_vuart of Service VM?
<console_vuart>COM Port 1</console_vuart>
Thanks!

@florian90re
Copy link
Author

@fuzhongl Sorry for the delay.

Here is the complete output with Version 3.0:
ACRN_console_output_v3.0_new.txt

@fuzhongl
Copy link
Contributor

@florian90re Thanks for share log of v3.0.
It seems like the system boot needs info from EMMC.

[19225594us][cpu=3][vm0:vcpu3][sev=2][seq=66]:vpci_write_cfg 0:d.0 not found! off: 0xe1, val: 0x0
[19235110us][cpu=3][vm0:vcpu3][sev=2][seq=67]:vpci_write_cfg 0:d.0 not found! off: 0xe1, val: 0x1
[19609703us][cpu=2][vm0:vcpu2][sev=2][seq=72]:vpci_write_cfg 0:d.0 not found! off: 0xe0, val: 0xb
[20434518us][cpu=2][vm0:vcpu2][sev=2][seq=73]:vpci_write_cfg 0:d.0 not found! off: 0xd0, val: 0x0
[20443994us][cpu=2][vm0:vcpu2][sev=2][seq=74]:vpci_write_cfg 0:d.0 not found! off: 0xd0, val: 0xd600003c
[20454100us][cpu=2][vm0:vcpu2][sev=2][seq=75]:vpci_write_cfg 0:d.0 not found! off: 0xdc, val: 0x0
[20463648us][cpu=2][vm0:vcpu2][sev=2][seq=76]:vpci_write_cfg 0:d.0 not found! off: 0xd4, val: 0x0
[20473230us][cpu=2][vm0:vcpu2][sev=2][seq=77]:vpci_write_cfg 0:d.0 not found! off: 0xd8, val: 0x0
[20482710us][cpu=2][vm0:vcpu2][sev=2][seq=78]:vpci_write_cfg 0:d.0 not found! off: 0xd8, val: 0x30e00000
[20492897us][cpu=2][vm0:vcpu2][sev=2][seq=79]:vpci_write_cfg 0:d.0 not found! off: 0xd8, val: 0x30e00000
[20503102us][cpu=2][vm0:vcpu2][sev=2][seq=80]:vpci_write_cfg 0:d.0 not found! off: 0xd8, val: 0x30e00000
[20513221us][cpu=2][vm0:vcpu2][sev=2][seq=81]:vpci_write_cfg 0:d.0 not found! off: 0xd8, val: 0x30e00001

Please help to try following patch:

diff --git a/misc/config_tools/library/board_cfg_lib.py b/misc/config_tools/library/board_cfg_lib.py
index 3f621a29d..92e583f3a 100644
--- a/misc/config_tools/library/board_cfg_lib.py
+++ b/misc/config_tools/library/board_cfg_lib.py
@@ -29,6 +29,7 @@ HEADER_LICENSE = common.open_license() + "\n"
 KNOWN_HIDDEN_PDEVS_BOARD_DB = {
     'apl-mrb':['00:0d:0'],
     'apl-up2':['00:0d:0'],
+    'my_board':['00:0d:0'],
 }

 TSN_DEVS = ["8086:4b30", "8086:4b31", "8086:4b32", "8086:4ba0", "8086:4ba1", "8086:4ba2",

Thanks!

@florian90re
Copy link
Author

@fuzhongl Thanks again for your help.

I applied the patch but got an error during the ACRN build:

/home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.c:146:2: error: excess elements in array initializer [-Werror]
146 | {
| ^
/home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.c:146:2: note: (near initialization for ‘plat_hidden_pdevs’)
cc1: all warnings being treated as errors
make[1]: *** [Makefile:554: /home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.o] Error 1
make[1]: Leaving directory '/home/codewerk/acrn-work/acrn-hypervisor/hypervisor'
make: *** [Makefile:130: hypervisor] Error 2

@fuzhongl
Copy link
Contributor

@fuzhongl Thanks again for your help.

I applied the patch but got an error during the ACRN build:

/home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.c:146:2: error: excess elements in array initializer [-Werror] 146 | { | ^ /home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.c:146:2: note: (near initialization for ‘plat_hidden_pdevs’) cc1: all warnings being treated as errors make[1]: *** [Makefile:554: /home/codewerk/acrn-work/acrn-hypervisor/build/hypervisor/configs/boards/board.o] Error 1 make[1]: Leaving directory '/home/codewerk/acrn-work/acrn-hypervisor/hypervisor' make: *** [Makefile:130: hypervisor] Error 2

@florian90re Please help to workaround the build issue with following modification in board.xml:
From:
<acrn-config board="my_board">
To:
<acrn-config board="apl-mrb">

It works on my side.

Please help to try if Service VM can boot up successfully with above change.
Thanks!

@fuzhongl
Copy link
Contributor

@florian90re Following is fix for the build issue:

diff --git a/misc/config_tools/library/board_cfg_lib.py b/misc/config_tools/library/board_cfg_lib.py
index 3f621a29d..92e583f3a 100644
--- a/misc/config_tools/library/board_cfg_lib.py
+++ b/misc/config_tools/library/board_cfg_lib.py
@@ -29,6 +29,7 @@ HEADER_LICENSE = common.open_license() + "\n"
 KNOWN_HIDDEN_PDEVS_BOARD_DB = {
     'apl-mrb':['00:0d:0'],
     'apl-up2':['00:0d:0'],
+    'my_board':['00:0d:0'],
 }

 TSN_DEVS = ["8086:4b30", "8086:4b31", "8086:4b32", "8086:4ba0", "8086:4ba1", "8086:4ba2",
diff --git a/misc/config_tools/xforms/lib.xsl b/misc/config_tools/xforms/lib.xsl
index 615961a16..f53ec481a 100644
--- a/misc/config_tools/xforms/lib.xsl
+++ b/misc/config_tools/xforms/lib.xsl
@@ -375,6 +375,9 @@
       <xsl:when test="//@board = 'apl-up2'">
         <func:result select="1" />
       </xsl:when>
+      <xsl:when test="//@board = 'my_board'">
+        <func:result select="1" />
+      </xsl:when>
       <xsl:otherwise>
         <func:result select="0" />
       </xsl:otherwise>

@florian90re
Copy link
Author

@fuzhongl This worked!!!
Thanks a lot for your help. I really appreciate the time you put in to fix this issue.

@dbkinder
Copy link
Contributor

@NanlinXie @junjiemao1 Does this patch work because, in this case, "my_board" is actually an APL-based board? It's not a general solution since not ALL "my_board" boards are APL-based. What's the general problem being fixed here and is there a general solution we'd fix in the next release?

@fuzhongl
Copy link
Contributor

fuzhongl commented Jul 19, 2022

@fuzhongl This worked!!! Thanks a lot for your help. I really appreciate the time you put in to fix this issue.

@florian90re Glad to know Service VM boots up successfully.
Could you please share what info in the EMMC which is needed for system boot?
Thanks!

@fuzhongl
Copy link
Contributor

fuzhongl commented Jul 19, 2022

@NanlinXie @junjiemao1 Does this patch work because, in this case, "my_board" is actually an APL-based board? It's not a general solution since not ALL "my_board" boards are APL-based. What's the general problem being fixed here and is there a general solution we'd fix in the next release?

@dbkinder Following patch is the fix; not workaround about the build issue. Please ignore the previous workaround patch.

@florian90re Following is fix for the build issue:

diff --git a/misc/config_tools/library/board_cfg_lib.py b/misc/config_tools/library/board_cfg_lib.py
index 3f621a29d..92e583f3a 100644
--- a/misc/config_tools/library/board_cfg_lib.py
+++ b/misc/config_tools/library/board_cfg_lib.py
@@ -29,6 +29,7 @@ HEADER_LICENSE = common.open_license() + "\n"
 KNOWN_HIDDEN_PDEVS_BOARD_DB = {
     'apl-mrb':['00:0d:0'],
     'apl-up2':['00:0d:0'],
+    'my_board':['00:0d:0'],
 }

 TSN_DEVS = ["8086:4b30", "8086:4b31", "8086:4b32", "8086:4ba0", "8086:4ba1", "8086:4ba2",
diff --git a/misc/config_tools/xforms/lib.xsl b/misc/config_tools/xforms/lib.xsl
index 615961a16..f53ec481a 100644
--- a/misc/config_tools/xforms/lib.xsl
+++ b/misc/config_tools/xforms/lib.xsl
@@ -375,6 +375,9 @@
       <xsl:when test="//@board = 'apl-up2'">
         <func:result select="1" />
       </xsl:when>
+      <xsl:when test="//@board = 'my_board'">
+        <func:result select="1" />
+      </xsl:when>
       <xsl:otherwise>
         <func:result select="0" />
       </xsl:otherwise>

@junjiemao1
Copy link
Contributor

@NanlinXie @junjiemao1 Does this patch work because, in this case, "my_board" is actually an APL-based board? It's not a general solution since not ALL "my_board" boards are APL-based. What's the general problem being fixed here and is there a general solution we'd fix in the next release?

As the name suggests, that list is intended to track hidden PCI devices on specific boards. It is so SPECIFIC that no general solution exists except that kind of hardcoding (because they are HIDDEN for any reason).

That said, it looks strange to me that the service VM attempts to access that device. @fuzhongl Any idea on how the kernel detects that device?

@fuzhongl
Copy link
Contributor

@NanlinXie @junjiemao1 Does this patch work because, in this case, "my_board" is actually an APL-based board? It's not a general solution since not ALL "my_board" boards are APL-based. What's the general problem being fixed here and is there a general solution we'd fix in the next release?

As the name suggests, that list is intended to track hidden PCI devices on specific boards. It is so SPECIFIC that no general solution exists except that kind of hardcoding (because they are HIDDEN for any reason).

That said, it looks strange to me that the service VM attempts to access that device. @fuzhongl Any idea on how the kernel detects that device?

@junjiemao1 The system can boot up with Service VM kernel as native. so this issue isn't related with kernel.
It is also strange to me that the system boots need some info from EMMC; but this EMMC is hidden by hypervisor.

Following is the Service VM boot fail log:

root@codewerk-APL-Platform:~# [  105.676680] mmc1: Timeout waiting for hardware interrupt.
[  105.677322] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[  105.678060] mmc1: sdhci: Sys addr:  0x00000008 | Version:  0x00001002
[  105.678791] mmc1: sdhci: Blk size:  0x00007200 | Blk cnt:  0x00000008
[  105.679518] mmc1: sdhci: Argument:  0x00000000 | Trn mode: 0x00000033
[  105.680247] mmc1: sdhci: Present:   0x1fff0001 | Host ctl: 0x0000003c
[  105.680976] mmc1: sdhci: Power:     0x0000000a | Blk gap:  0x00000080
[  105.681705] mmc1: sdhci: Wake-up:   0x00000000 | Clock:    0x00000207
[  105.682431] mmc1: sdhci: Timeout:   0x00000005 | Int stat: 0x00000000
[  105.683163] mmc1: sdhci: Int enab:  0x03ff000b | Sig enab: 0x03ff000b
[  105.683891] mmc1: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  105.684620] mmc1: sdhci: Caps:      0x546ec881 | Caps_1:   0x00000805
[  105.685347] mmc1: sdhci: Cmd:       0x00000c1b | Max curr: 0x00000000
[  105.686079] mmc1: sdhci: Resp[0]:   0x00000000 | Resp[1]:  0x00000000
[  105.686807] mmc1: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  105.687533] mmc1: sdhci: Host ctl2: 0x0000000c
[  105.688046] mmc1: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000024267d200
[  105.688848] mmc1: sdhci: ============================================
[  115.916673] mmc1: Timeout waiting for hardware interrupt.
[  115.917305] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[  115.918037] mmc1: sdhci: Sys addr:  0x00000008 | Version:  0x00001002
[  115.918767] mmc1: sdhci: Blk size:  0x00007200 | Blk cnt:  0x00000008
[  115.919495] mmc1: sdhci: Argument:  0x00000000 | Trn mode: 0x00000033
[  115.920224] mmc1: sdhci: Present:   0x1fff0001 | Host ctl: 0x0000003c
[  115.920953] mmc1: sdhci: Power:     0x0000000a | Blk gap:  0x00000080
[  115.921681] mmc1: sdhci: Wake-up:   0x00000000 | Clock:    0x00000207
[  115.922407] mmc1: sdhci: Timeout:   0x00000005 | Int stat: 0x00000000
[  115.923139] mmc1: sdhci: Int enab:  0x03ff000b | Sig enab: 0x03ff000b
[  115.923868] mmc1: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  115.924594] mmc1: sdhci: Caps:      0x546ec881 | Caps_1:   0x00000805
[  115.925326] mmc1: sdhci: Cmd:       0x00000c1b | Max curr: 0x00000000
[  115.926055] mmc1: sdhci: Resp[0]:   0x00000000 | Resp[1]:  0x00000000
[  115.926784] mmc1: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  115.927509] mmc1: sdhci: Host ctl2: 0x0000000c
[  115.928022] mmc1: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000024267d200
[  115.928824] mmc1: sdhci: ============================================

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: new The issue status: new for creation
Projects
None yet
Development

No branches or pull requests

5 participants