Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial boot of BMC takes too long #2813

Open
madscientist159 opened this issue Jan 21, 2018 · 19 comments
Assignees
Labels
bug

Comments

@madscientist159
Copy link

@madscientist159 madscientist159 commented Jan 21, 2018

Any boot / reboot of the BMC (normally as a result of standby power removal) takes way too long, typically in excess of 1-2 minutes from standby power application to READY state / ability to IPL host.

U-boot itself consumes around 15 seconds of this with kernel load from Flash and subsequent verification, but the largest offender seems to be something in the systemd startup sequence. Toward the end of the boot process the CPU is not the limiting factor, so there is a chance that some systemd dependency is falling back on a timeout before boot can finish and the host can be IPLed.

This was first noted on the Talos and Romulus systems.

@amboar

This comment has been minimized.

Copy link
Member

@amboar amboar commented Feb 4, 2018

@madscientist159 can you post some measurements using systemd-bootchart?

@madscientist159

This comment has been minimized.

Copy link
Author

@madscientist159 madscientist159 commented Feb 5, 2018

Absolutely! Looks like most of the time is wasted waiting for the mapper service if I'm reading this correctly.
systemd-bootchart

@amboar

This comment has been minimized.

Copy link
Member

@amboar amboar commented Feb 5, 2018

Ok. There's a fairly ambiguous issue open about improving mapper performance in #2860. Hopefully @tomjoseph83 can provide more information on what his intent is there.

@madscientist159

This comment has been minimized.

Copy link
Author

@madscientist159 madscientist159 commented Feb 5, 2018

Yeah, we had the BMC brick itself on one of our first boards at a customer site. We're still waiting for a ROM dump to see what happened, but I'd guess right now that system power got switched off during the lengthy BMC boot and something got corrupted as a result. The BMC should boot in under 30 seconds from where we sit, and ideally under 15 seconds.

@geissonator

This comment has been minimized.

Copy link
Contributor

@geissonator geissonator commented Feb 26, 2018

I've been working this in parallel with some other issues dealing with performance on BMC resets. The number 1 thing I've been working on has been removing the legacy python applications from our images. A lot of these python applications from skeleton are no longer used, yet still loaded and started.

A few commits have made it in:
https://gerrit.openbmc-project.xyz/#/c/8934/ (UBI only enhancement)
https://gerrit.openbmc-project.xyz/#/c/8874/ (system manager legacy function removal)
https://gerrit.openbmc-project.xyz/#/c/8871/ (legacy settings removal)
https://gerrit.openbmc-project.xyz/#/c/8834/ (legacy system state removal)

Some have not due to small uses in other repos (or waiting for merge):
https://gerrit.openbmc-project.xyz/#/c/9055/ (netman.py removal)
https://gerrit.openbmc-project.xyz/#/c/8829/ (settings removal)
https://gerrit.openbmc-project.xyz/#/c/8793/ (inventory removal)
https://gerrit.openbmc-project.xyz/#/c/9004/ (BMC control removal)

With the above changes I managed to reduce our boot time by about 20%. There's still some skeleton python applications that need to be refactored and moved to c++ function (chassis_control.py, system_manager.py, sensor_manager2.py, sync_inventory_items.py).

The other thing we've looked at is refactoring mapper into a C++ application. I believe @edtanous has a prototype for this and the initial results looked promising. Mapper by it's nature has a ton of processing to do because it has to introspect every existing dbus service when it starts and every new one that comes on the bus, so it's the number 1 user of CPU during a BMC boot.

@edtanous

This comment has been minimized.

Copy link
Contributor

@edtanous edtanous commented May 4, 2018

FYI, prototype is now available for experimenting. It has a ways to go before it's merge worthy, but should be able to validate some assumptions about the path we're going down.

https://gerrit.openbmc-project.xyz/#/c/10468/
https://gerrit.openbmc-project.xyz/#/c/10470/

@amboar

This comment has been minimized.

Copy link
Member

@amboar amboar commented May 10, 2018

In the mean time, I've also got some very much WIP patches to improve the performance of the current mapper implementation:

https://gerrit.openbmc-project.xyz/#/q/topic:mapper-perf

@geissonator

This comment has been minimized.

Copy link
Contributor

@geissonator geissonator commented Jul 12, 2018

Some more good stuff that got us another 10 seconds on the reboot:
https://gerrit.openbmc-project.xyz/#/c/11402/
https://gerrit.openbmc-project.xyz/#/c/11427/

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Mar 27, 2019

This issue has been automatically marked as stale because no activity has occurred in the last 6 months. It will be closed if no activity occurs in the next 30 days. If this issue should not be closed please add a comment. Thank you for your understanding and contributions.

@stale stale bot added the stale label Mar 27, 2019
@stale

This comment has been minimized.

Copy link

@stale stale bot commented Apr 26, 2019

This issue has been closed because no activity has occurred in the last 7 months. Please reopen if this issue should not have been closed. Thank you for your contributions.

@stale stale bot closed this Apr 26, 2019
@madscientist159

This comment has been minimized.

Copy link
Author

@madscientist159 madscientist159 commented Apr 27, 2019

This is still a significant concert on the latest OpenBMC master. Can someone with the correct privileges reopen so that I don't have to file a new report?

@amboar amboar reopened this May 2, 2019
@stale stale bot removed the stale label May 2, 2019
@stale

This comment has been minimized.

Copy link

@stale stale bot commented Nov 1, 2019

This issue has been automatically marked as stale because no activity has occurred in the last 6 months. It will be closed if no activity occurs in the next 30 days. If this issue should not be closed please add a comment. Thank you for your understanding and contributions.

@stale stale bot added the stale label Nov 1, 2019
@gtmills

This comment has been minimized.

Copy link
Member

@gtmills gtmills commented Nov 16, 2019

@madscientist159 Still a problem? Still want this issue open? We have made some improvements in this area.

@gtmills gtmills closed this Nov 16, 2019
@classilla

This comment has been minimized.

Copy link

@classilla classilla commented Nov 16, 2019

I'm not him, but I'm curious about the improvements. Which commit(s)?

@geissonator

This comment has been minimized.

Copy link
Contributor

@geissonator geissonator commented Nov 19, 2019

We have removed all python from our image, moved to the c++ mapper and c++ bmcweb REST server. All of this helped....but we have also continued to add new features and services that must be started. So, with master, on our witherspoon here is where we're at:

2.8.0-dev-877-g9d570fc

Nov 19 15:14:50 witherspoon-Y230UF71K03T systemd[1]: Startup finished in 8.472s (kernel) + 1min 44.506s (userspace) = 1min 52.978s.

I have noticed on the AST2600 evb that this same code runs 3-4x faster so that will be a nice upgrade when we get to it. But for AST2500, I'm not sure there's much more low hanging fruit.

@gtmills

This comment has been minimized.

Copy link
Member

@gtmills gtmills commented Nov 19, 2019

Moving to dbus-broker helped in this area as well.
bba6b5a

@classilla

This comment has been minimized.

Copy link

@classilla classilla commented Nov 19, 2019

Very interesting. Thank you.

@amboar

This comment has been minimized.

Copy link
Member

@amboar amboar commented Nov 20, 2019

I'd argue that 1m52s is still to slow:

The BMC should boot in under 30 seconds from where we sit, and ideally under 15 seconds.

@gtmills

This comment has been minimized.

Copy link
Member

@gtmills gtmills commented Nov 20, 2019

I'd argue that 1m52s is still to slow

Reopened

@gtmills gtmills reopened this Nov 20, 2019
@stale stale bot removed the stale label Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.