Description
cnservice.(*service).bootstrap() at pkg/cnservice/server.go:944-945 panics on ANY bootstrap error, including context.Canceled and context.DeadlineExceeded. This causes unrecoverable crashes instead of graceful error propagation when the bootstrap timeout expires or the service context is canceled during shutdown.
Root Cause
The current code:
if err := s.bootstrapService.Bootstrap(ctx); err != nil {
panic(moerr.AttachCause(ctx, err))
}
Panics on all errors, but context cancellation/timeout is NOT a bootstrap failure — it's an external signal that the operation should stop.
Inconsistency
The same file already handles context.Canceled gracefully for BootstrapUpgrade (line 955):
if err != context.Canceled {
// only log non-cancel errors
}
But initial Bootstrap does not have this guard.
Impact
- All embedded cluster tests with multiple CNs can fail with
panic: context canceled when bootstrap takes longer than 5 minutes
- Tests affected: TestBasicCluster, Test_TxnExecutorExec, TestPartitionBasedShardCanBeCreated, TestDeleteAndSelect, TestCreateAndDropPitr, Test_UpgradeEntry, TestAffectedRows, TestCompile, and more
- In production, this could crash a CN process during graceful shutdown
Additional Issue
pkg/embed/cluster.go:doStartLocked() also panics on CN Start() errors instead of propagating them, preventing callers from handling startup failures gracefully.
Proposed Fix
- Check for context cancellation/timeout before panicking in bootstrap
- Return error instead of panicking for context-related failures
- Change
doStartLocked() to propagate errors instead of panicking
Description
cnservice.(*service).bootstrap()atpkg/cnservice/server.go:944-945panics on ANY bootstrap error, includingcontext.Canceledandcontext.DeadlineExceeded. This causes unrecoverable crashes instead of graceful error propagation when the bootstrap timeout expires or the service context is canceled during shutdown.Root Cause
The current code:
Panics on all errors, but context cancellation/timeout is NOT a bootstrap failure — it's an external signal that the operation should stop.
Inconsistency
The same file already handles
context.Canceledgracefully forBootstrapUpgrade(line 955):But initial
Bootstrapdoes not have this guard.Impact
panic: context canceledwhen bootstrap takes longer than 5 minutesAdditional Issue
pkg/embed/cluster.go:doStartLocked()also panics on CN Start() errors instead of propagating them, preventing callers from handling startup failures gracefully.Proposed Fix
doStartLocked()to propagate errors instead of panicking