Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobmanager] Recover from job panics #169

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

xaionaro
Copy link
Member

@xaionaro xaionaro commented Jul 4, 2023

An arguable proposal (feel free to just reject it without any explanation).

Mitigating problems like this one:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2e4c403]

goroutine 17643 [running]:
osf/contest/plugins/reporters/purgatory.(*Reporter).getRackSerial(0x0?, {0xc000a72f40?, 0x0?, 0x0?})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:272 +0x83
osf/contest/plugins/reporters/purgatory.(*Reporter).getFinalReport(0x312be36?, {0xc000a72f40?, 0x1, 0x1})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:336 +0x19d
osf/contest/plugins/reporters/purgatory.(*Reporter).FinalReport(0xc000253088, {0xb25ae0, 0xc001168730}, {0x5c5b00?, 0xc000e19200?}, {0xc000a72f40?, 0xc00144b7e0?, 0xc00144b7e0?}, {0xb17cc0, 0xc0010c2960})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:761 +0x18c
github.com/linuxboot/contest/pkg/runner.(*JobRunner).Run(0xc000270900, {0xb25ae0?, 0xc0011685a0}, 0xc000afdc20, 0x0)
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/runner/job_runner.go:261 +0x1d83
github.com/linuxboot/contest/pkg/jobmanager.(*JobManager).runJob(0xc0001e5d90, {0xb25ae0, 0xc001629310}, 0xc000afdc20, 0xc0007dcf01?)
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/jobmanager/start.go:110 +0x325
created by github.com/linuxboot/contest/pkg/jobmanager.(*JobManager).startJob
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/jobmanager/start.go:85 +0x290

If a single job fails the whole instance has no need to panic.

@xaionaro xaionaro force-pushed the bugfix/recover_from_job_panics branch 2 times, most recently from 84d6448 to 73ffdb3 Compare July 4, 2023 15:01
Signed-off-by: Dmitrii Okunev <xaionaro@meta.com>
@xaionaro xaionaro force-pushed the bugfix/recover_from_job_panics branch from 73ffdb3 to dfc6f87 Compare July 4, 2023 15:02
@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (fa98f00) 61.77% compared to head (2ab1456) 61.80%.

❗ Current head 2ab1456 differs from pull request most recent head dfc6f87. Consider uploading reports for the commit dfc6f87 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #169      +/-   ##
===========================================
+ Coverage    61.77%   61.80%   +0.02%     
===========================================
  Files          131      131              
  Lines         9228     9234       +6     
===========================================
+ Hits          5701     5707       +6     
  Misses        2855     2855              
  Partials       672      672              
Flag Coverage Δ
e2e 49.71% <100.00%> (+0.04%) ⬆️
integration 56.86% <100.00%> (+<0.01%) ⬆️
unittests 46.03% <0.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/jobmanager/jobmanager.go 77.41% <100.00%> (+0.24%) ⬆️
pkg/jobmanager/start.go 76.85% <100.00%> (+0.89%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mimir-d
Copy link
Member

mimir-d commented May 9, 2024

what about the other possible failures? this now only handles the job start case, but other api events handling may fail. Am I reading this wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants