Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-commit.ci timing out when passes locally #62

Closed
matthewfeickert opened this issue Apr 27, 2021 · 7 comments
Closed

pre-commit.ci timing out when passes locally #62

matthewfeickert opened this issue Apr 27, 2021 · 7 comments
Labels

Comments

@matthewfeickert
Copy link

matthewfeickert commented Apr 27, 2021

👋 Hi. pre-commit.ci is failing by timeout for PR scikit-hep/pyhf#1403 when it passes locally in a fresh virtual environment (and also passes in pre-commit.ci but then timesout).

c.f. https://results.pre-commit.ci/repo/github/118789569

pre-commit_failure

and for a particular failing run

run_failure

This is probably just transitory issue, but thought I'd still report it.

cc @lukasheinrich @kratsg


Also example of my claim that pre-commit passes locally:

(base) $ git checkout feat/clean-public-api-all  # Branch for the PR that is failing
(base) $ pyenv virtualenv 3.8.7 test-pre-commit
(base) $ pyenv activate test-pre-commit 
(test-pre-commit) $ pip install --upgrade pip setuptools wheel
(test-pre-commit) $ pip install pre-commit
(test-pre-commit) $ pre-commit run --all-files
Check for added large files..............................................Passed
Check for case conflicts.................................................Passed
Check for merge conflicts................................................Passed
Check for broken symlinks................................................Passed
Check JSON...............................................................Passed
Check Yaml...............................................................Passed
Check Toml...............................................................Passed
Check Xml................................................................Passed
Debug Statements (Python)................................................Passed
Fix End of Files.........................................................Passed
Mixed line ending........................................................Passed
Fix requirements.txt.................................(no files to check)Skipped
Trim Trailing Whitespace.................................................Passed
black....................................................................Passed
blacken-docs.............................................................Passed
flake8...................................................................Passed
pyupgrade................................................................Passed
nbqa-black...............................................................Passed
nbqa-pyupgrade...........................................................Passed
@asottile
Copy link
Member

very strange, timings do look elevated today according to my metrics -- let me look into whether something changed

@matthewfeickert matthewfeickert changed the title pre-commit.ci timing out when passes locally pre-commit.ci timing out on nbqa-pyupgrade when passes locally Apr 27, 2021
@matthewfeickert matthewfeickert changed the title pre-commit.ci timing out on nbqa-pyupgrade when passes locally pre-commit.ci timing out when passes locally Apr 27, 2021
@matthewfeickert
Copy link
Author

I doubt this matters much, but the 3 shown timeouts above are happening at different stages:

@matthewfeickert
Copy link
Author

matthewfeickert commented Apr 27, 2021

@asottile
Copy link
Member

yeah the queue makes sense, I was kicking off a bunch of runs at the same time while the hosts were cycling.

there were no code changes during the period that led to higher timeouts, I suspect one of the hosts got a noisy neighbor in aws:

image

I'll be putting in some automated alerts to catch this particular failure mode in the future -- thanks for the report!

I'm going to send a message to the mailing list to make sure others know about this and follow up with a postmortem once I'm comfortable that it is resolved

I'll be watching this closely over the next couple of hours to make sure that fixed it

I'll also be sending out a postmortem entry to the mailing list

@matthewfeickert
Copy link
Author

Awesome. :) Many thanks for this report and also for being ⚡ fast in your feedback and help!

@asottile
Copy link
Member

marking this all clear, run times have returned to normal after mitigation


postmortem

root cause

unknown

  • no code changes occurred before or after the incident or to mitigate the incident
  • observable host level metrics (cpu / io) were not elevated on any of the affected hosts

what went well

  • run-level metrics were extremely helpful for identifying the affected timeframe and validating the fix
  • host rotation was quick and easy (already scripted)
  • helpful issue created by @matthewfeickert alerting to the problem

what didn't go well

  • detection: slow and entirely manual
  • prevention: unknown what caused the actual root issue

follow-up

  • detection: add automated alerting for elevated timing
  • prevention: investigate larger ec2 instance sizes for performance and to lessen "noisy neighbor" effects

@Borda

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants