Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing timeout when starting an node #1237

Merged
merged 5 commits into from Sep 29, 2021
Merged

Conversation

vishalchangrani
Copy link
Contributor

A node startup can take a very long time for certain node type e.g. Execution node. A timeout is not really needed when a node starts/stop.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the wisdom between no timeout and a very long one (10-30 min?). DOn't we use this in tests too?

@Kay-Zee
Copy link
Member

Kay-Zee commented Sep 19, 2021

do we use the timeout for tests? I'm not really following the logic of needing a startup timeout for a node for normal operations, but could be convinced for tests

@vishalchangrani
Copy link
Contributor Author

vishalchangrani commented Sep 20, 2021

T-systems have reported that they see a time out on node startup and their node enters a restart loop.

Hi @Vishal we started the node with actual parameters and image, but we receive always this: {"level":"fatal","node_role":"execution","node_id":"2b396b7fab0102f104a2af7e095b145cc14da28f863564802e158afc3e07e638","time":"2021-09-17T10:10:22Z","message":"node startup timed out"}

I would like to remove the time out all together. The unit test have an inherent timeout of 10 mins and should be good.

@vishalchangrani
Copy link
Contributor Author

@smnzhu @huitseeker @Kay-Zee bumping up this PR.

@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2021

Codecov Report

Merging #1237 (645435b) into master (5ad4831) will decrease coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1237      +/-   ##
==========================================
- Coverage   54.86%   54.80%   -0.06%     
==========================================
  Files         504      504              
  Lines       31918    31918              
==========================================
- Hits        17512    17494      -18     
- Misses      12031    12050      +19     
+ Partials     2375     2374       -1     
Flag Coverage Δ
unittests 54.80% <ø> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sus/approvals/assignment_collector_statemachine.go 42.30% <0.00%> (-9.62%) ⬇️
...ngine/common/synchronization/finalized_snapshot.go 68.75% <0.00%> (-4.17%) ⬇️
admin/command_runner.go 79.69% <0.00%> (-1.51%) ⬇️
engine/collection/synchronization/engine.go 62.90% <0.00%> (-1.08%) ⬇️
engine/common/synchronization/engine.go 68.78% <0.00%> (-1.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ad4831...645435b. Read the comment docs.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting to unblock, it seems clear the current timeout is too small, at least.

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an operational perspective, it seems like having some timeout makes sense, though definitely much longer than 1 minute. If a node hangs on initial startup or a later restart, operators will want their supervisor tasks to restart the process. if it hangs for a long time/indefinitely they'll require manual intervention.

I'm not sure about what startup/shutdown step could possibly hang that long, so I'm fine with this if long hangs are not a concern.

@Kay-Zee
Copy link
Member

Kay-Zee commented Sep 29, 2021

bors merge

@bors
Copy link
Contributor

bors bot commented Sep 29, 2021

@bors bors bot merged commit a78ab46 into master Sep 29, 2021
@bors bors bot deleted the vishal/remove_startup_timeout branch September 29, 2021 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants