Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force quit ipfs-cluster-service on second ctrl-c #358

Merged
merged 1 commit into from Apr 4, 2018

Conversation

lanzafame
Copy link
Contributor

Resolves: #258.

@ghost ghost assigned lanzafame Mar 23, 2018
@ghost ghost added the status/in-progress In progress label Mar 23, 2018
@coveralls
Copy link

coveralls commented Mar 23, 2018

Coverage Status

Coverage decreased (-0.1%) to 66.352% when pulling 685f58e on feat/cmd/svc/force-quit into f8acd4f on master.

ZenGround0
ZenGround0 previously approved these changes Mar 23, 2018
Copy link
Collaborator

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from some minor differences in opinion about naming this LGTM. I'm personally not worried about code climate cognitive complexity.

return cfg, &cfgs{clusterCfg, apiCfg, ipfshttpCfg, consensusCfg, trackerCfg, monCfg, diskInfCfg, numpinInfCfg}
}

type cfgs struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure 😄 The giant function signatures were bugging me.

@@ -222,20 +222,20 @@ configuration.
},
Action: func(c *cli.Context) error {
userSecret, userSecretDefined := userProvidedSecret(c.Bool("custom-secret"))
cfg, clustercfg, _, _, _, _, _, _, _ := makeConfigs()
defer cfg.Shutdown() // wait for saves
cfgmgr, cfgs := makeConfigs()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very subjective, but I'm wondering if there is a name for this variable that's easier to read. Can't pin down exactly what it is about this name that trips me up so it might just be me though. Some alternative ideas: Just mgr? Just cfg? cfgMgr? configManager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do cfgMgr, it is more idiomatic anyway. @ZenGround0 @hsanjuan to clarify, the config manager loads the initial configuration from file, generates multiaddrs for the new node, and then saves that configuration back to file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds mostly right. The config manager registers all of the other configs in makeConfigs to "manage" them. It only generates default values in the case of a call to the init subcommand and from then on loads the saved config values after registration. As you've noticed, the manager watches for certain changes to its registered component configs and then saves the new values. Also a cluster calls the manager's Validate function to validate each registered component config during initialization.

@lanzafame
Copy link
Contributor Author

@ZenGround0 naming change has been made 👍 and thanks for the explanation of the config manager

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small things, lgtm otherwise.

It may need rebase though, when I merge the libp2p-http stuff. we'll see.

}
go func() {
alreadyExiting = true
cluster.Shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err = missing

for {
select {
case <-signalChan:
err = cluster.Shutdown()
checkErr("shutting down cluster", err)
if alreadyExiting {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially racy? If 2 signals are received quickly, it may try to read alreadyExisting at the same time it's writing it. Easiest solution is to move alreadyExiting = true out of the goroutine.

hsanjuan
hsanjuan previously approved these changes Mar 26, 2018
@hsanjuan
Copy link
Collaborator

Cool, thanks!

@lanzafame needs rebase though!

@lanzafame
Copy link
Contributor Author

@hsanjuan rebase done 👍

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lanzafame I have been thinking... because forcing an exit is a potentially really bad thing, we shouldn't allow the second ctrl-c too close to the first one. So we should discard any second ctrl-c that happens within, say, within 10 seconds of the first one.

After 10 seconds, we should inform the user that, from that point, she can force-close cluster: something like Shutdown is taking long. Press Ctrl-c again to manually kill cluster. Note that this may corrupt the local cluster state.

@hsanjuan
Copy link
Collaborator

Also, let's fix that code climate warning.

@lanzafame
Copy link
Contributor Author

Also, let's fix that code climate warning.

@hsanjuan I have run the codeclimate analysis locally on the branch and it isn't complaining about the cyclometric complexity of daemon there. I am guessing it hasn't properly grabbed the latest changes.

I have been thinking... because forcing an exit is a potentially really bad thing, we shouldn't allow the second ctrl-c too close to the first one. So we should discard any second ctrl-c that happens within, say, within 10 seconds of the first one.

I understand where you are coming from with this, I had the same thought, but is there really much benefit to forcing the user to wait 10 seconds? I agree that we should definitely print the warning message regarding corrupting the cluster state, but I think we should leave it up to the user to decide if they want to risk corrupting their cluster state. IMHO, a prompt to confirm force quitting, might be a better option. Thoughts?

@hsanjuan
Copy link
Collaborator

@lanzafame what I'm fearing is that you can kill go-ipfs with double ctrl-c without any consequences, but in cluster you might interrupt the snapshotting process and screw the snapshot. So this should only be done if snapshotting gets stuck. We should also avoid accidental double ctrl-c.

Perhaps we can make it 3 ctrl-c. One shutdown, second prints warning about a third. Third kills cluster?

@lanzafame
Copy link
Contributor Author

lanzafame commented Mar 28, 2018 via email

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lanzafame I'm not much into interactive questioning.

It may be that the signal does not come from a ctrl-c either but from a kill. In general I think it's anti-pattern when a daemon which is supposed to be running in the background blocks and asks something interactively.

Plus unmute logs don't respect the set log levels, and errors should not be lost because we're waiting on some user info.

Plus second ctrl-c re-launches the shutudown goroutine.

And third ctrl-c would relaunch the shutdown goroutine and re-ask while the other prompt is still waiting to read an answer.

Can we do the 3 ctrl-c option? 1) shutdown 2) warning message 3) kill

@lanzafame
Copy link
Contributor Author

lanzafame commented Mar 28, 2018 via email

@lanzafame
Copy link
Contributor Author

@hsanjuan changes up 👍

@hsanjuan
Copy link
Collaborator

@lanzafame something is funky with sharness tests. Can you have a look?

hsanjuan
hsanjuan previously approved these changes Mar 28, 2018
Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but tests should be green. Also, holding this off until I have released 0.3.5.

@ZenGround0
Copy link
Collaborator

My guess is that our sharness helper function cluster-kill now has the power to force a shutdown and is causing the corrupted state that we are warning about.

We could make cluster_kill parse output and hold off for a longer backoff if it sees the first signal has been received. Increasing the backoff time entirely might also give some relief to sharness tests but is probably going to be racy.

@hsanjuan
Copy link
Collaborator

but it only cluster-kill only sends the signal once, so it shouldn't affect anything

@ZenGround0
Copy link
Collaborator

duh thanks for the correction

@lanzafame
Copy link
Contributor Author

It seems that the following test is the first to fail and also is the cause of cascading failures in subsequent tests:

*** t0025-ctl-status-report-commands.sh ***
not ok 1 - cluster-ctl can read id
#	
#	    id=`cluster_id`
#	    [ -n "$id" ] && ( ipfs-cluster-ctl id | egrep -q "$id" )
#	

But when running sharness tests locally, this one passes, and at time of commenting none of the others have failed (will edit if they do). I have kicked off the Travis CI builds to see if they pass, if they don't I will dig in but the failure seems rather unrelated to the changed code.

switch ctrlcCount {
case 1:
go func() {
time.Sleep(1 * time.Minute)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, stupid me forgets to remove the sleep statement before pushing up 🤦‍♂️ . Needed it to test cluster 'hanging'.

Refactor daemon() to reduce code complexity.

Refactor configuration in ipfs-cluster-service.

License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
@lanzafame
Copy link
Contributor Author

@hsanjuan @ZenGround0 So my time.Sleep stuff up, was what was causing the weird errors.

@hsanjuan hsanjuan added RFM Ready for Merge and removed status/in-progress In progress labels Mar 29, 2018
@hsanjuan hsanjuan merged commit c651aed into master Apr 4, 2018
@hsanjuan hsanjuan deleted the feat/cmd/svc/force-quit branch April 4, 2018 10:22
@hsanjuan hsanjuan mentioned this pull request Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFM Ready for Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants