New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport signal handling and live reload. #1679

Merged
merged 1 commit into from Feb 14, 2018

Conversation

Projects
None yet
3 participants
@klizhentas
Copy link
Contributor

klizhentas commented Feb 13, 2018

This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.

Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.

Here are the signals handled:

  • USR2 signal will cause the parent to fork
    a child process and pass listener file descriptors to it.
    Child process will close unused file descriptors
    and will bind to the used ones.

At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.

  • TERM, INT signals will trigger graceful process shutdown.
    Auth, node and proxy processes will wait until the amount
    of active connections goes down to 0 and will exit after that.

  • KILL, QUIT signals will cause immediate non-graceful
    shutdown.

  • HUP signal combines USR2 and TERM signals in a convenient
    way: parent process will fork a child process and
    self-initate graceful shutdown. This is a more convenient
    than USR2/TERM sequence, but less agile and robust
    as if the connection to the parent process drops, but
    the new process exits with error, administrators
    can lock themselves out of the environment.

Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.

@klizhentas klizhentas requested review from r0mant , kontsevoy and russjones Feb 13, 2018

@klizhentas klizhentas force-pushed the sasha/reload branch from 19578b5 to 909a554 Feb 13, 2018

@klizhentas

This comment has been minimized.

Copy link
Contributor Author

klizhentas commented Feb 13, 2018

retest this please

@klizhentas klizhentas force-pushed the sasha/reload branch 2 times, most recently from 8bdde0e to b890087 Feb 13, 2018

}
return trace.ConvertSystemError(err)
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Use writeLock() method from above?

return trace.ConvertSystemError(err)
}
defer f.Close()
if err := writeLock(f); err != nil {

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Wasn't this file just locked a couple of lines above?

const teleportFilesEnvVar = "TELEPORT_OS_FILES"

func execPath() (string, error) {
name, err := exec.LookPath(os.Args[0])

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

There's std os.Executable() method that seems to do what you need.

@@ -194,7 +173,7 @@ func Run(options Options) (executedCommand string, conf *service.Config) {
if err != nil {
utils.FatalError(err)
}
log.Info("teleport: clean exit")
log.Debugf("Clean exit.")

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Debug.

if ccf.Gops {
log.Debugf("starting gops agent")
err := agent.Listen(&agent.Options{Addr: ccf.GopsAddr})
log.Debugf("Starting gops agent.")

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Debug.

}
warnOnErr(sshProxy.Close())
} else {
log.Infof("Shutting down gracefully.")

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Info.

log.Infof("Proxy service exited.")
defer listeners.Close()
if payload == nil {
log.Infof("Shutting down immediately.")

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Info.

}
log.Infof("Exited.")

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Info.

@@ -402,7 +402,7 @@ func (s *server) diffConns(newConns, existingConns map[string]services.TunnelCon
}

func (s *server) Wait() {
s.srv.Wait()
s.srv.Wait(context.TODO())

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

Would it be better to have a caller pass context to the outer method (similar to Shutdown below)? Or is it too much refactoring?

This comment has been minimized.

@klizhentas

klizhentas Feb 13, 2018

Author Contributor

yeah, no one is using it now, so I decided to pass

if err != nil {
err = trace.ConvertSystemError(err)
if !trace.IsNotFound(err) {
return err

This comment has been minimized.

@r0mant

r0mant Feb 13, 2018

Contributor

trace.Wrap(err).

This comment has been minimized.

@klizhentas

klizhentas Feb 13, 2018

Author Contributor

ConvertSystemError wraps

@klizhentas

This comment has been minimized.

Copy link
Contributor Author

klizhentas commented Feb 13, 2018

For folks who are interested why are we using flock vs other types of locks, read this article:

https://gavv.github.io/blog/file-locks/

and this one is also fun:

http://0pointer.de/blog/projects/locking.html

for i := 0; i < attempts; i++ {
go func(cnt int) {
err := s.bk.UpsertVal(bucket, "key", []byte("new-value"), backend.Forever)
err := s.bk.UpsertVal(bucket, "key", []byte(value1), time.Hour)

This comment has been minimized.

@klizhentas

klizhentas Feb 13, 2018

Author Contributor

switch to create back

}
s.Infof("Shutdown: waiting for %v connections to finish.", activeConnections)
lastReport := time.Time{}
ticker := time.NewTicker(s.shutdownPollPeriod)

This comment has been minimized.

@klizhentas

klizhentas Feb 13, 2018

Author Contributor

defer ticker.Stop()

@klizhentas klizhentas force-pushed the sasha/reload branch from b890087 to 731ac3b Feb 13, 2018

Teleport signal handling and live reload.
This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.

Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.

Here are the signals handled:

* USR2 signal will cause the parent to fork
a child process and pass listener file descriptors to it.
Child process will close unused file descriptors
and will bind to the used ones.

At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.

* TERM, INT signals will trigger graceful process shutdown.
Auth, node and proxy processes will wait until the amount
of active connections goes down to 0 and will exit after that.

* KILL, QUIT signals will cause immediate non-graceful
shutdown.

* HUP signal combines USR2 and TERM signals in a convenient
way: parent process will fork a child process and
self-initate graceful shutdown. This is a more convenient
than USR2/TERM sequence, but less agile and robust
as if the connection to the parent process drops, but
the new process exits with error, administrators
can lock themselves out of the environment.

Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.

@klizhentas klizhentas force-pushed the sasha/reload branch from 731ac3b to 68b65f5 Feb 13, 2018

@klizhentas klizhentas merged commit 650a222 into master Feb 14, 2018

1 check passed

Teleport Build finished.
Details

@russjones russjones referenced this pull request Feb 15, 2018

Merged

Key Generation Fixes #1686

@klizhentas klizhentas referenced this pull request Feb 19, 2018

Closed

Changelog #1703

@klizhentas klizhentas deleted the sasha/reload branch Mar 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment