Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

Add proxy watcher and auto-configuration for Envoy #32

Merged
merged 3 commits into from
Jan 11, 2017

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Jan 10, 2017

Complete the proxy supervising process basic implementation.
In this configuration, proxy container has a watcher process that watches for changes to CDS and triggers a hot-reload in Envoy. The watcher creates an Envoy config file from CDS service definitions and inserts a reference to another manager instance acting as SDS.

Eventually, once CDS API is available, we do not need the hot reload. Hot reload would still be necessary for RDS or anything else that's not captured by Envoy runtime APIs.

@rshriram
Copy link
Member

Just a note: this will change completely once Envoy has CDS with polling. We won't need to do hot restarts. Will review tomorrow AM.

@kyessenov
Copy link
Contributor Author

Yeah, the plan is that CDS/RDS would fill out most of the config at runtime. We still need first-time config generation in the proxy container, so the code will probably stay.


// Spin up a new Envoy process.
cmd := exec.Command("envoy", "-c", s.config, "--restart-epoch", fmt.Sprint(restartEpoch))
cmd := exec.Command(s.binary, "-c", fname, "--restart-epoch", fmt.Sprint(s.epoch),
"--drain-time-s", "60", "--parent-shutdown-time-s", "90")
Copy link
Member

@GregHanson GregHanson Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also recently discovered the --drain-time-s and --parent-shutdown-time-s flags were needed for quicker envoy config file change propagation in amalgam8 . . . what factors contributed to you deciding on 60 and 90 seconds for these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation from Envoy documentation.
Ultimately, this delay should be determined by the frequency of restarts, and we do not know that yet.
60s seems like reasonable time.

@ijsnellf ijsnellf self-requested a review January 10, 2017 17:27

// Add the new Envoy process to the known set of running Envoy processes.
s.cmdMap[cmd] = restartEpoch
s.cmdMap[cmd] = s.epoch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To successfully launch a new Envoy process that will replace the running Envoy processes, the restart epoch of the new process must be exactly 1 greater than the highest restart epoch of the currently running Envoy processes. The original restart epoch logic handled the case where envoy may have crashed. This PR always increments the restart epoch but does not handle the case where the last process exited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are gaps in the the restart epoch, it will throw a core dump

# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root       130    54  0 18:54 ?        00:00:00 envoy -c /etc/envoy/envoy.json --restart-epoch 2

# envoy -c /etc/envoy/envoy.json --restart-epoch 4 &
[4] 140
root@06314a4f87f1:/# [2017-01-10 18:54:56.513][140][warning][main] initializing epoch 4 (hot restart version=3.2490504)
[2017-01-10 18:54:56.514][140][critical][assert] assert failure: rc != -1: /source/source/exe/hot_restart.cc:251

[4]+  Aborted                 (core dumped) envoy -c /etc/envoy/envoy.json --restart-epoch 4

# ps -ef                                                                                        
UID        PID  PPID  C STIME TTY          TIME CMD
root       130    54  0 18:54 ?        00:00:00 envoy -c /etc/envoy/envoy.json --restart-epoch 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!
This was explained in the code but I was looking at the Envoy documentation which is not as explicit.
There is still a race condition possible if the latest Envoy dies in the middle of Reload().
I added a short delay to avoid stressing this race. I really think Envoy should relax that epoch+1 requirement; otherwise, I just don't see how to solve this race.

Copy link
Contributor

@ijsnellf ijsnellf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kyessenov kyessenov merged commit e44b42c into istio:master Jan 11, 2017
@kyessenov kyessenov deleted the static_config branch January 11, 2017 18:33
ayj pushed a commit to ayj/pilot that referenced this pull request Jan 30, 2017
* Add proxy watcher and auto-configuration for Envoy

* Make the hot restart faster: 90s instead of 15m

* Correct restart epoch and active config determination
@lizan lizan mentioned this pull request Sep 7, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants