Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing more Envoy configuration knobs #2225

Open
pims opened this issue Feb 13, 2020 · 11 comments
Open

Exposing more Envoy configuration knobs #2225

pims opened this issue Feb 13, 2020 · 11 comments

Comments

@pims
Copy link
Contributor

@pims pims commented Feb 13, 2020

Please describe the problem you have
[A clear, concise, description of the problem you are facing. What is the problem that feature X would solve for you?]

When deploying Contour/Envoy in a large cluster with heterogenous workloads (gRPC, websockets, h2, etc.), some Envoy settings become quite important to get right.

Here's a subset of settings we have configured through a custom patch:

{
  DrainTimeout:      safetime(customCtx.DrainTimeout),
  IdleTimeout:       safetime(customCtx.IdleTimeout), 
  RequestTimeout:    safetime(customCtx.RequestTimeout),
  StreamIdleTimeout: safetime(customCtx.StreamIdleTimeout),

  PerConnectionBufferLimitBytes: customCtx.PerConnectionBufferLimitBytes,
  MaxConcurrentStreams:          customCtx.MaxConcurrentStreams,
  ...
}

Most of these are part of the HTTPConnectionManager filter configuration.
We have been using these settings with great success for several months in production now, and would be interested in contributing these changes upstream.

My questions to the Contour team: would this fall into the scope of what you envision for Contour?

If you believe that this is out of scope for this project, would you be open to minor code changes that would make it easier for people to patch.

A concrete example being the addition of requestTimeout in commit d197482

If this could be wrapped in some kind of type HTTPConnectionManagerConfiguration struct{}, adding unsupported fields would no longer require changing the function signature, which is used extensively through the comprehensive test suite.

Examples of features we have added over the years:

  • tracing
  • timeouts
  • limits for annotations (maxRetries, maxTimeout, maxPendingRequests…)
  • ext-authz

Worth noting, we were able to retain all of Contour's simplicity through custom defaults while giving us the ability to configure Envoy to our needs.

wdyt?

@jpeach

This comment has been minimized.

Copy link
Contributor

@jpeach jpeach commented Feb 13, 2020

xref #1902
xref #1375

@stevesloka

This comment has been minimized.

Copy link
Member

@stevesloka stevesloka commented Feb 13, 2020

@davecheney

This comment has been minimized.

Copy link
Contributor

@davecheney davecheney commented Feb 13, 2020

Thank you for raising this issue. I want to give a bit of background.

The general policy of this project has been to avoid knobs and tuneable for their own sake where possible. We've done this in the past by adjusting our defaults, and only in the cases where there are no generally applicable defaults exposed those tuneable to contour users.

What I would propose is that rather than tackling six different aspects of connection manager parameters, which are only tangentially related by the fact that they are all kind of tuneable things and are all presented in a single issue, we break this request into different parts, because before implementing anything, we need to have a user story, a reason to do something, and a way to know when we did it correctly. The more different requests are mixed into one issue, the lower the chance of success.

Does that sound reasonable to you?

@pims

This comment has been minimized.

Copy link
Contributor Author

@pims pims commented Feb 13, 2020

@davecheney totally reasonable. Thanks for sharing your reasoning.

I mentioned the HTTPConnectionManager configuration to share some of the options we chose to explicitly configure, and how we were able to configure all of them as a group of options, rather than individual changes.

Our starting point was the recommended configuration described at https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge

and discussed by @mattalberts in this thread: #1375 (comment)

What do you think about starting with max_connection_duration. I'll open an issue for it unless, there is another configuration setting that you think is more important.

@jpeach

This comment has been minimized.

Copy link
Contributor

@jpeach jpeach commented Feb 13, 2020

@davecheney Although historically Contour has always aimed for the defaults to be universal, maybe there's a case for a general envoy parameterization mechanism so that we don't end up adding support for a lot of different parameters on a case-by-case basis?

@davecheney

This comment has been minimized.

Copy link
Contributor

@davecheney davecheney commented Feb 13, 2020

WRT max connection duration, what’s the use case for limiting it?

@davecheney

This comment has been minimized.

Copy link
Contributor

@davecheney davecheney commented Feb 13, 2020

@pims

This comment has been minimized.

Copy link
Contributor Author

@pims pims commented Feb 18, 2020

WRT max connection duration, what’s the use case for limiting it?

@davecheney our experience of providing ingress capabilities for all the teams using our cluster has led us to enforce cluster-wide limits to help with potential application issues.

One concrete example: we’ve hit this issue many times golang/go#30702 and applying max-duration for connection helped in mitigating its impact.

Choosing a good default applicable to different use cases seems somewhat tricky, hence the original desire to make this a configurable option with sane default.

@davecheney

This comment has been minimized.

Copy link
Contributor

@davecheney davecheney commented Feb 18, 2020

@pims thanks for the information. What would be a good value for you?

@pims

This comment has been minimized.

Copy link
Contributor Author

@pims pims commented Feb 18, 2020

@davecheney we’ve configured most of our timeouts to have a value of limit(annotationValue, 300s) with a drainTimeout of 30s.

Let me know if you'd like to move this conversation to a dedicated issue. And we can close this one.

@davecheney

This comment has been minimized.

Copy link
Contributor

@davecheney davecheney commented Feb 18, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Contour Project Board
  
Unprioritized
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.