-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to k8s apiserver to reject incoming requests upon audit failure #65763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a handful of nits, then LGTM.
func auditEventProcessingError(w http.ResponseWriter, req *http.Request) { | ||
w.Header().Set("Content-Type", "text/plain") | ||
w.Header().Set("X-Content-Type-Options", "nosniff") | ||
w.WriteHeader(http.StatusInsufficientStorage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right error code... maybe just return a 500?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a plain 500, then we won't be able to differentiate it from other errors. The one I used is from 5xx family, since it's a server side error. At the same time, it is more specific than "something broke server side" which is what 500 means. Maybe someone from apimachinery should take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking maybe we don't want to distinguish it intentionally. Error messages like this can leak useful data to attackers, so distinguishing this case makes me a little nervous. Debug logs & metrics can provide information for those authorized, without clueing in the (possibly unprivileged) requester.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a valid point. At the same time, using the same error code will limit the ability to monitor how often requests are rejected because of audit backend failures. We'll be able to see just how often they are rejected in general. Is it possible to have both security and observability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're talking about monitoring from the client side, then no, not without returning this information. But if you're talking about monitoring the server, then that's what prometheus metrics are for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about existing prometheus metric that already exposes http response code, but you're right, this can be a separate metric exclusively for audit errors.
@@ -30,19 +30,19 @@ import ( | |||
) | |||
|
|||
// Avoid emitting errors that look like valid HTML. Quotes are okay. | |||
var sanitizer = strings.NewReplacer(`&`, "&", `<`, "<", `>`, ">") | |||
var Sanitizer = strings.NewReplacer(`&`, "&", `<`, "<", `>`, ">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this rendered correctly for plain content? I would think setting content-type + nosniff should be sufficient...
@@ -30,19 +30,19 @@ import ( | |||
) | |||
|
|||
// Avoid emitting errors that look like valid HTML. Quotes are okay. | |||
var sanitizer = strings.NewReplacer(`&`, "&", `<`, "<", `>`, ">") | |||
var Sanitizer = strings.NewReplacer(`&`, "&", `<`, "<", `>`, ">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use html.EscapeString
instead? https://golang.org/pkg/html/
w.Header().Set("Content-Type", "text/plain") | ||
w.Header().Set("X-Content-Type-Options", "nosniff") | ||
w.WriteHeader(http.StatusInsufficientStorage) | ||
fmt.Fprintf(w, "Server couldn't store an audit event for request: %q", responsewriters.Sanitizer.Replace(req.RequestURI)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just use responsewriters.InternalErorr
?
} | ||
} | ||
}() | ||
return sendErr == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, just use a named return value. I'll let you decide which is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, buffer will not be used when blocking-strict is selected, changed to true to keep existing behavior.
@@ -25,7 +25,8 @@ type Sink interface { | |||
// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal | |||
// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller | |||
// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary. | |||
ProcessEvents(events ...*auditinternal.Event) | |||
// Returns true on success, may return false on error. | |||
ProcessEvents(events ...*auditinternal.Event) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I was surprised you didn't just return the error, but I guess since we're handling the errors in-place it makes sense.
for _, ev := range events { | ||
b.logEvent(ev) | ||
if !b.logEvent(ev) { | ||
success = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively: success = success && b.logEvent(ev)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cause events to stop being logged after first failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
success = b.logEvent(ev) && success
would work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, good call.
|
||
type safeBackend struct { | ||
// The delegate backend that actually exports events. | ||
delegateBackend audit.Backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just embed it, then you only need to define ProcessEvents.
limitations under the License. | ||
*/ | ||
|
||
package safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dedicated package feels a bit heavy weight for this.... maybe just embed it in options? (should only need ~6 lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by embedding this in options? Wouldn't it require handling this flag in every backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant just append https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/audit.go with:
type safeBackend struct {
audit.Backend
}
func (s *safeBackend) ProcessEvents(ev ...*auditinternal.Event) bool {
s.Backend.ProcessEvents(ev...)
return true
}
Since it's only used in that file anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geez, you're right, that makes way more sense. Thanks, updated.
@@ -25,7 +25,8 @@ type Sink interface { | |||
// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal | |||
// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller | |||
// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary. | |||
ProcessEvents(events ...*auditinternal.Event) | |||
// Returns true on success, may return false on error. | |||
ProcessEvents(events ...*auditinternal.Event) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the buffered webhook audit? Does this function always return true in this scenario?
} | ||
} | ||
}() | ||
return sendErr == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw it, return false when the buffer is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, changed to true since buffered backend will not be used when blocking-strict is selected.
@@ -56,7 +56,11 @@ func WithAudit(handler http.Handler, sink audit.Sink, policy policy.Checker, lon | |||
} | |||
|
|||
ev.Stage = auditinternal.StageRequestReceived | |||
processAuditEvent(sink, ev, omitStages) | |||
if !processAuditEvent(sink, ev, omitStages) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reject request on stage StageRequestReceived
, It is not consistent ignoring other stags audit failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doesn't make sense to reject request at later stages (at least for mutating requests), as the request will already have been processed at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,it has been processed. But also without audit logs of these stages. So what i want to say is this pr does not handle lost audit events
completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my change, an auditor can get a guarantee that if there was no event with RequestReceived stage, there was no change to apiserver state and the caller will have a guarantee that if the request failed, the state wasn't changed.
With what you are suggesting, auditor will still get the guarantee that no event implies no change, but you will take away that guarantee from the caller. When request fails, you don't really know what happened.
One change I think might make sense here would be to reject non-mutating requests upon failures on other stages. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is kind of enhancement whatever .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some thought I don't like the idea of rejecting requests inconsistently - with audit policy configured to omit RequestReceived stage this would mean that audit backend issue will block all non-mutating requests, but allow mutating ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n I don't understand your response to this - The code that is here only rejects requests when the RequestReceived
event cannot be written. I think this is the correct behavior, but I'm not sure it matches what you said above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for being unclear. Let me try to clarify what I meant above.
I think this check should be applied only at RequestReceived
stage. Later on, a failure to audit log anything shouldn't affect apiserver response, because the request was already processed, possibly modifying etcd state. Now, one could argue that there is no harm in rejecting the request if it was non-mutating, but failed to be audit logged e.g. at ResponseCompleted
stage. However, this would lead to some weirdness, when RequestReceived
stage is omitted and audit backend has problems: mutating requests would ignore audit logging failures and simply work, while read only requests would be rejected because of audit backend problems.
/retest |
/retest |
/assign @deads2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, please squash commits
@@ -56,7 +56,11 @@ func WithAudit(handler http.Handler, sink audit.Sink, policy policy.Checker, lon | |||
} | |||
|
|||
ev.Stage = auditinternal.StageRequestReceived | |||
processAuditEvent(sink, ev, omitStages) | |||
if !processAuditEvent(sink, ev, omitStages) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n I don't understand your response to this - The code that is here only rejects requests when the RequestReceived
event cannot be written. I think this is the correct behavior, but I'm not sure it matches what you said above?
Thanks! Squashed. |
Ping. Is there any additional work needed in this PR? |
/lgtm |
@@ -327,10 +333,26 @@ func (o *AuditBatchOptions) AddFlags(pluginName string, fs *pflag.FlagSet) { | |||
"moment if ThrottleQPS was not utilized before. Only used in batch mode.") | |||
} | |||
|
|||
type safeBackend struct { | |||
audit.Backend | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like type safeBackend should be put in package pluginbuffered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had a separate package for this, but @tallclair convinced me this makes more sense here, as it is quite trivial. Since nothing in safeBackend is pluginbuffered-specific, I'd rather leave it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks.
@@ -57,6 +57,8 @@ type backend struct { | |||
|
|||
// Encoder used to calculate audit event sizes. | |||
e runtime.Encoder | |||
|
|||
// Value returned from ProcessEvents when there was any failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, some leftover. Removed now.
/retest Review the full test history for this PR. Silence the bot with an |
12 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/hold I need to fix some merge conflicts and failing tests... |
New changes are detected. LGTM label has been removed. |
/retest |
@sttts I rebased this and fixed failing tests. PTAL. |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
What this PR does / why we need it: Part of #65266
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Special notes for your reviewer:
Release note:
cc @loburm @tallclair @wojtek-t