Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to k8s apiserver to reject incoming requests upon audit failure #65763

Merged
merged 1 commit into from
Nov 17, 2018

Conversation

x13n
Copy link
Member

@x13n x13n commented Jul 3, 2018

What this PR does / why we need it: Part of #65266

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

Release note:

apiserver can be configured to reject requests that cannot be audit-logged.

cc @loburm @tallclair @wojtek-t

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 3, 2018
@k8s-ci-robot k8s-ci-robot requested review from enj and soltysh July 3, 2018 12:59
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a handful of nits, then LGTM.

func auditEventProcessingError(w http.ResponseWriter, req *http.Request) {
w.Header().Set("Content-Type", "text/plain")
w.Header().Set("X-Content-Type-Options", "nosniff")
w.WriteHeader(http.StatusInsufficientStorage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right error code... maybe just return a 500?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a plain 500, then we won't be able to differentiate it from other errors. The one I used is from 5xx family, since it's a server side error. At the same time, it is more specific than "something broke server side" which is what 500 means. Maybe someone from apimachinery should take a look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking maybe we don't want to distinguish it intentionally. Error messages like this can leak useful data to attackers, so distinguishing this case makes me a little nervous. Debug logs & metrics can provide information for those authorized, without clueing in the (possibly unprivileged) requester.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid point. At the same time, using the same error code will limit the ability to monitor how often requests are rejected because of audit backend failures. We'll be able to see just how often they are rejected in general. Is it possible to have both security and observability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking about monitoring from the client side, then no, not without returning this information. But if you're talking about monitoring the server, then that's what prometheus metrics are for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about existing prometheus metric that already exposes http response code, but you're right, this can be a separate metric exclusively for audit errors.

@@ -30,19 +30,19 @@ import (
)

// Avoid emitting errors that look like valid HTML. Quotes are okay.
var sanitizer = strings.NewReplacer(`&`, "&amp;", `<`, "&lt;", `>`, "&gt;")
var Sanitizer = strings.NewReplacer(`&`, "&amp;", `<`, "&lt;", `>`, "&gt;")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rendered correctly for plain content? I would think setting content-type + nosniff should be sufficient...

@@ -30,19 +30,19 @@ import (
)

// Avoid emitting errors that look like valid HTML. Quotes are okay.
var sanitizer = strings.NewReplacer(`&`, "&amp;", `<`, "&lt;", `>`, "&gt;")
var Sanitizer = strings.NewReplacer(`&`, "&amp;", `<`, "&lt;", `>`, "&gt;")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use html.EscapeString instead? https://golang.org/pkg/html/

w.Header().Set("Content-Type", "text/plain")
w.Header().Set("X-Content-Type-Options", "nosniff")
w.WriteHeader(http.StatusInsufficientStorage)
fmt.Fprintf(w, "Server couldn't store an audit event for request: %q", responsewriters.Sanitizer.Replace(req.RequestURI))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just use responsewriters.InternalErorr?

}
}
}()
return sendErr == nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, just use a named return value. I'll let you decide which is cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, buffer will not be used when blocking-strict is selected, changed to true to keep existing behavior.

@@ -25,7 +25,8 @@ type Sink interface {
// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal
// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller
// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary.
ProcessEvents(events ...*auditinternal.Event)
// Returns true on success, may return false on error.
ProcessEvents(events ...*auditinternal.Event) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I was surprised you didn't just return the error, but I guess since we're handling the errors in-place it makes sense.

for _, ev := range events {
b.logEvent(ev)
if !b.logEvent(ev) {
success = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively: success = success && b.logEvent(ev)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause events to stop being logged after first failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success = b.logEvent(ev) && success would work though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, good call.


type safeBackend struct {
// The delegate backend that actually exports events.
delegateBackend audit.Backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just embed it, then you only need to define ProcessEvents.

limitations under the License.
*/

package safe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dedicated package feels a bit heavy weight for this.... maybe just embed it in options? (should only need ~6 lines)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by embedding this in options? Wouldn't it require handling this flag in every backend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant just append https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/audit.go with:

type safeBackend struct {
	audit.Backend
}

func (s *safeBackend) ProcessEvents(ev ...*auditinternal.Event) bool {
	s.Backend.ProcessEvents(ev...)
	return true
}

Since it's only used in that file anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geez, you're right, that makes way more sense. Thanks, updated.

@@ -25,7 +25,8 @@ type Sink interface {
// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal
// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller
// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary.
ProcessEvents(events ...*auditinternal.Event)
// Returns true on success, may return false on error.
ProcessEvents(events ...*auditinternal.Event) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the buffered webhook audit? Does this function always return true in this scenario?

}
}
}()
return sendErr == nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw it, return false when the buffer is full.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, changed to true since buffered backend will not be used when blocking-strict is selected.

@@ -56,7 +56,11 @@ func WithAudit(handler http.Handler, sink audit.Sink, policy policy.Checker, lon
}

ev.Stage = auditinternal.StageRequestReceived
processAuditEvent(sink, ev, omitStages)
if !processAuditEvent(sink, ev, omitStages) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reject request on stage StageRequestReceived, It is not consistent ignoring other stags audit failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doesn't make sense to reject request at later stages (at least for mutating requests), as the request will already have been processed at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,it has been processed. But also without audit logs of these stages. So what i want to say is this pr does not handle lost audit events completely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my change, an auditor can get a guarantee that if there was no event with RequestReceived stage, there was no change to apiserver state and the caller will have a guarantee that if the request failed, the state wasn't changed.

With what you are suggesting, auditor will still get the guarantee that no event implies no change, but you will take away that guarantee from the caller. When request fails, you don't really know what happened.

One change I think might make sense here would be to reject non-mutating requests upon failures on other stages. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is kind of enhancement whatever .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought I don't like the idea of rejecting requests inconsistently - with audit policy configured to omit RequestReceived stage this would mean that audit backend issue will block all non-mutating requests, but allow mutating ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n I don't understand your response to this - The code that is here only rejects requests when the RequestReceived event cannot be written. I think this is the correct behavior, but I'm not sure it matches what you said above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being unclear. Let me try to clarify what I meant above.

I think this check should be applied only at RequestReceived stage. Later on, a failure to audit log anything shouldn't affect apiserver response, because the request was already processed, possibly modifying etcd state. Now, one could argue that there is no harm in rejecting the request if it was non-mutating, but failed to be audit logged e.g. at ResponseCompleted stage. However, this would lead to some weirdness, when RequestReceived stage is omitted and audit backend has problems: mutating requests would ignore audit logging failures and simply work, while read only requests would be rejected because of audit backend problems.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 6, 2018
@x13n
Copy link
Member Author

x13n commented Jul 6, 2018

/retest

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 6, 2018
@x13n
Copy link
Member Author

x13n commented Jul 6, 2018

/retest

@x13n
Copy link
Member Author

x13n commented Jul 10, 2018

/assign @deads2k

Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please squash commits

@@ -56,7 +56,11 @@ func WithAudit(handler http.Handler, sink audit.Sink, policy policy.Checker, lon
}

ev.Stage = auditinternal.StageRequestReceived
processAuditEvent(sink, ev, omitStages)
if !processAuditEvent(sink, ev, omitStages) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n I don't understand your response to this - The code that is here only rejects requests when the RequestReceived event cannot be written. I think this is the correct behavior, but I'm not sure it matches what you said above?

@x13n
Copy link
Member Author

x13n commented Jul 12, 2018

Thanks! Squashed.

@x13n
Copy link
Member Author

x13n commented Jul 18, 2018

Ping. Is there any additional work needed in this PR?

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 18, 2018
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2018
@@ -327,10 +333,26 @@ func (o *AuditBatchOptions) AddFlags(pluginName string, fs *pflag.FlagSet) {
"moment if ThrottleQPS was not utilized before. Only used in batch mode.")
}

type safeBackend struct {
audit.Backend
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like type safeBackend should be put in package pluginbuffered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had a separate package for this, but @tallclair convinced me this makes more sense here, as it is quite trivial. Since nothing in safeBackend is pluginbuffered-specific, I'd rather leave it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks.

@@ -57,6 +57,8 @@ type backend struct {

// Encoder used to calculate audit event sizes.
e runtime.Encoder

// Value returned from ProcessEvents when there was any failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, some leftover. Removed now.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

12 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2018
@x13n
Copy link
Member Author

x13n commented Nov 15, 2018

/hold

I need to fix some merge conflicts and failing tests...

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2018
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 16, 2018
@x13n
Copy link
Member Author

x13n commented Nov 16, 2018

/retest

@x13n
Copy link
Member Author

x13n commented Nov 16, 2018

@sttts I rebased this and fixed failing tests. PTAL.
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 16, 2018
@mikedanese mikedanese added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 17, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 46ebebc into kubernetes:master Nov 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants