Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log/slog: JSONHandler should deduplicate keys #59365

Open
veqryn opened this issue Apr 1, 2023 · 59 comments
Open

log/slog: JSONHandler should deduplicate keys #59365

veqryn opened this issue Apr 1, 2023 · 59 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@veqryn
Copy link
Contributor

veqryn commented Apr 1, 2023

What version of Go are you using (go version)?

8edcddd

Does this issue reproduce with the latest release?

yes

What did you do?

	logger := slog.New(slog.NewJSONHandler(os.Stdout))
	logger.LogAttrs(
		context.Background(),
		slog.LevelWarn,
		"fancy message",
		slog.Bool("msg", false),
		slog.String("my attr", "something"),
	)

What did you expect to see?

JSONHandler should be fixed by deduplicating those special keys (time, message, level, and source).
I use "source" as a key in a lot of my code right now, so I'd run into this problem.

Either

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg.msg": false,
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"fields": {
		"msg": false,
		"my attr": "something"
	}
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg#01": false,
	"my attr": "something"
}

What did you see instead?

Invalid json. It would break our log aggregation. Default/Builtin handlers should never create invalid output.

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg": false,
	"my attr": "something"
}

@jba

EDIT:
Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
Please check it out: https://github.com/veqryn/slog-dedup

go get github.com/veqryn/slog-dedup

package main

import (
	"log/slog"
	"os"

	slogdedup "github.com/veqryn/slog-dedup"
)

func main() {
	// OverwriteHandler
	overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(overwriter))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup overwrite handler",
	    "duplicated": "two"
	  }
	*/
	slog.Info("this is the dedup overwrite handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IgnoreHandler
	ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(ignorer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup ignore handler",
	    "duplicated": "zero"
	  }
	*/
	slog.Info("this is the dedup ignore handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IncrementHandler
	incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(incrementer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup incrementer handler",
	    "duplicated": "zero",
	    "duplicated#01": "one",
	    "duplicated#02": "two"
	  }
	*/
	slog.Info("this is the dedup incrementer handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// AppendHandler
	appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(appender))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup appender handler",
	    "duplicated": [
	      "zero",
	      "one",
	      "two"
	    ]
	  }
	*/
	slog.Info("this is the dedup appender handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)
}
@seankhliao seankhliao changed the title affected/package: log/slog: JSONHandler easily creates invalid json log/slog: JSONHandler should deduplicate keys Apr 1, 2023
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 1, 2023
@seankhliao seankhliao added this to the Go1.21 milestone Apr 1, 2023
@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

Whatever solution is chosen here, I would argue that it should work even if the user explicitly provides duplicate names:

logger.LogAttrs(
	context.Background(),
	slog.LevelWarn,
	"fancy message",
	slog.String("duplicate", "alpha"),
	slog.String("duplicate", "bravo"),
	slog.String("duplicate", "charlie"),
	slog.String("duplicate", "delta"),
	slog.String("duplicate", "echo"),
)

One possible solution is to use a "#%d" scheme similar to how testing.T.Run deduplicates conflicting test names.
For example:

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"duplicate": "alpha",
	"duplicate#01": "bravo",
	"duplicate#02": "charlie",
	"duplicate#03": "delta",
	"duplicate#04": "echo",
}

@seankhliao
Copy link
Member

I think the current situation is fine, duplicate keys aren't invalid, just less interoperable.
Changing the key name to deduplicate it seems worse.

As for the original issue, I'll note that option 3 was why WithGroup was added (though this doesn't affect deduplication in general).

@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

duplicate keys aren't invalid, just less interoperable.

Technically you are correct. Under RFC 8259, section 4, duplicate names are essentially treated as undefined behavior.

In a possible future major version of "encoding/json", we will likely make duplicate keys an explicit error (which is still complaint behavior with RFC 8259) as this lack of checking has been exploited, resulting in severe security vulnerabilities (e.g., CVE-2017-12635). It's unlikely that duplicate names will have such drastic security implications in logging, but strict checking of duplicate names is likely the overall direction that JSON will gradually move towards.

As an aside, Tim Bray, the author of RFC 8259 actually recommends against naïve compliance with RFC 8259, but rather that implementations adhere to RFC 7493 which makes strict decisions on issues that RFC 8259 leaves as undefined.

@jba
Copy link
Contributor

jba commented Apr 3, 2023

I think it's important to understand whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version. We should probably align slog with those semantics, even if slog appears in Go before encoding/json v2. We're already doing that for formatting NaNs, infinities and, probably, durations (#59345).

For this issue, if we think no dups will be the default for encoding/json v2, then I wouldn't be opposed to making it the default in slog and adding an AllowDuplicateNames option. And we should do the same for TextHandler for consistency.

@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version

That I can't really answer unfortunately.

We haven't even started the GitHub discussion on anything related to v2 "encoding/json" and may not be ready to do so for some months. A significant piece of work that remains is figuring out the inter-compatibility story between v1 and a hypothetical v2. That's perhaps the hardest part and requires quite a lot of head-scratching.

That said, if there is a v2 "encoding/json", I feel like the security arguments for rejecting duplicate names by default is pretty compelling.


The (very) tentative plan for v2 is that v1 "encoding/json" will actually be implemented in terms of v2 with the appropriate number of options specified that makes it behave semantically identically to what v1 "json" does today. Essentially, with the appropriate set of options, you can configure the semantics to be any hybrid mix that is in-between the default behavior of v1 and the default behavior of v2.

Since "log/slog" already depends on the v1 semantics of "encoding/json", you could imagine doing nothing to address this issue for now. I don't think we should block adoption of "slog" on a hypothetical v2 "json". Nor should we overly massage the output of "slog" to match a hypothetical v2 "json".

Supposing v2 "json" one day lands, you could imagine that v2 "json" be integrated into "slog" by exposing something like:

package slog

type HandlerOptions struct {
    // JSONOptions is a list of Option
    // that configures how JSON is marshaled.
    JSONOptions []json.Option
}

To log each message, "slog" would do something like:

// Start with DefaultV1Options since "slog" has historically
// marshaled with mostly v1 semantics by default.
opts := []json.Option{json.DefaultV1Options}
// We have historically specified Encoder.SetEscapeHTML(false).
// By passing a nil escape function, we achieve the same behavior.
opts = append(opts, json.EscapeRune(nil))
// Append any user-specified options,
// which will override any previously set options.
opts = append(opts, h.opts.JSONOptions...)
return json.Marshal(v, opts...)

(You'll note that the options API for "json" above is different than what currently exists in "go-json-experiment/json". I mentioned elsewhere in #59339 that we're considering switching to variadic options as we see it as one of the ways to resolve the v1-to-v2 compatibility tensions).
(Also ignore the performance implications of appending a bunch of options right before calling json.Marshal; we can find ways to to optimize this. Let's focus on semantics.)

Thus, for someone to use v2 "json" with "slog", they would do something like:

slog.HandlerOptions{
    JSONOptions: []json.Option{json.DefaultV2Options},
}.NewJSONHandler(...)

I should also note that DefaultV1Options and DefaultV2Options represent a composite set of options. For example DefaultV1Options includes AllowDuplicateNames(true) while DefaultV2Options includes AllowDuplicateNames(false).

You could also simplify the above by providing NewJSONv2Handler that implicitly inserts json.DefaultV2Options before user-specified options.

One could imagine that "slog" inspects HandlerOptions.JSONOptions and checks AllowDuplicateNames:

@jba
Copy link
Contributor

jba commented Apr 4, 2023

Thanks, @dsnet, that makes a lot of sense.

@08d2
Copy link

08d2 commented Apr 5, 2023

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}. Among many other reasons, this behavior would make it impossible to reliably assert that a log event emitted with some key K would actually be rendered with that key K in the ultimate output format.

Structured logging defines a concept of a log event which is conceptually above any specific encoding like JSON or logfmt or et cetera. That concept of a log event is well-defined by convention, and that definition doesn't accommodate repeated keys.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}.

Possibly, but there are no good answers when it comes to duplicate names.

It would be nice if the API made it statically impossible to have duplicates,
but I don't see how that can be done without taking in a map[string]any as the input,
which would not be performant.

Given the possibility of runtime duplicates, your options are:

  1. Emit the object with all duplicate fields
  2. Emit the object, but only keep one of the duplicate fields (e.g., the first occurrence)
  3. Rename the duplicate fields to avoid a conflict
  4. Report an error
    a. and still do either 1, 2, or 3
    b. or simply reject the message and not log it.

With option 1, you risk the log parser later not being able to read the message entirely or accurately. Duplicate names result in undefined behavior where the parser might ignore duplicates, only use the first one, only use the last one, or use some merged form of all duplicates.

With option 2, you avoid any ambiguities when reading, but are subtly dropping information when writing.

With option 3, you avoid any ambiguities as to the meaning when reading, but are changing the meaning when writing. At least the meaning when reading is consistent across all parsers.

Option 4a is decent as it allows the caller to at least realize that they're possibly holding the API wrong, but I suspect most people will not check the error of logging calls.

Of all the bad options, option 4a with either 1 or 3 is probably the best approach.
Options 2 and 4b discard information, which seems antithetical to the point of a logging library.

@08d2
Copy link

08d2 commented Apr 5, 2023

Option 2 is pretty clearly the correct choice.

Options 2 and 4b discard information, which seems antithetical to the point of a logging library.

I'm not sure why anyone would expect a structured logging library to never discard information. Emitting multiple structured log events with the same key is pretty clearly going to resolve to a single "winner" for that key, most likely the last event, but really any event would be acceptably valid.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Option 2 is pretty clearly the correct choice.

That's a matter of opinion. Just as how you find option 3 an odd position, I find option 2 an odd position. Every option has pros and cons.

I'm not sure why anyone would expect a structured logging library to never discard information.

Any value passed to LogAttrs by the programmer carries fairly clear intent that it be logged. Arbitrarily choosing some field to keep, while discarding all others goes against the intent. Thus, my expectation is for a logging library to try hard to never discard information.

@08d2
Copy link

08d2 commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event. So, no: it's definitely not the case that calling a structured logging method like LogAttrs can be assumed to produce any specific log output. Structured logging is, hopefully obviously, not equivalent to log.Printf, or fmt.Printf, or anything like that.

@veqryn
Copy link
Contributor Author

veqryn commented Apr 5, 2023

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

Personally I would have probably expected option 2, where the latest value wins (except that the Message, Time, Level, and Source fields would always win over any Attribute values).

My preferences, would probably be:
option 2 or 4a2
option 3 or 4a3

option 1 would break a lot of log aggregation systems
option 4b would seem like the worst option, as you lose the log line completely, making any error even harder to diagnose

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

As an aside, we should definitely create a vet check that detects accidental duplicate names.

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

I was thinking of an explicit error returned by slog.Handler.Handle. I guess a separate log line is a variation of option 4. However, a go vet check might be a better way to statically detect most of these cases.

@08d2
Copy link

08d2 commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

I'm not sure how to evaluate whether or not a particular mistake made by consumers of an API is common enough to warrant changing the API to accommodate that mistake. Especially so if that change would violate the intuitive expectations of the API, as is the case here.

@AndrewHarrisSPU
Copy link

It would be nice if the API made it statically impossible to have duplicates, but I don't see how that can be done without taking in a map[string]any as the input, which would not be performant.

I agree, I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of x/exp/slog/internal/buffer allocates output buffers as a sync.Pool resource.

type Buffer []byte

It's possible to also pool a map[string]bool for keys, and a []string for groups

type Buffer struct{
	buf []byte
	keys map[string]bool
	group []string
}

On freeing the Buffer, like the buf slice, the keys and group fields can be emptied without releasing backing memory. I used something similar experimenting with string interpolation of attributes into log messages which was fun and way more useless in practice than I anticipated, but I did benchmark and this Buffer approach seemed to:

  • not allocate
  • increase the time of logging proportional to the number of attributes (loosely and non-scientifically, each Attr was 20% more expensive - use 5 Attrs and pay for 6)
  • increase the size of pooled objects by a constant factor - the additional fields are small, but so is the buf slice.

A contrasting approach: I think a Deduplicate wrapper for ReplaceAttr would be possible without any change to slog, although tricky because deduplication after ReplaceAttr seems significant. Notably, with no change to slog, I don't think this allows for your option 4 of reporting an error, which I thought was intriguing. Something like:

func Deduplicate(func([]string, Attr)(Attr)) func([]string, Attr) Attr

(I really don't have any opinion on what's necessary here - I don't see much need for deduplication personally, prior art in structured logging in Go doesn't deduplicate AFAIK, and yet the the arguments for deduplication seem pretty reasonable)

@veqryn
Copy link
Contributor Author

veqryn commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

As an aside, we should definitely create a vet check that detects accidental duplicate names.

Would that be able to detect duplicates in sub-packages though? I feel like the most common occurrence of duplicates will be adding attributes and telemetry before passing the logger down to libraries or other sub-packages.

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

I was thinking of an explicit error returned by slog.Handler.Handle. I guess a separate log line is a variation of option 4. However, a go vet check might be a better way to statically detect most of these cases.

Right now, errors coming from Handler are discarded by the most common slog methods (ie: Info, WarnCtx, etc). I certainly wouldn't want to change the signature of Info, WarnCtx, etc, to include returning the error. I suppose it is possible to use a middleware type handler that captures and does something with errors, but that wouldn't be something people would use by default.
In any case, I would still want the log line outputted before any error returned.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Would that be able to detect duplicates in sub-packages though?

Depends on the complexity of call-tracking. This is a similar problem as vet checks for fmt.Sprintf, which already has some degree of call tracking. I don't know if it descends through package boundaries, but it is able to cross function call boundaries.

@zephyrtronium
Copy link
Contributor

One option might be to rename duplicated keys to !BADKEY(duplicateKey) as a sort of combination of @dsnet's options 3 and 4a while carrying some symmetry with the existing error-while-logging mechanism. Then again, there is also some asymmetry if it is only JSONHandler that checks duplicates.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Another argument against option 2 (which discards data) is security.

If untrusted input is able affect what names are present, then that can be used as a mechanism to hide malicious activity, knowing that it will not be logged. IMO, it seems better for readers to possibly choke on the malformed messages, than for the malicious activity to not be properly logged. An human investigator looking at the malformed message can see the duplicate shenanigan at play.

@08d2
Copy link

08d2 commented Apr 5, 2023

There are many additional (and unavoidable) security concerns in a structured logging API beyond this one, often related to how values are rendered. It is always an error to pass untrusted input to a structured logger directly.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

@08d2

There are many additional (and unavoidable) security concerns in a structured logging API beyond this one

Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.

@AndrewHarrisSPU

I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of x/exp/slog/internal/buffer allocates output buffers as a sync.Pool resource.

I'm not sure I understand. Suppose we used a Go map such that you did something like:

slog.Info(map[string]any{
    "msg":   "my wonderful message",
    "other": "something else",
})

Yes, you get the benefit of static compilation preventing duplicate map keys (although it can't check the implicit "time" and "level" keys). However, I'm missing the part where the pool helps out.

When slog.Info calls the underlying slog.Handler.Handle, it goes through an interface method call, and all escape analysis goes out the window, so the input map is treated as having escaped (and thus heap-allocated). To ensure the input remains stack allocated, you would have to copy the input map to one of the pooled maps and then call Handler.Handle with it. It's not clear to me that the act of copying the map is going to be fast enough.

@AndrewHarrisSPU
Copy link

AndrewHarrisSPU commented Apr 5, 2023

@dsnet

When slog.Info calls the underlying slog.Handler.Handle, it goes through an interface method call, and all escape analysis goes out the window, so the input map is treated as having escaped (and thus heap-allocated). To ensure the input remains stack allocated, you would have to copy the input map to one of the pooled maps and then call Handler.Handle with it. It's not clear to me that the act of copying the map is going to be fast enough.

I don't think we need to input maps to achieve option (2) or (3), I think I should clarify. This technique doesn't involve any change that isn't internal to Handler implementation, everything else looks the same as before. There's no map copying; there is one blank, probably already-allocated map obtained from a pool when the output Buffer is obtained by Handler.Handle. (edit: some attention to pre-formatted attributes is needed here, but it can be pretty efficiently done)

This map can be populated with keys visited when the handler traverses the attributes of the record, writing to output. A couple of quirks are that the traversal is made more interesting and dynamic by things like group Attrs, LogValuers that expand into groups, or ReplaceAttrs, etc. ReplaceAttr needs to happen before a key goes into the map, but this is tractable. Also, precisely, it's not just key strings that need to be matched, it's also the group prefix - a little tricky but possible to do safely and without allocating.

If this map is populated with visited keys, we can detect whether some key has been committed previously in the log line, and trigger behavior option (2) or (3) as a result. Big picture, I'm not enthusiastic about this as a solution, but I was surprised that it seemed more performant than I would've guessed.

@08d2
Copy link

08d2 commented Apr 6, 2023

Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.

Consider a type which renders in one way in some version X, and a different (and richer) way in a subsequent version Y. Consider the case where that change results in sensitive information from that type which was previously not logged becoming logged. This happens easily if the type grows a new sensitive field in a new version. Such changes are in practice never considered breaking in terms of versioning. A downstream application which logs a value of that type will have one behavior if it includes version X of the relevant package, and a different behavior when it includes version Y, and that version upgrade is in practice never going to be noticed, or tested, by the downstream application.

Printf based logging does not generally have this problem, because it is not conventional to pass values directly as a Printf log arguments.

@veqryn
Copy link
Contributor Author

veqryn commented Apr 7, 2023

fwiw we once used a database driver that logged our database password in plain text when in debug mode, using regular old Printf (it was patched right away)

I don't really see any security issues with structured logging that don't already exist with regular logging.

And its not like this is log4j and we are executing the statements in the logs or trying resolve remote xml or something. If the data is strings, the worst case I can think of is accidental logging of sensitive data.

@jba
Copy link
Contributor

jba commented Apr 8, 2023

A few points:

@veqryn's specific problem, duplicate "source" keys, will be easy to resolve with a ReplaceAttr function once #59280 is approved. You can then distinguish a "source" attribute that comes from the handler by seeing if the value is a *slog.Source. It's also going to appear before a user-supplied attribute, a fact you can use right now. The design doesn't promise that that will always be true, but I can't see why we'd change it.

In general, if you don't like duplicate keys, you can write a wrapping Handler. That can return an error if you like, but it's worth remembering that logging is often the last resort for debugging in the wild. If you've ever been in a position where you can't debug because you have a picky logger that refuses to output everything it has, you know how frustrating that can be, and how happily you'd jettison your principles of careful error detection and failing fast for just a few more characters on the screen.

This isn't a proposal, but if it were I'd vote to decline. We can always add dedup functionality later (though not as the default, it's true). As for json v2 compatibility, @dsnet has provided a good blueprint for how to achieve that when the time comes. But you go into the standard library with the packages you have, not the packages you wish you had. slog should behave like the current encoding/json.

@earthboundkid
Copy link
Contributor

Duplicate keys can lost if the processor doesn't accept duplicates, but they can also be lost if the processor expects mykey and gets mykey-1. There's not a best option here.

@prochac
Copy link

prochac commented Oct 31, 2023

@carlmjohnson Not keys, the whole log line can be dropped as an invalid JSON, if the JSON parser considers duplicate keys as an invalid JSON object

Edit:
I wanted to do it later, after app-level migration, but it seems that our stack processes duplicate keys fine. The last occurrence wins.
stack: Vector+Loki

@dsnet
Copy link
Member

dsnet commented Oct 31, 2023

@proachac. There are no good answers when it comes to duplicate names.

You mentioned that you view your logs through Grafana. What is the underlying data store that you're using with Grafana, and does that at least store JSON objects with duplicate names?

In particular, I'm trying to understand whether this a storage problem or a reading problem:

  • A storage problem is one where Prometheus (a common data store for Grafana) refuses to store the log message and drops it entirely. Do we know if Pormetheus and whether common logs data store (e.g., Splunk or DataDog) drop objects with duplicate names? In this situation, there is no hope for recovery since the data is lost forever.
  • A reading problem is one where some application writing in a programming language of choice can't read the data. The default behavior of v2 "json" will reject values with duplicate names, but this can be easily disabled with json.Unmarshal(b, &v, jsontext.AllowDuplicateNames(true)). That said, the semantic behavior of duplicate names will differ across different implementations. In this situation, the situation is bad, but at least there is hope for recovering the logs data.

@prochac
Copy link

prochac commented Oct 31, 2023

@dsnet I do not have a problem particularly. What I have is concern. In an imaginary scenario.
log/slog produces JSON logs how they SHOULDN'T (not MUSN'T tho) look like by RFC8259.
I know, it's the user's fault, but...
That's why I suggested that it would be worth of being in std lib.
I know, I can use the 3rd party library, and I'm doing it. At least, until we fully migrate and verify that everything is working just fine.

ECMA says that the latest wins

25.5.1 JSON.parse
https://262.ecma-international.org/14.0/#sec-json.parse

@08d2
Copy link

08d2 commented Nov 13, 2023

Are there any other commonly-used JSON parsers that will fail to parse {"a":1,"a":2} by default?

@prochac
Copy link

prochac commented Nov 13, 2023

Are there any other commonly-used JSON parsers that will fail to parse {"a":1,"a":2} by default?

JavaScript, the father of JSON si definitely not.
Python's stdlib json takes the second as well.

Maybe it's not a good strategy to be more Catholic than the Pope.

@dsnet
Copy link
Member

dsnet commented Nov 13, 2023

In "encoding/json/v2", we are aiming to reject duplicate names by default, but make it fairly easy to opt-out of the strict check.

The decision to reject duplicate names is driven by various security exploits that have occurred over the years that depended on abusing the unspecified behavior that occurs with duplicate names (across a variety of JSON implementations in various languages). See CVE-2017-12635 as one such example. This particular security stance doesn't benefit logging as much, since the content of a log message is rarely used to alter some critical piece of control flow. However, JSON is used for so much more than just logging (e.g., RPCs), that it is safer to be secure by default even if it is the more annoying default for logs analysis.

Also, even if various libraries allow for duplicate names, the behavior of how duplicates are handled varies widely. It is not enough to just ask the question of whether it is permissible. In the CVE mentioned above, Erland and JavaScript disagreed about the semantic meaning of a JSON object with duplicate names. In a similar way, permitting duplicate names, means that you may end up with logs analysis results that differ (perhaps wildly) depending on what JSON implementation you happened to use.

@marcusperlick
Copy link

I think this is not just a JSON topic. The log/slog package should be explicit about the semantics of duplicate keys provided via log.Log(…) or by having duplicates via log.With(…).Log(…). With a well-defined semantics each Handler must find its strategy to implement it. Otherwise, the information in a log would depend on the choice of the log Handler – not on what was provided by the application that uses the logger.

@madkins23
Copy link

madkins23 commented Jan 8, 2024

I stumbled onto this problem while I was trying to add some slog/zerolog handlers to a forked copy of go-logging-benchmarks. I filed a ticket there because half of the systems they're benchmarking duplicate a lot of test attributes (and the other half don't) and I felt like the benchmarks might not be completely accurate. Now I'm working on a suite of slog.Handler tests, with the primary motivation to check for duplicate fields, as slogtest does not.

My personal feeling is that duplicate fields in JSON are bad (many arguments above) and should be documented as such. The greatest advantage of slog is that handlers can be swapped out without rewriting the code that does the logging. This advantage is lessened if the logging output changes when the handlers are swapped, particularly for situations involving downstream processing of log files.

The simplest solution is #2 latest, though there may be unpredictability in "latest" in some situations. Or perhaps, if field "alpha" repeats, turn it into an array of all the values.

In any event, as mentioned earlier, the log/slog documentation should clarify the expected behavior, whatever it is supposed to be, sooner rather than later. I'm specifically thinking of the bullet items in the slog.Handler documentation.

ronny added a commit to ronny/clog that referenced this issue May 29, 2024
ronny added a commit to ronny/clog that referenced this issue May 29, 2024
* fixes incorrect WithAttrs and WithGroup behaviour

* Deduplicate attributes with the same key

golang/go#59365
@snqk
Copy link

snqk commented Jun 27, 2024

Perhaps a somewhat different problem, but my use-case required a need for merging / overlaying a new set of slog.Attr over the ones already stored in slog.Logger.

Considering a log/slog output like:

{
  "time": "2009-11-10T23:00:00Z",
  "level": "INFO",
  "msg": "hello_world",
  "uuid": "3a48d92f-8bf9-4633-8d35-682618412afa",
  "runtime": {
    "lifecycle": {
      "retry": {
        "total": 4,
        "timeout_seconds": 10,
        "attempt": 1
      }
    }
  },
  "runtime": {
    "lifecycle": {
      "retry": {
        "attempt": 2
      },
      "duration": {
        "nanoseconds": 123456
      }
    }
  },
  "throughput_bps": 59,
  "throughput_bps": 76
}

A potential merge would overlay the updated slog.LogValue over the original set, without modifying any existing keys of slog.Attr(s).

{
  "time": "2009-11-10T23:00:00Z",
  "level": "INFO",
  "msg": "hello_world",
  "uuid": "3a48d92f-8bf9-4633-8d35-682618412afa",
  "runtime": {
    "lifecycle": {
      "retry": {
        "total": 4,
        "attempt": 2,
        "timeout_seconds": 10
      },
      "duration": {
        "nanoseconds": 123456
      }
    }
  },
  "throughput_bps": 76
}

I also experimented with implementing slog.LogValuer on a type Runtime struct {}, adding it as a root attribute. I felt that was too expensive and unnecessary to render everytime. I'd be keen to explore better ways of achieving the same result if folks here have ideas.

I've created a Handler (https://github.com/snqk/slog-meld) based on this approach, in case anyone is looking for something similar. I'd appreciate any constructive feedback!

@jba
Copy link
Contributor

jba commented Jun 28, 2024

I haven't looked at the code in detail, but it seems reasonable.

I would test your handler with testing/slogtest.

@cupen
Copy link

cupen commented Aug 8, 2024

I think that deduplicate is expensive, that's not what a log formater is for.
The slog.Handler is an interface, your custom JSONHandler is good enough for it. So let's keep the default one as fast as possible.

@gopherbot gopherbot modified the milestones: Go1.23, Go1.24 Aug 13, 2024
@madkins23
Copy link

madkins23 commented Aug 13, 2024

In a previous post on this issue I was pushing for no duplicates in logged JSON. Having spent some quality time with various log/slog handlers and written one myself I now feel like removing duplicates is a fairly costly endeavor and should be optional.

I have compared slog.JSONHandler to snqk/meld and veqryn/slog-dedup (both wrapped around slog.JSONHandler) using my log/slog handler test harness:

Both of them work fine (or as well as the wrapped handlers, anyway). Note the performance cost in both cases.

FYI: @snqk @veqryn

@snqk
Copy link

snqk commented Aug 13, 2024

https://madkins23.github.io/go-slog/scores/:Dedup/summary.html

Thanks @madkins23

Overall I'd agree that merging/deduplication should not be offered as default. It's also important to note that deduplication ⊂ merge.

However I disagree that merging is that expensive. The benchmarks you posted show the meld logger isn't significantly behind the default logger (~97 vs ~87).

@madkins23
Copy link

Expensive is subjective. My goal is to provide data that will help developers choose the appropriate solution for the requirements of a particular project. Your handler is a useful and reasonably efficient option. Thanks for contributing!

@madkins23
Copy link

I had to change the URL above to:

due to limitations on file names in sum.golang.org/lookup.

@madkins23
Copy link

@snqk: The "score" numbers on that URL are a somewhat misleading. They are rolled up using an algorithm I pulled out of my posterior. Much useful detail is lost. 😞

I added tables to the score page showing the actual time and memory numbers from the benchmark testing. Check the score page for dedup again and use the Macros, Allocs, and Bytes buttons to the upper right of the score table to see this additional detail.

@snqk
Copy link

snqk commented Aug 17, 2024

@madkins23: aye, I'm not sure how the score was calculated but it was easier to reference 😄. I did look at the rest of the benchmarks though.

While all features come at some cost, your benchmarks helped me discover a couple performance related bugs (thanks!). Not as straightforward to fix, but I believe that's what's causing the bloated numbers for some benchmarks, notably BigGroup.

Going back to the OP though, I don't think this needs to be offered as a feature of JSONHandler. Apart from the obvious performance cost, this problem applies equally to TextHandler as well which isn't considered (as others have previously pointed out).

Outside of convenience, I don't hold a strong opinion on this feature being offered as an independent handler in log/slog.

@madkins23
Copy link

Glad the benchmarks helped, that's what they're for!

I agree that this doesn't need to be pushed into JSONHandler (or TextHandler for that matter). Your wrapper approach seems appropriate for cases where this feature is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests