-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log/slog: JSONHandler should deduplicate keys #59365
Comments
Whatever solution is chosen here, I would argue that it should work even if the user explicitly provides duplicate names: logger.LogAttrs(
context.Background(),
slog.LevelWarn,
"fancy message",
slog.String("duplicate", "alpha"),
slog.String("duplicate", "bravo"),
slog.String("duplicate", "charlie"),
slog.String("duplicate", "delta"),
slog.String("duplicate", "echo"),
) One possible solution is to use a "#%d" scheme similar to how {
"time": "2023-04-01T04:27:46.0959443-06:00",
"level": "WARN",
"msg": "fancy message",
"duplicate": "alpha",
"duplicate#01": "bravo",
"duplicate#02": "charlie",
"duplicate#03": "delta",
"duplicate#04": "echo",
} |
I think the current situation is fine, duplicate keys aren't invalid, just less interoperable. As for the original issue, I'll note that option 3 was why WithGroup was added (though this doesn't affect deduplication in general). |
Technically you are correct. Under RFC 8259, section 4, duplicate names are essentially treated as undefined behavior. In a possible future major version of "encoding/json", we will likely make duplicate keys an explicit error (which is still complaint behavior with RFC 8259) as this lack of checking has been exploited, resulting in severe security vulnerabilities (e.g., CVE-2017-12635). It's unlikely that duplicate names will have such drastic security implications in logging, but strict checking of duplicate names is likely the overall direction that JSON will gradually move towards. As an aside, Tim Bray, the author of RFC 8259 actually recommends against naïve compliance with RFC 8259, but rather that implementations adhere to RFC 7493 which makes strict decisions on issues that RFC 8259 leaves as undefined. |
I think it's important to understand whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version. We should probably align slog with those semantics, even if slog appears in Go before encoding/json v2. We're already doing that for formatting NaNs, infinities and, probably, durations (#59345). For this issue, if we think no dups will be the default for encoding/json v2, then I wouldn't be opposed to making it the default in slog and adding an |
That I can't really answer unfortunately. We haven't even started the GitHub discussion on anything related to v2 "encoding/json" and may not be ready to do so for some months. A significant piece of work that remains is figuring out the inter-compatibility story between v1 and a hypothetical v2. That's perhaps the hardest part and requires quite a lot of head-scratching. That said, if there is a v2 "encoding/json", I feel like the security arguments for rejecting duplicate names by default is pretty compelling. The (very) tentative plan for v2 is that v1 "encoding/json" will actually be implemented in terms of v2 with the appropriate number of options specified that makes it behave semantically identically to what v1 "json" does today. Essentially, with the appropriate set of options, you can configure the semantics to be any hybrid mix that is in-between the default behavior of v1 and the default behavior of v2. Since "log/slog" already depends on the v1 semantics of "encoding/json", you could imagine doing nothing to address this issue for now. I don't think we should block adoption of "slog" on a hypothetical v2 "json". Nor should we overly massage the output of "slog" to match a hypothetical v2 "json". Supposing v2 "json" one day lands, you could imagine that v2 "json" be integrated into "slog" by exposing something like: package slog
type HandlerOptions struct {
// JSONOptions is a list of Option
// that configures how JSON is marshaled.
JSONOptions []json.Option
} To log each message, "slog" would do something like: // Start with DefaultV1Options since "slog" has historically
// marshaled with mostly v1 semantics by default.
opts := []json.Option{json.DefaultV1Options}
// We have historically specified Encoder.SetEscapeHTML(false).
// By passing a nil escape function, we achieve the same behavior.
opts = append(opts, json.EscapeRune(nil))
// Append any user-specified options,
// which will override any previously set options.
opts = append(opts, h.opts.JSONOptions...)
return json.Marshal(v, opts...) (You'll note that the options API for "json" above is different than what currently exists in "go-json-experiment/json". I mentioned elsewhere in #59339 that we're considering switching to variadic options as we see it as one of the ways to resolve the v1-to-v2 compatibility tensions). Thus, for someone to use v2 "json" with "slog", they would do something like: slog.HandlerOptions{
JSONOptions: []json.Option{json.DefaultV2Options},
}.NewJSONHandler(...) I should also note that You could also simplify the above by providing One could imagine that "slog" inspects
|
Thanks, @dsnet, that makes a lot of sense. |
It would be very unexpected if repeated calls to Log with the same key Structured logging defines a concept of a log event which is conceptually above any specific encoding like JSON or logfmt or et cetera. That concept of a log event is well-defined by convention, and that definition doesn't accommodate repeated keys. |
Possibly, but there are no good answers when it comes to duplicate names. It would be nice if the API made it statically impossible to have duplicates, Given the possibility of runtime duplicates, your options are:
With option 1, you risk the log parser later not being able to read the message entirely or accurately. Duplicate names result in undefined behavior where the parser might ignore duplicates, only use the first one, only use the last one, or use some merged form of all duplicates. With option 2, you avoid any ambiguities when reading, but are subtly dropping information when writing. With option 3, you avoid any ambiguities as to the meaning when reading, but are changing the meaning when writing. At least the meaning when reading is consistent across all parsers. Option 4a is decent as it allows the caller to at least realize that they're possibly holding the API wrong, but I suspect most people will not check the error of logging calls. Of all the bad options, option 4a with either 1 or 3 is probably the best approach. |
Option 2 is pretty clearly the correct choice.
I'm not sure why anyone would expect a structured logging library to never discard information. Emitting multiple structured log events with the same key is pretty clearly going to resolve to a single "winner" for that key, most likely the last event, but really any event would be acceptably valid. |
That's a matter of opinion. Just as how you find option 3 an odd position, I find option 2 an odd position. Every option has pros and cons.
Any value passed to |
Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event. So, no: it's definitely not the case that calling a structured logging method like LogAttrs can be assumed to produce any specific log output. Structured logging is, hopefully obviously, not equivalent to log.Printf, or fmt.Printf, or anything like that. |
@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something? Personally I would have probably expected option 2, where the latest value wins (except that the Message, Time, Level, and Source fields would always win over any Attribute values). My preferences, would probably be: option 1 would break a lot of log aggregation systems |
By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3). As an aside, we should definitely create a vet check that detects accidental duplicate names.
I was thinking of an explicit error returned by |
I'm not sure how to evaluate whether or not a particular mistake made by consumers of an API is common enough to warrant changing the API to accommodate that mistake. Especially so if that change would violate the intuitive expectations of the API, as is the case here. |
I agree, I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of type Buffer []byte It's possible to also pool a type Buffer struct{
buf []byte
keys map[string]bool
group []string
} On freeing the
A contrasting approach: I think a
(I really don't have any opinion on what's necessary here - I don't see much need for deduplication personally, prior art in structured logging in Go doesn't deduplicate AFAIK, and yet the the arguments for deduplication seem pretty reasonable) |
Would that be able to detect duplicates in sub-packages though? I feel like the most common occurrence of duplicates will be adding attributes and telemetry before passing the logger down to libraries or other sub-packages.
Right now, errors coming from Handler are discarded by the most common slog methods (ie: Info, WarnCtx, etc). I certainly wouldn't want to change the signature of Info, WarnCtx, etc, to include returning the error. I suppose it is possible to use a middleware type handler that captures and does something with errors, but that wouldn't be something people would use by default. |
Depends on the complexity of call-tracking. This is a similar problem as vet checks for |
One option might be to rename duplicated keys to |
Another argument against option 2 (which discards data) is security. If untrusted input is able affect what names are present, then that can be used as a mechanism to hide malicious activity, knowing that it will not be logged. IMO, it seems better for readers to possibly choke on the malformed messages, than for the malicious activity to not be properly logged. An human investigator looking at the malformed message can see the duplicate shenanigan at play. |
There are many additional (and unavoidable) security concerns in a structured logging API beyond this one, often related to how values are rendered. It is always an error to pass untrusted input to a structured logger directly. |
Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.
I'm not sure I understand. Suppose we used a Go map such that you did something like: slog.Info(map[string]any{
"msg": "my wonderful message",
"other": "something else",
}) Yes, you get the benefit of static compilation preventing duplicate map keys (although it can't check the implicit "time" and "level" keys). However, I'm missing the part where the pool helps out. When |
I don't think we need to input maps to achieve option (2) or (3), I think I should clarify. This technique doesn't involve any change that isn't internal to This map can be populated with keys visited when the handler traverses the attributes of the record, writing to output. A couple of quirks are that the traversal is made more interesting and dynamic by things like group Attrs, If this map is populated with visited keys, we can detect whether some key has been committed previously in the log line, and trigger behavior option (2) or (3) as a result. Big picture, I'm not enthusiastic about this as a solution, but I was surprised that it seemed more performant than I would've guessed. |
Consider a type which renders in one way in some version X, and a different (and richer) way in a subsequent version Y. Consider the case where that change results in sensitive information from that type which was previously not logged becoming logged. This happens easily if the type grows a new sensitive field in a new version. Such changes are in practice never considered breaking in terms of versioning. A downstream application which logs a value of that type will have one behavior if it includes version X of the relevant package, and a different behavior when it includes version Y, and that version upgrade is in practice never going to be noticed, or tested, by the downstream application. Printf based logging does not generally have this problem, because it is not conventional to pass values directly as a Printf log arguments. |
fwiw we once used a database driver that logged our database password in plain text when in debug mode, using regular old Printf (it was patched right away) I don't really see any security issues with structured logging that don't already exist with regular logging. And its not like this is log4j and we are executing the statements in the logs or trying resolve remote xml or something. If the data is strings, the worst case I can think of is accidental logging of sensitive data. |
A few points: @veqryn's specific problem, duplicate "source" keys, will be easy to resolve with a ReplaceAttr function once #59280 is approved. You can then distinguish a "source" attribute that comes from the handler by seeing if the value is a In general, if you don't like duplicate keys, you can write a wrapping Handler. That can return an error if you like, but it's worth remembering that logging is often the last resort for debugging in the wild. If you've ever been in a position where you can't debug because you have a picky logger that refuses to output everything it has, you know how frustrating that can be, and how happily you'd jettison your principles of careful error detection and failing fast for just a few more characters on the screen. This isn't a proposal, but if it were I'd vote to decline. We can always add dedup functionality later (though not as the default, it's true). As for json v2 compatibility, @dsnet has provided a good blueprint for how to achieve that when the time comes. But you go into the standard library with the packages you have, not the packages you wish you had. |
Duplicate keys can lost if the processor doesn't accept duplicates, but they can also be lost if the processor expects |
@carlmjohnson Not keys, the whole log line can be dropped as an invalid JSON, if the JSON parser considers duplicate keys as an invalid JSON object Edit: |
@proachac. There are no good answers when it comes to duplicate names. You mentioned that you view your logs through Grafana. What is the underlying data store that you're using with Grafana, and does that at least store JSON objects with duplicate names? In particular, I'm trying to understand whether this a storage problem or a reading problem:
|
@dsnet I do not have a problem particularly. What I have is concern. In an imaginary scenario. ECMA says that the latest wins
|
Are there any other commonly-used JSON parsers that will fail to parse |
JavaScript, the father of JSON si definitely not. Maybe it's not a good strategy to be more Catholic than the Pope. |
In "encoding/json/v2", we are aiming to reject duplicate names by default, but make it fairly easy to opt-out of the strict check. The decision to reject duplicate names is driven by various security exploits that have occurred over the years that depended on abusing the unspecified behavior that occurs with duplicate names (across a variety of JSON implementations in various languages). See CVE-2017-12635 as one such example. This particular security stance doesn't benefit logging as much, since the content of a log message is rarely used to alter some critical piece of control flow. However, JSON is used for so much more than just logging (e.g., RPCs), that it is safer to be secure by default even if it is the more annoying default for logs analysis. Also, even if various libraries allow for duplicate names, the behavior of how duplicates are handled varies widely. It is not enough to just ask the question of whether it is permissible. In the CVE mentioned above, Erland and JavaScript disagreed about the semantic meaning of a JSON object with duplicate names. In a similar way, permitting duplicate names, means that you may end up with logs analysis results that differ (perhaps wildly) depending on what JSON implementation you happened to use. |
I think this is not just a JSON topic. The log/slog package should be explicit about the semantics of duplicate keys provided via log.Log(…) or by having duplicates via log.With(…).Log(…). With a well-defined semantics each Handler must find its strategy to implement it. Otherwise, the information in a log would depend on the choice of the log Handler – not on what was provided by the application that uses the logger. |
I stumbled onto this problem while I was trying to add some My personal feeling is that duplicate fields in JSON are bad (many arguments above) and should be documented as such. The greatest advantage of The simplest solution is #2 latest, though there may be unpredictability in "latest" in some situations. Or perhaps, if field "alpha" repeats, turn it into an array of all the values. In any event, as mentioned earlier, the |
* fixes incorrect WithAttrs and WithGroup behaviour * Deduplicate attributes with the same key golang/go#59365
Perhaps a somewhat different problem, but my use-case required a need for merging / overlaying a new set of Considering a {
"time": "2009-11-10T23:00:00Z",
"level": "INFO",
"msg": "hello_world",
"uuid": "3a48d92f-8bf9-4633-8d35-682618412afa",
"runtime": {
"lifecycle": {
"retry": {
"total": 4,
"timeout_seconds": 10,
"attempt": 1
}
}
},
"runtime": {
"lifecycle": {
"retry": {
"attempt": 2
},
"duration": {
"nanoseconds": 123456
}
}
},
"throughput_bps": 59,
"throughput_bps": 76
} A potential merge would overlay the updated {
"time": "2009-11-10T23:00:00Z",
"level": "INFO",
"msg": "hello_world",
"uuid": "3a48d92f-8bf9-4633-8d35-682618412afa",
"runtime": {
"lifecycle": {
"retry": {
"total": 4,
"attempt": 2,
"timeout_seconds": 10
},
"duration": {
"nanoseconds": 123456
}
}
},
"throughput_bps": 76
} I also experimented with implementing I've created a Handler (https://github.com/snqk/slog-meld) based on this approach, in case anyone is looking for something similar. I'd appreciate any constructive feedback! |
I haven't looked at the code in detail, but it seems reasonable. I would test your handler with testing/slogtest. |
I think that deduplicate is expensive, that's not what a log formater is for. |
In a previous post on this issue I was pushing for no duplicates in logged JSON. Having spent some quality time with various I have compared Both of them work fine (or as well as the wrapped handlers, anyway). Note the performance cost in both cases. |
Thanks @madkins23 Overall I'd agree that merging/deduplication should not be offered as default. It's also important to note that deduplication ⊂ merge. However I disagree that merging is that expensive. The benchmarks you posted show the meld logger isn't significantly behind the default logger (~97 vs ~87). |
Expensive is subjective. My goal is to provide data that will help developers choose the appropriate solution for the requirements of a particular project. Your handler is a useful and reasonably efficient option. Thanks for contributing! |
I had to change the URL above to: due to limitations on file names in |
@snqk: The "score" numbers on that URL are a somewhat misleading. They are rolled up using an algorithm I pulled out of my posterior. Much useful detail is lost. 😞 I added tables to the score page showing the actual time and memory numbers from the benchmark testing. Check the score page for dedup again and use the |
@madkins23: aye, I'm not sure how the score was calculated but it was easier to reference 😄. I did look at the rest of the benchmarks though. While all features come at some cost, your benchmarks helped me discover a couple performance related bugs (thanks!). Not as straightforward to fix, but I believe that's what's causing the bloated numbers for some benchmarks, notably Going back to the OP though, I don't think this needs to be offered as a feature of Outside of convenience, I don't hold a strong opinion on this feature being offered as an independent handler in |
Glad the benchmarks helped, that's what they're for! I agree that this doesn't need to be pushed into |
What version of Go are you using (
go version
)?8edcddd
Does this issue reproduce with the latest release?
yes
What did you do?
What did you expect to see?
JSONHandler should be fixed by deduplicating those special keys (time, message, level, and source).
I use "source" as a key in a lot of my code right now, so I'd run into this problem.
Either
Or
Or
Or
What did you see instead?
Invalid json. It would break our log aggregation. Default/Builtin handlers should never create invalid output.
@jba
EDIT:
Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
Please check it out: https://github.com/veqryn/slog-dedup
go get github.com/veqryn/slog-dedup
The text was updated successfully, but these errors were encountered: