Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: allow per-Encoder/per-Decoder registration of marshal/unmarshal functions #5901

Open
rsc opened this issue Jul 17, 2013 · 47 comments

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Jul 17, 2013

For example, if a user wants to marshal net.IP with custom code, we should provide a way
to do that, probably a method on *Encoder. Similarly for *Decoder.

Same for encoding/xml.
@robpike
Copy link
Contributor

@robpike robpike commented Aug 20, 2013

Comment 1:

Labels changed: removed go1.3.

@rsc
Copy link
Contributor Author

@rsc rsc commented Nov 27, 2013

Comment 2:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor Author

@rsc rsc commented Dec 4, 2013

Comment 3:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor Author

@rsc rsc commented Dec 4, 2013

Comment 4:

Labels changed: added repo-main.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Jan 2, 2017

I did some work on this sometime back in October, with CL https://go-review.googlesource.com/c/31091 to get the conversation started.
Perhaps we can work on it for Go1.9.

In there I introduced Encoder.RegisterEncoder

func (enc *Encoder) RegisterEncoder(t reflect.Type, fn func(interface{}) ([]byte, error))

and Decoder.RegisterDecoder

func (dec *Decoder) RegisterDecoder(t reflect.Type, fn func([]byte) (interface{}, error))

@sebastien-rosset
Copy link

@sebastien-rosset sebastien-rosset commented Sep 14, 2017

We have a similar requirement for custom serialization. Our specific use cases are:

  1. Data anonymization. For example, omit IP addresses in the JSON output.
  2. Data masking. For example, omit specific fields depending on user privileges.
  3. Omit empty values (e.g. empty strings). This is useful to generate compact documents.
    In all three cases, the decision to omit fields is done at runtime with contextual information.

We are using CiscoM31@1e9514f. In our case, the client interface is implemented by doing a simple lookup in a map, so there is no need to register hundreds of custom marshaller (we have lots of structs).
We could have used a technique similar to https://go-review.googlesource.com/c/31091.

@harrisonhjones
Copy link

@harrisonhjones harrisonhjones commented Dec 16, 2019

@rsc is there still interest in this feature on your end? I have a use-case for it and would be happy to implement it.

[Edit 1] Spelling
[Edit 2]

I wonder if the function signature should be

func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error))

with all standard Encoders exposed so that users can leverage them. Example redactor:

package main

import (
	"bytes"
	"encoding/json"
	"os"
	"reflect"
	"strings"
	"unicode/utf8"
)

func main() {
	enc := json.NewEncoder(os.Stdout)
	text := "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), StringMarshaller)
	enc.Encode(text)
	// Output
	// "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), func(value reflect.Value) ([]byte, error) {
		return StringMarshaller(reflect.ValueOf(strings.Replace(value.String(), "foo", "[REDACTED]", -1)))
	})
	enc.Encode(text)
	// Output
	// "My password, [REDACTED], is totally secure"
}

// Largely taken from `func (e *encodeState) `string(s string, escapeHTML bool)` in `encoding/json/encode.go`
// This would exist in encoding/json.
func StringMarshaller(value reflect.Value) ([]byte, error) {
	e := bytes.Buffer{}
	s := value.String()
	escapeHTML := false // TODO: Refactor StringEncoder into a 'htmlEscaping' one and a non 'htmlEscaping' one.

	e.WriteByte('"')
	start := 0
	for i := 0; i < len(s); {
		if b := s[i]; b < utf8.RuneSelf {
			if json.HTMLSafeSet[b] || (!escapeHTML && json.SafeSet[b]) {
				i++
				continue
			}
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteByte('\\')
			switch b {
			case '\\', '"':
				e.WriteByte(b)
			case '\n':
				e.WriteByte('n')
			case '\r':
				e.WriteByte('r')
			case '\t':
				e.WriteByte('t')
			default:
				// This encodes bytes < 0x20 except for \t, \n and \r.
				// If escapeHTML is set, it also escapes <, >, and &
				// because they can lead to security holes when
				// user-controlled strings are rendered into JSON
				// and served to some browsers.
				e.WriteString(`u00`)
				e.WriteByte(json.Hex[b>>4])
				e.WriteByte(json.Hex[b&0xF])
			}
			i++
			start = i
			continue
		}
		c, size := utf8.DecodeRuneInString(s[i:])
		if c == utf8.RuneError && size == 1 {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\ufffd`)
			i += size
			start = i
			continue
		}
		// U+2028 is LINE SEPARATOR.
		// U+2029 is PARAGRAPH SEPARATOR.
		// They are both technically valid characters in JSON strings,
		// but don't work in JSONP, which has to be evaluated as JavaScript,
		// and can lead to security holes there. It is valid JSON to
		// escape them, so we do so unconditionally.
		// See http://timelessrepo.com/json-isnt-a-javascript-subset for discussion.
		if c == '\u2028' || c == '\u2029' {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\u202`)
			e.WriteByte(json.Hex[c&0xF])
			i += size
			start = i
			continue
		}
		i += size
	}
	if start < len(s) {
		e.WriteString(s[start:])
	}
	e.WriteByte('"')

	return e.Bytes(), nil
}

@dsnet
Copy link
Member

@dsnet dsnet commented Dec 18, 2019

I have a use-case for this as well, but it's a bit specialized. Essentially:

  1. There is a type that I own (as a library author) that is used by many users where they pass it through encoding/json when they shouldn't.
  2. I would like to make changes to my type, but it breaks these users. I have no intention to support the use of this type with encoding/json, but I can't just break these users since there are are sufficient number of them.
  3. I would like to use the feature proposed here to migrate these users to an implementation of marshal/unmarshal that preserves the exact behavior as it exists today (even if its buggy and wrong).
  4. This now frees me to make the changes I want to make.

For prior art, the cmp package has a Comparer option that allows the caller to override the comparison of any specific type in the tree. The ability for users to specify custom comparisons has proven itself to be immensely powerful and flexible.

Even though I'd like to have something like this, I do have some concerns:

  • It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.
  • A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.
  • What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

@harrisonhjones
Copy link

@harrisonhjones harrisonhjones commented Dec 18, 2019

Thanks for the response! Some comments below:

It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.

I'd have to run a benchmark but I would expect a map lookup to be fairly quick. Perhaps others are concerned with a different scale of "slow" than I.

A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.

Interesting. I hadn't considered supporting interfaces. At the moment I only need concrete type overrides but if there are users out there that would benefit from an interface check I'd be willing to at least prototype it.

[Edit] Perhaps we could rename RegisterMarshaller to RegisterConcreteMarshaller and introduce RegisterInterfaceMarshaller later if users needed it?

What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

This actually came up in a discussion with a colleague. My preference would be to require users that want to custom marshal both T and *T to have to declare both marshaler overrides. You could use a single custom marshaler but would require both overrides.

...
var foo = ""
enc.RegisterMarshaller(reflect.TypeOf(foo), StringMarshaller)
enc.RegisterMarshaller(reflect.TypeOf(&foo), StringMarshaller)
...

func StringMarshaller(value reflect.Value) ([]byte, error) {
    ... check if value is a *String, if so deref ...
    ...
}

Alternately perhaps we modify the signiture of RegisterMarshaller to be something like:

func (enc *Encoder) RegisterMarshaller(reflect.Type, bool, func(reflect.Value) ([]byte, error))

where, if the bool was "true" it would match both T and *T and if it was false it would only match T or *T, depending on the passed in type. Looks a bit yucky to me though could be made to look better if we used the optional pattern.

@harrisonhjones
Copy link

@harrisonhjones harrisonhjones commented Dec 30, 2019

@dsnet I imagine you might be busy with the holidays but I wanted to give you a friendly ping on this. Any thoughts on my response?

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 31, 2019

Change https://golang.org/cl/212998 mentions this issue: encoding/json: implement type override for serialization

@dsnet
Copy link
Member

@dsnet dsnet commented Dec 31, 2019

While the logic certainly uses reflect under the hood, I'm not sure if we should expose that in the public API. I propose the following API instead:

// RegisterFunc registers a custom encoder to use for specialized types.
// The input f must be a function of the type func(T) ([]byte, error).
//
// When marshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered encoders that operate on concrete types,
// then registered encoders that operate on interface types
// in the order that they are registered, then the MarshalJSON method, and
// lastly the default behavior of Encode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (e *Encoder) RegisterFunc(f interface{})

// RegisterFunc registers a custom decoder to use for specialized types.
// The input f must be a function of the type func([]byte, T) error.
//
// When unmarshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered decoders that operate on concrete types,
// then registered decoders that operate on interface types
// in the order that they are registered, then the UnmarshalJSON method, and
// lastly the default behavior of Decode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (d *Decoder) RegisterFunc(f interface{})

Arguments for this API:

  • Most users of this API probably do not want to deal with reflect.Type and reflect.Value, but rather want to deal with concrete types. For this reason, json.Unmarshal takes in an interface{} rather than an reflect.Value.
  • Using interface{} looses type safety, but the other proposed signatures also have type safety issues. For a signature like: func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error)), we still can't statically guarantee that t is the same type as the type that function f expects as it's input.
  • It matches what's done in other API that also have type-safety issues. For example, sort.Slices takes in an interface{} with the requirement that it be a slice kind, rather than a reflect.Value.

The proposed API combined with the ability to handle interfaces, allows you to do something like:

e := json.NewEncoder(w)
e.RegisterFunc(protojson.Marshal)
e.Encode(v)

where protojson.Marshal is a function that matches the expected function signature. It enables the standard encoding/json package to now be able to properly serialize all proto.Message types without forcing or expecting every concrete proto.Message type to implement the json.Marshaler interface.

Since this is something I need for my other work, I uploaded a CL with my prototype implementation that I've been testing with actual code.

Some additional thoughts:

  • One concern is that json.Decoder.DisallowUnknownFields specifies a top-level option that should in theory affect the results of all unmarshal operations recursively. It is already a problem today that custom json.Unmarshaler implementations do not properly respect this option (and has actually been a real problem). It's unfortunate that we're adding another way where options are not properly propagated downward. However, the severity of the problem is not as bad as custom json.Unmarshaler implementations since the location where options are specified is most likely also the place where custom encoder functions are registered, so options can be properly replicated at the top-level.
  • The implementation checks for custom functions to handle the current value type first, and if the value is addressable, also tries checking for a pointer to that value type. This matches the behavior of encoding/json for how it handles checking for json.Marshaler and json.Unmarshaler implementations.
  • The implementation does not call decoder functions if the receiver type is a pointer/interface and the input JSON is null. It also does not call encoder functions if the receiver type is a pointer/interface and is nil. This matches the behavior of encoding/json for how it handles json.Marshaler and json.Unmarshaler implementations.
  • I haven't benchmarked it yet, but expect decoding to take minimal performance hit and encoding to take no performance hit when there are no custom encoders/decoders.
  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.

@dsnet dsnet changed the title encoding/json: allow override type marshaling proposal: encoding/json: allow override type marshaling Dec 31, 2019
@dsnet
Copy link
Member

@dsnet dsnet commented Dec 31, 2019

Adding Proposal label as there are at least two proposed APIs and this issue proposes added API to encoding/json.

@dmitshur dmitshur removed this from the Unplanned milestone Jan 1, 2020
@icholy
Copy link

@icholy icholy commented Sep 8, 2020

  1. needs to also provide an API for saying; "thanks for calling me, but I actually don't know how to serialize this specific type; please try something else".

I'd imagine this is necessary either way if you're supporting interfaces.

@dsnet
Copy link
Member

@dsnet dsnet commented Sep 8, 2020

I'd imagine this is necessary either way if you're supporting interfaces.

Can you provide an example?

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 8, 2020

You can imagine a sync.Map that is functionally a map[reflect.Type]bool. It caches the answer to "will type T implement the set of registered interface type overrides?" If T is in the cache, then the query is O(1), if not, then we perform a linear lookup over all registered interface type overrides (which I contend is usually close to 0).

I have to admit I'm still not getting this. Such a map will only tell you if a type T has a match in the set of interfaces, but not which interfaces T matches, right? You'd still have to check them one by one. This is better for the case where T matches none of the interfaces, but still, the feature doesn't seem O(1) to me.

For some cases, it would require registering dozens (if not hundreds) of concrete types. Even worse, it requires that the user of manually keep the code that declares their types constantly in sync with where those types are used with JSON serialization.

I was imagining this would be for code-generated types (since you mentioned Protobuf), so I was going to suggest code generating the registration of concrete types too. Though, the more I think about the counter-argument, the less I like it.

Can you explain how this would be the case?

I was thinking of a scenario where an interface only has a few implementations which are all known at compile time. From a performance perspective, it's likely better to register the concrete types with a shared piece of code. I think you can ignore this point if you're confident that we can implement interface support without it being too expensive for reasonable use cases.

Side note: if you intend the CL's performance to remain as-is, I wonder if we should document that we don't recommend registering more than a handful of interface types on a single encoder or decoder.

@dsnet
Copy link
Member

@dsnet dsnet commented Sep 8, 2020

I have to admit I'm still not getting this. Such a map will only tell you if a type T has a match in the set of interfaces, but not which interfaces T matches, right? You'd still have to check them one by one. This is better for the case where T matches none of the interfaces, but still, the feature doesn't seem O(1) to me.

A simple modification of the cache is to use a map[reflect.Type]int where the int is the index of the interface that we care about.

Side note: if you intend the CL's performance to remain as-is, I wonder if we should document that we don't recommend registering more than a handful of interface types on a single encoder or decoder.

Sounds reasonable.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 8, 2020

Thanks for adding the context on the design and the thoughts on performance. I'll go back to Gerrit now :)

@harrisonhjones
Copy link

@harrisonhjones harrisonhjones commented Sep 9, 2020

For what it's worth: I just reviewed https://go-review.googlesource.com/c/go/+/212998 and it looks good to me. Thanks for the work @dsnet

@odeke-em
Copy link
Member

@odeke-em odeke-em commented Jan 11, 2021

Thank you everyone, and Happy New Year! It is late in the cycle, and didn’t get much movement on the CL, and @mvdan left comments on it, though @dsnet has been super busy. With that, I am punting this to Go1.17.

@odeke-em odeke-em removed this from the Go1.16 milestone Jan 11, 2021
@odeke-em odeke-em added this to the Go1.17 milestone Jan 11, 2021
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 24, 2021

This issue is currently labeled as early-in-cycle for Go 1.17.
That time is now, so a friendly reminder to look at it again.

@dsnet
Copy link
Member

@dsnet dsnet commented Feb 24, 2021

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

@DeadlySurgeon
Copy link

@DeadlySurgeon DeadlySurgeon commented Mar 17, 2021

  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.

I'm very late to the party, but I believe generics would probably be fine here, might reduce the amount of reflection needed, and like you said would enable greater type safety for the proposed API.

As for the type parameterization, I thought up something like this.

// Encoder
func MarshalIP(ip *net.IP) ([]byte, error) { ... }
func Register[T any](f func(T) ([]byte, error)) { ... }

// Decoder
func UnmarshalIP(data []byte) (*net.IP, error) { ... }
func Register[T any](f func([]byte) (T, error)) { ... }
// or
func UnmarshalIP(data []byte, ip *net.IP) error { ... }
func Register[T any](f func([]byte, T) error) { ... }

You'd obviously still need reflection to get the typeof out of T, but at least then we can ensure that what's being passed is of type func during compilation. You could do this by getting the type of f, and then pulling the parameter T out of that, or, you could just make a new T and throw that into reflect.TypeOf.

@neild
Copy link
Contributor

@neild neild commented Aug 10, 2021

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

How's this going?

If this doesn't conflict with the overall package direction, it'd be nice to try to get this in for 1.18.

@dsnet
Copy link
Member

@dsnet dsnet commented Aug 10, 2021

I apologize, I got side-tracked from json work quite a bit the past few months, but starting to get back into it more.

@dsnet
Copy link
Member

@dsnet dsnet commented Aug 28, 2021

In implementing this for v2. I think we may want to defer on this feature until the release of generics.

I propose the following alternative API:

// Marshalers is a list of functions that each marshal a specific type.
// A nil Marshalers is equivalent to an empty list.
// For performance, marshalers should be stored in a global variable.
type Marshalers struct { ... }

// NewMarshalers constructs a list of functions that marshal a specific type.
// Functions that come earlier in the list take precedence.
func NewMarshalers(...*Marshalers) *Marshalers

// MarshalFunc constructs a marshaler that marshals values of type T.
func MarshalFunc[T any](fn func(T) ([]byte, error)) *Marshalers

// Unmarshalers is a list of functions that each unmarshal a specific type.
// A nil Unmarshalers is equivalent to an empty list.
type Unmarshalers struct { ... }

// NewUnmarshalers constructs a list of functions that unmarshal a specific type.
// Functions that come earlier in the list take precedence.
// For performance, unmarshalers should be stored in a global variable.
func NewUnmarshalers(...*Unmarshalers) *Unmarshalers

// UnmarshalFunc constructs an unmarshaler that unmarshals values of type T.
func UnmarshalFunc[T any](fn func([]byte, T) error) *Unmarshalers

// WithMarshalers configures the encoder to use any marshaler in m
// that operate on the current type that the encoder is marshaling.
func (*Encoder) WithMarshalers(m *Marshalers)

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

There are several advantages of this API:

  • Performance. Without generics, the implementation must go through reflect.Value.Call. Empirical testing using the tip implementation of generics shows that it is 4x faster than an implementation using reflect.Value.Call.
  • Type safety. The API that I proposed earlier provides no type safety to ensure that users are passing functions of the right signature.
  • Composability. Marshalers and Unmarshalers can be composed with one another. For example:
    m1 := NewMarshalers(f1, f2)
    m2 := NewMarshalers(f0, m1, f3) // equivalent to NewMarshalers(f0, f1, f2, f3)
  • Immutability. Marshalers and Unmarshalers are immutable once constructed. This allows the implementation to perform aggressive caching that would be complicated by the list of marshalers/unmarshalers changing underfoot. This would allow further performance benefits that my implementation for v1 fails to unlock.

Example usage:

var protoMarshalers = json.MarshalFunc(protojson.Marshal)

enc := json.NewEncoder(...)
enc.WithMarshalers(protoMarshalers)
... = enc.Decode(...)

Footnotes:

  • Credit goes to @rogpeppe for realizing that generics provides a significant performance benefit over Go reflection in this case.
  • The experiment was compiled using GOEXPERIMENT=unified since the compiler crashes otherwise.

\cc @mvdan

@neild
Copy link
Contributor

@neild neild commented Aug 30, 2021

Generics are arriving in 1.18. Should we add this (with generics) in 1.18, or delay a release?

@dsnet
Copy link
Member

@dsnet dsnet commented Aug 30, 2021

I keep forgetting that generics is coming imminently in 1.18. Assuming newly proposed API is acceptable, I support a release for 1.18.

In the event that generics is delayed, we could go with the following signatures:

func MarshalFunc(fn interface{}) *Marshalers
func UnmarshalFunc(fn interface{}) *Unmarshalers

It must rely on Go reflection for type safety at runtime. We could add a corresponding go vet or go build check that statically enforces type-safety so that we can switch it to the generic version in a backwards compatible way.

@liggitt
Copy link
Contributor

@liggitt liggitt commented Sep 15, 2021

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

Using a Decoder adds ~10 additional allocs and ~doubles memory consumption for callers which already have []byte data over use of json.Unmarshal, due to the Decoder only accepting a reader and then immediately reading the data into an internal buffer. Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

@dsnet
Copy link
Member

@dsnet dsnet commented Sep 15, 2021

Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like > DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

There's been work on what a theoretical v2 json could look like (if ever). That API supports setting options apart from the Decoder. Some or all of those ideas could be back-ported into the current json package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Accepted
Linked pull requests

Successfully merging a pull request may close this issue.

None yet