Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optional string interning #279

Merged
merged 2 commits into from
Apr 12, 2020
Merged

Conversation

CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Apr 7, 2020

Uses josharian/intern to optionally perform string interning during unmarshaling.
Saves one allocation per string field.

Fixes #191

@GoWebProd
Copy link
Collaborator

Hello, thanks for PR.

But why this needed? I understand that we can save 1 alloc per string but we need more time for unmarshalling.

@CAFxX
Copy link
Contributor Author

CAFxX commented Apr 12, 2020

@GoWebProd an experience report is included in the linked issue #191:

in one of our internal apps that has to parse a lot of json (with a lot of repeated string values) and process it, we were able to slash memory usage by a significant amount (30% on the whole app, >75% just for the in-memory data coming from the parsed json) by using [string interning]

(furthemore, see also the ongoing discussion in golang/go#32779 about adding interning to encoding/json)

Basically it's the same reason why sometimes you want to use sync.Pool: you do some extra work for the sake of avoiding memory allocations.

Keep in mind that allocating memory and garbage collecting it also consume CPU. So while it's true you spend some more CPU time to do string interning, it's also true you save CPU time by not allocating and by being able to run GC less often, and on a smaller heap.

Keep in mind that this feature is opt-in on a field-by-field basis; so for easyjson users that don't opt-in nothing changes, and they get the usual behavior with no overhead. This is for the same reason why it makes no sense to provide a benchmark showing the CPU difference: such a benchmark would be useless as it completely depends on how frequent the values are:

  • if the value of the field is never the same between unmarshaling operations, you will just consume CPU time and use a bit more memory
  • if the value of the field is always the same (or one of N values, with N small) then memory usage/allocations will decrease and it is quite possible that even CPU time will decrease

As such, the only reasonable option is to let the user enable it, if needed - ideally doing their own benchmarks.

README.md Outdated Show resolved Hide resolved
@GoWebProd
Copy link
Collaborator

Ok, looks needed. But I do not like the current implementation of interning, I will consider alternatives and come back.

@GoWebProd
Copy link
Collaborator

@CAFxX please see my variant of interning

@CAFxX
Copy link
Contributor Author

CAFxX commented Apr 12, 2020

@GoWebProd what is it specifically you don't like about the old version? The old version is significantly simpler in josharian/intern, and it has already seen significant production usage.

Also AFAICT your version lacks any synchronization, so if two goroutines perform unmarshaling of an object of the same type at the same time, you will get a data race. Unless I'm mistaken, the current code should probably not be merged.

Furthermore, you have hardcoded the cache size, and there's no way for the user to change it. Note that my version did not need it thanks to the use of sync.Pool by josharian/intern.

I honestly think the previous version was simpler (both regarding the code in the easyjson repo, as well as including the dependency), more correct and - if the data races in the new version were fixed - probably even faster.

@GoWebProd
Copy link
Collaborator

Yes, you need use mutex if want to use in many goroutines. But in josharian/intern you have N maps (if you have N concurrent access) and releasing at any time by GC.

@CAFxX
Copy link
Contributor Author

CAFxX commented Apr 12, 2020

Yes, you need use mutex if want to use in many goroutines.

This would be a breaking change no? Current users don't need to do this.

Also adding a mutex would significantly hinder scalability...

But in josharian/intern you have N maps (if you have N concurrent access) and releasing at any time by GC.

Correct. But I fail to see how that's a problem. In fact, note that the same applies to your version: every type has a different interning table (so strings from different objects are not deduplicated), and strings can be dropped of the table by hitting the cache size limit (and consider what happens if a json object contains more than 128 different strings!).

@GoWebProd
Copy link
Collaborator

I’ll think about a concurrent list, if not solution, then we’ll take the josharian

@GoWebProd GoWebProd merged commit 6ea07b3 into mailru:master Apr 12, 2020
@zikaeroh zikaeroh mentioned this pull request Jul 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally intern strings during unmarshaling
2 participants