Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes, strings: should have minimal dependency on unicode #54098

Open
dsnet opened this issue Jul 27, 2022 · 9 comments
Open

bytes, strings: should have minimal dependency on unicode #54098

dsnet opened this issue Jul 27, 2022 · 9 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Jul 27, 2022

Importing "unicode" immediately bloats a binary by ~100k. This is unfortunately unavoidable since the unicode.Categories map contains a reference to every Unicode category in existence (see #7600 or #2559).

We should make it such that only referencing bytes functions (e.g., bytes.HasPrefix) that do not depend on unicode should not result in unicode being linked into the binary.

Here's a list of functions that depend on unicode:

  • Fields -> unicode.IsSpace
  • ToUpper -> unicode.ToUpper
  • ToLower -> unicode.ToLower
  • ToTitle -> unicode.ToTitle
  • ToUpperSpecial -> unicode.SpecialCase
  • ToLowerSpecial -> unicode.SpecialCase
  • ToTitleSpecial -> unicode.SpecialCase
  • Title -> unicode.{ToTitle,IsLetter,IsDigit,IsSpace}
  • TrimSpace -> unicode.IsSpace
  • EqualFold -> unicode.SimpleFold

Of all of these, only Fields and TrimSpace are used to any significant degree. Even still, the implementation of unicode.IsSpace is fairly small and references a relatively small table.

Perhaps we should create a internal/unicodetables package that contains every table. The unicode package can depend on unicodetables, and other stdlib packages can depend on unicodetables directly.

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 27, 2022
@cherrymui cherrymui added this to the Backlog milestone Jul 27, 2022
@cherrymui
Copy link
Member

cc @bradfitz @ianlancetaylor

Perhaps we should just fix map initialization deadcode elimination issue... (We have thoughts, and there were some ongoing work.)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/419756 mentions this issue: unicode: split base functionality into internal package

@dsnet
Copy link
Member Author

dsnet commented Jul 28, 2022

Unless map deadcode elimination is going to happen in the near future, CL/419756 is perhaps a palatable solution for the interim period. It would actually be a simplification of the code base, since we could delete sub-copies of "unicode" functionality such as in "strconv" and in "fmt". Switching from "internal/unicode" back to "unicode" would be a trivial search and replace after #2559 is fixed.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/423376 mentions this issue: internal/bytealg: move implementation of strings.Clone

@martisch
Copy link
Contributor

Another place were we could potentially avoid pulling in all of unicode tables would be "reflect".

@dsnet
Copy link
Member Author

dsnet commented Aug 14, 2022

@cherrymui, I haven't seen any mentions about DCE for unused maps in the runtime/compiler meeting notes (#43930). Is it fair to say that this is not going to be something that's going to be worked on in the near future?

If so, what are our thoughts on doing something like https://go.dev/cl/419756 where we move the essential parts of "unicode" to "internal/unicode"? I've noticed quite a bit of technical debt throughout the Go source tree because stdlib packages are avoiding "unicode" and so it resorts to duplicating Unicode functionality or re-implementing functionality since it can't depend on a package (e.g., "bytes" or "strings") that transitively depends on "unicode".

In my opinion, the technical debt of splitting out part of "unicode" into "internal/unicode" is less than the technical debt of avoiding "unicode" all-together.

@randall77
Copy link
Contributor

I haven't seen any mentions about DCE for unused maps in the runtime/compiler meeting notes (#43930). Is it fair to say that this is not going to be something that's going to be worked on in the near future?

It's not something we're working on, or planning on working on, at the moment.

But see #2559, there are some external contributors that made some progress in this area, but it was never completed.

@cherrymui
Copy link
Member

@cherrymui, I haven't seen any mentions about DCE for unused maps in the runtime/compiler meeting notes (#43930). Is it fair to say that this is not going to be something that's going to be worked on in the near future?

I can give it a try this cycle, if I have a chance. I'd think it shouldn't be too hard to do, but I may be wrong.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528799 mentions this issue: strconv: replace cloneString with bytealg.CloneString

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants