Skip to content

bytes, strings: should have minimal dependency on unicode #54098

@dsnet

Description

@dsnet

Importing "unicode" immediately bloats a binary by ~100k. This is unfortunately unavoidable since the unicode.Categories map contains a reference to every Unicode category in existence (see #7600 or #2559).

We should make it such that only referencing bytes functions (e.g., bytes.HasPrefix) that do not depend on unicode should not result in unicode being linked into the binary.

Here's a list of functions that depend on unicode:

  • Fields -> unicode.IsSpace
  • ToUpper -> unicode.ToUpper
  • ToLower -> unicode.ToLower
  • ToTitle -> unicode.ToTitle
  • ToUpperSpecial -> unicode.SpecialCase
  • ToLowerSpecial -> unicode.SpecialCase
  • ToTitleSpecial -> unicode.SpecialCase
  • Title -> unicode.{ToTitle,IsLetter,IsDigit,IsSpace}
  • TrimSpace -> unicode.IsSpace
  • EqualFold -> unicode.SimpleFold

Of all of these, only Fields and TrimSpace are used to any significant degree. Even still, the implementation of unicode.IsSpace is fairly small and references a relatively small table.

Perhaps we should create a internal/unicodetables package that contains every table. The unicode package can depend on unicodetables, and other stdlib packages can depend on unicodetables directly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions