text/tabwriter: character width #8273

rui314 · 2014-06-23T18:42:07Z

What steps will reproduce the problem?
Issue:
gofmt, or text.tabwriter, assumes that all Unicode code points occupy exactly one column
in editors or on terminals. That assumption is not correct because most (but not all)
Chinese/Japanese/Korean characters, emojis, "fullwidth" Latin characters, etc,
occupy two columns. As a result gofmt formats Go code like this.

var Countries = map[string]string{
        "アメリカ合衆国": "United States of America",
        "日本":      "Japan",
        "ドイツ":     "Germany",
        "フランス":    "France",
        "ポーランド":   "Poland",
}

As you can see the column of the map value is misaligned. You cannot fix this by hand
because gofmt would reformat it for you in the wrong way if you do that. That's annoying.

In Unicode, there's a zero column character (ZERO WIDTH SPACE; U+200B). SOFT HYPHEN
(U+00AD) may be displayed as a hyphen at the end of a line but may be zero-width in
other places, depending on your display environment. These chracters also affect the
column layout.

What is the expected output? What do you see instead?
Proposal:
Unicode Standard Annex #11 gives the definition of column width for characters in the
legacy East Asian character sets. I propose to add the East Asian Width property to the
unicode package, so that we can get the column width for a CJK character. East Asian
Fullwidth and East Asian Wide characters should be treated as two column by tabwriter.

(Note: East Asian Ambiguous characters need to be treated as one column. They are
treated as two columns only in East Asian display environment. The character set
contains Cyrillic characters and others which we would never want to handle as two
column.)

Because the Annex #11 does not say anything about characters that are not in the legacy
East Asian character sets, we need additional rules for characters not in CJK character
sets but in Unicode. I propose this simple rule:

 - ZERO WIDTH SPACE is 0 column
 - Emojis are 2 columns
 - Other code points, including U+0000, SOFT HYPHEN, and all control characters, are 1 column

This additional rule will be implemented to an unexported function in text.tabwriter.

Caveats:
I deliberately avoid defining the generic "wcswidth" function to determine the
column width for a string in the standard library. That function can never be defined in
the right way because there's no standard for it. Also it'd be hard to get a reasonable
definition for characters with odd semantics, such as SOFT HYPHEN.

ianlancetaylor · 2014-06-23T19:07:08Z

Comment 1:

Labels changed: added repo-main, release-none.

griesemer · 2014-06-23T21:38:10Z

Comment 2:

This as a (relatively minor) change to the tabwriter so that it can handle single and
double-width characters based on the fixed (_font-independent_) Unicode Annex #11 width
information, and assuming that the layout is for fixed-width (and multiples of the
fixed-width) characters.
It is an explicit non-goal to make the tabwriter work for variable-width fonts at this
time (it is possible, but it only makes sense in context with an IDE which lays out code
depending on font size).

Owner changed to @griesemer.

Status changed to Thinking.

clausecker · 2015-01-04T20:02:52Z

I see support for full-width characters as something integral to this package. It would be a bit sad if we left many millions of users in countries that use CJK characters without a usably text/tabwriter package.

clausecker · 2015-01-04T20:05:26Z

For an example implementation of a function to figure out how many columns a character occupies, see https://github.com/mattn/go-runewidth.

imuli · 2015-05-25T16:27:40Z

For what it's worth, this is also a problem with combining characters (and not all meaningful combinations have canonical forms):

var test = map[string]int{
    "tes̪t":   0,
    "testing": 0,
}

clausecker · 2015-05-26T09:11:09Z

@imuli Combining characters are handled as if their width is 0. This is fine if the code will never introduce a line-break before a combining character (which it doesn't). There's no need for canonical forms as this scheme works just fine.

imuli · 2015-05-26T10:36:58Z

@fuzxxl Yes, the package you linked to handles combining characters just fine. I meant that they are another side of this bug however, one that perhaps doesn't fall under "variable width font".

XenoPhex · 2017-01-30T22:33:54Z

Any updates on this?

griesemer · 2017-01-30T22:54:35Z

@XenoPhex Sorry, but this package is frozen: https://go-review.googlesource.com/#/c/31910/2/src/text/tabwriter/tabwriter.go .

griesemer · 2017-01-30T23:01:39Z

PS: Even if the package were not frozen, we are not going to add specific character sets or tables to this package for special treatment. The only sensible approach would be to provide a function that given a Unicode char returns a width, leaving the actual width determination to a client. However, the only way we could add such a function is by extending the API; specifically it would probably require a new Init function.

This package is one of the earliest Go packages with some features (like HTML filtering) that are not needed/used anymore (at least by gofmt). We are not going to make further changes at this point.

If you need a special version, you can always vendor and adjust the code. A future gofmt might use a rewritten and trimmed version of this package. None of this is high-priority.

I will close this issue.

mattn · 2017-01-31T01:53:02Z

FYI: one another ways to do it. https://github.com/olekukonko/tablewriter

mpvl · 2018-04-05T07:49:59Z

@griesemer thanks for pointing me to this issue. This comes up once in a while.

The width information you need is already in golang.org/x/text/width. An implementation is not straightforward, though, as width cannot be determined unambiguously:

The width of fullwidth characters in a monospace font depends on the font. For East Asian fonts, a halfwidth character is indeed exactly half the width of a fullwidth character. A monospace Latin latin font, however, the ratio is typically, but not always, 3:5.
There are ambiguous characters for which it unknown whether a font will typically render them as fullwidth or halfwidth.
Some editors will modifiers explicitly (although arguably increasingly rarely).
Spacing may vary as Unicode gets updated.

Now arguably, with the current implementation will never align properly for anybody if any non-halfwidth rune is used. One could at least:

Render things properly for East Asian fonts if fullwidth characters are used.
Render things properly for any font if modifiers are used.

I've implemented an algorithm to determine a string width based on some experience-based recommendation for interpretation of ambiguous characters for exactly this purpose. It was decided not to add this way back then, but perhaps in light of Go 2 the willingness to change things have increased.

The main drawbacks of this approach:

It makes gofmt depend on x/text. But so does core.
Indentation may change as Unicode (and the go compiler) is updated.

The last one may be nasty if people are collaborating on the same project using different versions of gofmt. Gofmt would probably need some kind of logic to prevent flipping back and forth between two different interpretations and allow a flag to force an update.

Another complementary approach is to allow line breaks after table values so that values are indented and spaced independently of the keys.

Personally I think the best would be to rely on editors to do the outlining correctly, but have a best-effort implementation with some amount of stability guarantees that will render the alignment correctly in the majority of cases (albeit a small majority, I guess 66%). Note that even though no implementation will get the indentation right for everybody, it will at least at least do the right thing in many situations whereas now it is guaranteed to never do the right thing.

griesemer · 2018-04-05T17:16:37Z

@mpvl Thanks for the info. I don't think we want to be dependent on x/text. I was hoping that there might be a small number of unicode code point ranges that we could trivially detect (and that are unlikely to change in the future) to identify full-width/wide chars and just give them the space of 2 characters. Of course this all depends on the actual font used during rendering and so this assumes that wide characters are taking the space of 2 regular characters in that font.

mpvl · 2018-04-05T18:07:28Z

@griesemer: I don't think that is a scalable approach and leaves out handling zero-width characters, which is easier actually. We could do something similar what is done for core though: generate the tables in x/text and then copy them in to gofmt. x/text has been set up to automate this.

mpvl · 2018-05-01T12:44:30Z

At my visit to Gopher China I did some polling and almost exclusively people were using the preinstalled fonts for their editor (VSCode etc.) or a variant that would result in a CJK to Latin ratio of 5:3. Only sporadically somebody reported indeed using the traditional 2:1 ratio.

IOW, it seems that adopting a 2:1 ratio will not fix the problem for the majority of the people. Conversely, adopting a 5:3 ratio would seem to do the trick, but it would also result in some peculiar artifacts in the gofmt rendering. I'm not sure that it is worth it. This doesn't preclude providing better handling for modifiers, of course. Emojis:Latin is typically also 5:3.

Admittedly, my sample size was small (about 20), so I can Asta do a more large-scale poll, but I wouldn't hold my breath.

It seems that having editor plugins to handle this really the most ideal approach.

griesemer · 2018-05-01T16:41:32Z

Thanks, @mpvl, that is useful additional input. It sounds like there's no simple solution to address this.

rui314 added Thinking repo-main labels Jun 23, 2014

rui314 assigned griesemer Jun 23, 2014

clausecker mentioned this issue Jan 4, 2015

cmd/gofmt: incorrect comment alignment with double-width runes #7481

Closed

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

griesemer mentioned this issue Jun 23, 2016

cmd/gofmt: incorrect alignment of map keys when Arabic characters are used #16170

Closed

griesemer closed this as completed Jan 30, 2017

golang locked and limited conversation to collaborators Jan 31, 2018

gopherbot added the FrozenDueToAge label Jan 31, 2018

rsc unassigned griesemer Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text/tabwriter: character width #8273

text/tabwriter: character width #8273

rui314 commented Jun 23, 2014

ianlancetaylor commented Jun 23, 2014

griesemer commented Jun 23, 2014

clausecker commented Jan 4, 2015

clausecker commented Jan 4, 2015

imuli commented May 25, 2015

clausecker commented May 26, 2015

imuli commented May 26, 2015

XenoPhex commented Jan 30, 2017

griesemer commented Jan 30, 2017

griesemer commented Jan 30, 2017

mattn commented Jan 31, 2017

mpvl commented Apr 5, 2018

griesemer commented Apr 5, 2018

mpvl commented Apr 5, 2018

mpvl commented May 1, 2018

griesemer commented May 1, 2018

text/tabwriter: character width #8273

text/tabwriter: character width #8273

Comments

rui314 commented Jun 23, 2014

ianlancetaylor commented Jun 23, 2014

griesemer commented Jun 23, 2014

clausecker commented Jan 4, 2015

clausecker commented Jan 4, 2015

imuli commented May 25, 2015

clausecker commented May 26, 2015

imuli commented May 26, 2015

XenoPhex commented Jan 30, 2017

griesemer commented Jan 30, 2017

griesemer commented Jan 30, 2017

mattn commented Jan 31, 2017

mpvl commented Apr 5, 2018

griesemer commented Apr 5, 2018

mpvl commented Apr 5, 2018

mpvl commented May 1, 2018

griesemer commented May 1, 2018