New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txt, tutxt does not handle multi-byte string #74

Closed
mattn opened this Issue Sep 13, 2017 · 13 comments

Comments

Projects
None yet
2 participants
@mattn

mattn commented Sep 13, 2017

@startuml
bob -> alice : こんにちわ
bob <- alice : さようなら
@enduml

bar

PNG works fine. But TUTXT, TXT doesn't work correctly. It seems platnum does not handle width of multi-byte characters. Below is output of java -jar platuml.jar -charset=UTF8 -tutxt input.platuml

https://gist.githubusercontent.com/mattn/d1be0d6711043abfce9946fa0112024a/raw/fe635b3a72724e75052c5c260d8e7ee7038b2a00/gistfile1.txt

3fd8135165637b81

Below is an expected output.

f1c2437ca42b8850

@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Sep 16, 2017

Contributor

Hi,
Many thanks for the report. There is indeed an issue, and we have learn something very interesting!

The issue is due to the fact that we made the false assumption that using a monospaced font every character will be printed with the same width.

And here we discover the notion of half-width and full-width character :-) Even with a monospaced font, some character are wider than other.
Some pointers;

Determining if a character is half-width or full-width does not seems to be so easy to do in Java (if you have a good solution, please post here!).
Once we've found a way, we will have to update our code.
This may take some time, so please be patient. We'll post a message here when a beta version will be ready.
Thanks again

Contributor

arnaudroques commented Sep 16, 2017

Hi,
Many thanks for the report. There is indeed an issue, and we have learn something very interesting!

The issue is due to the fact that we made the false assumption that using a monospaced font every character will be printed with the same width.

And here we discover the notion of half-width and full-width character :-) Even with a monospaced font, some character are wider than other.
Some pointers;

Determining if a character is half-width or full-width does not seems to be so easy to do in Java (if you have a good solution, please post here!).
Once we've found a way, we will have to update our code.
This may take some time, so please be patient. We'll post a message here when a beta version will be ready.
Thanks again

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Sep 17, 2017

You can use wcwidth implementation for java.

/**
 * <p>See <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c">wcwidth.c</a></p>
 *
 * <p>This is an implementation of wcwidth() and wcswidth() (defined in
 * IEEE Std 1002.1-2001) for Unicode.</p>
 *
 * http://www.opengroup.org/onlinepubs/007904975/functions/wcwidth.html
 * http://www.opengroup.org/onlinepubs/007904975/functions/wcswidth.html
 *
 * <p>In fixed-width output devices, Latin characters all occupy a single
 * "cell" position of equal width, whereas ideographic CJK characters
 * occupy two such cells. Interoperability between terminal-line
 * applications and (teletype-style) character terminals using the
 * UTF-8 encoding requires agreement on which character should advance
 * the cursor by how many cell positions. No established formal
 * standards exist at present on which Unicode character shall occupy
 * how many cell positions on character terminals. These routines are
 * a first attempt of defining such behavior based on simple rules
 * applied to data provided by the Unicode Consortium.</p>
 *
 * <p>For some graphical characters, the Unicode standard explicitly
 * defines a character-cell width via the definition of the East Asian
 * FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes.
 * In all these cases, there is no ambiguity about which width a
 * terminal shall use. For characters in the East Asian Ambiguous (A)
 * class, the width choice depends purely on a preference of backward
 * compatibility with either historic CJK or Western practice.
 * Choosing single-width for these characters is easy to justify as
 * the appropriate long-term solution, as the CJK practice of
 * displaying these characters as double-width comes from historic
 * implementation simplicity (8-bit encoded characters were displayed
 * single-width and 16-bit ones double-width, even for Greek,
 * Cyrillic, etc.) and not any typographic considerations.</p>
 *
 * <p>Much less clear is the choice of width for the Not East Asian
 * (Neutral) class. Existing practice does not dictate a width for any
 * of these characters. It would nevertheless make sense
 * typographically to allocate two character cells to characters such
 * as for instance EM SPACE or VOLUME INTEGRAL, which cannot be
 * represented adequately with a single-width glyph. The following
 * routines at present merely assign a single-cell width to all
 * neutral characters, in the interest of simplicity. This is not
 * entirely satisfactory and should be reconsidered before
 * establishing a formal standard in this area. At the moment, the
 * decision which Not East Asian (Neutral) characters should be
 * represented by double-width glyphs cannot yet be answered by
 * applying a simple rule from the Unicode database content. Setting
 * up a proper standard for the behavior of UTF-8 character terminals
 * will require a careful analysis not only of each Unicode character,
 * but also of each presentation form, something the author of these
 * routines has avoided to do so far.</p>
 *
 * <p>http://www.unicode.org/unicode/reports/tr11/</p>
 */
public class Wcwidth {

  /**
   * sorted list of non-overlapping intervals of non-spacing characters
   * generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c"
   */
  private static final int[][] COMBINING = {
      {0x0300, 0x036F}, {0x0483, 0x0486}, {0x0488, 0x0489},
      {0x0591, 0x05BD}, {0x05BF, 0x05BF}, {0x05C1, 0x05C2},
      {0x05C4, 0x05C5}, {0x05C7, 0x05C7}, {0x0600, 0x0603},
      {0x0610, 0x0615}, {0x064B, 0x065E}, {0x0670, 0x0670},
      {0x06D6, 0x06E4}, {0x06E7, 0x06E8}, {0x06EA, 0x06ED},
      {0x070F, 0x070F}, {0x0711, 0x0711}, {0x0730, 0x074A},
      {0x07A6, 0x07B0}, {0x07EB, 0x07F3}, {0x0901, 0x0902},
      {0x093C, 0x093C}, {0x0941, 0x0948}, {0x094D, 0x094D},
      {0x0951, 0x0954}, {0x0962, 0x0963}, {0x0981, 0x0981},
      {0x09BC, 0x09BC}, {0x09C1, 0x09C4}, {0x09CD, 0x09CD},
      {0x09E2, 0x09E3}, {0x0A01, 0x0A02}, {0x0A3C, 0x0A3C},
      {0x0A41, 0x0A42}, {0x0A47, 0x0A48}, {0x0A4B, 0x0A4D},
      {0x0A70, 0x0A71}, {0x0A81, 0x0A82}, {0x0ABC, 0x0ABC},
      {0x0AC1, 0x0AC5}, {0x0AC7, 0x0AC8}, {0x0ACD, 0x0ACD},
      {0x0AE2, 0x0AE3}, {0x0B01, 0x0B01}, {0x0B3C, 0x0B3C},
      {0x0B3F, 0x0B3F}, {0x0B41, 0x0B43}, {0x0B4D, 0x0B4D},
      {0x0B56, 0x0B56}, {0x0B82, 0x0B82}, {0x0BC0, 0x0BC0},
      {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40}, {0x0C46, 0x0C48},
      {0x0C4A, 0x0C4D}, {0x0C55, 0x0C56}, {0x0CBC, 0x0CBC},
      {0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC, 0x0CCD},
      {0x0CE2, 0x0CE3}, {0x0D41, 0x0D43}, {0x0D4D, 0x0D4D},
      {0x0DCA, 0x0DCA}, {0x0DD2, 0x0DD4}, {0x0DD6, 0x0DD6},
      {0x0E31, 0x0E31}, {0x0E34, 0x0E3A}, {0x0E47, 0x0E4E},
      {0x0EB1, 0x0EB1}, {0x0EB4, 0x0EB9}, {0x0EBB, 0x0EBC},
      {0x0EC8, 0x0ECD}, {0x0F18, 0x0F19}, {0x0F35, 0x0F35},
      {0x0F37, 0x0F37}, {0x0F39, 0x0F39}, {0x0F71, 0x0F7E},
      {0x0F80, 0x0F84}, {0x0F86, 0x0F87}, {0x0F90, 0x0F97},
      {0x0F99, 0x0FBC}, {0x0FC6, 0x0FC6}, {0x102D, 0x1030},
      {0x1032, 0x1032}, {0x1036, 0x1037}, {0x1039, 0x1039},
      {0x1058, 0x1059}, {0x1160, 0x11FF}, {0x135F, 0x135F},
      {0x1712, 0x1714}, {0x1732, 0x1734}, {0x1752, 0x1753},
      {0x1772, 0x1773}, {0x17B4, 0x17B5}, {0x17B7, 0x17BD},
      {0x17C6, 0x17C6}, {0x17C9, 0x17D3}, {0x17DD, 0x17DD},
      {0x180B, 0x180D}, {0x18A9, 0x18A9}, {0x1920, 0x1922},
      {0x1927, 0x1928}, {0x1932, 0x1932}, {0x1939, 0x193B},
      {0x1A17, 0x1A18}, {0x1B00, 0x1B03}, {0x1B34, 0x1B34},
      {0x1B36, 0x1B3A}, {0x1B3C, 0x1B3C}, {0x1B42, 0x1B42},
      {0x1B6B, 0x1B73}, {0x1DC0, 0x1DCA}, {0x1DFE, 0x1DFF},
      {0x200B, 0x200F}, {0x202A, 0x202E}, {0x2060, 0x2063},
      {0x206A, 0x206F}, {0x20D0, 0x20EF}, {0x302A, 0x302F},
      {0x3099, 0x309A}, {0xA806, 0xA806}, {0xA80B, 0xA80B},
      {0xA825, 0xA826}, {0xFB1E, 0xFB1E}, {0xFE00, 0xFE0F},
      {0xFE20, 0xFE23}, {0xFEFF, 0xFEFF}, {0xFFF9, 0xFFFB},
      {0x10A01, 0x10A03}, {0x10A05, 0x10A06}, {0x10A0C, 0x10A0F},
      {0x10A38, 0x10A3A}, {0x10A3F, 0x10A3F}, {0x1D167, 0x1D169},
      {0x1D173, 0x1D182}, {0x1D185, 0x1D18B}, {0x1D1AA, 0x1D1AD},
      {0x1D242, 0x1D244}, {0xE0001, 0xE0001}, {0xE0020, 0xE007F},
      {0xE0100, 0xE01EF}
  };

  static boolean bisearch(int ucs) {
    int min = 0;
    int mid;
    int max = COMBINING.length - 1;

    if (ucs < COMBINING[0][0] || ucs > COMBINING[max][1]) {
      return false;
    }
    while (max >= min) {
      mid = (min + max) / 2;
      if (ucs > COMBINING[mid][1]) {
        min = mid + 1;
      } else if (ucs < COMBINING[mid][0]) {
        max = mid - 1;
      } else {
        return true;
      }
    }

    return false;
  }

  /**
   * See : http://www.cl.cam.ac.uk/%7Emgk25/ucs/wcwidth.c
   *
   * The following two functions define the column width of an ISO 10646
   * character as follows:
   *
   *    - The null character (U+0000) has a column width of 0.
   *
   *    - Other C0/C1 control characters and DEL will lead to a return
   *      value of -1.
   *
   *    - Non-spacing and enclosing combining characters (general
   *      category code Mn or Me in the Unicode database) have a
   *      column width of 0.
   *
   *    - SOFT HYPHEN (U+00AD) has a column width of 1.
   *
   *    - Other format characters (general category code Cf in the Unicode
   *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
   *
   *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
   *      have a column width of 0.
   *
   *    - Spacing characters in the East Asian Wide (W) or East Asian
   *      Full-width (F) category as defined in Unicode Technical
   *      Report #11 have a column width of 2.
   *
   *    - All remaining characters (including all printable
   *      ISO 8859-1 and WGL4 characters, Unicode control characters,
   *      etc.) have a column width of 1.
   *
   * This implementation assumes that wchar_t characters are encoded
   * in ISO 10646.
   */
  public static int of(int codePoint) {
    // test for 8-bit control characters
    if (codePoint == 0) {
      return 0;
    }
    if (codePoint < 32 || (codePoint >= 0x7f && codePoint < 0xa0)) {
      return -1;
    }
    // binary search in table of non-spacing characters
    if (bisearch(codePoint)) {
      return 0;
    }

    // if we arrive here, ucs is not a combining or C0/C1 control character
    return 1 +
        ((codePoint >= 0x1100 &&
            (codePoint <= 0x115f ||                    // Hangul Jamo init. consonants
                codePoint == 0x2329 || codePoint == 0x232a ||
                (codePoint >= 0x2e80 && codePoint <= 0xa4cf &&
                    codePoint != 0x303f) ||                  // CJK ... Yi
                (codePoint >= 0xac00 && codePoint <= 0xd7a3) || // Hangul Syllables
                (codePoint >= 0xf900 && codePoint <= 0xfaff) || // CJK Compatibility Ideographs
                (codePoint >= 0xfe10 && codePoint <= 0xfe19) || // Vertical forms
                (codePoint >= 0xfe30 && codePoint <= 0xfe6f) || // CJK Compatibility Forms
                (codePoint >= 0xff00 && codePoint <= 0xff60) || // Fullwidth Forms
                (codePoint >= 0xffe0 && codePoint <= 0xffe6) ||
                (codePoint >= 0x20000 && codePoint <= 0x2fffd) ||
                (codePoint >= 0x30000 && codePoint <= 0x3fffd))) ? 1 : 0);
  }
}

mattn commented Sep 17, 2017

You can use wcwidth implementation for java.

/**
 * <p>See <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c">wcwidth.c</a></p>
 *
 * <p>This is an implementation of wcwidth() and wcswidth() (defined in
 * IEEE Std 1002.1-2001) for Unicode.</p>
 *
 * http://www.opengroup.org/onlinepubs/007904975/functions/wcwidth.html
 * http://www.opengroup.org/onlinepubs/007904975/functions/wcswidth.html
 *
 * <p>In fixed-width output devices, Latin characters all occupy a single
 * "cell" position of equal width, whereas ideographic CJK characters
 * occupy two such cells. Interoperability between terminal-line
 * applications and (teletype-style) character terminals using the
 * UTF-8 encoding requires agreement on which character should advance
 * the cursor by how many cell positions. No established formal
 * standards exist at present on which Unicode character shall occupy
 * how many cell positions on character terminals. These routines are
 * a first attempt of defining such behavior based on simple rules
 * applied to data provided by the Unicode Consortium.</p>
 *
 * <p>For some graphical characters, the Unicode standard explicitly
 * defines a character-cell width via the definition of the East Asian
 * FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes.
 * In all these cases, there is no ambiguity about which width a
 * terminal shall use. For characters in the East Asian Ambiguous (A)
 * class, the width choice depends purely on a preference of backward
 * compatibility with either historic CJK or Western practice.
 * Choosing single-width for these characters is easy to justify as
 * the appropriate long-term solution, as the CJK practice of
 * displaying these characters as double-width comes from historic
 * implementation simplicity (8-bit encoded characters were displayed
 * single-width and 16-bit ones double-width, even for Greek,
 * Cyrillic, etc.) and not any typographic considerations.</p>
 *
 * <p>Much less clear is the choice of width for the Not East Asian
 * (Neutral) class. Existing practice does not dictate a width for any
 * of these characters. It would nevertheless make sense
 * typographically to allocate two character cells to characters such
 * as for instance EM SPACE or VOLUME INTEGRAL, which cannot be
 * represented adequately with a single-width glyph. The following
 * routines at present merely assign a single-cell width to all
 * neutral characters, in the interest of simplicity. This is not
 * entirely satisfactory and should be reconsidered before
 * establishing a formal standard in this area. At the moment, the
 * decision which Not East Asian (Neutral) characters should be
 * represented by double-width glyphs cannot yet be answered by
 * applying a simple rule from the Unicode database content. Setting
 * up a proper standard for the behavior of UTF-8 character terminals
 * will require a careful analysis not only of each Unicode character,
 * but also of each presentation form, something the author of these
 * routines has avoided to do so far.</p>
 *
 * <p>http://www.unicode.org/unicode/reports/tr11/</p>
 */
public class Wcwidth {

  /**
   * sorted list of non-overlapping intervals of non-spacing characters
   * generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c"
   */
  private static final int[][] COMBINING = {
      {0x0300, 0x036F}, {0x0483, 0x0486}, {0x0488, 0x0489},
      {0x0591, 0x05BD}, {0x05BF, 0x05BF}, {0x05C1, 0x05C2},
      {0x05C4, 0x05C5}, {0x05C7, 0x05C7}, {0x0600, 0x0603},
      {0x0610, 0x0615}, {0x064B, 0x065E}, {0x0670, 0x0670},
      {0x06D6, 0x06E4}, {0x06E7, 0x06E8}, {0x06EA, 0x06ED},
      {0x070F, 0x070F}, {0x0711, 0x0711}, {0x0730, 0x074A},
      {0x07A6, 0x07B0}, {0x07EB, 0x07F3}, {0x0901, 0x0902},
      {0x093C, 0x093C}, {0x0941, 0x0948}, {0x094D, 0x094D},
      {0x0951, 0x0954}, {0x0962, 0x0963}, {0x0981, 0x0981},
      {0x09BC, 0x09BC}, {0x09C1, 0x09C4}, {0x09CD, 0x09CD},
      {0x09E2, 0x09E3}, {0x0A01, 0x0A02}, {0x0A3C, 0x0A3C},
      {0x0A41, 0x0A42}, {0x0A47, 0x0A48}, {0x0A4B, 0x0A4D},
      {0x0A70, 0x0A71}, {0x0A81, 0x0A82}, {0x0ABC, 0x0ABC},
      {0x0AC1, 0x0AC5}, {0x0AC7, 0x0AC8}, {0x0ACD, 0x0ACD},
      {0x0AE2, 0x0AE3}, {0x0B01, 0x0B01}, {0x0B3C, 0x0B3C},
      {0x0B3F, 0x0B3F}, {0x0B41, 0x0B43}, {0x0B4D, 0x0B4D},
      {0x0B56, 0x0B56}, {0x0B82, 0x0B82}, {0x0BC0, 0x0BC0},
      {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40}, {0x0C46, 0x0C48},
      {0x0C4A, 0x0C4D}, {0x0C55, 0x0C56}, {0x0CBC, 0x0CBC},
      {0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC, 0x0CCD},
      {0x0CE2, 0x0CE3}, {0x0D41, 0x0D43}, {0x0D4D, 0x0D4D},
      {0x0DCA, 0x0DCA}, {0x0DD2, 0x0DD4}, {0x0DD6, 0x0DD6},
      {0x0E31, 0x0E31}, {0x0E34, 0x0E3A}, {0x0E47, 0x0E4E},
      {0x0EB1, 0x0EB1}, {0x0EB4, 0x0EB9}, {0x0EBB, 0x0EBC},
      {0x0EC8, 0x0ECD}, {0x0F18, 0x0F19}, {0x0F35, 0x0F35},
      {0x0F37, 0x0F37}, {0x0F39, 0x0F39}, {0x0F71, 0x0F7E},
      {0x0F80, 0x0F84}, {0x0F86, 0x0F87}, {0x0F90, 0x0F97},
      {0x0F99, 0x0FBC}, {0x0FC6, 0x0FC6}, {0x102D, 0x1030},
      {0x1032, 0x1032}, {0x1036, 0x1037}, {0x1039, 0x1039},
      {0x1058, 0x1059}, {0x1160, 0x11FF}, {0x135F, 0x135F},
      {0x1712, 0x1714}, {0x1732, 0x1734}, {0x1752, 0x1753},
      {0x1772, 0x1773}, {0x17B4, 0x17B5}, {0x17B7, 0x17BD},
      {0x17C6, 0x17C6}, {0x17C9, 0x17D3}, {0x17DD, 0x17DD},
      {0x180B, 0x180D}, {0x18A9, 0x18A9}, {0x1920, 0x1922},
      {0x1927, 0x1928}, {0x1932, 0x1932}, {0x1939, 0x193B},
      {0x1A17, 0x1A18}, {0x1B00, 0x1B03}, {0x1B34, 0x1B34},
      {0x1B36, 0x1B3A}, {0x1B3C, 0x1B3C}, {0x1B42, 0x1B42},
      {0x1B6B, 0x1B73}, {0x1DC0, 0x1DCA}, {0x1DFE, 0x1DFF},
      {0x200B, 0x200F}, {0x202A, 0x202E}, {0x2060, 0x2063},
      {0x206A, 0x206F}, {0x20D0, 0x20EF}, {0x302A, 0x302F},
      {0x3099, 0x309A}, {0xA806, 0xA806}, {0xA80B, 0xA80B},
      {0xA825, 0xA826}, {0xFB1E, 0xFB1E}, {0xFE00, 0xFE0F},
      {0xFE20, 0xFE23}, {0xFEFF, 0xFEFF}, {0xFFF9, 0xFFFB},
      {0x10A01, 0x10A03}, {0x10A05, 0x10A06}, {0x10A0C, 0x10A0F},
      {0x10A38, 0x10A3A}, {0x10A3F, 0x10A3F}, {0x1D167, 0x1D169},
      {0x1D173, 0x1D182}, {0x1D185, 0x1D18B}, {0x1D1AA, 0x1D1AD},
      {0x1D242, 0x1D244}, {0xE0001, 0xE0001}, {0xE0020, 0xE007F},
      {0xE0100, 0xE01EF}
  };

  static boolean bisearch(int ucs) {
    int min = 0;
    int mid;
    int max = COMBINING.length - 1;

    if (ucs < COMBINING[0][0] || ucs > COMBINING[max][1]) {
      return false;
    }
    while (max >= min) {
      mid = (min + max) / 2;
      if (ucs > COMBINING[mid][1]) {
        min = mid + 1;
      } else if (ucs < COMBINING[mid][0]) {
        max = mid - 1;
      } else {
        return true;
      }
    }

    return false;
  }

  /**
   * See : http://www.cl.cam.ac.uk/%7Emgk25/ucs/wcwidth.c
   *
   * The following two functions define the column width of an ISO 10646
   * character as follows:
   *
   *    - The null character (U+0000) has a column width of 0.
   *
   *    - Other C0/C1 control characters and DEL will lead to a return
   *      value of -1.
   *
   *    - Non-spacing and enclosing combining characters (general
   *      category code Mn or Me in the Unicode database) have a
   *      column width of 0.
   *
   *    - SOFT HYPHEN (U+00AD) has a column width of 1.
   *
   *    - Other format characters (general category code Cf in the Unicode
   *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
   *
   *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
   *      have a column width of 0.
   *
   *    - Spacing characters in the East Asian Wide (W) or East Asian
   *      Full-width (F) category as defined in Unicode Technical
   *      Report #11 have a column width of 2.
   *
   *    - All remaining characters (including all printable
   *      ISO 8859-1 and WGL4 characters, Unicode control characters,
   *      etc.) have a column width of 1.
   *
   * This implementation assumes that wchar_t characters are encoded
   * in ISO 10646.
   */
  public static int of(int codePoint) {
    // test for 8-bit control characters
    if (codePoint == 0) {
      return 0;
    }
    if (codePoint < 32 || (codePoint >= 0x7f && codePoint < 0xa0)) {
      return -1;
    }
    // binary search in table of non-spacing characters
    if (bisearch(codePoint)) {
      return 0;
    }

    // if we arrive here, ucs is not a combining or C0/C1 control character
    return 1 +
        ((codePoint >= 0x1100 &&
            (codePoint <= 0x115f ||                    // Hangul Jamo init. consonants
                codePoint == 0x2329 || codePoint == 0x232a ||
                (codePoint >= 0x2e80 && codePoint <= 0xa4cf &&
                    codePoint != 0x303f) ||                  // CJK ... Yi
                (codePoint >= 0xac00 && codePoint <= 0xd7a3) || // Hangul Syllables
                (codePoint >= 0xf900 && codePoint <= 0xfaff) || // CJK Compatibility Ideographs
                (codePoint >= 0xfe10 && codePoint <= 0xfe19) || // Vertical forms
                (codePoint >= 0xfe30 && codePoint <= 0xfe6f) || // CJK Compatibility Forms
                (codePoint >= 0xff00 && codePoint <= 0xff60) || // Fullwidth Forms
                (codePoint >= 0xffe0 && codePoint <= 0xffe6) ||
                (codePoint >= 0x20000 && codePoint <= 0x2fffd) ||
                (codePoint >= 0x30000 && codePoint <= 0x3fffd))) ? 1 : 0);
  }
}
@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Sep 19, 2017

Contributor

Many thanks for wcwidth code, we have begun to use it.
Unfortunately, the situation is much more complex than expected.
See following example:

     ┌───┐          ┌─────┐
     │bob│          │alice│
     └─┬─┘          └──┬──┘
       │    1234567    │   
       │────こここここ───>│   
       │    んんんん     │   
       │    にににに     │   
       │    ちちちち      │   
       │    わわわわ     │   
       │    にににに     │   
       │──────────────>│   

It seems that not every character has a width of 1 or 2.
The width seems to be a fraction and seems to depend of the font (for example, you need only six "こ" to have the same length as 1234567 (at least, with the font used by our current Firefox. Using Chrome on the same computer, only for "こ" are need!)
Which means it will be very difficult to make ASCII art working with those characters.

Any tip welcome!

Contributor

arnaudroques commented Sep 19, 2017

Many thanks for wcwidth code, we have begun to use it.
Unfortunately, the situation is much more complex than expected.
See following example:

     ┌───┐          ┌─────┐
     │bob│          │alice│
     └─┬─┘          └──┬──┘
       │    1234567    │   
       │────こここここ───>│   
       │    んんんん     │   
       │    にににに     │   
       │    ちちちち      │   
       │    わわわわ     │   
       │    にににに     │   
       │──────────────>│   

It seems that not every character has a width of 1 or 2.
The width seems to be a fraction and seems to depend of the font (for example, you need only six "こ" to have the same length as 1234567 (at least, with the font used by our current Firefox. Using Chrome on the same computer, only for "こ" are need!)
Which means it will be very difficult to make ASCII art working with those characters.

Any tip welcome!

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Sep 19, 2017

seems to depend of the font (for example, you need only six "こ" to have the same length as 1234567 (at least, with the font used by our current Firefox. Using Chrome on the same computer, only for "こ" are need!)

Yes, it depend on fonts. But if the font is monospace, all of characters have a width of 1 or 2 (or more).
And For talking about TUI, you should not use propotional font for example using browser.

If using monospace font, "こ" has double width of "1".

mattn commented Sep 19, 2017

seems to depend of the font (for example, you need only six "こ" to have the same length as 1234567 (at least, with the font used by our current Firefox. Using Chrome on the same computer, only for "こ" are need!)

Yes, it depend on fonts. But if the font is monospace, all of characters have a width of 1 or 2 (or more).
And For talking about TUI, you should not use propotional font for example using browser.

If using monospace font, "こ" has double width of "1".

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn commented Sep 19, 2017

@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Sep 19, 2017

Contributor

Ok, here is a first beta. We have used the Wcwidth class (thanks again).
https://www.dropbox.com/s/koo42q3d9gxw288/plantuml.jar?dl=0
It should work with your first example. Could you try it and tell us ?
Our work is not finished and many things will not work (yet), but your example should.
Thanks,

Contributor

arnaudroques commented Sep 19, 2017

Ok, here is a first beta. We have used the Wcwidth class (thanks again).
https://www.dropbox.com/s/koo42q3d9gxw288/plantuml.jar?dl=0
It should work with your first example. Could you try it and tell us ?
Our work is not finished and many things will not work (yet), but your example should.
Thanks,

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Sep 19, 2017

mattn commented Sep 19, 2017

@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Sep 19, 2017

Contributor

Sure.
Here is a zip file of modified files https://www.dropbox.com/s/mjq9vbal3uorr0o/wcwidth.zip?dl=0

It's difficult for us to work on this, because we cannot really test it: we did not find any fonts in browser that are really monospace.
For example, this is not working on our config:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<p style="font-family: monospace;">
this is a test<br>
bob - alice : 1234567890<br>
bob - alice : こんにちわ<br>
bob - alice : さようなら<br>
</p>
</html>

Anyway, if you want to change our code, here are some tips:

  • We have added a length() method in Wcwidth that computes the real length of a string, using 1 or 2 per character, thanks to of() method.
  • There is now a getWcWidth() in StringUtils that uses this length() method
  • BasicCharAreaImpl contains a 2-dimension array of characters. When we add a wide character in the array, we put an extra '\0' char in following column.
    This denotes that the wide character is using two places. And in getLine(), we remove those extra '\0' so that the returned string has less characters but will be printed with the same length as "regular" lines.

Our modification is not perfect, but it is the shortest we've found that should work.
Feel free to improve it if you find issues : as I said, it's difficult for us to test.

Thanks again.

Contributor

arnaudroques commented Sep 19, 2017

Sure.
Here is a zip file of modified files https://www.dropbox.com/s/mjq9vbal3uorr0o/wcwidth.zip?dl=0

It's difficult for us to work on this, because we cannot really test it: we did not find any fonts in browser that are really monospace.
For example, this is not working on our config:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<p style="font-family: monospace;">
this is a test<br>
bob - alice : 1234567890<br>
bob - alice : こんにちわ<br>
bob - alice : さようなら<br>
</p>
</html>

Anyway, if you want to change our code, here are some tips:

  • We have added a length() method in Wcwidth that computes the real length of a string, using 1 or 2 per character, thanks to of() method.
  • There is now a getWcWidth() in StringUtils that uses this length() method
  • BasicCharAreaImpl contains a 2-dimension array of characters. When we add a wide character in the array, we put an extra '\0' char in following column.
    This denotes that the wide character is using two places. And in getLine(), we remove those extra '\0' so that the returned string has less characters but will be printed with the same length as "regular" lines.

Our modification is not perfect, but it is the shortest we've found that should work.
Feel free to improve it if you find issues : as I said, it's difficult for us to test.

Thanks again.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Sep 20, 2017

Awesome! Working perfectly. 💯

mattn commented Sep 20, 2017

Awesome! Working perfectly. 💯

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Oct 5, 2017

@arnaudroques Could you please merge your changes into master?

mattn commented Oct 5, 2017

@arnaudroques Could you please merge your changes into master?

@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Oct 5, 2017

Contributor

Sure, I will commit this in the few incoming days.

Contributor

arnaudroques commented Oct 5, 2017

Sure, I will commit this in the few incoming days.

@arnaudroques

This comment has been minimized.

Show comment
Hide comment
@arnaudroques

arnaudroques Oct 7, 2017

Contributor

It's committed. Thanks for your help!

Contributor

arnaudroques commented Oct 7, 2017

It's committed. Thanks for your help!

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Oct 7, 2017

Great news! I was surprised how quickly you understood about multibyte characters. Thank you.

mattn commented Oct 7, 2017

Great news! I was surprised how quickly you understood about multibyte characters. Thank you.

@mattn mattn closed this Oct 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment