New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/cgo: add C.WcharString #1691

Open
rsc opened this Issue Apr 13, 2011 · 20 comments

Comments

Projects
None yet
9 participants
@rsc
Contributor

rsc commented Apr 13, 2011

for calling routines that need a wchar_t*
@rsc

This comment has been minimized.

Contributor

rsc commented Dec 9, 2011

Comment 1:

Labels changed: added priority-later.

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 12, 2011

Comment 2:

Labels changed: added priority-go1.

@remyoudompheng

This comment has been minimized.

Contributor

remyoudompheng commented Dec 15, 2011

Comment 3:

should the various C.GoString, C.WString etc. move somewhere in package runtime/cgo ?
that would avoid inlining code in cgo string constants. Or is it annoying because that
would imply that some C types are predefined in runtime/cog and some others are
auto-generated?
@robpike

This comment has been minimized.

Contributor

robpike commented Jan 13, 2012

Comment 4:

Owner changed to builder@golang.org.

@rsc

This comment has been minimized.

Contributor

rsc commented Feb 17, 2012

Comment 6:

wchar_t is pretty rare; need not be in Go 1.

Labels changed: added priority-later, removed priority-go1.

@peterGo

This comment has been minimized.

Contributor

peterGo commented Feb 19, 2012

Comment 7:

On Windows, wchar_t is ubiquitous. Windows Unicode-enabled API functions use UTF-16
(wide character) encoding, which is used for native Unicode encoding on Windows
operating systems.
Windows Data Types for Strings
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374131.aspx
@rsc

This comment has been minimized.

Contributor

rsc commented Feb 19, 2012

Comment 8:

I would be happy to review a patch providing wchar_t in cgo,
but the Go team is not going to make it a priority for their own
Go work to write such a patch.
@gopherbot

This comment has been minimized.

gopherbot commented Mar 13, 2012

Comment 9 by Edward.Casey.Adams:

Perhaps Cgo users should link to libiconv (http://www.gnu.org/software/libiconv/)
instead?
The problem is that both the width and the unicode encoding for wchar_t is not well
defined. (See http://en.wikipedia.org/wiki/Wide_character#C.2FC.2B.2B) For example, on
Windows/Visual Studio platforms, wchar_t is 16 bits wide and encoded in UTF-16LE,
whereas most linux distros wchar_t is defined to be 32 bits wide, but most unicode is in
UTF-8 stored in regular chars and most anything else won't be little-endian. Thus adding
C.WcharString() adds ambiguity.
@rsc

This comment has been minimized.

Contributor

rsc commented Mar 13, 2012

Comment 10:

You would only use C.WcharString on systems where you needed a wchar_t*.
The definition would be whatever that means on that system.
@rsc

This comment has been minimized.

Contributor

rsc commented Sep 12, 2012

Comment 11:

Labels changed: added go1.1.

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 9, 2012

Comment 12:

Labels changed: removed go1.1.

@rsc

This comment has been minimized.

Contributor

rsc commented Nov 27, 2013

Comment 13:

Labels changed: added go1.3maybe.

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 4, 2013

Comment 14:

Labels changed: added release-none, removed go1.3maybe.

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 4, 2013

Comment 15:

Labels changed: added repo-main.

@GeertJohan

This comment has been minimized.

Contributor

GeertJohan commented Apr 30, 2014

Comment 16:

I once made this package: https://github.com/GeertJohan/cgo.wchar
It works well, but requires libiconv. I have never tested it on anything except linux.
@andlabs

This comment has been minimized.

Contributor

andlabs commented Jun 2, 2014

Comment 17:

The problem with comment #10 is that you would either
a) need to know what the definition of wchar_t is on the target platform
b) use the mbtowc() family of functions - which requires you to know what the multibyte
encoding is
If we can guarantee that all systems supported by Go have a multibyte encoding of UTF-8,
then we can implement this portably. Alas:
$ uname -a
Linux pietro-laptop 3.13.0-29-generic #52-Ubuntu SMP Wed May 28 12:42:47 UTC 2014 x86_64
x86_64 x86_64 GNU/Linux
$ cat multibyte.c
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <string.h>
#include <errno.h>
#include <locale.h>
int main(void)
{
    wchar_t wide = L'世';
    char multibyte[MB_LEN_MAX];
    int i, n;
    setlocale(LC_ALL, "");
    errno = 0;
    n = wctomb(multibyte, wide);
    if (n == -1) {
        fprintf(stderr, "error %s\n", strerror(errno));
        return 1;
    }
    if (n == 0) {
        fprintf(stderr, "weird: wctomb() returned 0 (no bytes in output)\n");
        return 2;
    }
    for (i = 0; i < n; i++)
        printf("%02X ", multibyte[i]);
    printf("\n");
    return 0;
}
$ LC_CTYPE= ./a.out 
FFFFFFE4 FFFFFFB8 FFFFFF96 
$ LC_CTYPE=en_US.UTF8 ./a.out
FFFFFFE4 FFFFFFB8 FFFFFF96 
$ LC_CTYPE=ja_JP.SJIS ./a.out 
FFFFFF90 FFFFFFA2 
So as far as I can gather, a C.CWString() would need to be platform-specific.
For Windows, we can either
- do the work on the Go side: have unicode/utf16 do the conversion (this is what package
syscall does)
- do the work on the C side: use MultiByteToWideChar() in kernel32.dll by passing
CP_UTF8 as the first argument (which should work regardless of locale)
For the Unixes, though, I'm not sure... other than linking to libiconv, which I imagine
isn't optimal, or flat out not providing it since it isn't used much to begin with, in
which case for Windows we could just say use the routines in package syscall.
(I have wanted to prune through cgo myself sometime.)
@mdempsky

This comment has been minimized.

Member

mdempsky commented Aug 6, 2014

Comment 18:

C99 and later specify that if __STDC_ISO_10646__ is defined, then wchar_t characters
have value equal to their Unicode code point.  We could conditionally provide/expose
C.WcharString() (or C.CWString() or whatever) only if the C compiler defines that macro,
and then I don't think we need to rely on any external libraries like libiconv.
I think the only nit would be how to handle code points greater than WCHAR_MAX.  ISO C
doesn't specify how to handle that case, but in practice it seems like encoding
characters using UTF-{8*sizeof(wchar_t)} should work.  Varying the implementation
depending on sizeof(wchar_t) might be a tad involved, but nothing really out of the
ordinary from what cgo already has to do I think.
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 6, 2014

Comment 19:

As far as I can tell neither GCC nor clang define __STDC_ISO_10646__ so this seems
rather theoretical.
@mdempsky

This comment has been minimized.

Member

mdempsky commented Aug 6, 2014

Comment 20:

Hm, at least GCC (4.8.2) on Ubuntu 14.04 defines it:
$ echo | gcc -E -dD - | grep STDC_ISO_10646
#define __STDC_ISO_10646__ 201103L
(Seems to come from /usr/include/stdc-predef.h, provided by glibc.)
But indeed GCC 4.6.3 on Ubuntu 12.04 or even just Clang 3.5 on Ubuntu 14.04 do not, so
that's unfortunate.
@mdempsky

This comment has been minimized.

Member

mdempsky commented Aug 6, 2014

Comment 21:

Oh, older glibc define __STDC_ISO_10646__ in <features.h>, which then gets pulled
in by other glibc headers like <wchar.h>, but won't be provided by default or by
GCC provided headers like <stddef.h>.
But I suppose it's still not a very worthwhile signal unless Windows and OS X also
define it.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment