Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free #1822

Merged
merged 1 commit into from Sep 11, 2015

Conversation

Projects
None yet
6 participants
@mattn
Contributor

mattn commented Mar 7, 2014

Add mrb_cstr_from_locale/mrb_cstr_to_locale. ARGV should be utf8 strings converted from locale strings. And printstr should print locale strings.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

If you worry needless memory allocation on non-utf8 locales, I can add #ifdef code into them.

Contributor

mattn commented Mar 7, 2014

If you worry needless memory allocation on non-utf8 locales, I can add #ifdef code into them.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

Sorry about noisy commits.

Contributor

mattn commented Mar 7, 2014

Sorry about noisy commits.

@matz

This comment has been minimized.

Show comment
Hide comment
@matz

matz Mar 7, 2014

Member

Nice idea, but direct usage of malloc/strdup/strndup/free is unacceptable in mruby.
Besides that, if mrb_cstr_to/from_locale do no conversion but copying memory in non Windows environment, I am not sure whether the names are appropriate.

Member

matz commented Mar 7, 2014

Nice idea, but direct usage of malloc/strdup/strndup/free is unacceptable in mruby.
Besides that, if mrb_cstr_to/from_locale do no conversion but copying memory in non Windows environment, I am not sure whether the names are appropriate.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

How about this?

Contributor

mattn commented Mar 7, 2014

How about this?

@matz

This comment has been minimized.

Show comment
Hide comment
@matz

matz Mar 7, 2014

Member

Much better. I prefer no overhead way.
Should we name it after win32 name (e.g. codepage) instead of generic locale?
Or had we better to keep room for future extension?

Member

matz commented Mar 7, 2014

Much better. I prefer no overhead way.
Should we name it after win32 name (e.g. codepage) instead of generic locale?
Or had we better to keep room for future extension?

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

I named this inspired with glib string functions.

http://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-locale-to-utf8

BTW, It seems to be confrict.

Contributor

mattn commented Mar 7, 2014

I named this inspired with glib string functions.

http://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-locale-to-utf8

BTW, It seems to be confrict.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

patched again and pushed forcely.

Contributor

mattn commented Mar 7, 2014

patched again and pushed forcely.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

Do you perfer the name is codepage ?

BEFORE

  • mrb_utf8_from_locale
  • mrb_utf8_free
  • mrb_locale_from_utf8
  • mrb_locale_free

AFTER

  • mrb_utf8_from_codepage
  • mrb_utf8_free
  • mrb_codepage_from_utf8
  • mrb_codepage_free
Contributor

mattn commented Mar 7, 2014

Do you perfer the name is codepage ?

BEFORE

  • mrb_utf8_from_locale
  • mrb_utf8_free
  • mrb_locale_from_utf8
  • mrb_locale_free

AFTER

  • mrb_utf8_from_codepage
  • mrb_utf8_free
  • mrb_codepage_from_utf8
  • mrb_codepage_free
@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 7, 2014

Contributor

I suppose this APIs will be used in mrbgems. So we must decide the name carefully.

Contributor

mattn commented Mar 7, 2014

I suppose this APIs will be used in mrbgems. So we must decide the name carefully.

@beoran

This comment has been minimized.

Show comment
Hide comment
@beoran

beoran Mar 7, 2014

I am not sure, but I think it must be possible to do this using the C99 standard mbrtowc/mbtowc/wctomb, etc functions? I don't like platform-specific code too much if there is a standard way to do it.

beoran commented Mar 7, 2014

I am not sure, but I think it must be possible to do this using the C99 standard mbrtowc/mbtowc/wctomb, etc functions? I don't like platform-specific code too much if there is a standard way to do it.

@matz

This comment has been minimized.

Show comment
Hide comment
@matz

matz Mar 8, 2014

Member

Unfortunately, C99 mbtowc etc. can only convert strings between locale dependent multibyte encoding (which may or may not be UTF-8) and opaque wide character encoding (which may or may not be UTF-32). We cannot switch locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8 conversion.

Member

matz commented Mar 8, 2014

Unfortunately, C99 mbtowc etc. can only convert strings between locale dependent multibyte encoding (which may or may not be UTF-8) and opaque wide character encoding (which may or may not be UTF-32). We cannot switch locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8 conversion.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 8, 2014

Contributor

I can write code for converting wide char code point to utf8 bytes.
But this need to call setlocale at initializetion. And currently this
convertion is needed just for windows.

On 3/8/14, matz notifications@github.com wrote:

Unfortunately, C99 mbtowc etc. can only convert strings between locale
dependent multibyte encoding (which may or may not be UTF-8) and opaque wide
character encoding (which may or may not be UTF-32). We cannot switch
locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8
conversion.


Reply to this email directly or view it on GitHub:
#1822 (comment)

  • Yasuhiro Matsumoto
Contributor

mattn commented Mar 8, 2014

I can write code for converting wide char code point to utf8 bytes.
But this need to call setlocale at initializetion. And currently this
convertion is needed just for windows.

On 3/8/14, matz notifications@github.com wrote:

Unfortunately, C99 mbtowc etc. can only convert strings between locale
dependent multibyte encoding (which may or may not be UTF-8) and opaque wide
character encoding (which may or may not be UTF-32). We cannot switch
locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8
conversion.


Reply to this email directly or view it on GitHub:
#1822 (comment)

  • Yasuhiro Matsumoto
@beoran

This comment has been minimized.

Show comment
Hide comment
@beoran

beoran Mar 8, 2014

Hmm, I see. Windows i18n & l10n seems really complicated... Too bad the ansi functions are not powerful enough. If that's the case , then please carry on. However, I do wonder how plain ruby handles this problem that in the old 1.8.x days before we had Encoding?

beoran commented Mar 8, 2014

Hmm, I see. Windows i18n & l10n seems really complicated... Too bad the ansi functions are not powerful enough. If that's the case , then please carry on. However, I do wonder how plain ruby handles this problem that in the old 1.8.x days before we had Encoding?

@matz

This comment has been minimized.

Show comment
Hide comment
@matz

matz Mar 8, 2014

Member

@beoran good question.

Back in 1.8 days, strings do not handle any multibyte encoding, but Regexp do. Besides that,
there's no encoding conversion at all.

So if you want to handle multibyte strings in 1.8, you have to use Regexp matching with your locale.
1.8 regexp only supports Shift-JIS, EUC and UTF-8.

Member

matz commented Mar 8, 2014

@beoran good question.

Back in 1.8 days, strings do not handle any multibyte encoding, but Regexp do. Besides that,
there's no encoding conversion at all.

So if you want to handle multibyte strings in 1.8, you have to use Regexp matching with your locale.
1.8 regexp only supports Shift-JIS, EUC and UTF-8.

@bjorndm

This comment has been minimized.

Show comment
Hide comment
@bjorndm

bjorndm Mar 10, 2014

Hmmm, interesting. Currently a mruby string is just an arbitrary byte buffer. Wouldn't it be possible to do any conversions on the mruby side? Escpecially since we don't have working regexps yet.

Also, we have to keep issue #1715 in mind. What should strings in mruby be? Dumb byte buffers like in Ruby 1.8 or in Lua? Always UTF-8 encoded, with a separate byte buffer class, like it is in in Python? Or support some form of Encoding...? We need to think about this well. The balance is to keep mruby small, whilst at the same time help portability and i18n.

bjorndm commented Mar 10, 2014

Hmmm, interesting. Currently a mruby string is just an arbitrary byte buffer. Wouldn't it be possible to do any conversions on the mruby side? Escpecially since we don't have working regexps yet.

Also, we have to keep issue #1715 in mind. What should strings in mruby be? Dumb byte buffers like in Ruby 1.8 or in Lua? Always UTF-8 encoded, with a separate byte buffer class, like it is in in Python? Or support some form of Encoding...? We need to think about this well. The balance is to keep mruby small, whilst at the same time help portability and i18n.

@take-cheeze

This comment has been minimized.

Show comment
Hide comment
@take-cheeze

take-cheeze Mar 10, 2014

Contributor

At least mruby-onig-regexp can support non-utf8 strings easily because Oniguruma has great support of many encodings.

Contributor

take-cheeze commented Mar 10, 2014

At least mruby-onig-regexp can support non-utf8 strings easily because Oniguruma has great support of many encodings.

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Mar 17, 2014

Contributor

I'm not sure about this exact use case, but I think this is to input/output on Windows Console.
If so, why you use ANSI API and covert it to UTF-8?
You can use Wide API and convert between UTF-16 and UTF-8.
It is more speedy and no information losses.

Contributor

nurse commented Mar 17, 2014

I'm not sure about this exact use case, but I think this is to input/output on Windows Console.
If so, why you use ANSI API and covert it to UTF-8?
You can use Wide API and convert between UTF-16 and UTF-8.
It is more speedy and no information losses.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 17, 2014

Contributor

If so, why you use ANSI API and covert it to UTF-8?

Are you talking to me?

mattn@f3017a9#diff-2a75186e465ffeac1b306a350f4a56f8R71

Contributor

mattn commented Mar 17, 2014

If so, why you use ANSI API and covert it to UTF-8?

Are you talking to me?

mattn@f3017a9#diff-2a75186e465ffeac1b306a350f4a56f8R71

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 17, 2014

Contributor

See #1715, we made a spec how to store utf-8 bytes into RString.

Contributor

mattn commented Mar 17, 2014

See #1715, we made a spec how to store utf-8 bytes into RString.

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Mar 18, 2014

Contributor

On output, you convert UTF-8 strings to (Wide Character and then convert to) ANSI strings (SJIS strings) and call fwrite(3). It losts Unicode characters.

For example CRuby converts UTF-8 strings to Wide Characters and use WriteConsoleW().

Contributor

nurse commented Mar 18, 2014

On output, you convert UTF-8 strings to (Wide Character and then convert to) ANSI strings (SJIS strings) and call fwrite(3). It losts Unicode characters.

For example CRuby converts UTF-8 strings to Wide Characters and use WriteConsoleW().

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 18, 2014

Contributor

Ah, I understand it now. The issue is, if anything, how to handle non-utf-8 strings with minimul changes in mruby. What you say is just thing for mrb_p. Right? And I guess, it's easy improvement after merging this. Thanks.

Contributor

mattn commented Mar 18, 2014

Ah, I understand it now. The issue is, if anything, how to handle non-utf-8 strings with minimul changes in mruby. What you say is just thing for mrb_p. Right? And I guess, it's easy improvement after merging this. Thanks.

@matz

This comment has been minimized.

Show comment
Hide comment
@matz

matz Mar 19, 2014

Member

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

Member

matz commented Mar 19, 2014

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 19, 2014

Contributor

It's possible but it require call of setlocale(LC_CTYPE, "");. This affect many things and make side-effects.

Contributor

mattn commented Mar 19, 2014

It's possible but it require call of setlocale(LC_CTYPE, "");. This affect many things and make side-effects.

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Mar 20, 2014

Contributor

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

What i say is not from/to locale string. I pointed we can get UTF-16 string from Console/output to Console.

  • A: Console --(locale string) --> mruby
  • B: Console --(UTF-16 string) --> mruby
Contributor

nurse commented Mar 20, 2014

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

What i say is not from/to locale string. I pointed we can get UTF-16 string from Console/output to Console.

  • A: Console --(locale string) --> mruby
  • B: Console --(UTF-16 string) --> mruby
@bjorndm

This comment has been minimized.

Show comment
Hide comment
@bjorndm

bjorndm Mar 21, 2014

What is the problem we're trying to solve here anyway?

If I understand correctly, the problem is that currently it's not possible to output UTF-8 encoded strings to the console on Windows using puts, etc?

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

Edit: I compiled mruby using mingw and then mruby simply crashed on utf-8 input.. >_<

bjorndm commented Mar 21, 2014

What is the problem we're trying to solve here anyway?

If I understand correctly, the problem is that currently it's not possible to output UTF-8 encoded strings to the console on Windows using puts, etc?

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

Edit: I compiled mruby using mingw and then mruby simply crashed on utf-8 input.. >_<

@take-cheeze

This comment has been minimized.

Show comment
Hide comment
@take-cheeze

take-cheeze Mar 21, 2014

Contributor

BTW, tty of libuv treats const char* as UTF-8 on Windows.
libuv will convert it to wchar_t* internally.
And it makes ANSI escape code compatible on Windows too.

Contributor

take-cheeze commented Mar 21, 2014

BTW, tty of libuv treats const char* as UTF-8 on Windows.
libuv will convert it to wchar_t* internally.
And it makes ANSI escape code compatible on Windows too.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 22, 2014

Contributor

@bjorndm

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

It's not good way to solve. Changing console codepage affects console font. So window will resized. And, if don't have unicode fonts, we can't display any utf-8 strings. For example, I want to use mruby as script language.

ls *.rb | xargs mruby

Console window will be resized for each files.

Asking @nurse's comment:

Below is my patch is doing:

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WideCharToMultiByte for converting wide characters into codepage bytes.
  • fwrite for writing codepage bytes.

You say that I can be to step 2 to step 3 above.

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WriteConsoleW for writing wide characters.

But WriteConsoleW should be used for that the output handle is console.

mruby foo.rb > log

In this case, output handle isn't console.

@matz what is your worries or questions?

Contributor

mattn commented Mar 22, 2014

@bjorndm

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

It's not good way to solve. Changing console codepage affects console font. So window will resized. And, if don't have unicode fonts, we can't display any utf-8 strings. For example, I want to use mruby as script language.

ls *.rb | xargs mruby

Console window will be resized for each files.

Asking @nurse's comment:

Below is my patch is doing:

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WideCharToMultiByte for converting wide characters into codepage bytes.
  • fwrite for writing codepage bytes.

You say that I can be to step 2 to step 3 above.

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WriteConsoleW for writing wide characters.

But WriteConsoleW should be used for that the output handle is console.

mruby foo.rb > log

In this case, output handle isn't console.

@matz what is your worries or questions?

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Mar 22, 2014

Contributor

As @mattn says, if you use cp65001 you invite another issues.
With Wide APIs you can bypass such locale-related issues.
You should use Wide APIs when you talk with Windows.

But WriteConsoleW should be used for that the output handle is console.

Good point, you can check it with _isatty( _fileno( stdout ) ).

Contributor

nurse commented Mar 22, 2014

As @mattn says, if you use cp65001 you invite another issues.
With Wide APIs you can bypass such locale-related issues.
You should use Wide APIs when you talk with Windows.

But WriteConsoleW should be used for that the output handle is console.

Good point, you can check it with _isatty( _fileno( stdout ) ).

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 24, 2014

Contributor

@nurse do you mean that it should put #ifdef _WIN32?

Contributor

mattn commented Mar 24, 2014

@nurse do you mean that it should put #ifdef _WIN32?

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Mar 25, 2014

Contributor

@mattn If mruby supports Unicode on Windows, it should do. But mruby has a option to split such feature into mrbgems.

Contributor

nurse commented Mar 25, 2014

@mattn If mruby supports Unicode on Windows, it should do. But mruby has a option to split such feature into mrbgems.

@bjorndm

This comment has been minimized.

Show comment
Hide comment
@bjorndm

bjorndm Mar 25, 2014

I like the idea of making an mrbgem for windows-specific code too.

bjorndm commented Mar 25, 2014

I like the idea of making an mrbgem for windows-specific code too.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Mar 25, 2014

Contributor

it's possible to implement mrb_p for windows, But it's not possible to convert ARGS before start to run scripts. So I did't make it.

Contributor

mattn commented Mar 25, 2014

it's possible to implement mrb_p for windows, But it's not possible to convert ARGS before start to run scripts. So I did't make it.

Support windows locale
Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free. Just works for windows.
@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Sep 11, 2015

Contributor

rebased.

Contributor

mattn commented Sep 11, 2015

rebased.

matz added a commit that referenced this pull request Sep 11, 2015

Merge pull request #1822 from mattn/locale
Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free

@matz matz merged commit eb9bec1 into mruby:master Sep 11, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mattn mattn deleted the mattn:locale branch Sep 11, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment