Skip to content

Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free #1822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 11, 2015

Conversation

mattn
Copy link
Contributor

@mattn mattn commented Mar 7, 2014

Add mrb_cstr_from_locale/mrb_cstr_to_locale. ARGV should be utf8 strings converted from locale strings. And printstr should print locale strings.

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

If you worry needless memory allocation on non-utf8 locales, I can add #ifdef code into them.

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

Sorry about noisy commits.

@matz
Copy link
Member

matz commented Mar 7, 2014

Nice idea, but direct usage of malloc/strdup/strndup/free is unacceptable in mruby.
Besides that, if mrb_cstr_to/from_locale do no conversion but copying memory in non Windows environment, I am not sure whether the names are appropriate.

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

How about this?

@matz
Copy link
Member

matz commented Mar 7, 2014

Much better. I prefer no overhead way.
Should we name it after win32 name (e.g. codepage) instead of generic locale?
Or had we better to keep room for future extension?

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

I named this inspired with glib string functions.

http://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-locale-to-utf8

BTW, It seems to be confrict.

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

patched again and pushed forcely.

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

Do you perfer the name is codepage ?

BEFORE

  • mrb_utf8_from_locale
  • mrb_utf8_free
  • mrb_locale_from_utf8
  • mrb_locale_free

AFTER

  • mrb_utf8_from_codepage
  • mrb_utf8_free
  • mrb_codepage_from_utf8
  • mrb_codepage_free

@mattn
Copy link
Contributor Author

mattn commented Mar 7, 2014

I suppose this APIs will be used in mrbgems. So we must decide the name carefully.

@beoran
Copy link

beoran commented Mar 7, 2014

I am not sure, but I think it must be possible to do this using the C99 standard mbrtowc/mbtowc/wctomb, etc functions? I don't like platform-specific code too much if there is a standard way to do it.

@matz
Copy link
Member

matz commented Mar 8, 2014

Unfortunately, C99 mbtowc etc. can only convert strings between locale dependent multibyte encoding (which may or may not be UTF-8) and opaque wide character encoding (which may or may not be UTF-32). We cannot switch locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8 conversion.

@mattn
Copy link
Contributor Author

mattn commented Mar 8, 2014

I can write code for converting wide char code point to utf8 bytes.
But this need to call setlocale at initializetion. And currently this
convertion is needed just for windows.

On 3/8/14, matz notifications@github.com wrote:

Unfortunately, C99 mbtowc etc. can only convert strings between locale
dependent multibyte encoding (which may or may not be UTF-8) and opaque wide
character encoding (which may or may not be UTF-32). We cannot switch
locale in the middle of execution in C99 neither.

So those functions are too weak to implement locale to/from UTF-8
conversion.


Reply to this email directly or view it on GitHub:
#1822 (comment)

  • Yasuhiro Matsumoto

@beoran
Copy link

beoran commented Mar 8, 2014

Hmm, I see. Windows i18n & l10n seems really complicated... Too bad the ansi functions are not powerful enough. If that's the case , then please carry on. However, I do wonder how plain ruby handles this problem that in the old 1.8.x days before we had Encoding?

@matz
Copy link
Member

matz commented Mar 8, 2014

@beoran good question.

Back in 1.8 days, strings do not handle any multibyte encoding, but Regexp do. Besides that,
there's no encoding conversion at all.

So if you want to handle multibyte strings in 1.8, you have to use Regexp matching with your locale.
1.8 regexp only supports Shift-JIS, EUC and UTF-8.

@bjorndm
Copy link

bjorndm commented Mar 10, 2014

Hmmm, interesting. Currently a mruby string is just an arbitrary byte buffer. Wouldn't it be possible to do any conversions on the mruby side? Escpecially since we don't have working regexps yet.

Also, we have to keep issue #1715 in mind. What should strings in mruby be? Dumb byte buffers like in Ruby 1.8 or in Lua? Always UTF-8 encoded, with a separate byte buffer class, like it is in in Python? Or support some form of Encoding...? We need to think about this well. The balance is to keep mruby small, whilst at the same time help portability and i18n.

@take-cheeze
Copy link
Contributor

At least mruby-onig-regexp can support non-utf8 strings easily because Oniguruma has great support of many encodings.

@nurse
Copy link
Contributor

nurse commented Mar 17, 2014

I'm not sure about this exact use case, but I think this is to input/output on Windows Console.
If so, why you use ANSI API and covert it to UTF-8?
You can use Wide API and convert between UTF-16 and UTF-8.
It is more speedy and no information losses.

@mattn
Copy link
Contributor Author

mattn commented Mar 17, 2014

If so, why you use ANSI API and covert it to UTF-8?

Are you talking to me?

mattn@f3017a9#diff-2a75186e465ffeac1b306a350f4a56f8R71

@mattn
Copy link
Contributor Author

mattn commented Mar 17, 2014

See #1715, we made a spec how to store utf-8 bytes into RString.

@nurse
Copy link
Contributor

nurse commented Mar 18, 2014

On output, you convert UTF-8 strings to (Wide Character and then convert to) ANSI strings (SJIS strings) and call fwrite(3). It losts Unicode characters.

For example CRuby converts UTF-8 strings to Wide Characters and use WriteConsoleW().

@mattn
Copy link
Contributor Author

mattn commented Mar 18, 2014

Ah, I understand it now. The issue is, if anything, how to handle non-utf-8 strings with minimul changes in mruby. What you say is just thing for mrb_p. Right? And I guess, it's easy improvement after merging this. Thanks.

@matz
Copy link
Member

matz commented Mar 19, 2014

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

@mattn
Copy link
Contributor Author

mattn commented Mar 19, 2014

It's possible but it require call of setlocale(LC_CTYPE, "");. This affect many things and make side-effects.

@nurse
Copy link
Contributor

nurse commented Mar 20, 2014

I am sorry I don't understand. How can we convert locale string to UTF-8 in ANSI API?

What i say is not from/to locale string. I pointed we can get UTF-16 string from Console/output to Console.

  • A: Console --(locale string) --> mruby
  • B: Console --(UTF-16 string) --> mruby

@bjorndm
Copy link

bjorndm commented Mar 21, 2014

What is the problem we're trying to solve here anyway?

If I understand correctly, the problem is that currently it's not possible to output UTF-8 encoded strings to the console on Windows using puts, etc?

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

Edit: I compiled mruby using mingw and then mruby simply crashed on utf-8 input.. >_<

@take-cheeze
Copy link
Contributor

BTW, tty of libuv treats const char* as UTF-8 on Windows.
libuv will convert it to wchar_t* internally.
And it makes ANSI escape code compatible on Windows too.

@mattn
Copy link
Contributor Author

mattn commented Mar 22, 2014

@bjorndm

I read a bit here: http://stackoverflow.com/questions/1371012/how-do-i-print-utf-8-from-c-console-application-on-windows, and I found that just entering chcp 65001 will correctly display UTF-8 in the Windows console. So, I think we don't need to change mruby for this.

But, if it's really a problem, we could use SetConsoleOutputCP (http://msdn.microsoft.com/en-us/library/ms686013.aspx) to automatically set a utf-8 enabled code page on Windows.

It's not good way to solve. Changing console codepage affects console font. So window will resized. And, if don't have unicode fonts, we can't display any utf-8 strings. For example, I want to use mruby as script language.

ls *.rb | xargs mruby

Console window will be resized for each files.

Asking @nurse's comment:

Below is my patch is doing:

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WideCharToMultiByte for converting wide characters into codepage bytes.
  • fwrite for writing codepage bytes.

You say that I can be to step 2 to step 3 above.

  • MultiByteToWideChar for converting utf-8 bytes into wide characters.
  • WriteConsoleW for writing wide characters.

But WriteConsoleW should be used for that the output handle is console.

mruby foo.rb > log

In this case, output handle isn't console.

@matz what is your worries or questions?

@nurse
Copy link
Contributor

nurse commented Mar 22, 2014

As @mattn says, if you use cp65001 you invite another issues.
With Wide APIs you can bypass such locale-related issues.
You should use Wide APIs when you talk with Windows.

But WriteConsoleW should be used for that the output handle is console.

Good point, you can check it with _isatty( _fileno( stdout ) ).

@mattn
Copy link
Contributor Author

mattn commented Mar 24, 2014

@nurse do you mean that it should put #ifdef _WIN32?

@nurse
Copy link
Contributor

nurse commented Mar 25, 2014

@mattn If mruby supports Unicode on Windows, it should do. But mruby has a option to split such feature into mrbgems.

@bjorndm
Copy link

bjorndm commented Mar 25, 2014

I like the idea of making an mrbgem for windows-specific code too.

@mattn
Copy link
Contributor Author

mattn commented Mar 25, 2014

it's possible to implement mrb_p for windows, But it's not possible to convert ARGS before start to run scripts. So I did't make it.

Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free. Just works for windows.
@mattn
Copy link
Contributor Author

mattn commented Sep 11, 2015

rebased.

matz added a commit that referenced this pull request Sep 11, 2015
Add mrb_utf8_from_locale, mrb_utf8_free, mrb_locale_from_utf8, mrb_locale_free
@matz matz merged commit eb9bec1 into mruby:master Sep 11, 2015
@mattn mattn deleted the locale branch September 11, 2015 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants