Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SL.str: char* vs wchar_t* vs std::string #829

Open
magol opened this issue Jan 16, 2017 · 15 comments
Open

SL.str: char* vs wchar_t* vs std::string #829

magol opened this issue Jan 16, 2017 · 15 comments
Assignees

Comments

@magol
Copy link

magol commented Jan 16, 2017

I can not understand from the guide when I should use char* and when I should use std::string in parameters and return value.

In most of the examples you use char*, but is not that little too imprecise?

What about the difference between char* and wchar_t*?

@cubbimew
Copy link
Member

cubbimew commented Jan 16, 2017

There's F.25: Use a zstring or a not_null to designate a C-style string which can be seen in action in many examples in the guldelines, such as F.22: Use T* or owner<T*> to designate a single object. It is also mentioned again in a couple places, at least in R.2: In interfaces, use raw pointers to denote individual objects (only)

I'd say the guidance is

  • owning: std::string
  • non-owning:
    • used with C APIs? gsl::zstring/gsl::czstring
    • not used with C APIs? std::string_view/gsl::string_span

perhaps it should be made more explicit, such as by filling out the rule placeholder SL.str: String

As for guidelines for the use of wchar_t/char32_t/etc, that's a whole other discussion

@magol
Copy link
Author

magol commented Jan 17, 2017

I'm not sure if I understood it quite right. By owned, do you mean that the called function is responsible for the parameter?
By C-style code, do you mean that the function is calling C API, or that the data in the argument is coming from C API?
What is the best way to handle interactivity with code that uses CString?
Is it the same rules for returning string?
Is it the same rules for in and out values?

Owned Not owned
C API Not C API
In std::string gsl::zstring
gsl::czstring
std::string_view
gsl::string_span
Out
In/Out

It is a lot of information that is missing in SL.str: String :-)

@AraHaan
Copy link

AraHaan commented Jan 29, 2017

In the case of wchar_t*, std::wstring is a collection of wchar_t* as far as I know. So if you wanted to get wchar_t*'s you can always to .c_str() on an std::wstring. That is as far as I am aware of. Also the difference between an char* (a pointer to a character) versus an wchar_t* (a pointer to a wide character) The difference from char and wchar_t is that wchar_t is larger than char And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters. wchar_t can also be for things in unicode too. There is also a way to convert the short characters to wide as well for wstring. Here is an example:

#include <string>
#include <iostream>

int main() {
    std::wstring data = "This is an wide string.\n";
    // Note: with wide strings you need std::wcout instead of std::cout
    std::wcout << data;
    return 0;
}

That example above has an issue on top of not being able to compile because wstring does not like char's nor does it accept them. If it was able to compile you would notice that is all the letters are so short that they will not output as English characters or if you casted them to wchar_t then the wide string data would be entirely empty. To bypass that prefix it with an L instead of casting the char's to wchar_t (Note the uppercase L).

#include <string>
#include <iostream>

int main() {
    std::wstring data = L"This is an wide string.\n";
    // Note: with wide strings you need std::wcout instead of std::cout
    std::wcout << data;
    return 0;
}

And now it whould work. Hope this explains not only the difference between wchar_t and char as well as the difference between std::string and std::wstring. I know many who fall for this thinking they are the same when 1 is actually larger than the other. And yeah there is an time and a place to use wide strings but as you can see you can make normal characters that are not wide wide with an uppercase L in front of the string. I am not entirely sure if someone was to input English characters that are not wide into std::wcin if it would make those English characters wide in a way they don't look Chinese or Japanese.

@jwakely
Copy link
Contributor

jwakely commented Jan 29, 2017

If I understand correctly, you're asking for compilers to accept invalid code and magically transform an array of char to an array of wchar_t, which is not possible in general (because it would require assumptions about character sets and encodings). In any case, this is not a "guideline" that can be recommended as something for C++ programmers to follow.

@jwakely jwakely closed this as completed Jan 29, 2017
@jwakely
Copy link
Contributor

jwakely commented Jan 29, 2017

Oops, sorry, I closed this but meant to just add a reply to the previous comment. Reopening.

@jwakely jwakely reopened this Jan 29, 2017
@AraHaan
Copy link

AraHaan commented Jan 29, 2017

Ok, updated the comment.

@cubbimew
Copy link
Member

cubbimew commented Jan 30, 2017

And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters.

not really, if you're on an OS that supports Unicode, such as Linux, this works just as well:

#include <string>
#include <iostream>
int main() {
    std::string data = "This is not a wide string, but it says 很高兴认识你.\n";
    std::cout << data; 
}

live demo http://melpon.org/wandbox/permlink/P0LUKyLzTs1xPyKu

.. but getting into that discussion would detail this thread thoroughly.

@MikeGitb
Copy link

@cubbimrw: Actually, this is not a question of Unicode support in general, but UTF-8 encoding in particular.

@AraHaan
Copy link

AraHaan commented Jan 30, 2017

@cubbimew Not all systems support unicode (like you said) and Windows, by default is set to a code page and in that not all characters are supported on it. And those characters are Japanese and Chinese characters with the default code page Windows has set up (unless you reset it) Some systems however on the local is set to UTF-8. And point being not all programs are coded to automatically translate all the text on itself pased on what the code page or the local is set to. And sometmes it is just not logical for things like small Console applications. Windows does support Unicode but only if you set the code page to be able to support UTF-8 (If they even put in a way to explicitly set it to UTF-8).

@MikeGitb
Copy link

MikeGitb commented Jan 31, 2017

Please stop mixing up unicode and UTF-8. Unicode is a standard that uniquely maps all? Known characters (actually code points) to a number. UTF-8 (usually using char as data type for individual code units) is one possible encoding of that, UTF-16 (using char16_t or wchar_t on windows) a different one, but both examples use (most likely) unicode.

The problem is that afaik linux - by default - assumes that a char* points to a utf-8 encoded string, whereas a windows assumes by default that it is some single byte encoding like latin-1, which can only encode a small subset. wchar_t afaik always assumed to be a unicode code unit on both platforms, but has different sizes and encodings (2 byte, utf-16 encoding on windows, 4 byte, utf-32 encoding on linux)

@cubbimew
Copy link
Member

cubbimew commented Jan 31, 2017

Please stop mixing up unicode and UTF-8.

I did not use a u8 string literal on purpose. That example would work as expected on any system that supports Unicode regardless of what transformation format it chose for the narrow multibyte encoding: UTF-8, GB18030, SCSU, whatever. That said, UTF-8 has been part of Unicode for over 20 years.

2 byte, utf-16 encoding on windows

It's UCS2 on Windows, obsolete as of 1996: L'\U0001F4A9' is a wchar_t on Windows with a meaningless value, and you can't read that from a UTF-8 file with std::codecvt_utf8. Yes, some (all?) WinAPIs treat wchar_ts as UTF-16 code units, but the language and the standard library do not (although you can trick stdout/cout into treating it that way with a non-standard API call). I'm not even mentioning lack of any Unicode locales in the CRT.

char32_t could have saved the day, but LEWG voted against its use in iostreams, regexes, etc in 2006, in anticipation of a real Unicode library. Eleven years later.. here's hoping for C++20.

@magol magol changed the title char* vs wchar_t* vs std::string SL.str: char* vs wchar_t* vs std::string Jan 31, 2017
@AndrewPardoe
Copy link
Contributor

Bjarne, per our meeting, please write this up during a minute slice of your infinite spare time.

@BjarneStroustrup
Copy link
Contributor

I have made a start on the ASCII-string part of this

@magol
Copy link
Author

magol commented Apr 16, 2017

@BjarneStroustrup
Great
If I can have a wish, I would like to have a list of all common string types that are in the wild and recommendation what to do with them.
In the program I work with, there are a lot of MFC and Windows API, and I want that the integration with that code to work, even as I write modern code.

@magol
Copy link
Author

magol commented Aug 27, 2017

I see that the ASCII part is there now, but what about the Unicode part?
How should I do with code that use a lot of CString in the code? How should I do the modernization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants