Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse wchar_t args correctly #222

Open
Theodikes opened this issue Oct 15, 2022 · 17 comments
Open

Parse wchar_t args correctly #222

Theodikes opened this issue Oct 15, 2022 · 17 comments

Comments

@Theodikes
Copy link

Since users of my program can input non-ascii-encoded strings, it should be possible to parse arguments like this:
vector<wstring> args = program.get<vector<wstring>>("--some-arg");

With standard const char** args this of course doesn't work, I tried to use int wmain(int argc, const wchar_t** argv), but program.parse_args(argc, argv) doesn`t work with wchars.

@skrobinson
Copy link
Contributor

Can you preprocess the wmain parameters into a std::vector<std::string> and call the program.parse_args(const std::vector<std::string>) overload?

@Theodikes
Copy link
Author

Can you preprocess the wmain parameters into a std::vector<std::string> and call the program.parse_args(const std::vector<std::string>) overload?

I cannot preprocess "wmain" into vector of string, because if you convert specific symbols in wchar_t to standard string, you loss data. I need users to be able to pass the path to the output or input file as arguments, and file names are often in the user's language (Chinese, Ukrainian, etc.). But if I try convert user input to standard string and open file, I get filename like this: те�т.txt

@p-ranav
Copy link
Owner

p-ranav commented Oct 18, 2022

You could keep your arguments as std::wstring and convert them to std::string using codecvt wstring_convert converter.

The following works fine on Godbolt.

// clang-format off
#include <https://raw.githubusercontent.com/p-ranav/argparse/master/include/argparse/argparse.hpp>
// clang-format on

#include <iostream>
#include <vector>
#include <string>
#include <codecvt>

int main(int argc, char *argv[])
{
    std::wstring arg = L"Hello, 世界.txt";

    /// Convert to std::string
    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string arg_as_string = converter.to_bytes(arg);

    /// Create argument parser
    argparse::ArgumentParser program("program_name");
    program.add_argument("filename");

    try
    {
        /// Parse vector of strings
        program.parse_args(std::vector<std::string>{"./program", arg_as_string});
    }
    catch (const std::exception &err)
    {
        std::cerr << "argparse failed with: " << typeid(err).name() << " " << err.what() << "\n";
        std::cerr << program;
        return 1;
    }

    auto input = program.get("filename");
    std::cout << input << "\n"; /// Hello, 世界.txt
}

@Theodikes
Copy link
Author

You could keep your arguments as std::wstring and convert them to std::string using codecvt wstring_convert converter.

The following works fine on Godbolt. ```

You could keep your arguments as std::wstring and convert them to std::string using codecvt wstring_convert converter.

The following works fine on Godbolt.

// clang-format off
#include <https://raw.githubusercontent.com/p-ranav/argparse/master/include/argparse/argparse.hpp>
// clang-format on

#include <iostream>
#include <vector>
#include <string>
#include <codecvt>

int main(int argc, char *argv[])
{
    std::wstring arg = L"Hello, 世界.txt";

    /// Convert to std::string
    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string arg_as_string = converter.to_bytes(arg);

    /// Create argument parser
    argparse::ArgumentParser program("program_name");
    program.add_argument("filename");

    try
    {
        /// Parse vector of strings
        program.parse_args(std::vector<std::string>{"./program", arg_as_string});
    }
    catch (const std::exception &err)
    {
        std::cerr << "argparse failed with: " << typeid(err).name() << " " << err.what() << "\n";
        std::cerr << program;
        return 1;
    }

    auto input = program.get("filename");
    std::cout << input << "\n"; /// Hello, 世界.txt
}

I dont know. Maybe its working on Godbolt, because all locales and encodings are added there, but on my pc this program prints Hello, 世界.txt, and create file with a similar illegible name.

@skrobinson
Copy link
Contributor

@p-ranav Beat me to it. Yes, convert to UTF-8 for internal use and no codepoints should be mangled.

@Theodikes When sending UTF-8 strings back out to the consle or filesystem, you may need to convert to the user's locale.

@Theodikes
Copy link
Author

@p-ranav Beat me to it. Yes, convert to UTF-8 for internal use and no codepoints should be mangled.

@Theodikes When sending UTF-8 strings back out to the consle or filesystem, you may need to convert to the user's locale.

The problem is that the user can process files that are not in their locale. For example, I often see Ukrainian files, although I do not have this locale. In addition, how to correctly determine the locale of each user? For example, I use en-US Windows, but my native language is non-acsii. Isn't it easier to add support for wstring and wmain, which will immediately reduce all problems with locales to nothing?

@skrobinson
Copy link
Contributor

how to correctly determine the locale of each user?

Welcome to the world of multi-language programming. I don't mean that sarcastically, cross language coding is full of challenges.

Isn't it easier to add support for wstring and wmain, which will immediately reduce all problems with locales to nothing?

Only on Windows. wmain is MSVC-specific and not portable to other tool chains.

@Theodikes
Copy link
Author

how to correctly determine the locale of each user?

Welcome to the world of multi-language programming. I don't mean that sarcastically, cross language coding is full of challenges.

Yes, I know it's not easy, but why deliberately complicate your life when you can just use wstring?

Isn't it easier to add support for wstring and wmain, which will immediately reduce all problems with locales to nothing?

Only on Windows. wmain is MSVC-specific and not portable to other tool chains.

Yes, I know it. But my program is simple 300kb exe file which should work only under Windows...

@skrobinson
Copy link
Contributor

..why deliberately complicate your life...

Not everyone uses UTF-16 and std::wstring. Many platforms use UTF-8 and std::string. Both work, it's working across encodings that gets complicated.

But my program...should work only under Windows.

argparse tries to be cross platform.

Have you considered forking argparse and maintaining a set of changes that work for you? Or have you looked for another argument parsing library that fits more closely to your requirements?

@Theodikes
Copy link
Author

..why deliberately complicate your life...

Not everyone uses UTF-16 and std::wstring. Many platforms use UTF-8 and std::string. Both work, it's working across encodings that gets complicated.

I know it. But I was thinking about not changing to the detriment of cross-platform, but about adding a separate function for wmain parsing, for example, so that in addition to program.parse_args(argc, argv) for char argv, program.parse_wargs(argc, argv can be called if wmain is used and argv has wchar_t type

@skrobinson
Copy link
Contributor

I don't believe it is as simple as adding parse_wargs. argparse internally uses std::string, so an UTF-16 to UTF-8 translation would need to be done in parse_wargs. Which means argparse must be aware of the user's locale.

Would you be interested in templatizing argparse string use?

Every class, function, and statement that uses std::string or std::string_view would need to add template <typename StringType = std::string> (or sv equivalent). Then everywhere string is used would need conversion to StringType. After all this, I think argparse::ArgumentParser<std::wstring> program("program_name"); would do what you want and still allow backward compatibilty for UTF-8.

I don't know if this is workable or would be acceptable to @p-ranav.

@Theodikes
Copy link
Author

I think argparse::ArgumentParser<std::wstring> program("program_name"); would do what you want and still allow backward compatibilty for UTF-8.

No, it doesn`t work. Error class "ArgumentParser" may not have a template argument list
But if it worked (essentially replacing all strings and everything related to them with wstring) then this would be the perfect solution for me

@skrobinson
Copy link
Contributor

Could you push your changes to a fork? I'd like to see what is not working.

@p-ranav
Copy link
Owner

p-ranav commented Nov 3, 2022

I think argparse::ArgumentParser<std::wstring> program("program_name"); would do what you want and still allow backward compatibilty for UTF-8.

No, it doesn`t work. Error class "ArgumentParser" may not have a template argument list But if it worked (essentially replacing all strings and everything related to them with wstring) then this would be the perfect solution for me

Have you made any changes to the code? If not, it won't work as things are today.

After all this, I think argparse::ArgumentParserstd::wstring program("program_name"); would do what you want and still allow backward compatibilty for UTF-8.

As @skrobinson says, we'd have to templatize everything on the string type first. Then, you could use ArgumentParser<StringType>.

I'm not sure yet if all of the string algorithms we use are directly applicable to multi-byte character strings or if there are any assumptions based on std::string. Yes, the approach is acceptable though I worry that the effort might be non-trivial.

@Theodikes
Copy link
Author

I'm not sure yet if all of the string algorithms we use are directly applicable to multi-byte character strings or if there are any assumptions based on std::string. Yes, the approach is acceptable though I worry that the effort might be non-trivial.

Looked through the entire source code, changed everything to wstring / wchar_t / wstringstream and so on. I changed constant strings, functions, where necessary, but there was one problem that I don’t know how to solve: in the do_from_chars function, the call to from_chars, for which there is no multibyte analogue

@skrobinson
Copy link
Contributor

Here comes the possible non-trivial effort as, @p-ranav put it.

We have argparse::do_strtod and argparse::do_from_chars that seem to have overlapping functionality. It might be possible to adapt do_strtod to also handle integer types (i.e. std::strtol and family), remove do_from_chars, and use do_strtod (please rename) for all conversions. This would remove the use of std::from_chars.

@Theodikes
Copy link
Author

Theodikes commented Nov 5, 2022

Here comes the possible non-trivial effort as, @p-ranav put it.

We have argparse::do_strtod and argparse::do_from_chars that seem to have overlapping functionality. It might be possible to adapt do_strtod to also handle integer types (i.e. std::strtol and family), remove do_from_chars, and use do_strtod (please rename) for all conversions. This would remove the use of std::from_chars.

Yes, I have already doing this other, simpler way (as I think). I delete do_from_chars and all generic_strtod templates, changed name of do_strtod function to do_strtonum and replace generic_strtod call to this code:

if (is_floating_point<T>::value) return std::stold(first);
else if (is_unsigned<T>::value) return std::stoull(first, nullptr, radix);
else return std::stoll(first, nullptr, radix);

It handles any number correctly, including scientific notation, octal and hexadecimal.

But then for more than four hours I was looking for a problem - the program incorrectly handled negative numeric arguments.
It turned out that in function is_decimal_literal, in the exponent_part_opt label, the end of the string was checked (eof = -1), while wstring has eof = 65535.

Now all works, maybe later I will add compatibility with the current version so that the library can work with both wmain and main, and add tests, but for now I just made a draft working version (fork).

Issue solved, thanks for help.

@Theodikes Theodikes reopened this Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants