Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paths as std::string is not safe or cross-platform -- e.g. fails on Windows with all non-ascii paths #352

Closed
diablodale opened this issue Jan 26, 2022 · 9 comments

Comments

@diablodale
Copy link
Contributor

All non-ascii paths on OS which are not natively UTF-8 are doomed to fail.
For example, c:/MeinDeutschPäth is likely to fail on Windows due to the umlaut.
Non-latin characters, multibyte glyphs, etc. will all fail on operating systems that do not natively default to utf-8.

One example at

void NeuralNetwork::setBlobPath(const std::string& path) {
auto asset = assetManager.set("__blob", path);

Likely the cause of issue #294 because Chinese glyphs are definitely not the limited ascii.

Setup

  • all

Repro

Found while writing my app and then code review. A std::string is not a utf-8 string. And it will fail on Windows for any path that is non-ascii. The best fix is to use <filesystem> which is available in C++17, or <experimental::filesystem> in most C++14 toolchains. The STL library will manage going to/from utf-8 with clearly defined params/funcs. For example, it will manage the differences between native utf-8 on Linux vs. wchar_t on Windows.

If one doesn't use the STL, then one needs to write supporting functions for Windows (perhaps others) to convert between Windows wchar_t UTF-16 and what appears to be the desire of AssetManager to have a utf-8 path. There are clear apis in the STL to manually convert between UTF-16 and UTF-8.

I see many locations throughout depthai-core which have this bug. For example AssetManager::set(const std::string& key, const std::string& path, ...
Also locations using pointers to char like const char* pathToBootloader which are insufficient for native UTF-16 usage on Windows. They will have the same path bugs

Fix

It is not hard to use the STL cvt approaches rather than "filesystem". Pseudocode...

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf16convert;
std::wstring widePath = utf16convert.from_bytes(myutf8path);
std::wifstream stream(widePath, std::ios::in | std::ios::binary);

Workarounds

  • There is no code workaround. The conversions must be done.
  • A user could move the blobs, mesh, calibration data, and all other data defined via a path... to a path that has only ascii characters. This is quite restrictive, but it is a workaround.
  • The Windows OS setting to use utf-8 suggested in The model path exists in Chinese, depthai_demo.py cannot run error #294 is not suggested and I discourage that approach due to the high risk compatibility issues it introduces across all apps and all os features.
@themarpe
Copy link
Collaborator

Would the following work well, while we modify the conversion depending on the platform?

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf16convert;
std::wstring widePath = utf16convert.from_bytes(myutf8path);
std::wifstream stream(widePath, std::ios::in | std::ios::binary);

And we can keep the paths saved in char arrays / std::string, and specify that those have to be UTF-8, but convert to wide upon calls to various FS related calls (eg. for Wins)?

Or we go with std::filesystem - if Windows STL support is there for it?

I think both options would work well also from perspective of Python bindings, as the paths are UTF-8 there platform independent - so calls to the bound functions taking std::string would work as expected.

@diablodale
Copy link
Contributor Author

MSVC's STL fully supports <filesystem> when the C++17 switch is used. Now as I type that, I realize that you want to target c++14. And there is no promise of <filesystem> being compatible with <experimental::filesystem> . https://docs.microsoft.com/en-us/cpp/standard-library/filesystem?view=msvc-170

Given a c++14 approach, I think using codecvt with clear-as-possible APIs/params that declare they must be utf-8.

  • Yes, you can store in depthai-owned variables as a std::string. The ambiguity is an ongoing issue that still hasn't been fully solved in C++20.
  • This approach will cause double-convert on Windows where everything is wchar_t and UTF-16...even if using filesystem::path it is internally stored as wchar_t.
  • I think most of the depthai-core code that is affected is not performance-oriented. If it is, then could have overloaded functions to accept both wstring and string. Or SFINAE with templates. Or use boost::filesystem
  • I ?think? there was a c++ errata update to the standard that overloaded the std::ifstream, ofstream, etc. so it will accept a CharT rather than only char *. The tests+examples will easily surface any problem.
  • Probably useful to make a macro or function that behaves differently on non-utf8 operating systems. And then use that everywhere.
  • Also could be fancy and make dai::path class which has constructors for string, wstring, and path. And make all the depthai-core apis accept only dai::path. Then a caller can pass string, wstring, or path and c++ will implicitely convert it to a dai::path. I think it would even implicitly convert char* and wchar*. And then you can always get what is needed like std::ifstream stream(daiPath.native(), std::ios::in | std::ios::binary);
// main app code
// apps before c++17 will have to use native OS apis or codecvt to convert from wchar_t -> utf-8
// app with c++17 have `filesystem` so can use filesystem::path and its utf8 handling
neuralnet->setBlobPath(modelPath.generic_u8string()); // convert #1: wchar_t -> utf-8 stored in an std::string

// setBlobPath(const std::string &utf8Path) calls assetManager.set("__blob", path)

std::shared_ptr<dai::Asset> AssetManager::set(const std::string& key, const std::string& path, int alignment) {
    #ifdef WINDOWS
      std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf16convert;
      std::wstring widePath = utf16convert.from_bytes(path); // convert #2: utf-8 -> wchar_t
      std::ifstream stream(widePath, std::ios::in | std::ios::binary);
    #else
      // Load binary file at path
      std::ifstream stream(path, std::ios::in | std::ios::binary);
    #endif
    if(!stream.is_open()) {
    ...

@diablodale
Copy link
Contributor Author

...also be careful/test text output of this data. For example, on Windows a printf("%s", myutf8string.cstr()) will fail to display correctly. A windows hack is to use the %S with a wchar like `printf("%S" myutf8string.getwchar()). This makes me think the fancy class might be useful so that you can use it everywhere in debugs, logs, file apis, etc. 🤔

@diablodale
Copy link
Contributor Author

I've got some bandwidth again. If we can align on approach, I can take on this work.

I lean towards my bullet 3: "I think most of the depthai-core code that is affected is not performance-oriented. If it is, then could have overloaded functions to accept both wstring and string. Or SFINAE with templates. Or use boost::filesystem"

  1. I suspect I can get C++ templates to work for string, wstring, and path. Are templates ok with your python binding approach?
  2. boost::filesystem is well tested and crossplatform. It will need something in your python bindings so python knows it can send plain strings to it. https://stackoverflow.com/a/56010504
  3. Duplicate code 😝 and make sibling functions everywhere for string and wstring

@diablodale
Copy link
Contributor Author

Did some reasearch. I think the approach could be easy and flexible

change one line of the binding in depthai-python

        .def("setBlobPath", [](NeuralNetwork& nn, py::object obj){
            // Allows to call this function with paths as well as strings
            nn.setBlobPath<std::string>(py::str(obj));  // only need to add the explicit template param
        }, py::arg("path"), DOC(dai, node, NeuralNetwork, setBlobPath))

then in depthai-core, change the signature to be a template.

template<typename T>
void NeuralNetwork::setBlobPath(const T& path) {
    auto asset = assetManager.set("__blob", path);
    ...
}

template<typename T>
std::shared_ptr<dai::Asset> AssetManager::set(const std::string& key, const T& path, int alignment) {
    // Load binary file at path
    std::ifstream<T> stream(path, std::ios::in | std::ios::binary);

Something very similar to that should work with std::string and filesystem::path. I may need to specialize T=wstring w/ codecvt for those SDK users that are only C++14, on windows, and need the widechar support. Easy code like I wrote above.

@themarpe
Copy link
Collaborator

Thanks @diablodale and sorry for the delay

Your latest suggestion seems good to me - so we'd assume UTF-8 for std::string and let users use wstring for Windows (or filesystem::path)

Also bindings using std::string template is 👌 as string there are UTF-8. https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#passing-python-strings-to-c

may need to specialize T=wstring w/ codecvt for those SDK users that are only C++14, on windows, and need the widechar support

What would need to happen in this case? Wouldn't eg. ifstream taking wstring do the right thing underneath?

Also we can specialize (or declare?) for this 3 cases (std::string, std::wstring & std::filesystem::path) if we don't want to move function implementation into the header

@diablodale
Copy link
Contributor Author

Only Microsoft's STL implementation supports ifstream(wstring path).
https://github.com/microsoft/STL/blob/33007ac75485ec3d465ab482112aba270a581725/stl/inc/fstream#L830
and naturally MSVC uses that implementation.

Other compilers for Windows probably don't use Microsoft's STL and runtime. Therefore, probably don't support wchar for path. I suspect this is a very small percentage of your users. I suggest not supporting this scenario.
The pseudo-unix things like ming claim to be posix/unix so it is their responsibility to handle utf-8 correctly in their lib/runtimes.

When a Windows user calls, e.g., ::setBlobPath(string& path)
By our conversation above, that signature expects path to be UTF-8.
But on Windows passing that string directly to ifstream(string) will fail (like today) for any path that is non-ASCII.
This could be a common coding error in the simplest code because even the forthcoming docs on this will declare ::setBlobPath(string& path) where path is utf-8.
On windows, what would be needed is to use the STL to convert the string of UTF-8 into a wstring of UTF-16 and then call ifstream(wstring).

I'll start work on PR and explore how to do this. I hear that you want to limit code in the headers.
There are 14 places with ifstream() and 3 ofstream(). I want an approach where I am not writing 3x17 overload functions.
So I'm going to lean towards templates, or a dai::path class, or dia::ifstream/ofstream. Something I can write once and then can be used everywhere.

diablodale added a commit to diablodale/depthai-core that referenced this issue Feb 16, 2022
diablodale added a commit to diablodale/depthai-core that referenced this issue Feb 22, 2022
@diablodale
Copy link
Contributor Author

@themarpe found a subissue while applying this PR broadly in depthai-core...
...depthai paths are also stored in environment variables.

I believe two PRs might be a good approach. PR #384 is the first that enables Unicode paths in API signatures, if/ofstream, and spdlog. A second later PR will enable Unicode from environment variables...which naturally needs the first PR's fixes.

The os manages the storage of the different char sets (ascii, utf-8, utf-16, etc.). When an app needs the env var value, it calls an os API. On Windows...it needs to call the UTF-16 wide character APIs for environment variables to get Unicode text.

A Windows customer in a country/language who's glyphs do not fit in ASCII, e.g. Japan, China, UAE, Germany, etc. stores a filename or path, using their native language, into env vars that depthai reads.
depthai calls the API to get the env var.

// runtime fail: underlying utility calls ascii OS api, Windows provides best-fit value, the path is corrupt/unusable
auto fwBinaryPath = utility::getEnv("DEPTHAI_DEVICE_BINARY");

// compile fail: underlying utility calls the ascii OS api which returns a std::string, it should return a Unicode std::wstring
std::wstring fwBinaryPath = utility::getEnv("DEPTHAI_DEVICE_BINARY");

I believe the utility::getEnv() call chain should always return wstring on Windows. Making the caller responsible for managing any unicode/ascii/number convert. After the getEnv() is fixed, then this is possible...

// works: the getEnv() code has been updated to return wstring
auto fwBinaryPath = utility::getEnv("DEPTHAI_DEVICE_BINARY");

// works: the getEnv() code has been updated to return wstring and dai::Path can be directly constructed with it
dai::Path fwBinaryPath = utility::getEnv("DEPTHAI_DEVICE_BINARY");

There ?may? be some cascading work to ensure things that parse the env var wstring are not assuming char. Templates and/or the typical MSVC overloads that can accept wchar in functions can likely address these.

Quick look at spdlog's code, they are partially aware of this issue. I see macro defines like SPDLOG_WCHAR_TO_UTF8_SUPPORT, SPDLOG_WCHAR_FILENAMES but I don't see they dealt with Windows for the env vars.

diablodale added a commit to diablodale/depthai-core that referenced this issue Mar 4, 2022
@themarpe
Copy link
Collaborator

themarpe commented Mar 7, 2022

Thanks @diablodale
I'd say that env vars are for starter off limits. Specifying the device binary manually is mostly an debugging help. But once support is added I agree it should be extended for that as well for it to be complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants