New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<filesystem>: prevent filesystem::path dangerous conversions to/from default code page encoding #909
Comments
If you want UTF-8, simply use function If you want UTF-8 with |
@Berrysoft I understand that we may use the u8path conversion. However developers will forget and often we will find that out after the release, when someone will report that some file operation somewhere does not work with localized paths. |
C++ is not a language that will avoid developers' mistake, and also it should not limit the developers' behavior. Yes, if the standard library uses UTF-8 everywhere, the developers may face less problems, but they may also come across many other problems. Maybe there's a program which should read from a text file in the same code page with the system, and thus the program could not use UTF-8 all the time, because the user won't care about the encoding of input. In addition, STL is based on CRT of Windows, in which all functions uses the current code page, and thus all the functions of CRT and STL assume that |
How exactly does Boost.Filesystem prevent such conversions? According to my understanding, it behaves the same as For an error, this is a request for a non-Standard extension, which is explicitly a non-goal of our project. For a warning, this is a request to warn about Standard-conforming usage, which we don't warn about as a general rule. (We emit deprecation warnings for ISO-deprecated features, and non-Standard features that we're trying to get rid of, but otherwise we don't emit warnings from the library for Standard usage. One reason is that we don't have a mechanism for emitting such warnings other than deprecation; another is that the library has a limited local view of the program, instead of the more global view that the compiler and/or code analysis enjoy.) If you want to ban usage of I'm inclined to close this as |
As far as I know, there's a method |
|
Stephan, I really appreciate you guys are listening.
As @Berrysoft noted, we call imbue to get utf8 behavior on boost::filesystem::path. I understand that it leads to costly conversions between utf8 and wide characters on Windows, but for our purpose the costly conversion is much cheaper than bug squashing of localization issues.
I am an application programmer and a computational geometer by a trade. I am more a hands on person than a purist. Being a Linux militant evangelist during my studies, then developing Windows applications for 15 years, then multi-platform applications for another 4 years, I was burned by internationalization issues many times. Frankly I cannot imagine using a local code page API for anything else than a quick prototype. For anything else using a local code page API on Windows is IMHO almost always an error even for an in-house product in this connected world: Somebody somewhere will have her/his Windows set to one language and type file names and paths in another language.
Complete code review is something that you guys may afford at Microsoft, but a small software house cannot. Also it is too easy to overlook these kind of bugs. We need an automatic solution.
Again, your statement is the purist point of view. In the real world, if the default conversions do what you almost always don't want to do and these default conversions are hidden behind innocently looking methods, it is an invitation to errors. At the current state of affairs, I can only see myself using std::filesystem on Linux only: Mine field on Windows, and the default OSX compiler links against stdc++ library, which is shipped with the system, thus starting to support std::filesystem with the latest OSX Catalina. If I had to apply std::filesystem to our multi-platform Windows project which uses UTF-8 internally, I would most likely have to hack your stdlib source code to emit errors on local code page conversions, at least on the build server. Vojtech |
Well, if you don't need other code pages, don't want to change or review codes, and need Or do another hack: write your own |
write |
Thanks for the hint. Does it work on Windows 7 as well? I am not sure how the other libraries would handle it though. |
I tested. It works on Windows 7 #include <stdio.h>
#include <locale>
#include <filesystem>
using namespace std::filesystem;
int main()
{
setlocale(LC_ALL, ".UTF-8");
path p{ "中文" };
printf(p.u8string().c_str());
wprintf(p.c_str());
return 0;
} output
|
We talked about this, and we'll review a PR to add such a warning, off-by-default, through the |
Agree with OP. A warning (and optionally error) would be nice. Can confirm Have also found Use UTF-8 code pages in Windows apps which also makes |
Replacing boost::filesystem::fstream with boost::nowide::fstream variants with the unfortunate cost of string path conversion on Windows from 16 bits to UTF8 and back to 16 bits. Unfortunately we cannot use std::filesystem yet as it is missing on older MACs and because the interface is crooked minefield on Windows see microsoft/STL#909
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
Replacing boost::filesystem::fstream with boost::nowide::fstream variants with the unfortunate cost of string path conversion on Windows from 16 bits to UTF8 and back to 16 bits. Unfortunately we cannot use std::filesystem yet as it is missing on older MACs and because the interface is crooked minefield on Windows see microsoft/STL#909
On Windows, add a new "utf8.manifest" application manifest to Orbit, OrbitService and all tests so that we force UTF-8 as the application code page. With this change, the "A" variants of Windows APIs now support UTF-8, which is great because we don't have to change API calls to their "W" variant, saving a UTF-8 to UTF-16 string conversion in user code. This also makes std::filesystem::path accept UTF-8 strings by default (withouth explicit std::filesystem::u8path() call), and the "std::filesystem::path::string()" also now outputs proper UTF-8 strings out of the box, whitout calling "filesystem::path::u8string()". References: - microsoft/STL#909 - https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page - https://cmake.org/cmake/help/v3.4/release/3.4.html#other From CMake's doc: "CMake learned to honor *.manifest source files with MSVC tools. Manifest files named as sources of .exe and .dll targets will be merged with linker-generated manifests and embedded in the binary." Tests: Newly added "UnicodeFileExists" in FileTest.cpp works out of the box without modification to our "FileExists" function. This means that "filesystem::path::exists()" can also be called directly with UTF-8 strings.
I am wondering how I can turn this one on because I don't see anything related to this inside the code. Besides that, it is still weird, that a valid std::filesystem::path can throw when calling string(), which should be one of the most innocent functions. |
Generally, allocation is needed for copying the internally stored string into the return value, so we can't avoid potential throwing.
Perhaps it's suitable to conditionally deprecate these constructors: Lines 648 to 674 in 8eb3a06
The mechanism for conditional deprecation can be found in #634. |
What is the state on the planned compiler warning that was mentioned before? We are also seeing this problem on a large multi-platform code base and it is a showstopper for the migration from Boost.Filesystem to the Standard version. It is impossible to manually track down all character literals that may end up as std::filesystem::path instances. This implicit conversion is guaranteed to introduce bugs that are hard to track down. A compiler warning would really be a powerful mitigation of the problem. |
Hello.
I would like to ask for a way to disable implicit conversions of const char* to std::filesystem::path using the default 8bit code page encoding and vice versa. If we had to switch from boost::filesystem to std::filesystem, these innocently looking conversions would be a major mine field to us. Indeed, we have had the same problem for years in our product using the C++ multi platform library wxWidgets, where the wxString class behaves the same way on Windows as now the std::filesystem::path in regard to default conversion to/from const char*, and the team members have to be mindful to not trigger these harmful conversions.
I understand the decision to use 16bit characters on Windows for performance reasons. What I do not understand is the decision to convert to / from a local code page by default in the year 2020. It would be great if one had a way to switch the default conversion to UTF-8, either in runtime or by a preprocessor macro. AFAIK the only way to convince std::filesystem to convert to/from UTF-8 by default is to switch the whole Windows 10 to an experimental UTF-8 code page, which is not an option for a production code really, which shall run on Windows 7 and older Windows 10.
I also understand that it is difficult to change the std::filesystem API now as it has already been standardized. I would therefore like to ask for some way to trigger compiler warnings or errors when these default from / to local code page conversions are triggered. Without that we would most likely not switch from boost::filesystem to std::filesystem, as such move would be too risky.
Thank you for consideration,
Vojtech Bubnik, Prusa Research
The text was updated successfully, but these errors were encountered: