Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_json(std::filesystem::path) can create invalid UTF-8 chars on windows #4271

Open
2 tasks done
MHebes opened this issue Jan 19, 2024 · 2 comments
Open
2 tasks done

Comments

@MHebes
Copy link

MHebes commented Jan 19, 2024

Description

This conversion function:

https://github.com/nlohmann/json/blob/7efe875495a3ed7d805ddbb01af0c7725f50c88b/include/nlohmann/detail/conversions/to_json.hpp#L416C1-L420C2

template<typename BasicJsonType>
inline void to_json(BasicJsonType& j, const std_fs::path& p)
{
    j = p.string();
}

uses p.string(), which does not give a UTF-8-encoded string on windows (in some cases, maybe?). Trying to dump() the resultant JSON throws a "invalid UTF-8 byte" exception.

Reproduction steps

Convert a std::filesystem::path, which contains a unicode "Right Single Quotation Mark" character (U+2019), to a json implicitly or with to_json.

Inspect the new json (string_t)'s bytes, either by dump()ing, or converting to BSON.

Expected vs. actual results

Expected: "Strings are stored in UTF-8 encoding." per https://json.nlohmann.me/api/basic_json/string_t/

Actual: The string gets converted by std::filesystem::path::string(), which appears to convert it to Windows-1252 encoding. Its bytes end up as \x92 rather than \xe2\x80\x99.

Minimal code example

#include <filesystem>
#include <iostream>
#include <nlohmann/json.hpp>

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = std::filesystem::path(wide_unicode_right_quote);
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Workaround I'm using is to use WideCharToMultiByte + .native() to get the string in UTF-8 before passing to nlohmann:

inline std::string Narrow(std::wstring_view wstr) {
  if (wstr.empty()) return {};
  int len = ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), nullptr, 0, nullptr, nullptr);
  std::string out(len, 0);
  ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), &out[0], len, nullptr, nullptr);
  return out;
}

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = Narrow(std::filesystem::path(wide_unicode_right_quote).native());
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Error messages

"[json.exception.type_error.316] invalid UTF-8 byte at index 0: 0x92

Compiler and operating system

MSVC 2022 Professional, C++ 20

Library version

develop - a259ecc

Validation

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

I can also workaround this problem by adding a manifest XML that sets my app's code page to CP_UTF8 on supported versions of windows.

In CMake I wrapped this in a function:

# target_add_manifest(<target> <manifest file>)
#
# You probably want to use ${MANIFEST_FILE_UTF8} defined below this function
#
# Adds a manifest file (https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests)
# to an EXE
function(target_add_manifest TARGET_NAME MANIFEST_FILE)
  if(NOT TARGET_NAME)
	  message(FATAL_ERROR "You must provide a target")
	endif()
	if(NOT MANIFEST_FILE)
	  message(FATAL_ERROR "You must provide a manifest file")
	endif()
	add_custom_command(
		TARGET ${TARGET_NAME}
		POST_BUILD
		COMMAND "mt.exe" -manifest \"${MANIFEST_FILE}\" \"-updateresource:$<TARGET_FILE:${TARGET_NAME}>\"
	)
endfunction()

which is used like this (probably want to wrap in a platform check):

add_executable(myapp main.cpp)
target_add_manifest(myapp "${CMAKE_CURRENT_SOURCE_DIR}/cmake/utf8.manifest")

with utf8.manifest being:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

This solves the problem, if the app is running on at least Windows Version 1903. Still a bug but wanted to share this workaround because it's useful for many libraries that have the same issue.

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

Proposed diff to do the conversion to UTF-8 when targeting windows:

diff --git a/include/nlohmann/detail/conversions/to_json.hpp b/include/nlohmann/detail/conversions/to_json.hpp
index 562089c3..a8b74688 100644
--- a/include/nlohmann/detail/conversions/to_json.hpp
+++ b/include/nlohmann/detail/conversions/to_json.hpp
@@ -413,10 +413,20 @@ inline void to_json(BasicJsonType& j, const T& t)
 }
 
 #if JSON_HAS_FILESYSTEM || JSON_HAS_EXPERIMENTAL_FILESYSTEM
+#if defined(_WIN32)
+#include <windows.h>
+#endif
 template<typename BasicJsonType>
 inline void to_json(BasicJsonType& j, const std_fs::path& p)
 {
+#if defined(_WIN32)
+    int len = ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), nullptr, 0, nullptr, nullptr);
+    std::string as_utf8(len, 0);
+    ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), &narrowed_string[0], len, nullptr, nullptr);
+    j = std::move(as_utf8);
+#else
     j = p.string();
+#endif
 }
 #endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant