Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Xmlb, JSON] Issue with some special characters #8

Open
ak2yny opened this issue Dec 19, 2023 · 1 comment
Open

[Xmlb, JSON] Issue with some special characters #8

ak2yny opened this issue Dec 19, 2023 · 1 comment

Comments

@ak2yny
Copy link
Contributor

ak2yny commented Dec 19, 2023

Steps to reproduce:

  • Try to compile legal_360.json using installed raven-formats with xmlb legal_360.json legal_360.engb

Error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Scripts\xmlb.exe\__main__.py", line 7, in <module>
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\site-packages\raven_formats\xmlb.py", line 236, in main
    compile(input_file, output_file)
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\site-packages\raven_formats\xmlb.py", line 206, in compile
    data = json.load(json_file, object_pairs_hook=parse_json_object_pairs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2467: invalid continuation byte

Platform information:

  • Windows-11-10.0.22621-SP0
  • Python 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)] on win32
  • Raven-Formats v1.6
    (Note: Other JSON programs on my machine seem to have the same issue)

Things I tried:

  • Replace Å, é, and á in the source JSON file -> worked
  • Convert the file to XML and compile with xmlb legal_360.xml legal_360.engb -> worked
  • Use the converter of ak2yny's version to convert to XML -> failed (same error)
@ak2yny
Copy link
Contributor Author

ak2yny commented Jan 12, 2024

After some tests with the test file, I found that the issue seems to be the decoder (utf-8):

Traceback (most recent call last):
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 346, in <module>
    main()
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 341, in main
    convert(input_file, output_file, not args.no_indent)
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 315, in convert
    json_data = input_file.read()
                ^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\encodings\utf_8_sig.py", line 69, in _buffer_decode
    return codecs.utf_8_decode(input, errors, final)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2467: invalid continuation byte

The complete code starting at line 314 is as follows

    with input_path.open(mode='r', encoding='utf-8-sig') as input_file:
        json_data = input_file.read()

The solutions would be to use different decoder settings or ignore/replace problematic characters. Decoders can all produce the issue, depending on the input format, so using utf-8 would still be the best solution. I'm not sure which option is better to use, "ignore" or "replace".

However, after some reading and tests, I came up with this (instead of above code):

from chardet import detect

    with input_path.open(mode='rb') as input_file:
        raw_data = input_file.read()
    json_data = raw_data.decode(encoding=detect(raw_data)['encoding']).replace("\r", "") # The decoding sometimes seems to add an extra carriage return

Notes:

  • chardet is not part of Python by default (pip install chardet).
  • The encoding might not be detected (correctly), depending on the input file as per comment to the solution where this comes from.
  • After using any of the solutions, json.loads(json_data, object_pairs_hook=parse_json_object_pairs) now works correctly.
  • With the earlier solutions, we could use with input_path.open(mode='r', encoding='utf-8', errors='ignore') as input_file: for example and do whatever we want with input_file directly.
  • The encoding seems only to be an issue on Windows, where utf-8 is the general solution, but not always (latin-1 seems another).
  • I also tried to use with input_path.open(mode='r') as input_file: but this seems to (always) produce an empty string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant