[Xmlb, JSON] Issue with some special characters #8

ak2yny · 2023-12-19T14:35:31Z

Steps to reproduce:

Try to compile legal_360.json using installed raven-formats with xmlb legal_360.json legal_360.engb

Error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Scripts\xmlb.exe\__main__.py", line 7, in <module>
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\site-packages\raven_formats\xmlb.py", line 236, in main
    compile(input_file, output_file)
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\site-packages\raven_formats\xmlb.py", line 206, in compile
    data = json.load(json_file, object_pairs_hook=parse_json_object_pairs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2467: invalid continuation byte

Platform information:

Windows-11-10.0.22621-SP0
Python 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)] on win32
Raven-Formats v1.6
(Note: Other JSON programs on my machine seem to have the same issue)

Things I tried:

Replace Å, é, and á in the source JSON file -> worked
Convert the file to XML and compile with xmlb legal_360.xml legal_360.engb -> worked
Use the converter of ak2yny's version to convert to XML -> failed (same error)

The text was updated successfully, but these errors were encountered:

ak2yny · 2024-01-12T15:47:22Z

After some tests with the test file, I found that the issue seems to be the decoder (utf-8):

Traceback (most recent call last):
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 346, in <module>
    main()
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 341, in main
    convert(input_file, output_file, not args.no_indent)
  File "D:\GitHub\raven-formats\src\raven_formats\xmlb.py", line 315, in convert
    json_data = input_file.read()
                ^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
  File "C:\Users\x\AppData\Local\Programs\Python\Python312\Lib\encodings\utf_8_sig.py", line 69, in _buffer_decode
    return codecs.utf_8_decode(input, errors, final)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2467: invalid continuation byte

The complete code starting at line 314 is as follows

    with input_path.open(mode='r', encoding='utf-8-sig') as input_file:
        json_data = input_file.read()

The solutions would be to use different decoder settings or ignore/replace problematic characters. Decoders can all produce the issue, depending on the input format, so using utf-8 would still be the best solution. I'm not sure which option is better to use, "ignore" or "replace".

However, after some reading and tests, I came up with this (instead of above code):

from chardet import detect

    with input_path.open(mode='rb') as input_file:
        raw_data = input_file.read()
    json_data = raw_data.decode(encoding=detect(raw_data)['encoding']).replace("\r", "") # The decoding sometimes seems to add an extra carriage return

Notes:

chardet is not part of Python by default (pip install chardet).
The encoding might not be detected (correctly), depending on the input file as per comment to the solution where this comes from.
After using any of the solutions, json.loads(json_data, object_pairs_hook=parse_json_object_pairs) now works correctly.
With the earlier solutions, we could use with input_path.open(mode='r', encoding='utf-8', errors='ignore') as input_file: for example and do whatever we want with input_file directly.
The encoding seems only to be an issue on Windows, where utf-8 is the general solution, but not always (latin-1 seems another).
I also tried to use with input_path.open(mode='r') as input_file: but this seems to (always) produce an empty string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Xmlb, JSON] Issue with some special characters #8

[Xmlb, JSON] Issue with some special characters #8

ak2yny commented Dec 19, 2023 •

edited

ak2yny commented Jan 12, 2024 •

edited

[Xmlb, JSON] Issue with some special characters #8

[Xmlb, JSON] Issue with some special characters #8

Comments

ak2yny commented Dec 19, 2023 • edited

ak2yny commented Jan 12, 2024 • edited

ak2yny commented Dec 19, 2023 •

edited

ak2yny commented Jan 12, 2024 •

edited