[Bug] json encoding error on Windows where utf-8 is not the default #1118

Bing-su · 2022-12-31T10:53:08Z

Bug description

I tried to train a Korean language recognition model with doctr/references/recognition/train_pytorch.py script, and I got encoding errors.

first error is here.

doctr/doctr/datasets/recognition.py

Lines 39 to 40 in e66ce01

    
           with open(labels_path) as f: 
        
               labels = json.load(f)

and second is here.

doctr/doctr/models/factory/hub.py

Lines 86 to 87 in e66ce01

    
           with config_path.open("w") as f: 
        
               json.dump(model_config, f, indent=2, ensure_ascii=False)

I'm using Korean Windows 11, and the default encoding is 'cp949', so it is an error that could not read 'utf-8'.

Code snippet to reproduce the bug

# doctr/datasets/vocabs.py
VOCABS["korean"] = VOCABS["english"] + "ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒ" + "ᅡᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ" + "ᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ"

python train_pytorch.py  \
    vitstr_small  \
    --train_path train  \
    --val_path validation  \
    --name vitstr_small-korean  \
    --workers 8  \
    --vocab korean  \
    --wb  \
    --push-to-hub  \
    --amp

It probably won't give an error on windows using utf-8.

dataset:
https://drive.google.com/file/d/1RN6pQAELWGYmwt1y6xnF6Xj0dO5RKU-Q/view?usp=share_link
(344k images, 1GB)

Error traceback

Cloning https://huggingface.co/Bingsu/vitstr_small-korean into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Bingsu/vitstr_small-korean into local empty directory.
Pulling changes ...
WARNING:huggingface_hub.repository:Pulling changes ...
Upload file pytorch_model.bin:  87%|███████████████████████████████████████▉      | 71.1M/81.9M [00:06<00:00, 14.8MB/s]remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/Bingsu/vitstr_small-korean
   5aefb96..101a2bf  main -> main

WARNING:huggingface_hub.repository:remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/Bingsu/vitstr_small-korean
   5aefb96..101a2bf  main -> main

Upload file pytorch_model.bin: 100%|██████████████████████████████████████████████| 81.9M/81.9M [00:09<00:00, 9.45MB/s]
Traceback (most recent call last):
  File "C:\Users\smartmind\Desktop\workspace\test\train_ocr\train_pytorch.py", line 468, in <module>
    main(args)
  File "C:\Users\smartmind\Desktop\workspace\test\train_ocr\train_pytorch.py", line 405, in main
    push_to_hf_hub(model, exp_name, task="recognition", run_config=args)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\site-packages\doctr\models\factory\hub.py", line 179, in push_to_hf_hub
    _save_model_and_config_for_hf_hub(model, repo.local_dir, arch=arch, task=task)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\site-packages\doctr\models\factory\hub.py", line 87, in _save_model_and_config_for_hf_hub
    json.dump(model_config, f, indent=2, ensure_ascii=False)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
UnicodeEncodeError: 'cp949' codec can't encode character '\xa3' in position 98: illegal multibyte sequence

Environment

DocTR version: N/A
TensorFlow version: N/A
PyTorch version: 1.13.1 (torchvision 0.14.1)
OpenCV version: 4.7.0
OS: Microsoft Windows 11 Pro
Python version: 3.10.8
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: 11.7.99
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False
is_torch_available: True

The text was updated successfully, but these errors were encountered:

zahidetastan · 2023-01-26T08:36:42Z

Hello @Bing-su ,
I had a similar error for Turkish characters.
Can you try this?

with open(labels_path, encoding="utf8") as f:
labels = json.load(f)

and

with config_path.open("w", encoding="utf8") as f:
json.dump(model_config, f, indent=2, ensure_ascii=False)

Bing-su · 2023-01-26T14:38:33Z

Thank you for your help. In fact, I already solved the problem, and this post is for sharing the problem.

Bing-su added the type: bug Something isn't working label Dec 31, 2022

felixdittrich92 closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] json encoding error on Windows where utf-8 is not the default #1118

[Bug] json encoding error on Windows where utf-8 is not the default #1118

Bing-su commented Dec 31, 2022

zahidetastan commented Jan 26, 2023 •

edited

Bing-su commented Jan 26, 2023

[Bug] json encoding error on Windows where utf-8 is not the default #1118

[Bug] json encoding error on Windows where utf-8 is not the default #1118

Comments

Bing-su commented Dec 31, 2022

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

zahidetastan commented Jan 26, 2023 • edited

Bing-su commented Jan 26, 2023

zahidetastan commented Jan 26, 2023 •

edited