Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] json encoding error on Windows where utf-8 is not the default #1118

Closed
Bing-su opened this issue Dec 31, 2022 · 2 comments
Closed

[Bug] json encoding error on Windows where utf-8 is not the default #1118

Bing-su opened this issue Dec 31, 2022 · 2 comments
Labels
type: bug Something isn't working

Comments

@Bing-su
Copy link

Bing-su commented Dec 31, 2022

Bug description

I tried to train a Korean language recognition model with doctr/references/recognition/train_pytorch.py script, and I got encoding errors.

first error is here.

with open(labels_path) as f:
labels = json.load(f)

and second is here.

with config_path.open("w") as f:
json.dump(model_config, f, indent=2, ensure_ascii=False)

I'm using Korean Windows 11, and the default encoding is 'cp949', so it is an error that could not read 'utf-8'.

Code snippet to reproduce the bug

# doctr/datasets/vocabs.py
VOCABS["korean"] = VOCABS["english"] + "ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒ" + "ᅡᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ" + "ᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ"
python train_pytorch.py  \
    vitstr_small  \
    --train_path train  \
    --val_path validation  \
    --name vitstr_small-korean  \
    --workers 8  \
    --vocab korean  \
    --wb  \
    --push-to-hub  \
    --amp

It probably won't give an error on windows using utf-8.

dataset:
https://drive.google.com/file/d/1RN6pQAELWGYmwt1y6xnF6Xj0dO5RKU-Q/view?usp=share_link
(344k images, 1GB)

Error traceback

Cloning https://huggingface.co/Bingsu/vitstr_small-korean into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Bingsu/vitstr_small-korean into local empty directory.
Pulling changes ...
WARNING:huggingface_hub.repository:Pulling changes ...
Upload file pytorch_model.bin:  87%|███████████████████████████████████████▉      | 71.1M/81.9M [00:06<00:00, 14.8MB/s]remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/Bingsu/vitstr_small-korean
   5aefb96..101a2bf  main -> main

WARNING:huggingface_hub.repository:remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/Bingsu/vitstr_small-korean
   5aefb96..101a2bf  main -> main

Upload file pytorch_model.bin: 100%|██████████████████████████████████████████████| 81.9M/81.9M [00:09<00:00, 9.45MB/s]
Traceback (most recent call last):
  File "C:\Users\smartmind\Desktop\workspace\test\train_ocr\train_pytorch.py", line 468, in <module>
    main(args)
  File "C:\Users\smartmind\Desktop\workspace\test\train_ocr\train_pytorch.py", line 405, in main
    push_to_hf_hub(model, exp_name, task="recognition", run_config=args)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\site-packages\doctr\models\factory\hub.py", line 179, in push_to_hf_hub
    _save_model_and_config_for_hf_hub(model, repo.local_dir, arch=arch, task=task)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\site-packages\doctr\models\factory\hub.py", line 87, in _save_model_and_config_for_hf_hub
    json.dump(model_config, f, indent=2, ensure_ascii=False)
  File "C:\Users\smartmind\miniconda3\envs\ocr\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
UnicodeEncodeError: 'cp949' codec can't encode character '\xa3' in position 98: illegal multibyte sequence

Environment

DocTR version: N/A
TensorFlow version: N/A
PyTorch version: 1.13.1 (torchvision 0.14.1)
OpenCV version: 4.7.0
OS: Microsoft Windows 11 Pro
Python version: 3.10.8
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: 11.7.99
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False
is_torch_available: True
@Bing-su Bing-su added the type: bug Something isn't working label Dec 31, 2022
@zahidetastan
Copy link

zahidetastan commented Jan 26, 2023

Hello @Bing-su ,
I had a similar error for Turkish characters.
Can you try this?

with open(labels_path, encoding="utf8") as f:
labels = json.load(f)

and

with config_path.open("w", encoding="utf8") as f:
json.dump(model_config, f, indent=2, ensure_ascii=False)

@Bing-su
Copy link
Author

Bing-su commented Jan 26, 2023

Thank you for your help. In fact, I already solved the problem, and this post is for sharing the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants