Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

Open
ubarry opened this issue Jun 20, 2024 · 0 comments

Comments

@ubarry
Copy link

ubarry commented Jun 20, 2024

mdtable reports UnicodeDecodeError when I try to convert utf-8 csv file with chinese characters(hereinafter called 'test.csv') into markdown table.

background info

Win11 system, system default codepage: 936, python 3.9.18, python default encoder: 'utf8', pkg manager: miniconda

mdtable installed at "D:\app_barry\miniconda\Scripts\mdtable.exe", by typing following command in powershell:

pip install mdtable

tests I've tried

content of test.csv and test2.csv:

bbc,cctv,wapo
1,2,3
啊啊,拜拜,尺寸

1.original:

PS E:\down> mdtable test.csv
Traceback (most recent call last):
  File "D:\app_barry\miniconda\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\app_barry\miniconda\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\app_barry\miniconda\Scripts\mdtable.exe\__main__.py", line 7, in <module>
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\cli.py", line 26, in main
    table = MDTable(
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\mdtable.py", line 57, in __init__
    self._csv_dict = _read_csv(
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\mdtable.py", line 180, in _read_csv
    header = next(csv_reader)
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence

2.retry using a new csv file encoded as GBK(hereinafter called 'test2.csv') > success, no bug reports.

PS E:\down> mdtable test2.csv
| bbc | cctv | wapo |
| --- | ---- | ---- |
| 1   | 2    | 3    |
| 啊啊  | 拜拜   | 尺寸   |

PS E:\down>

3.delete all chinese characters in test.csv(now being a ascii file with utf8 encoding, hereinafter called 'test3') , and retry > success, no bug reports.

bbc,cctv,wapo
1,2,3
4,5,6
PS E:\down> mdtable test3.csv
| bbc | cctv | wapo |
| --- | ---- | ---- |
| 1   | 2    | 3    |
| 4   | 5    | 6    |

4.retry in code page 65001 > fail, same bug report as 1.

5.check python default decoding > sys.getdefaultencoding() == 'utf8'

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

my gusses

Mdtable uses encoding and decoding methods similar to ANSI endoding, so it uses GBK encoding on my computer as default. But I input a utf-8 encoding csv file, so when it comes to non-ascii charaters like '啊', the program reports DecodeError.

If my guesses are correct, adding an option to select encoder could be a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant