UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

ubarry · 2024-06-20T09:26:41Z

mdtable reports UnicodeDecodeError when I try to convert utf-8 csv file with chinese characters(hereinafter called 'test.csv') into markdown table.

background info

Win11 system, system default codepage: 936, python 3.9.18, python default encoder: 'utf8', pkg manager: miniconda

mdtable installed at "D:\app_barry\miniconda\Scripts\mdtable.exe", by typing following command in powershell:

pip install mdtable

tests I've tried

content of test.csv and test2.csv:

bbc,cctv,wapo
1,2,3
啊啊,拜拜,尺寸

1.original:

PS E:\down> mdtable test.csv
Traceback (most recent call last):
  File "D:\app_barry\miniconda\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\app_barry\miniconda\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\app_barry\miniconda\Scripts\mdtable.exe\__main__.py", line 7, in <module>
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\app_barry\miniconda\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\cli.py", line 26, in main
    table = MDTable(
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\mdtable.py", line 57, in __init__
    self._csv_dict = _read_csv(
  File "D:\app_barry\miniconda\lib\site-packages\mdtable\mdtable.py", line 180, in _read_csv
    header = next(csv_reader)
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence

2.retry using a new csv file encoded as GBK(hereinafter called 'test2.csv') > success, no bug reports.

PS E:\down> mdtable test2.csv
| bbc | cctv | wapo |
| --- | ---- | ---- |
| 1   | 2    | 3    |
| 啊啊  | 拜拜   | 尺寸   |

PS E:\down>

3.delete all chinese characters in test.csv(now being a ascii file with utf8 encoding, hereinafter called 'test3') , and retry > success, no bug reports.

bbc,cctv,wapo
1,2,3
4,5,6

PS E:\down> mdtable test3.csv
| bbc | cctv | wapo |
| --- | ---- | ---- |
| 1   | 2    | 3    |
| 4   | 5    | 6    |

4.retry in code page 65001 > fail, same bug report as 1.

5.check python default decoding > sys.getdefaultencoding() == 'utf8'

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

my gusses

Mdtable uses encoding and decoding methods similar to ANSI endoding, so it uses GBK encoding on my computer as default. But I input a utf-8 encoding csv file, so when it comes to non-ascii charaters like '啊', the program reports DecodeError.

If my guesses are correct, adding an option to select encoder could be a solution.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

ubarry commented Jun 20, 2024

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 40: illegal multibyte sequence #8

Comments

ubarry commented Jun 20, 2024

background info

tests I've tried

my gusses