`fread()` doesn't support unicode in file names on Windows #3400

o414o · 2022-12-20T08:58:02Z

我刚刚开始尝试使用datatable，发现如果文件中含有中文路径，将会出现IOError。
然而同一个文件，在全英文路径下则不会出现这样的问题。
报错信息附在最后。
我不知道，是否已存在了解决方案，我尝试搜过，但没有找到解决方案。

My English is not good. I use machine translation:

I just tried to use datatable, and found that if the file contains a Chinese path, an IOError will appear.
However, for the same file, this problem will not occur in the full English path.
The error information is attached at the end.
I don't know whether there is a solution. I tried to search, but I didn't find a solution.

IOError                                   Traceback (most recent call last)
<timed exec> in <module>

IOError: Unable to obtain size of D:/测试.csv: [errno 2] No such file or directory

The text was updated successfully, but these errors were encountered:

o414o · 2022-12-20T09:05:15Z

Sorry to forget to explain that this error only exists in windows. My Chinese path is normal on linux.

oleksiyskononenko · 2022-12-20T18:52:45Z

On Windows we use _stat64() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat64() that is essentially a wide character version of _stat64(). Thanks for reporting the issue.

o414o · 2022-12-21T03:06:53Z

非常感谢你的回复。
我进行了一些尝试，发现问题可能是路径被datatable当成gbk编码读取了（实际是UTF-8编码)，故而找不到相关的路径。
我是这么试验的：
Thank you very much for your reply.

I made some attempts, and found that the problem may be that the path was read by the datatable as a gbk code (actually a UTF-8 code), so I could not find the relevant path.

I tried it this way:

import pandas as pd
import datatable as dt
import sys
print('defaultencoding: ' + sys.getdefaultencoding())
print('stdout.encoding: ' + sys.stdout.encoding)
print('stdin.encoding: ' + sys.stdin.encoding)

test_file = 'D:/测试.csv'
pd_df = pd.read_csv('D:/test.csv', encoding='utf-8', low_memory=False)
dt_df = dt.Frame(dt_df)
dt_df.to_csv(test_file)

output

defaultencoding: utf-8
stdout.encoding: UTF-8
stdin.encoding: utf-8

然后输出文件是D:/娴嬭瘯.csv
Then the output file is D:/娴嬭瘯.csv

print('D:/测试.csv'.encode('utf-8').decode('gbk'))

output

D:/娴嬭瘯.csv

可以确认这就是编码的识别不正确。但我不知道如何配置dt的识别编码，目前只能用土办法：以dt.fread('D:/娴嬭瘯.csv')的形式读取和保存文件。
如果可以，我想知道dt是从哪里读取的编码配置文件，以及是否能够手动修改这个配置文件。

It can be confirmed that the identification of the code is incorrect. But I don't know how to configure the identification code of 'dt'. At present, I can only use the local method: read and save files in the form of 'dt. fread ('D:/test. csv') '.

If so, I want to know where the 'dt' code configuration file is read from, and whether the configuration file can be modified manually.

On Windows we use stat() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat() that is essentially a wide character version of stat(). Thanks for reporting the issue.

oleksiyskononenko · 2022-12-23T05:23:45Z

The simplest workaround is to rename your file to use only ASCII characters. To support unicode file names on WIndows we need to make changes to datatable source code.

TimothyZero · 2023-02-27T19:28:00Z

try this:

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    f.write(d.to_csv())

oleksiyskononenko · 2023-02-27T22:47:17Z

@TimothyZero you can even try

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    DT = dt.fread(f)

mengdeer589 · 2024-03-13T14:11:34Z

我尝试了使用with open方法来解决读取文件包含中文路径的问题，但是这带来了文件读取耗时的显著增长

oleksiyskononenko changed the title ~~fread() cannot read the file containing Chinese path~~ fread() doesn't support unicode in file names on Windows Dec 20, 2022

oleksiyskononenko added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fread()` doesn't support unicode in file names on Windows #3400

`fread()` doesn't support unicode in file names on Windows #3400

o414o commented Dec 20, 2022

o414o commented Dec 20, 2022

oleksiyskononenko commented Dec 20, 2022 •

edited

o414o commented Dec 21, 2022

oleksiyskononenko commented Dec 23, 2022

TimothyZero commented Feb 27, 2023 •

edited

oleksiyskononenko commented Feb 27, 2023

mengdeer589 commented Mar 13, 2024

fread() doesn't support unicode in file names on Windows #3400

fread() doesn't support unicode in file names on Windows #3400

Comments

o414o commented Dec 20, 2022

My English is not good. I use machine translation:

o414o commented Dec 20, 2022

oleksiyskononenko commented Dec 20, 2022 • edited

o414o commented Dec 21, 2022

oleksiyskononenko commented Dec 23, 2022

TimothyZero commented Feb 27, 2023 • edited

oleksiyskononenko commented Feb 27, 2023

mengdeer589 commented Mar 13, 2024

`fread()` doesn't support unicode in file names on Windows #3400

`fread()` doesn't support unicode in file names on Windows #3400

oleksiyskononenko commented Dec 20, 2022 •

edited

TimothyZero commented Feb 27, 2023 •

edited