Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread() doesn't support unicode in file names on Windows #3400

Open
o414o opened this issue Dec 20, 2022 · 7 comments
Open

fread() doesn't support unicode in file names on Windows #3400

o414o opened this issue Dec 20, 2022 · 7 comments
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label

Comments

@o414o
Copy link

o414o commented Dec 20, 2022

我刚刚开始尝试使用datatable,发现如果文件中含有中文路径,将会出现IOError。
然而同一个文件,在全英文路径下则不会出现这样的问题。
报错信息附在最后。
我不知道,是否已存在了解决方案,我尝试搜过,但没有找到解决方案。

My English is not good. I use machine translation:

I just tried to use datatable, and found that if the file contains a Chinese path, an IOError will appear.
However, for the same file, this problem will not occur in the full English path.
The error information is attached at the end.
I don't know whether there is a solution. I tried to search, but I didn't find a solution.

IOError                                   Traceback (most recent call last)
<timed exec> in <module>

IOError: Unable to obtain size of D:/测试.csv: [errno 2] No such file or directory
@o414o
Copy link
Author

o414o commented Dec 20, 2022

Sorry to forget to explain that this error only exists in windows. My Chinese path is normal on linux.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Dec 20, 2022

On Windows we use _stat64() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat64() that is essentially a wide character version of _stat64(). Thanks for reporting the issue.

@oleksiyskononenko oleksiyskononenko changed the title fread() cannot read the file containing Chinese path fread() doesn't support unicode in file names on Windows Dec 20, 2022
@oleksiyskononenko oleksiyskononenko added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Dec 20, 2022
@o414o
Copy link
Author

o414o commented Dec 21, 2022

非常感谢你的回复。
我进行了一些尝试,发现问题可能是路径被datatable当成gbk编码读取了(实际是UTF-8编码),故而找不到相关的路径。
我是这么试验的:
Thank you very much for your reply.

I made some attempts, and found that the problem may be that the path was read by the datatable as a gbk code (actually a UTF-8 code), so I could not find the relevant path.

I tried it this way:

import pandas as pd
import datatable as dt
import sys
print('defaultencoding: ' + sys.getdefaultencoding())
print('stdout.encoding: ' + sys.stdout.encoding)
print('stdin.encoding: ' + sys.stdin.encoding)

test_file = 'D:/测试.csv'
pd_df = pd.read_csv('D:/test.csv', encoding='utf-8', low_memory=False)
dt_df = dt.Frame(dt_df)
dt_df.to_csv(test_file)

output

defaultencoding: utf-8
stdout.encoding: UTF-8
stdin.encoding: utf-8

然后输出文件是D:/娴嬭瘯.csv
Then the output file is D:/娴嬭瘯.csv

print('D:/测试.csv'.encode('utf-8').decode('gbk'))

output

D:/娴嬭瘯.csv

可以确认这就是编码的识别不正确。但我不知道如何配置dt的识别编码,目前只能用土办法:以dt.fread('D:/娴嬭瘯.csv')的形式读取和保存文件。
如果可以,我想知道dt是从哪里读取的编码配置文件,以及是否能够手动修改这个配置文件。

It can be confirmed that the identification of the code is incorrect. But I don't know how to configure the identification code of 'dt'. At present, I can only use the local method: read and save files in the form of 'dt. fread ('D:/test. csv') '.

If so, I want to know where the 'dt' code configuration file is read from, and whether the configuration file can be modified manually.

On Windows we use stat() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat() that is essentially a wide character version of stat(). Thanks for reporting the issue.

@oleksiyskononenko
Copy link
Contributor

The simplest workaround is to rename your file to use only ASCII characters. To support unicode file names on WIndows we need to make changes to datatable source code.

@TimothyZero
Copy link

TimothyZero commented Feb 27, 2023

try this:

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    f.write(d.to_csv())

@oleksiyskononenko
Copy link
Contributor

@TimothyZero you can even try

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    DT = dt.fread(f)

@mengdeer589
Copy link

我尝试了使用with open方法来解决读取文件包含中文路径的问题,但是这带来了文件读取耗时的显著增长

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label
Projects
None yet
Development

No branches or pull requests

4 participants