Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: parser_source as a filename with multibyte characters in Windows(non utf-8 filesystem) #6807

Closed
wants to merge 8 commits into from

Conversation

hshimizu77
Copy link
Contributor

fopen() in Windows doesn't accept utf-8 encoded filename with multibyte characters, so need to convert it to filesystem encoding.
Set 'utf-8' as default in case sys.getfilesystemencoding() return None.
sys.getfilesystemencoding() will return 'mbcs' in Windows, and will 'utf-8' or user setting in other systems.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2014

thanks

can u write a test that fails without this change and works with it ?

@jreback jreback added this to the 0.15.0 milestone Apr 22, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@hshimizu77 tests?

@hshimizu77
Copy link
Contributor Author

I have added a test but I'm wondering it's right place or not.

@@ -535,7 +535,10 @@ cdef class TextReader:

if isinstance(source, basestring):
if not isinstance(source, bytes):
source = source.encode('utf-8')
if sys.getfilesystemencoding() is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do:

source = source.encode(sys.getfilesystemencoding() or 'utf-8')

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

that's the correct place

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@TomAugspurger can you test this on OSX?

@hshimizu77 ok...let me test on windows; I think we may put this in w/o the test (but don't change the test ATM).

as I don't think pandas will install on some systems because of the encoding of the filename (in theory it should always work, but don't want to bork older systems).

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 23, 2014
@TomAugspurger
Copy link
Contributor

The test errors

(pandas-dev) ~/E/p/l/p/s/p/p/i/tests (hshimizu77-bug-fix=) nosetests test_parsers:ParserTests.test_read_csv_example_in_windows_filesystem
E
======================================================================
ERROR: test_parsers.ParserTests.test_read_csv_example_in_windows_filesystem
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/admin/Envs/pandas-dev/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/admin/Envs/pandas-dev/lib/python2.7/site-packages/pandas/pandas/io/tests/test_parsers.py", line 1827, in test_read_csv_example_in_windows_filesystem
    self.read_csv(path)
  File "/Users/admin/Envs/pandas-dev/lib/python2.7/site-packages/pandas/pandas/io/tests/test_parsers.py", line 55, in read_csv
    raise NotImplementedError
NotImplementedError

----------------------------------------------------------------------
Ran 1 test in 0.067s

FAILED (errors=1)

Seems to work from the interpreter though:

In [1]: from test_parsers import *

In [2]: path = tm.get_data_path(u'日本語ファイル名テスト_read_csv_in_win_filesystem.csv')

In [3]: pd.read_csv(path)
Out[3]: 
     A    B    C
0  100  200  300
1  aaa  bbb  ccc

[2 rows x 3 columns]

@TomAugspurger
Copy link
Contributor

I think you want read_csv not self.read_csv

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@hshimizu77

for this

path = tm.get_data_path(u'日本語ファイル名テスト_read_csv_in_win_filesystem.csv')

use u(.....) instead of the a literal u (this is translated by the compat module as needed for py2/3

@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

merged via 79df67a

thanks!

@jreback jreback closed this Apr 30, 2014
@hshimizu77 hshimizu77 deleted the bug-fix branch May 3, 2014 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants