Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError when reading file with accents in file path #15086

Closed
JGoutin opened this issue Jan 9, 2017 · 23 comments

Comments

Projects
None yet
@JGoutin
Copy link

commented Jan 9, 2017

Code Sample, a copy-pastable example if possible

test.txt and test_é.txt are the same file, only the name change:

pd.read_csv('test.txt')
Out[3]: 
   1 1 1
0  1 1 1
1  1 1 1

pd.read_csv('test_é.txt')
Traceback (most recent call last):

  File "<ipython-input-4-fd67679d1d17>", line 1, in <module>
    pd.read_csv('test_é.txt')

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 389, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 730, in __init__
    self._make_engine(self.engine)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 923, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)

  File "d:\app\python36\lib\site-packages\pandas\io\parsers.py", line 1390, in __init__
    self._reader = _parser.TextReader(src, **kwds)

  File "pandas\parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4184)

  File "pandas\parser.pyx", line 669, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8471)

OSError: Initializing from file failed

Problem description

Pandas return OSError when trying to read a file with accents in file path.

The problem is new (Since I upgraded to Python 3.6 and Pandas 0.19.2)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.9.3
boto: None
pandas_datareader: None

@m-charlton

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2017

Just my pennies worth. Quickly tried it out on Mac OSX and Ubuntu with no
problems. See below.

Could this be an environment/platform problem? I noticed that the LOCALE is
set to None.None. Unfortunately I do not have a windows machine to try this
example on. Admittedly this would not explain why you've seen this after the
upgrade to python3.6 and pandas 0.19.2.

Note: I just set up a virtualenv with python3.6 and installed pandas 0.19.2 using pip.

>>> import pandas as pd
>>> pd.read_csv('test_é.txt')
   a  b  c
0  1  2  3
1  4  5  6

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: None
numpy: 1.11.3
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2017

I believe 3.6 switches the file system encoding on windows to utf8 (from ascii). Apart from that we don't have testing enable yet on windows for 3.6 (as some of the required packages are just now becoming available).

@jreback jreback added the Windows label Jan 9, 2017

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2017

@JGoutin

so I just added build support on appveyor (windows) for 3.6, so if you'd push up your tests to see if it works, would be great.

@z94624

This comment has been minimized.

Copy link

commented Jul 16, 2017

I also faced the same problem when the program stopped at pd.read_csv(file_path). The situation is similar to me after I upgraded my python to 3.6 (I'm not sure the last time the python I installed is exactly what version, maybe 3.5......).

@tpietruszka

This comment has been minimized.

Copy link

commented Aug 23, 2017

@jreback what is the next step towards a fix here?
You have mentioned a PR that got 'blown away' - what does it mean?

While I do not use Windows, I could try to help (just got a VM to debug a piece of my code that apparently does not work on windows)

BTW, a workaround: pass a file handle instead of a name
pd.read_csv(open('test_é.txt', 'r'))
(there are several workarounds in related issues, but I have not seen this one)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2017

@tpietruszka see comments on the PR: #15092 (it got removed from a private fork, was pretty much there).

you basically need to encode the paths differently on py3.6 (vs other pythons) on wnidows. basically need to implement: https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8

@dondon2475848

This comment has been minimized.

Copy link

commented Aug 29, 2017

my old code (can't run):

import pandas as pd
import os
file_path='./dict/字典.csv'
df_name = pd.read_csv(file_path,sep=',' )

new code (sucessful):

import pandas as pd
import os
file_path='./dict/dict.csv'
df_name = pd.read_csv(file_path,sep=',' )

I think this bug is filename problem.
I change filename from chinese to english, it can run now.

@fotisj

This comment has been minimized.

Copy link

commented Jan 14, 2018

If anyone comes here like me because he/she hit the same problem, here is a solution until pandas is fixed to work with pep 529 (basically any non ascii chars will in your path or filename will result in errors):

Insert the following two lines at the beginning of your code to revert back to the old way of handling paths on windows:

import sys
sys._enablelegacywindowsfsencoding()

@ColdHumour

This comment has been minimized.

Copy link

commented Jan 21, 2018

I use the solution above and it works. Thanks very much @fotisj !
However I'm still confused on why DataFrame.to_csv() doesn't occur same problem. In other words, for unicode file path, write is ok, while read isn't.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

just need a small change in the parser code to fix this (there was a PR doing this) but was deleted

@jdferreira

This comment has been minimized.

Copy link

commented Apr 20, 2018

@TomAugspurger that does not work. read_csv expects a str and not a bytes value. It fails with

OSError: Expected file path name or file-like object, got <class 'bytes'> type
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2018

@mmagnuski

This comment has been minimized.

Copy link

commented Nov 21, 2018

Just pinging this - I have the same issue, I'm using a workaround but it would be great if that was not required.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2018

this needs a community patch

@kchawla-pi

This comment has been minimized.

Copy link

commented Dec 10, 2018

I am encountering this issue. I want to try and contribute a patchc Any pointers on how to start fixing this?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

I think none of the maintainers have access to a system that can reproduce this.

Perhaps some of the others in this issue can help put together a solution.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 13, 2019

COMPAT: Properly encode filenames in read_csv
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 13, 2019

COMPAT: Properly encode filenames in read_csv
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 13, 2019

COMPAT: Properly encode filenames in read_csv
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 13, 2019

COMPAT: Properly encode filenames in read_csv
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jan 13, 2019

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 14, 2019

COMPAT: Properly encode filenames in read_csv
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

jreback added a commit that referenced this issue Jan 14, 2019

COMPAT: Properly encode filenames in read_csv (#24758)
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes gh-15086.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

COMPAT: Properly encode filenames in read_csv (pandas-dev#24758)
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

COMPAT: Properly encode filenames in read_csv (pandas-dev#24758)
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.

vnlitvin added a commit to anmyachev/pandas that referenced this issue Mar 12, 2019

vnlitvin added a commit to anmyachev/pandas that referenced this issue Mar 14, 2019

vnlitvin added a commit to anmyachev/pandas that referenced this issue Mar 14, 2019

vnlitvin added a commit to anmyachev/pandas that referenced this issue Mar 18, 2019

@vnlitvin vnlitvin referenced this issue Mar 18, 2019

Merged

BUG: reading windows utf8 filenames in py3.6 #25769

4 of 4 tasks complete

vnlitvin added a commit to anmyachev/pandas that referenced this issue Mar 20, 2019

jreback added a commit that referenced this issue Mar 20, 2019

BUG: reading windows utf8 filenames in py3.6 (#25769)
* Fix gh-15086 properly instead of making a workaround

* fix code style

* Make sure test_filename_with_special_chars properly tests combinations of chars
Updated whatsnew

* Address comments by @jreback

* Parametrize test_filename_with_special_chars

Use CP-1252 and CP-1251 filenames separately,
skip the test on Windows on < 3.6 as it won't pass

anmyachev added a commit to anmyachev/pandas that referenced this issue Apr 18, 2019

BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
* Fix pandas-devgh-15086 properly instead of making a workaround

* fix code style

* Make sure test_filename_with_special_chars properly tests combinations of chars
Updated whatsnew

* Address comments by @jreback

* Parametrize test_filename_with_special_chars

Use CP-1252 and CP-1251 filenames separately,
skip the test on Windows on < 3.6 as it won't pass

Kiku-git added a commit to Kiku-git/pandas that referenced this issue May 16, 2019

BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
* Fix pandas-devgh-15086 properly instead of making a workaround

* fix code style

* Make sure test_filename_with_special_chars properly tests combinations of chars
Updated whatsnew

* Address comments by @jreback

* Parametrize test_filename_with_special_chars

Use CP-1252 and CP-1251 filenames separately,
skip the test on Windows on < 3.6 as it won't pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.