BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

davemfish · 2021-01-07T19:40:37Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import tempfile
import textwrap
import pandas
import os

workspace_dir = tempfile.mkdtemp()
csv_file = os.path.join(workspace_dir, 'non-utf8.csv')
# encode with ISO Cyrillic, include a non-ASCII character to achieve UnicodeDecodeError
with open(csv_file, 'w', encoding='iso8859_5') as file_obj:
    file_obj.write(textwrap.dedent(
        """
        header,
        fЮЮ,
        bar
        """
    ).strip())

try:
    dataframe = pandas.read_csv(csv_file, sep=None)
except UnicodeDecodeError as error:
    os.remove(csv_file)
    raise

Problem description

os.remove raises a PermissionError on Windows because apparently the file handle is still open. This only happens when the sep=None kwarg is used. Leaving out that kwarg gets the expected output.

Expected Output

Traceback (most recent call last):
  File "..\scratch\pandas_file_handle.py", line 19, in <module>
    dataframe = pandas.read_csv(csv_file, sep=None)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 457, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 814, in __init__
    self._engine = self._make_engine(self.engine)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 1045, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 2291, in __init__
    self._make_reader(self.handles.handle)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 2412, in _make_reader
    line = f.readline()
  File "C:\Users\dmf\projects\invest\env\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 10: invalid continuation byte

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-01-07T20:07:15Z

thank you for the example!

When I run your code on linux, it doesn't even end up in the except branch. I only get a warning that, the python engine should be used:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

Running the code with -W default doesn't mention any resource warnings. Do you get warnings when you run the example with -W default. That might help us to narrow down where the file isn't closed.

edit: when you run the code without sep, can you please try it one time with engine="c" (that should be the case were it should fail as expected) and then engine="python"? Maybe it just fails with sep because it is using the python engine.

twoertwein · 2021-01-07T20:22:54Z

I found a different example which fails for a different reason (Error("Could not determine delimiter")) but in the end read_csv doesn't close its file handle.

import os
from pathlib import Path

import pandas

file = Path("non-utf8.csv")
file.write_bytes(b"\xe4\na\n1")  # non utf-8 character

dataframe = pandas.read_csv(file, engine="python", sep=None)

sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='non-utf8.csv' mode='r' encoding='utf-8'>

davemfish · 2021-01-07T20:37:27Z

@twoertwein thanks for looking into this!

thank you for the example!

When I run your code on linux, it doesn't even end up in the except branch. I only get a warning that, the python engine should be used:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

Running the code with -W default doesn't mention any resource warnings. Do you get warnings when you run the example with -W default. That might help us to narrow down where the file isn't closed.

Yes,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 10: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "..\scratch\pandas_file_handle.py", line 21, in <module>
    os.remove(csv_file)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\dmf\\AppData\\Local\\Temp\\tmppqtn8h_p\\non-utf8.csv'
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='C:\\Users\\dmf\\AppData\\Local\\Temp\\tmppqtn8h_p\\non-utf8.csv' mode='r' encoding='utf-8'>

edit: when you run the code without sep, can you please try it one time with engine="c" (that should be the case were it should fail as expected)

Yes, pandas.read_csv(csv_file, engine="c") fails as expected with UnicodeDecodeError

and then engine="python"? Maybe it just fails with sep because it is using the python engine.

pandas.read_csv(csv_file, engine="python") gives same result as above with engine="c"

davemfish added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2021

davemfish mentioned this issue Jan 7, 2021

pandas 1.2.0 compatibility natcap/invest#428

Closed

twoertwein mentioned this issue Jan 7, 2021

BUG: read_csv does not close file during an error in _make_reader #39029

Merged

4 tasks

jreback added this to the 1.3 milestone Jan 13, 2021

jreback added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021

jreback modified the milestones: 1.3, 1.2.1 Jan 13, 2021

jreback closed this as completed in #39029 Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

davemfish commented Jan 7, 2021

INSTALLED VERSIONS

twoertwein commented Jan 7, 2021 •

edited

twoertwein commented Jan 7, 2021

davemfish commented Jan 7, 2021 •

edited

BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

Comments

davemfish commented Jan 7, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

twoertwein commented Jan 7, 2021 • edited

twoertwein commented Jan 7, 2021

davemfish commented Jan 7, 2021 • edited

Output of `pd.show_versions()`

twoertwein commented Jan 7, 2021 •

edited

davemfish commented Jan 7, 2021 •

edited