read_csv and others derived from _read close user-provided filehandles #14418

ebolyen · 2016-10-13T19:09:25Z

I believe the "regression" was introduced on this line. That being said, tracking which filehandles a library owns vs what a user provided is hard, and I can't fault you guys if this is considered correct behavior from now on. Just wanted to bring it to your attention. Thanks!

A small, complete example of the issue

In [1]: import pandas as pd

In [2]: import io

In [3]: fh = io.StringIO('a,b\n1,2\n')

In [4]: fh.closed
Out[4]: False

In [5]: pd.read_csv(fh)
Out[5]: 
   a  b
0  1  2

In [6]: fh.closed
Out[6]: True

Expected Output

In [6]: fh.closed
Out[6]: False

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-96-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 20.1.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5a2
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

ebolyen · 2016-10-13T19:13:20Z

As context for why it would be nice to not close a user-filehandle, we have a function like this:

def _parse_blast_data(fh, columns, error, error_message, comment=None,
                      skiprows=None):
    read_csv = functools.partial(pd.read_csv, na_values='N/A', sep='\t',
                                 header=None, keep_default_na=False,
                                 comment=comment, skiprows=skiprows)
    lineone = read_csv(fh, nrows=1)

    if len(lineone.columns) != len(columns):
        raise error(error_message % (len(columns), len(lineone.columns)))

    fh.seek(0)
    return read_csv(fh, names=columns, dtype=_possible_columns)

which reads the first line to check the columns before reading the entire file.

jreback · 2016-10-13T20:50:46Z

@ebolyen

we need to close file handles that pandas opens. These include things like a re-encoding stream and compression streams (e.g. pandas needs to open a new handle). In theory we shouldn't be closing a user stream (though it certainly is possible), its not tested that well.

The changes above were to fix NOT closing things in the test suite as PY3 reports unclosed handles much better than PY2.

so if you can avoid closing things that are not supposed to be that would be great.
Need also to test with actual file handles, and memory mapped ones, in addition to a file-like handle.

ebolyen · 2016-10-13T21:10:39Z

@jreback agree completely, and thank you guys for working to clean up resources opened by pandas.

The line I noted in combination with this one is what I believe is causing user sources to be closed. As unlike the handles list (which captures very nicely the filehandles that pandas opened), it is always closed regardless of context and it appears that parser.TextReader uses whatever it was handed as it's dsource which is always closed when itself is closed. So it seems like there would need to be a more context-aware ._reader property.

After poking through that it occurred to me that it may be specific only to the C parser engine, and testing it looks like that is the case:

In [1]: import pandas as pd

In [2]: import io

In [3]: fh = io.StringIO('a,b\n1,2')

In [4]: pd.read_csv(fh, engine='python')
Out[4]: 
   a  b
0  1  2

In [5]: fh.closed
Out[5]: False

jreback · 2016-10-13T21:13:52Z

@ebolyen yes, the c-engine got a facelift w.r.t. handles in 0.19.0.

yeah, we may need to keep some kind of state whether to close a handle

gfyoung · 2016-10-14T03:57:08Z

That seems logical. Perhaps close if an actual file BUT keep open any streams (e.g. StringIO), as that could be easily check IINM?

jorisvandenbossche · 2016-10-14T14:47:18Z

@agraboso or @gfyoung if you would have time to give this is a look, certainly welcome! :-) Would be nice to fix this in 0.19.1

chris-b1 · 2016-10-14T17:04:46Z

Could make an argument either way, but if we want to follow what seems to be the stdlib convention, should also leave handles to actual files open too.

In [48]: %%file tmp.json
    ...: {"a": "22"}
Overwriting tmp.json

In [49]: import json

In [50]: fh = open('tmp.json'); json.load(fh); fh.closed
Out[50]: False

gfyoung · 2016-10-15T05:11:43Z

But what about the resource warnings in our tests? I presume we would then just do an assert_produces_warning check?

jorisvandenbossche · 2016-10-31T20:49:13Z

@ebolyen BTW, always welcome to test #14520 (but given that the tests pass it should work I think). Will be in upcoming 0.19.1

tpllaha · 2020-09-25T16:30:54Z

This seems to still be happening (in version 1.1.2 & python 3.8.5), but only in case a file opened in binary mode is passed AND the encoding parameter is provided (even if it has the default utf-8 value).

Script to reproduce:

import sys
import pandas
from io import BytesIO, StringIO

print(f'Python version: {sys.version}')
print(f'pandas version: {pandas.__version__}')

string_io = StringIO('a,b\n1,2')
bytes_io_1 = BytesIO(b'a,b\n1,2')
bytes_io_2 = BytesIO(b'a,b\n1,2')

pandas.read_csv(string_io)
print(f'Was StringIO closed? - {string_io.closed}')

pandas.read_csv(bytes_io_1)
print(f'Was BytesIO closed when encoding is NOT passed? - {bytes_io_1.closed}')

pandas.read_csv(bytes_io_2, encoding='utf-8')
print(f'Was BytesIO closed when encoding is passed? - {bytes_io_2.closed}')

prints:

Python version: 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27) 
[Clang 6.0 (clang-600.0.57)]
pandas version: 1.1.2
Was StringIO closed? - False
Was BytesIO closed when encoding is NOT passed? - False
Was BytesIO closed when encoding is passed? - True
                                              ^

jorisvandenbossche · 2020-09-26T07:03:49Z

@tpllaha this is a very old issue. Can you open a new issue with your reproducible example?

tpllaha · 2020-10-08T12:34:54Z

@jorisvandenbossche done in #36980

jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Oct 13, 2016

jreback changed the title ~~read_csv and others derived from _read close user-provided filehandles~~ read_csv and others derived from _read close user-provided filehandles Oct 13, 2016

jreback added Bug IO CSV read_csv, to_csv Difficulty Intermediate labels Oct 13, 2016

jreback added this to the Next Major Release milestone Oct 13, 2016

thequackdaddy mentioned this issue Oct 17, 2016

TST: Fixed failing tests statsmodels/statsmodels#3239

Merged

jorisvandenbossche mentioned this issue Oct 27, 2016

BUG: don't close user-provided file handles in C parser (GH14418) #14520

Merged

jorisvandenbossche closed this as completed in #14520 Nov 2, 2016

jorisvandenbossche modified the milestones: 0.19.1, Next Major Release Nov 2, 2016

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Nov 2, 2016

tpllaha mentioned this issue Oct 8, 2020

BUG: Pandas closes user-provided file handles that it doesn't own. #36980

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv and others derived from _read close user-provided filehandles #14418

read_csv and others derived from _read close user-provided filehandles #14418

ebolyen commented Oct 13, 2016

ebolyen commented Oct 13, 2016

jreback commented Oct 13, 2016

ebolyen commented Oct 13, 2016 •

edited

Loading

jreback commented Oct 13, 2016

gfyoung commented Oct 14, 2016

jorisvandenbossche commented Oct 14, 2016

chris-b1 commented Oct 14, 2016

gfyoung commented Oct 15, 2016 •

edited

Loading

jorisvandenbossche commented Oct 31, 2016

tpllaha commented Sep 25, 2020 •

edited

Loading

jorisvandenbossche commented Sep 26, 2020

tpllaha commented Oct 8, 2020 •

edited

Loading

read_csv and others derived from _read close user-provided filehandles #14418

read_csv and others derived from _read close user-provided filehandles #14418

Comments

ebolyen commented Oct 13, 2016

A small, complete example of the issue

Expected Output

Output of pd.show_versions()

ebolyen commented Oct 13, 2016

jreback commented Oct 13, 2016

ebolyen commented Oct 13, 2016 • edited Loading

jreback commented Oct 13, 2016

gfyoung commented Oct 14, 2016

jorisvandenbossche commented Oct 14, 2016

chris-b1 commented Oct 14, 2016

gfyoung commented Oct 15, 2016 • edited Loading

jorisvandenbossche commented Oct 31, 2016

tpllaha commented Sep 25, 2020 • edited Loading

jorisvandenbossche commented Sep 26, 2020

tpllaha commented Oct 8, 2020 • edited Loading

Output of `pd.show_versions()`

ebolyen commented Oct 13, 2016 •

edited

Loading

gfyoung commented Oct 15, 2016 •

edited

Loading

tpllaha commented Sep 25, 2020 •

edited

Loading

tpllaha commented Oct 8, 2020 •

edited

Loading