Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
read_csv and others derived from _read close user-provided filehandles #14418
Comments
ebolyen
commented
Oct 13, 2016
|
As context for why it would be nice to not close a user-filehandle, we have a function like this: def _parse_blast_data(fh, columns, error, error_message, comment=None,
skiprows=None):
read_csv = functools.partial(pd.read_csv, na_values='N/A', sep='\t',
header=None, keep_default_na=False,
comment=comment, skiprows=skiprows)
lineone = read_csv(fh, nrows=1)
if len(lineone.columns) != len(columns):
raise error(error_message % (len(columns), len(lineone.columns)))
fh.seek(0)
return read_csv(fh, names=columns, dtype=_possible_columns)which reads the first line to check the columns before reading the entire file. |
jorisvandenbossche
added the
Data IO
label
Oct 13, 2016
|
we need to close file handles that pandas opens. These include things like a re-encoding stream and compression streams (e.g. pandas needs to open a new handle). In theory we shouldn't be closing a user stream (though it certainly is possible), its not tested that well. The changes above were to fix NOT closing things in the test suite as PY3 reports unclosed handles much better than PY2. so if you can avoid closing things that are not supposed to be that would be great. |
jreback
changed the title from
`read_csv` and others derived from `_read` close user-provided filehandles to read_csv and others derived from _read close user-provided filehandles
Oct 13, 2016
ebolyen
commented
Oct 13, 2016
•
|
@jreback agree completely, and thank you guys for working to clean up resources opened by pandas. The line I noted in combination with this one is what I believe is causing user sources to be closed. As unlike the After poking through that it occurred to me that it may be specific only to the C parser engine, and testing it looks like that is the case: In [1]: import pandas as pd
In [2]: import io
In [3]: fh = io.StringIO('a,b\n1,2')
In [4]: pd.read_csv(fh, engine='python')
Out[4]:
a b
0 1 2
In [5]: fh.closed
Out[5]: False |
|
@ebolyen yes, the c-engine got a facelift w.r.t. handles in 0.19.0. yeah, we may need to keep some kind of state whether to close a handle |
jreback
added Bug IO CSV Difficulty Intermediate Effort Low
labels
Oct 13, 2016
jreback
added this to the
Next Major Release
milestone
Oct 13, 2016
|
That seems logical. Perhaps close if an actual file BUT keep open any streams (e.g. |
|
Could make an argument either way, but if we want to follow what seems to be the stdlib convention, should also leave handles to actual files open too. In [48]: %%file tmp.json
...: {"a": "22"}
Overwriting tmp.json
In [49]: import json
In [50]: fh = open('tmp.json'); json.load(fh); fh.closed
Out[50]: False |
|
But what about the resource warnings in our tests? I presume we would then just do an assert_produces_warning check? |
ebolyen commentedOct 13, 2016
I believe the "regression" was introduced on this line. That being said, tracking which filehandles a library owns vs what a user provided is hard, and I can't fault you guys if this is considered correct behavior from now on. Just wanted to bring it to your attention. Thanks!
A small, complete example of the issue
In [1]: import pandas as pd In [2]: import io In [3]: fh = io.StringIO('a,b\n1,2\n') In [4]: fh.closed Out[4]: False In [5]: pd.read_csv(fh) Out[5]: a b 0 1 2 In [6]: fh.closed Out[6]: TrueExpected Output
Output of
pd.show_versions()commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-96-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 20.1.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5a2
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None