BUG: to_parquet corrupting temporary files #37257

deschman · 2020-10-19T17:20:23Z

Code Sample, a copy-pastable example

I get an OSError from the following code:

from tempfile import NamedTemporaryFile
import pandas as pd

with NamedTemporaryFile(suffix='.gz') as file:
    df = pd.DataFrame({'A': [1, 2, 3]})
    df.to_parquet(file)

    df['A'] = list('abc')
    df.to_parquet(file)

    df = pd.read_parquet(file)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\Lib\site-packages\mymodules\reporting\untitled0.py", line 11, in <module>
    df = pd.read_parquet(file)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parquet.py", line 142, in read
    path, columns=columns, filesystem=fs, **kwargs

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\parquet.py", line 1595, in read_table
    use_pandas_metadata=use_pandas_metadata)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\parquet.py", line 1475, in read
    use_threads=use_threads

  File "pyarrow\_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table

  File "pyarrow\_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table

  File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status

  File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status

OSError: Unexpected end of stream

Problem description

The error message provided by pyarrow does not indicate where the issue is in the code. I did not test with fastparquet engine as they do not currently seem to support the _TemporaryFileWrapper. If this should go to pyarrow instead, please let me know.

Expected Output

If a temporary file should not be wrote over by to_parquet multiple times, I would expect to_parquet to throw an error before corrupting the file.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : db08276
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.1.0
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 0.7.1
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

deschman · 2020-11-03T15:05:12Z

If any additional information is needed for this issue, please let me know.

ivanovmg · 2020-11-13T15:19:43Z

I confirm the same error on master.

mvdornellas · 2021-02-18T19:43:42Z

I have the same problem in the code below

Input

def getFile(path):
    s3 = boto3.resource('s3')
    buffer = io.BytesIO()
    s3_object = s3.Object(bucket, path)
    s3_object.download_fileobj(buffer)
    df = pd.read_parquet(buffer,engine='pyarrow')
    return df

Output

Enviroment
AWS

Pandas version
1.0.1

Has anyone managed to solve it?

deschman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 19, 2020

jbrockmendel added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_parquet corrupting temporary files #37257

BUG: to_parquet corrupting temporary files #37257

deschman commented Oct 19, 2020

INSTALLED VERSIONS

deschman commented Nov 3, 2020

ivanovmg commented Nov 13, 2020

mvdornellas commented Feb 18, 2021 •

edited

BUG: to_parquet corrupting temporary files #37257

BUG: to_parquet corrupting temporary files #37257

Comments

deschman commented Oct 19, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

deschman commented Nov 3, 2020

ivanovmg commented Nov 13, 2020

mvdornellas commented Feb 18, 2021 • edited

Output of `pd.show_versions()`

mvdornellas commented Feb 18, 2021 •

edited