Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_parquet corrupting temporary files #37257

Open
deschman opened this issue Oct 19, 2020 · 3 comments
Open

BUG: to_parquet corrupting temporary files #37257

deschman opened this issue Oct 19, 2020 · 3 comments
Labels
Bug IO Parquet parquet, feather

Comments

@deschman
Copy link

Code Sample, a copy-pastable example

I get an OSError from the following code:

from tempfile import NamedTemporaryFile
import pandas as pd

with NamedTemporaryFile(suffix='.gz') as file:
    df = pd.DataFrame({'A': [1, 2, 3]})
    df.to_parquet(file)

    df['A'] = list('abc')
    df.to_parquet(file)

    df = pd.read_parquet(file)
  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\Lib\site-packages\mymodules\reporting\untitled0.py", line 11, in <module>
    df = pd.read_parquet(file)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parquet.py", line 142, in read
    path, columns=columns, filesystem=fs, **kwargs

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\parquet.py", line 1595, in read_table
    use_pandas_metadata=use_pandas_metadata)

  File "C:\Users\deschman\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\parquet.py", line 1475, in read
    use_threads=use_threads

  File "pyarrow\_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table

  File "pyarrow\_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table

  File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status

  File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status

OSError: Unexpected end of stream

Problem description

The error message provided by pyarrow does not indicate where the issue is in the code. I did not test with fastparquet engine as they do not currently seem to support the _TemporaryFileWrapper. If this should go to pyarrow instead, please let me know.

Expected Output

If a temporary file should not be wrote over by to_parquet multiple times, I would expect to_parquet to throw an error before corrupting the file.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.1.0
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 0.7.1
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@deschman deschman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 19, 2020
@deschman
Copy link
Author

deschman commented Nov 3, 2020

If any additional information is needed for this issue, please let me know.

@ivanovmg
Copy link
Member

I confirm the same error on master.

@mvdornellas
Copy link

mvdornellas commented Feb 18, 2021

I have the same problem in the code below

Input

def getFile(path):
    s3 = boto3.resource('s3')
    buffer = io.BytesIO()
    s3_object = s3.Object(bucket, path)
    s3_object.download_fileobj(buffer)
    df = pd.read_parquet(buffer,engine='pyarrow')
    return df

Output
image

Enviroment
AWS

Pandas version
1.0.1

Has anyone managed to solve it?

@jbrockmendel jbrockmendel added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

4 participants