Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.to_parquet doesn't round-trip pyarrow StringDtype #42664

Open
2 of 3 tasks
TomAugspurger opened this issue Jul 22, 2021 · 14 comments
Open
2 of 3 tasks

BUG: DataFrame.to_parquet doesn't round-trip pyarrow StringDtype #42664

TomAugspurger opened this issue Jul 22, 2021 · 14 comments
Assignees
Labels
Docs good first issue IO Parquet parquet, feather Strings String extension data type and string data

Comments

@TomAugspurger
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
a = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("pyarrow"))})
a.to_parquet("test.parquet")
b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(a, b)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_493/3001616580.py in <module>
      3 a.to_parquet("test.parquet")
      4 b = pd.read_parquet("test.parquet")
----> 5 pd.testing.assert_frame_equal(a, b)

    [... skipping hidden 3 frame]

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/_testing/asserters.py in raise_assert_detail(obj, message, left, right, diff, index_values)
    663         msg += f"\n[diff]: {diff}"
    664 
--> 665     raise AssertionError(msg)
    666 
    667 

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="A") are different

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[python]

Problem description

read_parquet currently loads all string dtype as string[python]. We'd ideally match what was written.

Expected Output

A DataFrame with string[pyarrow] rather than string[python]

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-1040-azure
Version          : #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.7.5
pip              : 20.3.4
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : 1.10.2
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.25.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 2021.06.1
fastparquet      : None
gcsfs            : 2021.06.1
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 4.0.0
pyxlsb           : None
s3fs             : 2021.06.1
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : None
tabulate         : 0.8.9
xarray           : 0.18.2
xlrd             : None
xlwt             : None
numba            : 0.53.1
```​

</details>
@TomAugspurger TomAugspurger added Bug Strings String extension data type and string data IO Parquet parquet, feather labels Jul 22, 2021
@simonjayhawkins
Copy link
Member

This is the tested behaviour. The storage type is an implementation detail. This allows reading a parquet file without pyarrow. If you change the global setting to pyarrow. you can read any string array in a pyarrow backed string array.

import pandas as pd
a = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("pyarrow"))})
a2 = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("python"))})

a.to_parquet("test.parquet")
with pd.option_context("string_storage", "pyarrow"):
    b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(b, a)

a2.to_parquet("test.parquet")
with pd.option_context("string_storage", "pyarrow"):
    b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(b, a)

@TomAugspurger
Copy link
Contributor Author

Thanks.

I think it's worth adding a small note to the docs, both in user_guide/text.rst and user_guide/io.rst noting that

  1. The parquet representation of StringDtype is the same, regardless of the storage (is that correct?)
  2. The data will be read in according to the string_storage setting (and show an example of getting pyarrow storage

@jeremyswerdlow
Copy link

Hi there, long time user, first time contributor. Is it alright for me to pick up this issue and expand the Docs?

@mzeitlin11
Copy link
Member

Yep, that would be great @jeremyswerdlow!

@nakatomotoi
Copy link
Contributor

@TomAugspurger @mzeitlin11
Hello, I would like to work on this issue. This will be my first contribution to oss.
@jeremyswerdlow
If you haven't worked on this issue yet, I'd like to work on it. Is it ok for you?

@Sara-cos
Copy link

Sara-cos commented Sep 1, 2021

take

@jorisvandenbossche
Copy link
Member

This is the tested behaviour. The storage type is an implementation detail. This allows reading a parquet file without pyarrow.

Although it is tested, I don't know if we intentionally don't support a roundtrip. It's true that we should be able to read any parquet file with fastparquet without having pyarrow for parquet files with string data.

But when starting from a pandas dataframe, we store information about the dtype in the metadata of the arrow table / parquet file. And if we want, I think we could make use of this to support a faithful roundtrip out of the box.

@randolf-scholz
Copy link
Contributor

This would be really nice, as the memory difference can be huge. Got tripped up when trying to load a table stored by pyarrow that would take 16G when using string[pyarrow], but > 60G using regular string[python].

@rsm-23
Copy link
Contributor

rsm-23 commented Jun 30, 2023

Thanks.

I think it's worth adding a small note to the docs, both in user_guide/text.rst and user_guide/io.rst noting that

1. The parquet representation of StringDtype is the same, regardless of the `storage` (is that correct?)

2. The data will be read in according to the `string_storage` setting (and show an example of getting pyarrow storage

Hey @TomAugspurger does this only require updating DOCS? Can I take it up?

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

Need some help with the above PR. One of the CI checks is failing but I am able to build the doc in local. Can someone help me understand this:
Sphinx parallel build error: RuntimeError: Non Expected warning in /home/runner/work/pandas/pandas/doc/source/user_guide/io.rst line 5485

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

Tried using :suppress: but that didn't help either. Any help is appreciated.
Figured out that using :okwarning: was the way to go. Not fully convinced why it is necessary since I didn't find any warnings while building in local, but it could be happening because of .to_parquet()

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

@TomAugspurger can you please review the PR?

@TomAugspurger
Copy link
Contributor Author

I'm not sure what the latest status of the StringDtype, but based on #48469 it sounds like there's still some flux. It'd be good to confirm that no changes there will affect the parquet storage type.

And based on #42664 (comment), it sounds like my earlier comment wasn't necessarily correct.

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 2, 2023

I'll probably wait out the discussion on #48469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue IO Parquet parquet, feather Strings String extension data type and string data
Projects
None yet
9 participants