BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype #42664

TomAugspurger · 2021-07-22T14:50:46Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
a = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("pyarrow"))})
a.to_parquet("test.parquet")
b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(a, b)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_493/3001616580.py in <module>
      3 a.to_parquet("test.parquet")
      4 b = pd.read_parquet("test.parquet")
----> 5 pd.testing.assert_frame_equal(a, b)

    [... skipping hidden 3 frame]

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/_testing/asserters.py in raise_assert_detail(obj, message, left, right, diff, index_values)
    663         msg += f"\n[diff]: {diff}"
    664 
--> 665     raise AssertionError(msg)
    666 
    667 

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="A") are different

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[python]

Problem description

read_parquet currently loads all string dtype as string[python]. We'd ideally match what was written.

Expected Output

A DataFrame with string[pyarrow] rather than string[python]

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-1040-azure
Version          : #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.7.5
pip              : 20.3.4
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : 1.10.2
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.25.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 2021.06.1
fastparquet      : None
gcsfs            : 2021.06.1
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 4.0.0
pyxlsb           : None
s3fs             : 2021.06.1
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : None
tabulate         : 0.8.9
xarray           : 0.18.2
xlrd             : None
xlwt             : None
numba            : 0.53.1
```

</details>

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-07-22T16:44:05Z

This is the tested behaviour. The storage type is an implementation detail. This allows reading a parquet file without pyarrow. If you change the global setting to pyarrow. you can read any string array in a pyarrow backed string array.

import pandas as pd
a = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("pyarrow"))})
a2 = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("python"))})

a.to_parquet("test.parquet")
with pd.option_context("string_storage", "pyarrow"):
    b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(b, a)

a2.to_parquet("test.parquet")
with pd.option_context("string_storage", "pyarrow"):
    b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(b, a)

TomAugspurger · 2021-07-28T13:21:50Z

Thanks.

I think it's worth adding a small note to the docs, both in user_guide/text.rst and user_guide/io.rst noting that

The parquet representation of StringDtype is the same, regardless of the storage (is that correct?)
The data will be read in according to the string_storage setting (and show an example of getting pyarrow storage

jeremyswerdlow · 2021-08-06T19:37:37Z

Hi there, long time user, first time contributor. Is it alright for me to pick up this issue and expand the Docs?

mzeitlin11 · 2021-08-10T18:42:46Z

Yep, that would be great @jeremyswerdlow!

nakatomotoi · 2021-08-25T07:02:10Z

@TomAugspurger @mzeitlin11
Hello, I would like to work on this issue. This will be my first contribution to oss.
@jeremyswerdlow
If you haven't worked on this issue yet, I'd like to work on it. Is it ok for you?

Sara-cos · 2021-09-01T08:01:57Z

take

jorisvandenbossche · 2021-09-21T20:38:40Z

This is the tested behaviour. The storage type is an implementation detail. This allows reading a parquet file without pyarrow.

Although it is tested, I don't know if we intentionally don't support a roundtrip. It's true that we should be able to read any parquet file with fastparquet without having pyarrow for parquet files with string data.

But when starting from a pandas dataframe, we store information about the dtype in the metadata of the arrow table / parquet file. And if we want, I think we could make use of this to support a faithful roundtrip out of the box.

randolf-scholz · 2022-09-09T14:22:46Z

This would be really nice, as the memory difference can be huge. Got tripped up when trying to load a table stored by pyarrow that would take 16G when using string[pyarrow], but > 60G using regular string[python].

rsm-23 · 2023-06-30T16:12:50Z

Thanks.

I think it's worth adding a small note to the docs, both in user_guide/text.rst and user_guide/io.rst noting that
1. The parquet representation of StringDtype is the same, regardless of the `storage` (is that correct?)

2. The data will be read in according to the `string_storage` setting (and show an example of getting pyarrow storage

Hey @TomAugspurger does this only require updating DOCS? Can I take it up?

rsm-23 · 2023-07-01T11:25:49Z

Need some help with the above PR. One of the CI checks is failing but I am able to build the doc in local. Can someone help me understand this:
Sphinx parallel build error: RuntimeError: Non Expected warning in /home/runner/work/pandas/pandas/doc/source/user_guide/io.rst line 5485

rsm-23 · 2023-07-01T12:21:06Z

~~Tried using :suppress: but that didn't help either. Any help is appreciated.~~
Figured out that using :okwarning: was the way to go. Not fully convinced why it is necessary since I didn't find any warnings while building in local, but it could be happening because of .to_parquet()

rsm-23 · 2023-07-01T15:39:55Z

@TomAugspurger can you please review the PR?

TomAugspurger · 2023-07-02T12:04:47Z

I'm not sure what the latest status of the StringDtype, but based on #48469 it sounds like there's still some flux. It'd be good to confirm that no changes there will affect the parquet storage type.

And based on #42664 (comment), it sounds like my earlier comment wasn't necessarily correct.

rsm-23 · 2023-07-02T15:05:21Z

I'll probably wait out the discussion on #48469

TomAugspurger added Bug Strings String extension data type and string data IO Parquet parquet, feather labels Jul 22, 2021

TomAugspurger added Docs good first issue and removed Bug labels Jul 28, 2021

github-actions bot assigned Sara-cos Sep 1, 2021

mzeitlin11 mentioned this issue Sep 14, 2021

BUG: string[pyarrow] dtype not loaded from feather #43565

Closed

3 tasks

Sara-cos mentioned this issue Sep 16, 2021

DOC : DataFrame.to_parquet doesn't round-trip pyarrow StringDtype #42664 #43604

Closed

1 task

TomAugspurger mentioned this issue Nov 7, 2022

Read Parquet directly into string[pyarrow] dask/dask#9631

Closed

rsm-23 mentioned this issue Jul 1, 2023

DOC: Added note and example for pyarrow storage #53961

Closed

2 tasks

randolf-scholz mentioned this issue Mar 19, 2024

BUG: string[pyarrow] dtype doesn't roundtrip through pyarrow #50074

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype #42664

BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype #42664

TomAugspurger commented Jul 22, 2021

simonjayhawkins commented Jul 22, 2021

TomAugspurger commented Jul 28, 2021

jeremyswerdlow commented Aug 6, 2021

mzeitlin11 commented Aug 10, 2021

nakatomotoi commented Aug 25, 2021

Sara-cos commented Sep 1, 2021

jorisvandenbossche commented Sep 21, 2021

randolf-scholz commented Sep 9, 2022

rsm-23 commented Jun 30, 2023

rsm-23 commented Jul 1, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023

TomAugspurger commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

BUG: DataFrame.to_parquet doesn't round-trip pyarrow StringDtype #42664

BUG: DataFrame.to_parquet doesn't round-trip pyarrow StringDtype #42664

Comments

TomAugspurger commented Jul 22, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

simonjayhawkins commented Jul 22, 2021

TomAugspurger commented Jul 28, 2021

jeremyswerdlow commented Aug 6, 2021

mzeitlin11 commented Aug 10, 2021

nakatomotoi commented Aug 25, 2021

Sara-cos commented Sep 1, 2021

jorisvandenbossche commented Sep 21, 2021

randolf-scholz commented Sep 9, 2022

rsm-23 commented Jun 30, 2023

rsm-23 commented Jul 1, 2023 • edited Loading

rsm-23 commented Jul 1, 2023 • edited Loading

rsm-23 commented Jul 1, 2023

TomAugspurger commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype #42664

BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype #42664

Output of `pd.show_versions()`

rsm-23 commented Jul 1, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023 •

edited

Loading