BUG: `string[pyarrow]` dtype doesn't roundtrip through `pyarrow` #50074

jrbourbeau · 2022-12-05T20:49:58Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": ["foo", "bar", "baz"]}, dtype="string[pyarrow]")
# Round-trip `df` through pyarrow using `Table.from_pandas` and `Table_to_pandas` + `types_mapper=`
types_mapper={pa.string(): pd.ArrowDtype(pa.string())}
df_pa = pa.Table.from_pandas(df).to_pandas(types_mapper=types_mapper.get)
pd.testing.assert_frame_equal(df, df_pa)
# The assertion above fails with:
# Attribute "dtype" are different
# [left]:  string[pyarrow]
# [right]: string[pyarrow]

Issue Description

While working on improved support for pyarrow-backed dtypes over in Dask, I came across a case where round-tripping a DataFrame with string[pyarrow] dtypes doesn't appear to fully work as expected.

Expected Behavior

When using pa.Table.from_pandas and pa.Table.to_pandas + types_mapper as described in the above example, I would expect to get an equivalent DataFrame back.

Additionally, the

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[pyarrow]

is somewhat confusing as, from the information in the error message, it actually looks like these dtypes should be the same.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.9.15.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Thu Sep 29 20:12:57 PDT 2022; root:xnu-8020.240.7~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.2
numpy            : 1.21.6
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 59.8.0
pip              : 22.3.1
Cython           : None
pytest           : 7.2.0
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.7.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : 2022.11.0
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : None
numba            : 0.56.4
numexpr          : 2.8.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.11.0
scipy            : 1.9.3
snappy           :
sqlalchemy       : 1.4.44
tables           : 3.7.0
tabulate         : None
xarray           : 2022.11.0
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-12-05T21:15:08Z

Poking around a bit, I've noticed that the underlying .type attribute for df.x.dtype vs. df_pa.x.dtype are different

In [2]: df.x.dtype.type
Out[2]: str

In [3]: df_pa.x.dtype.type
Out[3]: pyarrow.lib.DataType

mroeschke · 2022-12-05T23:03:38Z

Thanks for the report. Yeah this is confusing because as of 1.5 there are technically 2 separate pyarrow backed string implementations. This could probably be clearer in our docs https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#pyarrow

In [11]: pd.StringDtype("pyarrow")
Out[11]: string[pyarrow]

In [12]: pd.ArrowDtype(pa.string())
Out[12]: string[pyarrow]

pd.StringDtype + pd.arrays.ArrowStringArray is still more feature complete than pd.ArrowDtype(pa.string) + pd.arrays.ArrowExtensionArray since the latter is not plugged into the pyarrow.compute methods for string operations, but once that's integrated in theory pd.ArrowDtype(pa.string) + pd.arrays.ArrowExtensionArray could be used for everything #48469 (comment)

jrbourbeau · 2022-12-06T17:48:15Z

Ah, I see -- thanks for clarifying @mroeschke. So, for the time being, it sounds like folks should use pd.StringDtype("pyarrow") when converting pyarrow datatypes to pandas dtypes, correct?

FWIW if I switch to using pd.StringDtype("pyarrow") in the original example, things work as expected:

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": ["foo", "bar", "baz"]}, dtype="string[pyarrow]")
# Round-trip `df` through pyarrow using `Table.from_pandas` and `Table_to_pandas` + `types_mapper=`
def types_mapper(pa_type):
    if pa_type == pa.string():
        return pd.StringDtype("pyarrow")
df_pa = pa.Table.from_pandas(df).to_pandas(types_mapper=types_mapper)
pd.testing.assert_frame_equal(df, df_pa)

mroeschke · 2022-12-06T17:56:43Z

So, for the time being, it sounds like folks should use pd.StringDtype("pyarrow") when converting pyarrow datatypes to pandas dtypes, correct?

Correct (for strings only). For all other pyarrow data types pd.ArrowDtype should work.

jrbourbeau · 2022-12-06T18:02:04Z

Thanks -- that also fixes the corresponding issue I was running into over in Dask

https://github.com/dask/dask/pull/9719/files#diff-965eb1b3afb3ebaa80c6c3d896d50044911569d696e5e25807d5e22f8d22d668R1588-R1594

rachtsingh · 2023-10-24T17:29:14Z

To check my understanding, df = df.astype(dtype={'col': 'string[pyarrow]'}) sets the column dtype to be ArrowStringArray, but I actually want ArrowExtensionArray? I've been trying to figure out why str.split was slow, and it looks like it's calling the object-based API with a Python for loop, which isn't what I wanted.

Does Pandas have an issue tracking when the above astype call will return an ArrowExtensionArray? That looks like the eventual goal, right?

mroeschke · 2023-10-24T17:39:44Z

but I actually want ArrowExtensionArray?

Yes you can use pd.ArrowDtype(pyarrow.string())

Does Pandas have an issue tracking when the above astype call will return an ArrowExtensionArray? That looks like the eventual goal, right?

Hopefully in pandas 3.0 (coming out next year) where strings are inferred and stored with pyarrow in the future

randolf-scholz · 2024-03-19T14:34:41Z

Another roundtrip issues is when simply serializing/deserializing a dataframe to parquet (#42664)

import pandas as pd

df = pd.DataFrame({"col": ["a", "b", "c"]}, dtype="string[pyarrow]")
df.to_parquet("foo.parquet")

df2 = pd.read_parquet("foo.parquet")
pd.testing.assert_frame_equal(df, df2)  # left: string[pyarrow], right: string[python]

df3 = pd.read_parquet("foo.parquet", dtype_backend="pyarrow")
pd.testing.assert_frame_equal(df, df3)  # left: string[pyarrow], right: large_string[pyarrow]

jrbourbeau added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 5, 2022

mroeschke added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 5, 2022

This was referenced Dec 12, 2022

Serialize all pyarrow extension arrays efficiently dask/dask#9740

Merged

string[pyarrow] dtype does not roundtrip in P2P shuffling dask/distributed#7420

Closed

Fix P2PShuffle serialization for categorical data dask/distributed#7410

Merged

jrbourbeau mentioned this issue Jan 25, 2023

BUG: DataFrame.convert_dtypes() converting object to pd.ArrowDtype instead of pd.StringDtype #50971

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `string[pyarrow]` dtype doesn't roundtrip through `pyarrow` #50074

BUG: `string[pyarrow]` dtype doesn't roundtrip through `pyarrow` #50074

jrbourbeau commented Dec 5, 2022

jrbourbeau commented Dec 5, 2022

mroeschke commented Dec 5, 2022

jrbourbeau commented Dec 6, 2022

mroeschke commented Dec 6, 2022

jrbourbeau commented Dec 6, 2022

rachtsingh commented Oct 24, 2023

mroeschke commented Oct 24, 2023 •

edited

randolf-scholz commented Mar 19, 2024 •

edited

BUG: string[pyarrow] dtype doesn't roundtrip through pyarrow #50074

BUG: string[pyarrow] dtype doesn't roundtrip through pyarrow #50074

Comments

jrbourbeau commented Dec 5, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

jrbourbeau commented Dec 5, 2022

mroeschke commented Dec 5, 2022

jrbourbeau commented Dec 6, 2022

mroeschke commented Dec 6, 2022

jrbourbeau commented Dec 6, 2022

rachtsingh commented Oct 24, 2023

mroeschke commented Oct 24, 2023 • edited

randolf-scholz commented Mar 19, 2024 • edited

BUG: `string[pyarrow]` dtype doesn't roundtrip through `pyarrow` #50074

BUG: `string[pyarrow]` dtype doesn't roundtrip through `pyarrow` #50074

mroeschke commented Oct 24, 2023 •

edited

randolf-scholz commented Mar 19, 2024 •

edited