Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

datapythonista · 2023-03-03T12:58:42Z

As far as I could see, there is no easy way given a PyArrow table, to get a DataFrame with pyarrow types.

I'd expect that those idioms work:

import numpy
import pyarrow
import pandas

arrow_u8 = pyarrow.array([1, 2, 3], type=pyarrow.uint8())
arrow_f64 = pyarrow.array([1., 2., 3.], type=pyarrow.float64())
table = pyarrow.table([arrow_u8, arrow_f64], names=['u8', 'f64'])

# Using the PyArrow `to_pandas` method will use NumPy backed data
df = table.to_pandas()

# Using the constructor with a PyArrow table raises: ValueError: DataFrame constructor not properly called!
df = pandas.DataFrame(table)

# This is not implemented (the method doesn't exist)
df = pandas.DataFrame.from_arrow(table)

# Creating a dataframe column by column naively from the arrow array will use NumPy dtypes
df = pandas.DataFrame({'u8': arrow_u8,
                       'f64': arrow_f64})

I think the easier way to make the transition is with something like this:

df =  pandas.DataFrame({name: pandas.Series(array,
                                            dtype=pandas.ArrowDtype(array.type))
                        for array, name
                        in zip(table.columns, table.column_names)})

@pandas-dev/pandas-core Given that Arrow dtypes is one of the highlights of pandas 2.0, shouldn't we provide at least one easy way to convert before the release?

phofl · 2023-03-03T13:50:33Z

Arrow isn’t fully supported yet, just something to consider when communicating this.

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

datapythonista · 2023-03-03T14:13:02Z

Arrow isn’t fully supported yet, just something to consider when communicating this.

Agree, but this seems like an important and common use case, and doesn't seem difficult to implement, no?

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

I didn't think about it, sounds good. But do we have a pandas function for the mapper, or is it something every user wanting this functionality should write? If that's the case the code I wrote is probably simpler.

phofl · 2023-03-03T15:01:52Z

No we don’t have a common function and I remembers this incorrectly, we are only using it for our own nullable dtypes. Internally we are doing more or less the same as you did.

you can wrap in an ArrowExtensionArray instead of using the Series

jbrockmendel · 2023-03-03T16:00:25Z

# Using the PyArrow to_pandas method will use NumPy backed data

This seems like something we should ask pyarrow to change?

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

phofl · 2023-03-03T16:00:31Z

Ok there is actually an easy way to do this:

table.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))

Edit: I think this should be sufficient for now? We should definitely document this. I'll open a PR.

jbrockmendel · 2023-03-03T16:02:04Z

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

phofl · 2023-03-03T16:03:32Z

Oh good point, yes you are correct.

datapythonista · 2023-03-03T19:57:45Z

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

I have a working version of it in #51769, and seems simple enough to be worth adding. With pyarrow being optional.

jreback · 2023-03-04T01:50:17Z

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

this seems like the ideal soln; should just document for now

datapythonista added the Arrow pyarrow functionality label Mar 3, 2023

phofl mentioned this issue Mar 3, 2023

DOC: Add explanation how to convert arrow table to pandas df #51762

Merged

5 tasks

datapythonista mentioned this issue Mar 3, 2023

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

Closed

1 task

datapythonista closed this as completed in #51762 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

datapythonista commented Mar 3, 2023

phofl commented Mar 3, 2023

datapythonista commented Mar 3, 2023

phofl commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023 •

edited

Loading

phofl commented Mar 3, 2023 •

edited

Loading

jbrockmendel commented Mar 3, 2023

phofl commented Mar 3, 2023

datapythonista commented Mar 3, 2023

jreback commented Mar 4, 2023

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Comments

datapythonista commented Mar 3, 2023

phofl commented Mar 3, 2023

datapythonista commented Mar 3, 2023

phofl commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023 • edited Loading

phofl commented Mar 3, 2023 • edited Loading

jbrockmendel commented Mar 3, 2023

phofl commented Mar 3, 2023

datapythonista commented Mar 3, 2023

jreback commented Mar 4, 2023

jbrockmendel commented Mar 3, 2023 •

edited

Loading

phofl commented Mar 3, 2023 •

edited

Loading