Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Closed
datapythonista opened this issue Mar 3, 2023 · 9 comments · Fixed by #51762
Closed

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

datapythonista opened this issue Mar 3, 2023 · 9 comments · Fixed by #51762
Labels
Arrow pyarrow functionality

Comments

@datapythonista
Copy link
Member

As far as I could see, there is no easy way given a PyArrow table, to get a DataFrame with pyarrow types.

I'd expect that those idioms work:

import numpy
import pyarrow
import pandas

arrow_u8 = pyarrow.array([1, 2, 3], type=pyarrow.uint8())
arrow_f64 = pyarrow.array([1., 2., 3.], type=pyarrow.float64())
table = pyarrow.table([arrow_u8, arrow_f64], names=['u8', 'f64'])

# Using the PyArrow `to_pandas` method will use NumPy backed data
df = table.to_pandas()

# Using the constructor with a PyArrow table raises: ValueError: DataFrame constructor not properly called!
df = pandas.DataFrame(table)

# This is not implemented (the method doesn't exist)
df = pandas.DataFrame.from_arrow(table)

# Creating a dataframe column by column naively from the arrow array will use NumPy dtypes
df = pandas.DataFrame({'u8': arrow_u8,
                       'f64': arrow_f64})

I think the easier way to make the transition is with something like this:

df =  pandas.DataFrame({name: pandas.Series(array,
                                            dtype=pandas.ArrowDtype(array.type))
                        for array, name
                        in zip(table.columns, table.column_names)})

@pandas-dev/pandas-core Given that Arrow dtypes is one of the highlights of pandas 2.0, shouldn't we provide at least one easy way to convert before the release?

@datapythonista datapythonista added the Arrow pyarrow functionality label Mar 3, 2023
@phofl
Copy link
Member

phofl commented Mar 3, 2023

Arrow isn’t fully supported yet, just something to consider when communicating this.

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

@datapythonista
Copy link
Member Author

Arrow isn’t fully supported yet, just something to consider when communicating this.

Agree, but this seems like an important and common use case, and doesn't seem difficult to implement, no?

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

I didn't think about it, sounds good. But do we have a pandas function for the mapper, or is it something every user wanting this functionality should write? If that's the case the code I wrote is probably simpler.

@phofl
Copy link
Member

phofl commented Mar 3, 2023

No we don’t have a common function and I remembers this incorrectly, we are only using it for our own nullable dtypes. Internally we are doing more or less the same as you did.

you can wrap in an ArrowExtensionArray instead of using the Series

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 3, 2023

# Using the PyArrow to_pandas method will use NumPy backed data

This seems like something we should ask pyarrow to change?

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

@phofl
Copy link
Member

phofl commented Mar 3, 2023

Ok there is actually an easy way to do this:

table.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))

Edit: I think this should be sufficient for now? We should definitely document this. I'll open a PR.

@jbrockmendel
Copy link
Member

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

@phofl
Copy link
Member

phofl commented Mar 3, 2023

Oh good point, yes you are correct.

@datapythonista
Copy link
Member Author

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

I have a working version of it in #51769, and seems simple enough to be worth adding. With pyarrow being optional.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2023

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

this seems like the ideal soln; should just document for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
4 participants