Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): handle Series init from python sequence of numpy arrays #5918

Merged

Conversation

alexander-beedie
Copy link
Collaborator

Closes #5905.


We already handle Series init from 2D numpy arrays; this PR extends that slightly so that we can also recognise/handle init from a python sequence (list/tuple) of numpy arrays.

Before

import numpy as np
import polars as pl

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌─────────┐
# │ colx    │
# │ ---     │
# │ object  │  << 'object' dtype
# ╞═════════╡
# │ [1 2 3] │
# │ [4 5 6] │
# └─────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# pyo3_runtime.PanicException: cannot convert object to arrow

After

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌───────────┐
# │ colx      │
# │ ---       │
# │ list[i64] │  << native dtype
# ╞═══════════╡
# │ [1, 2, 3] │
# │ [4, 5, 6] │
# └───────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# ┌─────────────┐
# │ colx        │
# │ ---         │
# │ struct[1]   │
# ╞═════════════╡
# │ {[1, 2, 3]} │
# │ {[4, 5, 6]} │
# └─────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Dec 28, 2022
@ritchie46
Copy link
Member

Awesome!

I am working in some voodoo that allows us to put python objects in an arrow struct. Then both issues are solved. :)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 28, 2022

It's a Christmas miracle, heh :)

Side-query on supported arrow types; does arrow2/rust handle Decimal yet? If I recall correctly the answer a while back was "no", but I've been coming across more use-cases (via BigQuery-sourced data) recently and thought I'd re-visit.

@ritchie46
Copy link
Member

It's a Christmas miracle, heh :)

Side-query on supported arrow types; does arrow2/rust handle Decimal yet? If I recall correctly the answer a while back was "no", but I've been coming across more use-cases (via BigQuery-sourced data) recently and thought I'd re-visit.

They do. We need to add a i128 primitive chundedarray and a decimal logical type on the polars side to make this work.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 28, 2022

They do.

Oooo... ok, that's interesting.
I should probably make a New Year's resolution to more actively improve my Rust ;)

@ritchie46 ritchie46 merged commit eaeb703 into pola-rs:master Dec 28, 2022
@alexander-beedie alexander-beedie deleted the init-sequence-of-numpy-arrays branch January 5, 2023 19:40
@im44pos
Copy link

im44pos commented Mar 6, 2023

Closes #5905.

We already handle Series init from 2D numpy arrays; this PR extends that slightly so that we can also recognise/handle init from a python sequence (list/tuple) of numpy arrays.

Before

import numpy as np
import polars as pl

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌─────────┐
# │ colx    │
# │ ---     │
# │ object  │  << 'object' dtype
# ╞═════════╡
# │ [1 2 3] │
# │ [4 5 6] │
# └─────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# pyo3_runtime.PanicException: cannot convert object to arrow

After

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌───────────┐
# │ colx      │
# │ ---       │
# │ list[i64] │  << native dtype
# ╞═══════════╡
# │ [1, 2, 3] │
# │ [4, 5, 6] │
# └───────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# ┌─────────────┐
# │ colx        │
# │ ---         │
# │ struct[1]   │
# ╞═════════════╡
# │ {[1, 2, 3]} │
# │ {[4, 5, 6]} │
# └─────────────┘

Why do I only get the "Before" result?
I'm on Windows10, with:
Python 3.8.13
Numpy 1.23.1
Polars 0.14.6

@ghuls
Copy link
Collaborator

ghuls commented Mar 6, 2023

Closes #5905.
We already handle Series init from 2D numpy arrays; this PR extends that slightly so that we can also recognise/handle init from a python sequence (list/tuple) of numpy arrays.
Before

import numpy as np
import polars as pl

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌─────────┐
# │ colx    │
# │ ---     │
# │ object  │  << 'object' dtype
# ╞═════════╡
# │ [1 2 3] │
# │ [4 5 6] │
# └─────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# pyo3_runtime.PanicException: cannot convert object to arrow

After

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})
# ┌───────────┐
# │ colx      │
# │ ---       │
# │ list[i64] │  << native dtype
# ╞═══════════╡
# │ [1, 2, 3] │
# │ [4, 5, 6] │
# └───────────┘

df.select( 
  [pl.struct(["colx"])] 
)
# ┌─────────────┐
# │ colx        │
# │ ---         │
# │ struct[1]   │
# ╞═════════════╡
# │ {[1, 2, 3]} │
# │ {[4, 5, 6]} │
# └─────────────┘

Why do I only get the "Before" result? I'm on Windows10, with: Python 3.8.13 Numpy 1.23.1 Polars 0.14.6

Because you are running a old version. Latest version is 0.16.11 at the moment.

@im44pos
Copy link

im44pos commented Mar 7, 2023

Ok, that helps.
I guess I should have noticed that myself.

It solves that I now get the list[i64] instead of the object.
But I still don't get the struct[1]
Then also upgraded Numpy to 1.24.2: same result.

@ghuls
Copy link
Collaborator

ghuls commented Mar 7, 2023

So what is the output you get then?

import polars as pl

print(pl.show_versions())

df = pl.DataFrame({ 
    "colx": [np.array([1,2,3]), np.array([4,5,6])],
})

print(df)

df_struct = df.select( 
  [pl.struct(["colx"])] 
)

print(df_struct)

@im44pos
Copy link

im44pos commented Mar 7, 2023

---Version info---
Polars: 0.16.11
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 05:59:45) [MSC v.1929 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.4.4
numpy: 1.24.2
fsspec:
connectorx: 0.3.1
xlsx2csv:
deltalake:
matplotlib: 3.5.2
None
shape: (2, 1)
┌───────────┐
│ colx │
│ --- │
│ list[i32] │
╞═══════════╡
│ [1, 2, 3] │
│ [4, 5, 6] │
└───────────┘
shape: (2, 1)
┌─────────────┐
│ colx │
│ --- │
│ struct[1] │
╞═════════════╡
│ {[1, 2, 3]} │
│ {[4, 5, 6]} │
└─────────────┘

@im44pos
Copy link

im44pos commented Mar 7, 2023

df.select(
[pl.struct(["colx"])]
)

is differenf from your

df_struct = df.select(
[pl.struct(["colx"])]
)

@ghuls
Copy link
Collaborator

ghuls commented Mar 7, 2023

Ok, that helps. I guess I should have noticed that myself.

It solves that I now get the list[i64] instead of the object. But I still don't get the struct[1] Then also upgraded Numpy to 1.24.2: same result.

To me it looks like you get the struct[1] result now.

@im44pos
Copy link

im44pos commented Mar 24, 2023

To me it looks like you get the struct[1] result now.

Yes. due to:
df_struct = df.select([pl.struct(["colx"])]
print(df_struct
As I posted before.
Thanks for the support.

I don't know if I should post this here (relativ new to software engineering and github).

I'm refactoring my jupyter notebook into python code.
Therefor created a new environment with slightly different library versions:
print(pl.show_versions())

---Version info---
Polars: 0.16.14
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 15:53:35) [MSC v.1929 64 bit (AMD64)]
---Optional dependencies---
numpy: 1.24.2
pandas:
pyarrow: 11.0.0
connectorx: 0.3.1
deltalake:
fsspec:
matplotlib: 3.7.1
xlsx2csv:
xlsxwriter:
None

And the example as above works as it should.

In my code I collect data and put that into a polars dataframe.
Center_of_Mass (type: numpy.ndarray) goes into the column center of mass (type: list[f64])
But:
Matrix_of_Inertia (type: numpy.ndarray) goes into the column matrix of inertia (type: object)
Matrix_of_Inertia List (type: list) goes into the column matrix of inertia List (type: object)

afbeelding

Why do I get different results for Center_of_Mass and Matrix_of_Inertia inputs?
How can I convert a column within the dataframe from object to struct?

@ghuls
Copy link
Collaborator

ghuls commented Mar 24, 2023

Without an input dataframe as example it is hard to say what is happening.

@im44pos
Copy link

im44pos commented Mar 24, 2023

Without an input dataframe as example it is hard to say what is happening.

There is no input dataframe:
The numpy.ndarrays are the output for functions that I call and store in lists.
Then I create the dataframe from these lists.

The thing now is that I created a small notebook to replicate this result, and replaced all of my computation with just the numpy.ndarrays. "Strangely" I get 3 times a list[f64] as output.
Then I put the code from the notebook in a single file with visual studio code: again I get 3 times a list[f64] as output.
So it must be in computations (and libraries) that I use in my functions.

Learning every step on the way, I'll have to do some work and test.

P.S.: How can I convert a column within the dataframe from object to struct?
I still might need this, because the matrix of inertia should be 3x3 instead of 1x3 as in this example.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PanicException: Cannot convert object to arrow when using struct with numpy arrays in an expression
4 participants