Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame construction from numpy with dtype object #17819

Closed
2 tasks done
dpinol opened this issue Jul 23, 2024 · 2 comments
Closed
2 tasks done

DataFrame construction from numpy with dtype object #17819

dpinol opened this issue Jul 23, 2024 · 2 comments
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@dpinol
Copy link
Contributor

dpinol commented Jul 23, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame(np.array([[[5], 7]], np.object_), schema={"a": pl.List(pl.Int64), "b": pl.Int64}, orient="row")
# or also more simply
pl.Series("k", np.array([5], np.object_), pl.Int64)

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[35], line 1
----> 1 pl.DataFrame(np.array([[5.6]], np.object_), schema={"a": pl.Float64}, orient="row")

File ~/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/dataframe/frame.py:384, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    379     self._df = series_to_pydf(
    380         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    381     )
    383 elif _check_for_numpy(data) and isinstance(data, np.ndarray):
--> 384     self._df = numpy_to_pydf(
    385         data,
    386         schema=schema,
    387         schema_overrides=schema_overrides,
    388         strict=strict,
    389         orient=orient,
    390         nan_to_null=nan_to_null,
    391     )
    393 elif _check_for_pyarrow(data) and isinstance(data, pa.Table):
    394     self._df = arrow_to_pydf(
    395         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    396     )

File ~/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py:1331, in numpy_to_pydf(data, schema, schema_overrides, orient, strict, nan_to_null)
   1328 else:
   1329     if orient == "row":
   1330         data_series = [
-> 1331             pl.Series(
   1332                 name=column_names[i],
   1333                 values=(
   1334                     data
   1335                     if two_d and n_columns == 1 and shape[1] > 1
   1336                     else data[:, i]
   1337                 ),
   1338                 dtype=schema_overrides.get(column_names[i]),
   1339                 strict=strict,
   1340                 nan_to_null=nan_to_null,
   1341             )._s
   1342             for i in range(n_columns)
   1343         ]
   1344     else:
   1345         data_series = [
   1346             pl.Series(
   1347                 name=column_names[i],
   (...)
   1355             for i in range(n_columns)
   1356         ]

File ~/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/series/series.py:319, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
    316             return
    318     if dtype is not None:
--> 319         self._s = self.cast(dtype, strict=strict)._s
    321 elif _check_for_pyarrow(values) and isinstance(
    322     values, (pa.Array, pa.ChunkedArray)
    323 ):
    324     self._s = arrow_to_pyseries(name, values, dtype=dtype, strict=strict)

File ~/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/series/series.py:3992, in Series.cast(self, dtype, strict, wrap_numerical)
   3990 # Do not dispatch cast as it is expensive and used in other functions.
   3991 dtype = parse_into_dtype(dtype)
-> 3992 return self._from_pyseries(self._s.cast(dtype, strict, wrap_numerical))

ComputeError: cannot cast 'Object' type

Issue description

DataFrame with non pl.Object schema cannot be created from numpy arrays with dtype object.
Working in numpy with `np.object_ is indispensable when other columns are strings or nested arrays, or to set "nulls" with NaN for integer columns.
However, since polars dtype is columnar, it guess it should support to concretize it.
Related to #17484

Expected behavior

It should work the same as

pl.DataFrame(np.array([[5.6]]), schema={"a": pl.Float64}, orient="row")

Installed versions

In [53]: pl.show_versions()
--------Version info---------
Polars:               1.2.1
Index type:           UInt32
Platform:             Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 11 2024, 10:16:04) [GCC 13.2.0]

----Optional dependencies----
numpy:                1.26.4
pandas:               2.2.2
pyarrow:              16.1.0

Same with numpy 2.0.1

@dpinol dpinol added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 23, 2024
@deanm0000
Copy link
Collaborator

This isn't a bug, as the error message says ComputeError: cannot cast 'Object' type. You've got to do the conversion on the numpy/python side from object to a strict dtype that polars can work with.

For instance if you have

x=np.array([[[5], 7]], np.object_)

then you could do

pl.DataFrame({
    'a':list(x[:,0]),
    'b':x[:,1].astype(np.float64)
})

@deanm0000 deanm0000 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024
@deanm0000 deanm0000 added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Aug 15, 2024
@dpinol
Copy link
Contributor Author

dpinol commented Aug 15, 2024

@deanm000 thanks for the workaround.
Unfortunately, as I mentioned, how can I then import nulls values for string columns? With polars 0.x it was possible, through None's, but with polars 1.x I still haven't found any way.

In [34]: x=np.array([["7"], [np.nan]], np.object_); pl.DataFrame({
    ...:     'b':x[:,0].astype(np.str_)
    ...: }, nan_to_null=True, schema={"b": pl.String})
Out[34]: 
shape: (2, 1)
┌─────┐
│ b   │
│ --- │
│ str │
╞═════╡
│ 7   │
│ nan │
└─────┘

In [35]: x=np.array([["7"], [None]], np.object_); pl.DataFrame({
    ...:     'b':x[:,0].astype(np.str_)
    ...: }, nan_to_null=True, schema={"b": pl.String})
Out[35]: 
shape: (2, 1)
┌──────┐
│ b    │
│ ---  │
│ str  │
╞══════╡
│ 7    │
│ None │
└──────┘

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants