Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is not possible to concatenate arrays of different data types. #15946

Open
2 tasks done
david-waterworth opened this issue Apr 29, 2024 · 4 comments
Open
2 tasks done
Labels
python Related to Python Polars

Comments

@david-waterworth
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

data = pl.from_dicts(
    [
        {"group_id":0, "frequencies": [{"id":None, "count":10}, {"id":"a", "count":10}]},
        {"group_id":1, "frequencies": [{"id":"b", "count":20}, {"id":None, "count":5}]},
        {"group_id":2, "frequencies": [{"id":None, "count":10}, {"id":"a", "count":3}, {"id":"b", "count":2}, {"id":"c", "count":1}]},

        {"group_id":3, "frequencies": [{"id":None, "count":12}]},
    ]
)

def probabilities(frequencies):
    total = sum([x["count"] for x in  frequencies])
    output = [{"id":x["id"], "probability":x["count"]/total} for x in frequencies]

    print(output)
    return output

data.with_columns(
    probabilities=pl.col("frequencies").map_elements(
        probabilities,
        return_dtype=pl.List(pl.Struct({'id': pl.String, 'probability': pl.Float64}))
    )
)

Log output

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File venv/lib/python3.10/site-packages/polars/expr/expr.py:4130, in Expr._map_batches_wrapper.__call__(self, *args, **kwargs)
   4129 def __call__(self, *args: Any, **kwargs: Any) -> Any:
-> 4130     result = self.function(*args, **kwargs)
   4131     if _check_for_numpy(result) and isinstance(result, np.ndarray):
   4132         result = pl.Series(result, dtype=self.return_dtype)

File .venv/lib/python3.10/site-packages/polars/expr/expr.py:4469, in Expr.map_elements.<locals>.wrap_f(x)
   4467 with warnings.catch_warnings():
   4468     warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4469     return x.map_elements(
   4470         function, return_dtype=return_dtype, skip_nulls=skip_nulls
   4471     )

File .venv/lib/python3.10/site-packages/polars/series/series.py:5333, in Series.map_elements(self, function, return_dtype, skip_nulls)
   5329     pl_return_dtype = py_type_to_dtype(return_dtype)
   5331 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
   5332 return self._from_pyseries(
-> 5333     self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
   5334 )

PanicException: called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("It is not possible to concatenate arrays of different data types."))

Issue description

I'm sure this isn't the "canonical" way of achieving what I want, but it's a POC which I intend on rewriting later. I have grouped data, each group contains a frequencies dict that maps keys to counts and I'm trying to convert this to an equivalent dict of probabilities.

Each struct has an id and count key, and the value associated with id may be null, this is the source of the error. When there is only 1 item in the array (i.e. group_id==3) then the error above is thrown.

In fact simply returning the input as output (from the ufunc) will also trigger the error.

If you comment out the last row from the example (i.e. {"group_id":3...) then there's no error.

Expected behavior

I think this should work, even if it's not the recommended way of doing this?

Installed versions

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-5.15.0-102-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               24.2.1
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@david-waterworth david-waterworth added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 29, 2024
@david-waterworth
Copy link
Author

david-waterworth commented Apr 29, 2024

Also note the example frame was generated by polars using

df.group_by(pl.col("group_id")).agg(frequencies=pl.col("id").value_counts(sort=True))

So if this expression can product a singleton array with a null value (i.e. [{id:null, count:10}] ) then it should probably be possible to create as output from map_elements

@cmdlineluser
Copy link
Contributor

It seems there is no "supertype" inference happening from the .map_elements result?

You can modify the example to return a Series instead with the dtype:

return pl.Series(output, dtype=pl.Struct({'id': pl.String, 'probability': pl.Float64}))

As for canonical, there is .list.eval() if you are not aware:

df.with_columns(
   pl.col("frequencies").list.eval(
      pl.struct(
         id = pl.element().struct["id"],
         probability = pl.element().struct["count"] / pl.element().struct["count"].sum()
      )
   )
)

# shape: (4, 2)
# ┌──────────┬─────────────────────────────────────────────────────────┐
# │ group_id ┆ frequencies                                             │
# │ ---      ┆ ---                                                     │
# │ i64      ┆ list[struct[2]]                                         │
# ╞══════════╪═════════════════════════════════════════════════════════╡
# │ 0        ┆ [{null,0.5}, {"a",0.5}]                                 │
# │ 1        ┆ [{"b",0.8}, {null,0.2}]                                 │
# │ 2        ┆ [{null,0.625}, {"a",0.1875}, {"b",0.125}, {"c",0.0625}] │
# │ 3        ┆ [{null,1.0}]                                            │
# └──────────┴─────────────────────────────────────────────────────────┘

Although I imagine there is a simpler approach without using .value_counts()

@david-waterworth
Copy link
Author

Thanks @cmdlineluser - I was aware of .list.eval() but I misunderstood the meaning of pl.element() - I thought it literally meant a single element, but looking at the source it's an alias for F.col(""), so you can use it in the same way (i.e. pl.element().struct["count"] refers to a single item, pl.element().struct["count"].sum() applies over all items.

Either work-around works, should I close this?

@deanm0000 deanm0000 removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels May 3, 2024
@cmdlineluser
Copy link
Contributor

Yeah, the element usage is a bit confusing in some cases.

should I close this?

I'm not sure - it does look like your original example should work.

At the very least, I would think the PanicException needs to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants