It is not possible to concatenate arrays of different data types. #15946

david-waterworth · 2024-04-29T01:05:07Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

data = pl.from_dicts(
    [
        {"group_id":0, "frequencies": [{"id":None, "count":10}, {"id":"a", "count":10}]},
        {"group_id":1, "frequencies": [{"id":"b", "count":20}, {"id":None, "count":5}]},
        {"group_id":2, "frequencies": [{"id":None, "count":10}, {"id":"a", "count":3}, {"id":"b", "count":2}, {"id":"c", "count":1}]},

        {"group_id":3, "frequencies": [{"id":None, "count":12}]},
    ]
)

def probabilities(frequencies):
    total = sum([x["count"] for x in  frequencies])
    output = [{"id":x["id"], "probability":x["count"]/total} for x in frequencies]

    print(output)
    return output

data.with_columns(
    probabilities=pl.col("frequencies").map_elements(
        probabilities,
        return_dtype=pl.List(pl.Struct({'id': pl.String, 'probability': pl.Float64}))
    )
)

Log output

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File venv/lib/python3.10/site-packages/polars/expr/expr.py:4130, in Expr._map_batches_wrapper.__call__(self, *args, **kwargs)
   4129 def __call__(self, *args: Any, **kwargs: Any) -> Any:
-> 4130     result = self.function(*args, **kwargs)
   4131     if _check_for_numpy(result) and isinstance(result, np.ndarray):
   4132         result = pl.Series(result, dtype=self.return_dtype)

File .venv/lib/python3.10/site-packages/polars/expr/expr.py:4469, in Expr.map_elements.<locals>.wrap_f(x)
   4467 with warnings.catch_warnings():
   4468     warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4469     return x.map_elements(
   4470         function, return_dtype=return_dtype, skip_nulls=skip_nulls
   4471     )

File .venv/lib/python3.10/site-packages/polars/series/series.py:5333, in Series.map_elements(self, function, return_dtype, skip_nulls)
   5329     pl_return_dtype = py_type_to_dtype(return_dtype)
   5331 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
   5332 return self._from_pyseries(
-> 5333     self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
   5334 )

PanicException: called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("It is not possible to concatenate arrays of different data types."))

Issue description

I'm sure this isn't the "canonical" way of achieving what I want, but it's a POC which I intend on rewriting later. I have grouped data, each group contains a frequencies dict that maps keys to counts and I'm trying to convert this to an equivalent dict of probabilities.

Each struct has an id and count key, and the value associated with id may be null, this is the source of the error. When there is only 1 item in the array (i.e. group_id==3) then the error above is thrown.

In fact simply returning the input as output (from the ufunc) will also trigger the error.

If you comment out the last row from the example (i.e. {"group_id":3...) then there's no error.

Expected behavior

I think this should work, even if it's not the recommended way of doing this?

Installed versions

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-5.15.0-102-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               24.2.1
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

david-waterworth · 2024-04-29T01:09:55Z

Also note the example frame was generated by polars using

df.group_by(pl.col("group_id")).agg(frequencies=pl.col("id").value_counts(sort=True))

So if this expression can product a singleton array with a null value (i.e. [{id:null, count:10}] ) then it should probably be possible to create as output from map_elements

cmdlineluser · 2024-04-29T11:01:43Z

It seems there is no "supertype" inference happening from the .map_elements result?

You can modify the example to return a Series instead with the dtype:

return pl.Series(output, dtype=pl.Struct({'id': pl.String, 'probability': pl.Float64}))

As for canonical, there is .list.eval() if you are not aware:

df.with_columns(
   pl.col("frequencies").list.eval(
      pl.struct(
         id = pl.element().struct["id"],
         probability = pl.element().struct["count"] / pl.element().struct["count"].sum()
      )
   )
)

# shape: (4, 2)
# ┌──────────┬─────────────────────────────────────────────────────────┐
# │ group_id ┆ frequencies                                             │
# │ ---      ┆ ---                                                     │
# │ i64      ┆ list[struct[2]]                                         │
# ╞══════════╪═════════════════════════════════════════════════════════╡
# │ 0        ┆ [{null,0.5}, {"a",0.5}]                                 │
# │ 1        ┆ [{"b",0.8}, {null,0.2}]                                 │
# │ 2        ┆ [{null,0.625}, {"a",0.1875}, {"b",0.125}, {"c",0.0625}] │
# │ 3        ┆ [{null,1.0}]                                            │
# └──────────┴─────────────────────────────────────────────────────────┘

Although I imagine there is a simpler approach without using .value_counts()

david-waterworth · 2024-04-29T23:27:49Z

Thanks @cmdlineluser - I was aware of .list.eval() but I misunderstood the meaning of pl.element() - I thought it literally meant a single element, but looking at the source it's an alias for F.col(""), so you can use it in the same way (i.e. pl.element().struct["count"] refers to a single item, pl.element().struct["count"].sum() applies over all items.

Either work-around works, should I close this?

cmdlineluser · 2024-05-03T11:48:28Z

Yeah, the element usage is a bit confusing in some cases.

should I close this?

I'm not sure - it does look like your original example should work.

At the very least, I would think the PanicException needs to be fixed.

david-waterworth added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 29, 2024

deanm0000 removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is not possible to concatenate arrays of different data types. #15946

It is not possible to concatenate arrays of different data types. #15946

david-waterworth commented Apr 29, 2024

david-waterworth commented Apr 29, 2024 •

edited

cmdlineluser commented Apr 29, 2024

david-waterworth commented Apr 29, 2024

cmdlineluser commented May 3, 2024

It is not possible to concatenate arrays of different data types. #15946

It is not possible to concatenate arrays of different data types. #15946

Comments

david-waterworth commented Apr 29, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

david-waterworth commented Apr 29, 2024 • edited

cmdlineluser commented Apr 29, 2024

david-waterworth commented Apr 29, 2024

cmdlineluser commented May 3, 2024

david-waterworth commented Apr 29, 2024 •

edited