Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyo3_runtime.PanicException: DataType: [] not supported in writing to csv #6038

Closed
2 tasks done
ovcharenko opened this issue Jan 4, 2023 · 18 comments · Fixed by #6040
Closed
2 tasks done

pyo3_runtime.PanicException: DataType: [] not supported in writing to csv #6038

ovcharenko opened this issue Jan 4, 2023 · 18 comments · Fixed by #6040
Labels
bug Something isn't working python Related to Python Polars

Comments

@ovcharenko
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Trying to export DataFrame with data type List causes the exception.

Reproducible example

import polars as pl

df = pl.DataFrame({
    "text": ["sample1"],
    "list": [[1, 2]]
})

df.write_csv()

Expected behavior

Similar to what you got from Pandas:

>>> df.to_pandas().to_csv(index=False)
'text,list\nsample1,[1 2]\n'

Installed versions

---Version info---
Polars: 0.15.11
Index type: UInt32
Platform: macOS-13.1-arm64-arm-64bit
Python: 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.23.5
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
@ovcharenko ovcharenko added bug Something isn't working python Related to Python Polars labels Jan 4, 2023
@ritchie46
Copy link
Member

Csv's should not have list values. Flatten your data or use another format, such as arrow, parquet, json.

@ovcharenko
Copy link
Author

Csv's should not have list values. Flatten your data or use another format, such as arrow, parquet, json.

What are you referencing when saying so?

@ritchie46
Copy link
Member

ritchie46 commented Jan 4, 2023

We follow the RFC 4180: https://datatracker.ietf.org/doc/html/rfc4180 as the closest thing to a reference on what is allowed in CSV.

CSV is not a format well suited for nested data. You can encode your data in a string column and then serialize that later, but that is not something we will support.

It is best to use formats designed to work with nested data or if you want to use csv, transform your table to long format.

We can improve the error and suggest other formats.

@ovcharenko
Copy link
Author

I thought so. But this RFC doesn't say anything about not allowing lists. Just trying to understand the reason...

@ritchie46
Copy link
Member

Just trying to understand the reason

CSV is ill suited for nested data, so we do not support it. We like to focus and improve the data structures that are well suited for a certain task. There are good alternatives: JSON, parquet, IPC

@ghuls
Copy link
Collaborator

ghuls commented Jan 4, 2023

You can serialize your list by converting them first to a list with strings and adding a delimeter that is different from your column delimiter and not appearing in your list data:

In [73]: df.with_columns([pl.col("list").cast(pl.List(pl.Utf8)).arr.join(";")])
Out[73]: 
shape: (1, 2)
┌─────────┬──────┐
│ textlist │
│ ------  │
│ strstr  │
╞═════════╪══════╡
│ sample11;2  │
└─────────┴──────┘

In [74]: df.with_columns([pl.col("list").cast(pl.List(pl.Utf8)).arr.join(";")]).write_csv()
Out[74]: 'text,list\nsample1,1;2\n'

In [75]: pl.read_csv(b'text,list\nsample1,1;2\n', sep=",").with_columns([pl.col("list").str.split(";").cast(pl.List(pl.Int64))])
Out[75]: 
shape: (1, 2)
┌─────────┬───────────┐
│ textlist      │
│ ------       │
│ strlist[i64] │
╞═════════╪═══════════╡
│ sample1 ┆ [1, 2]    │
└─────────┴───────────┘

@ovcharenko
Copy link
Author

You can serialize your list by converting them first to a list with strings and adding a delimeter that is different from your column delimiter and not appearing in your list data:

In [73]: df.with_columns([pl.col("list").cast(pl.List(pl.Utf8)).arr.join(";")])
Out[73]: 
shape: (1, 2)
┌─────────┬──────┐
│ textlist │
│ ------  │
│ strstr  │
╞═════════╪══════╡
│ sample11;2  │
└─────────┴──────┘

In [74]: df.with_columns([pl.col("list").cast(pl.List(pl.Utf8)).arr.join(";")]).write_csv()
Out[74]: 'text,list\nsample1,1;2\n'

In [75]: pl.read_csv(b'text,list\nsample1,1;2\n', sep=",").with_columns([pl.col("list").str.split(";").cast(pl.List(pl.Int64))])
Out[75]: 
shape: (1, 2)
┌─────────┬───────────┐
│ textlist      │
│ ------       │
│ strlist[i64] │
╞═════════╪═══════════╡
│ sample1 ┆ [1, 2]    │
└─────────┴───────────┘

Thanks, but only if I was the same person who will read the files :) I was looking for Pandas replacement and Polars is very attractive. Alas, I can't use it "the Polars" way in term of final output. And having conversion to Pandas just to have lists in CSV seems... odd.

@ritchie46
Copy link
Member

Thanks, but only if I was the same person who will read the files :) I was looking for Pandas replacement and Polars is very attractive. Alas, I can't use it "the Polars" way in term of final output. And having conversion to Pandas just to have lists in CSV seems... odd.

Why don't you take a file format that is designed for nested data?

@ovcharenko
Copy link
Author

Thanks, but only if I was the same person who will read the files :) I was looking for Pandas replacement and Polars is very attractive. Alas, I can't use it "the Polars" way in term of final output. And having conversion to Pandas just to have lists in CSV seems... odd.

Why don't you take a file format that is designed for nested data?

Legacy support. Anyway, I can live with Pandas.

@ghuls
Copy link
Collaborator

ghuls commented Jan 4, 2023

Thanks, but only if I was the same person who will read the files :) I was looking for Pandas replacement and Polars is very attractive. Alas, I can't use it "the Polars" way in term of final output. And having conversion to Pandas just to have lists in CSV seems... odd.

With Pandas you would also not be able to read that CSV data back as a list column as it would write [1 2] and read it a just a string column without post processing.

import io

import pandas as pd
import polars as pl

In [111]: df.to_pandas().to_csv()
Out[111]: ',text,list\n0,sample1,[1 2]\n'

In [125]: df_pd = pd.read_csv(io.StringIO(df.to_pandas().to_csv()))

In [126]: df_pd
Out[126]: 
   Unnamed: 0     text   list
0           0  sample1  [1 2]

In [127]: df_pd["list"][0]
Out[127]: '[1 2]'

@ovcharenko
Copy link
Author

Thanks, but only if I was the same person who will read the files :) I was looking for Pandas replacement and Polars is very attractive. Alas, I can't use it "the Polars" way in term of final output. And having conversion to Pandas just to have lists in CSV seems... odd.

With Pandas you would also not be able to read that CSV data back as a list column as it would write [1 2] and read it a just a string column without post processing.

import io

import pandas as pd
import polars as pl

In [111]: df.to_pandas().to_csv()
Out[111]: ',text,list\n0,sample1,[1 2]\n'

In [125]: df_pd = pd.read_csv(io.StringIO(df.to_pandas().to_csv()))

In [126]: df_pd
Out[126]: 
   Unnamed: 0     text   list
0           0  sample1  [1 2]

In [127]: df_pd["list"][0]
Out[127]: '[1 2]'

I know! That's the point. I can't do that on Polars without Pandas

@ritchie46
Copy link
Member

ritchie46 commented Jan 4, 2023

I know! That's the point. I can't do that on Polars without Pandas

You see that the datatype read by pandas is a string, not a list<i64>?

@ovcharenko
Copy link
Author

I know! That's the point. I can't do that on Polars without Pandas

You see that the datatype read by pandas is a string, not a list<i64>?

Sure. Why? Is any way to convert that to the same using Polars? Because it's not so obvious...

>>> df.with_columns([pl.col("list").str.decode("utf8")])
...
ValueError: encoding must be one of {'hex', 'base64'}, got utf8

@ovcharenko
Copy link
Author

So it looks like this or similar is running under the hood during CSV export:

>>> df.with_columns([(pl.col("list") + "")])
...
pyo3_runtime.PanicException: this operation is not implemented/valid for this dtype: List(Int64)

When more appropriate way would be:

>>> df.apply(lambda t: (t[0], str(t[1])))
shape: (1, 2)
┌──────────┬──────────┐
│ column_0column_1 │
│ ------      │
│ strstr      │
╞══════════╪══════════╡
│ sample1  ┆ [1, 2]   │
└──────────┴──────────┘
>>>

But running internally without UDFs.

@ghuls
Copy link
Collaborator

ghuls commented Jan 4, 2023

In [129]: df.with_columns([(pl.lit("[") + pl.col("list").cast(pl.List(pl.Utf8)).arr.join(" ") + pl.lit("]")).alias("list")])
Out[129]: 
shape: (1, 2)
┌─────────┬───────┐
│ textlist  │
│ ------   │
│ strstr   │
╞═════════╪═══════╡
│ sample1 ┆ [1 2] │
└─────────┴───────┘

Or a bit nicer wrapped in a function:

def list_to_str(df, list_col_name):
    return df.with_column(
        (
            pl.lit("[") + pl.col(list_col_name).cast(pl.List(pl.Utf8)).arr.join(" ") + pl.lit("]")
        ).alias(list_col_name)
    )

In [135]: df.pipe(list_to_str, "list")
Out[135]: 
shape: (1, 2)
┌─────────┬───────┐
│ textlist  │
│ ------   │
│ strstr   │
╞═════════╪═══════╡
│ sample1 ┆ [1 2] │
└─────────┴───────┘

@ovcharenko
Copy link
Author

That doesn't solve issue with nested lists, but at least demonstrate why it is not so trivial to resolve.

@shamy1997
Copy link

I faced with the same problem. Using list is common, but polars made it hard. Maybe polars can take it into account to make itself better.

@rafmagns-skepa-dreag
Copy link

rafmagns-skepa-dreag commented Feb 22, 2023

this would be valuable to me so that I can serialize my dataframe for use in COPY operations into postgres (I sometimes work with array and json columns). I'm not aware of any easy way to do this or any implementations of polars dataframe to postgres binary COPY format. the function above mostly works for me though to get into a csv so thanks! (also apologies for commenting on a closed issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants