Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(format): converting from Arrow dictionary type is not working #8207

Closed
1 task done
buhrmann opened this issue Feb 3, 2024 · 1 comment
Closed
1 task done

bug(format): converting from Arrow dictionary type is not working #8207

buhrmann opened this issue Feb 3, 2024 · 1 comment
Assignees
Labels
bug Incorrect behavior inside of ibis datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) format Issues related to output formats

Comments

@buhrmann
Copy link

buhrmann commented Feb 3, 2024

What happened?

Ibis chokes when importing an Arrow table, doesn't seem to support categorical/dictionary data:

import pandas as pd
import pyarrow as pa

df = pd.Series(list("abc"), dtype="category").to_frame()
tbl = pa.Table.from_pandas(df)
print(tbl)

icon = ibis.duckdb.connect()
icon.create_table("dataset", obj=tbl)
File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/formats/pyarrow.py:134, in PyArrowType.to_ibis(cls, typ, nullable)
    132     return dt.JSON()
    133 else:
--> 134     return _from_pyarrow_types[typ](nullable=nullable)

KeyError: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

What version of ibis are you using?

7.2.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[60], line 9
      6 print(tbl)
      8 icon = ibis.duckdb.connect()
----> 9 icon.create_table("dataset", obj=tbl)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/backends/base/sql/alchemy/__init__.py:292, in BaseAlchemyBackend.create_table(self, name, obj, schema, database, temp, overwrite)
    289 import pyarrow_hotfix  # noqa: F401
    291 if isinstance(obj, (pd.DataFrame, pa.Table)):
--> 292     obj = ibis.memtable(obj)
    294 if database == self.current_database:
    295     # avoid fully qualified name
    296     database = None

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/expr/api.py:438, in memtable(data, columns, schema, name)
    433 if columns is not None and schema is not None:
    434     raise NotImplementedError(
    435         "passing `columns` and schema` is ambiguous; "
    436         "pass one or the other but not both"
    437     )
--> 438 return _memtable(data, name=name, schema=schema, columns=columns)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/common/dispatch.py:88, in lazy_singledispatch.<locals>.call(arg, *args, **kwargs)
     86 @functools.wraps(func)
     87 def call(arg, *args, **kwargs):
---> 88     return dispatch(type(arg))(arg, *args, **kwargs)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/expr/api.py:456, in _memtable_from_pyarrow_table(data, name, schema, columns)
    452     assert schema is None, "if `columns` is not `None` then `schema` must be `None`"
    453     schema = sch.Schema(dict(zip(columns, sch.infer(data).values())))
    454 return ops.InMemoryTable(
    455     name=name if name is not None else util.gen_name("pyarrow_memtable"),
--> 456     schema=sch.infer(data) if schema is None else schema,
    457     data=PyArrowTableProxy(data),
    458 ).to_expr()

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/common/dispatch.py:88, in lazy_singledispatch.<locals>.call(arg, *args, **kwargs)
     86 @functools.wraps(func)
     87 def call(arg, *args, **kwargs):
---> 88     return dispatch(type(arg))(arg, *args, **kwargs)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/expr/schema.py:279, in infer_pyarrow_table(table, schema)
    276 from ibis.formats.pyarrow import PyArrowSchema
    278 schema = schema if schema is not None else table.schema
--> 279 return PyArrowSchema.to_ibis(schema)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/formats/pyarrow.py:216, in PyArrowSchema.to_ibis(cls, schema)
    213 @classmethod
    214 def to_ibis(cls, schema: pa.Schema) -> Schema:
    215     """Convert a pyarrow schema to a schema."""
--> 216     fields = [(f.name, PyArrowType.to_ibis(f.type, f.nullable)) for f in schema]
    217     return Schema.from_tuples(fields)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/formats/pyarrow.py:216, in <listcomp>(.0)
    213 @classmethod
    214 def to_ibis(cls, schema: pa.Schema) -> Schema:
    215     """Convert a pyarrow schema to a schema."""
--> 216     fields = [(f.name, PyArrowType.to_ibis(f.type, f.nullable)) for f in schema]
    217     return Schema.from_tuples(fields)

File ~/micromamba/envs/grapy/lib/python3.9/site-packages/ibis/formats/pyarrow.py:134, in PyArrowType.to_ibis(cls, typ, nullable)
    132     return dt.JSON()
    133 else:
--> 134     return _from_pyarrow_types[typ](nullable=nullable)

KeyError: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@buhrmann buhrmann added the bug Incorrect behavior inside of ibis label Feb 3, 2024
@kszucs kszucs changed the title bug: Arrow import not working bug(format): converting from Arrow dictionary type is not working Feb 3, 2024
@kszucs
Copy link
Member

kszucs commented Feb 3, 2024

Thanks @buhrmann for the report!

Confirmed. A smaller reproducer not involving the duckdb backend is to use ibis.memtable():

import ibis
import pandas as pd
import pyarrow as pa

df = pd.Series(list("abc"), dtype="category").to_frame()
tbl = pa.Table.from_pandas(df)

ibis.memtable(tbl)

The issue is caused by not handling the arrow dictionary type in PyArrowType.to_ibis()`.

@kszucs kszucs added the format Issues related to output formats label Feb 3, 2024
@cpcloud cpcloud added the datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) label Feb 5, 2024
@kszucs kszucs closed this as completed in 14c4226 Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) format Issues related to output formats
Projects
Archived in project
Development

No branches or pull requests

4 participants