Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframes with categorical columns cannot be interchanged #4652

Closed
honno opened this issue Jul 7, 2022 · 2 comments · Fixed by #4737
Closed

Dataframes with categorical columns cannot be interchanged #4652

honno opened this issue Jul 7, 2022 · 2 comments · Fixed by #4737
Assignees
Labels
bug 🦗 Something isn't working

Comments

@honno
Copy link

honno commented Jul 7, 2022

Interchanging a dataframe with categorical columns results in NotImplementedError. I'm not super familiar with data storage here, but I was wondering if maybe this is actually an unintended bug.

>>> from modin import pandas as pd
>>> df = pd.DataFrame({"foo": pd.Series([0, 1, 2], dtype="category")})
>>> from modin.pandas.utils import from_dataframe
>>> modin_df = from_dataframe(df)
NotImplementedError: See self.describe_null
Full traceback
.../modin/pandas/utils.py:123, in from_dataframe(df)
    120 from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher
    121 from .dataframe import DataFrame
--> 123 return DataFrame(query_compiler=FactoryDispatcher.from_dataframe(df))

.../modin/core/execution/dispatching/factories/dispatcher.py:175, in FactoryDispatcher.from_dataframe(cls, *args, **kwargs)
    172 @classmethod
    173 @_inherit_docstrings(factories.BaseFactory._from_dataframe)
    174 def from_dataframe(cls, *args, **kwargs):
--> 175     return cls.__factory._from_dataframe(*args, **kwargs)

.../modin/core/execution/dispatching/factories/factories.py:197, in BaseFactory._from_dataframe(cls, *args, **kwargs)
    189 @classmethod
    190 @doc(
    191     _doc_io_method_template,
   (...)
    195 )
    196 def _from_dataframe(cls, *args, **kwargs):
--> 197     return cls.io_cls.from_dataframe(*args, **kwargs)

.../modin/core/io/io.py:120, in BaseIO.from_dataframe(cls, df)
    105 @classmethod
    106 def from_dataframe(cls, df):
    107     """
    108     Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol `__dataframe__()`.
    109 
   (...)
    118         QueryCompiler containing data from the DataFrame.
    119     """
--> 120     return cls.query_compiler_cls.from_dataframe(df, cls.frame_cls)

.../modin/core/storage_formats/pandas/query_compiler.py:278, in PandasQueryCompiler.from_dataframe(cls, df, data_cls)
    276 @classmethod
    277 def from_dataframe(cls, df, data_cls):
--> 278     return cls(data_cls.from_dataframe(df))

.../modin/core/dataframe/pandas/dataframe/dataframe.py:2968, in PandasDataframe.from_dataframe(cls, df)
   2963 from modin.core.dataframe.pandas.exchange.dataframe_protocol.from_dataframe import (
   2964     from_dataframe_to_pandas,
   2965 )
   2967 ErrorMessage.default_to_pandas(message="`from_dataframe`")
-> 2968 pandas_df = from_dataframe_to_pandas(df)
   2969 return cls.from_pandas(pandas_df)

.../modin/core/dataframe/pandas/exchange/dataframe_protocol/from_dataframe.py:68, in from_dataframe_to_pandas(df, n_chunks)
     66 pandas_dfs = []
     67 for chunk in df.get_chunks(n_chunks):
---> 68     pandas_df = protocol_df_chunk_to_pandas(chunk)
     69     pandas_dfs.append(pandas_df)
     71 pandas_df = pandas.concat(pandas_dfs, axis=0, ignore_index=True)

.../modin/core/dataframe/pandas/exchange/dataframe_protocol/from_dataframe.py:111, in protocol_df_chunk_to_pandas(df)
    109     columns[name], buf = primitive_column_to_ndarray(col)
    110 elif dtype == DTypeKind.CATEGORICAL:
--> 111     columns[name], buf = categorical_column_to_series(col)
    112 elif dtype == DTypeKind.STRING:
    113     columns[name], buf = string_column_to_ndarray(col)

.../modin/core/dataframe/pandas/exchange/dataframe_protocol/from_dataframe.py:167, in categorical_column_to_series(col)
    164     raise NotImplementedError("Non-dictionary categoricals not supported yet")
    166 categories = np.array(tuple(mapping.values()))
--> 167 buffers = col.get_buffers()
    169 codes_buff, codes_dtype = buffers["data"]
    170 codes = buffer_to_ndarray(codes_buff, codes_dtype, col.offset, col.size)

.../modin/core/dataframe/pandas/exchange/dataframe_protocol/column.py:313, in PandasProtocolColumn.get_buffers(self)
    311 buffers["data"] = self._get_data_buffer()
    312 try:
--> 313     buffers["validity"] = self._get_validity_buffer()
    314 except NoValidityBuffer:
    315     buffers["validity"] = None

.../modin/core/dataframe/pandas/exchange/dataframe_protocol/column.py:422, in PandasProtocolColumn._get_validity_buffer(self)
    420     msg = "This column uses NaN as null so does not have a separate mask"
    421 else:
--> 422     raise NotImplementedError("See self.describe_null")
    424 raise NoValidityBuffer(msg)

NotImplementedError: See self.describe_null

I assume the message relates to

# Null values for categoricals are stored as `-1` sentinel values
# in the category date (e.g., `col.values.codes` is int8 np.ndarray)
DTypeKind.CATEGORICAL: (ColumnNullType.USE_SENTINEL, -1),

but why can't _get_validity_buffer() raise NoValidityBuffer here instead, so by extension get_buffers() will have buffers["validity"] = None.

try:
buffers["validity"] = self._get_validity_buffer()
except NoValidityBuffer:
buffers["validity"] = None

Wouldn't that be valid given

- "validity": a two-element tuple whose first element is a buffer
containing mask values indicating missing data and
whose second element is the mask value buffer's
associated dtype. None if the null representation is not a bit or byte mask.

i.e. if columns use sentinel values, the validity buffer can just be None.

cc @vnlitvinov

@vnlitvinov vnlitvinov self-assigned this Jul 7, 2022
@vnlitvinov
Copy link
Collaborator

Thanks for the report, I will have a look!

@vnlitvinov vnlitvinov added the bug 🦗 Something isn't working label Jul 7, 2022
pyrito pushed a commit to pyrito/modin that referenced this issue Jul 29, 2022
Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>
@vnlitvinov vnlitvinov assigned pyrito and unassigned vnlitvinov Aug 1, 2022
YarShev pushed a commit that referenced this issue Aug 1, 2022
Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>
@pyrito
Copy link
Collaborator

pyrito commented Aug 1, 2022

@honno your intuition about the issue was correct and we are handling this case properly now. Thanks for opening the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants