Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(steps): handle unknown category for all encoders #73

Closed
jitingxu1 opened this issue Apr 19, 2024 · 2 comments
Closed

feat(steps): handle unknown category for all encoders #73

jitingxu1 opened this issue Apr 19, 2024 · 2 comments

Comments

@jitingxu1
Copy link
Collaborator

Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.

Open an issue to record this for future consideration.

The current implemenations:

  • CategoricalEncode will convert unknown category to None
  • OneHotEncode will convert all encoded cols to 0, see following example.
  • CountEncode will convert unknown category to 0

For example:

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res

AMZN in the 5th row is unknown, it will be translated to all 0s

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time                    ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp               │ int8        │ int8        │ int8        │ int8        │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.038 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.048 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.049 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.050 │           0 │           0 │           0 │           0 │
│ 2016-05-25 13:30:00.051 │           0 │           0 │           0 │           1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
@deepyaman
Copy link
Collaborator

CategoricalEncode will convert unknown category to None

I haven't looked into whether this is correct or not, since the step doesn't have any tests; we should definitely add one, and then can identify the correct behaviors. 😅

OneHotEncode will convert all encoded cols to 0, see following example.

Having a separate category for unknown (i.e. rest of the encoding column values are all 0) could be a nice option to provide/give the user more flexibility, but may not matter for a lot of model types (e.g. GBDT).

CountEncode will convert unknown category to 0

This is intentional, as the count should be 0 for something that has not been seen.

To me, seems like the immediate action item is to add a test for CategoricalEncode to make sure it's functionality is correct, and to (at lower priority) make the OneHotEncode unknown category handling a bit more flexible.

@deepyaman
Copy link
Collaborator

@jitingxu1 closing this; if want to make OneHotEncode unknown category handling a bit more flexible, feel free to create a new issue, but it doesn't seem to be a priority at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants