Bug in categorical_encoders.py ? #18

wrannaman · 2021-06-06T23:17:47Z

Love the repo! I'm trying to use this to predict customer behavior and have a data set with some continuous and some categorical data as well as some dates.

Describe the bug
When I have a date_column and encode_date_columns=True for a classification objective, the following error occurs

Traceback (most recent call last):
  File "1.py", line 53, in <module>
    tabular_model.fit(train=train, validation=val)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 455, in fit
    reset,
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 383, in _pre_fit
    train, validation, test, target_transform, train_sampler
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 294, in _prepare_dataloader
    self.datamodule.setup("fit")
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_datamodule.py", line 267, in setup
    self.validation, stage="inference"
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_datamodule.py", line 182, in preprocess_data
    data = self.categorical_encoder.transform(data)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/categorical_encoders.py", line 41, in transform
    assert all(c in X.columns for c in self.cols)
AssertionError

I put some logging in around X.columns and self.cols

X is:
Index(['interval', 'amount', 'status', 'target', '_Month', '_Quarter',
       '_Is_quarter_end', '_Is_year_end'],
      dtype='object')
self cols is:
['interval', 'status', '_Month', '_Quarter', '_Is_quarter_end', '_Is_year_end']
[Next call happens immediately after]
X is:
Index(['interval', 'amount', 'status', 'target', '_Month', '_Quarter',
       '_Is_quarter_start', '_Is_year_start'],
      dtype='object')
self cols is:
['interval', 'status', '_Month', '_Quarter', '_Is_quarter_end', '_Is_year_end']

To Reproduce
Steps to reproduce the behavior:

Here is the script

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig
from sklearn.model_selection import train_test_split
import pandas as pd


df = pd.read_csv('./repro.csv')
data = df

target_cols = ['target'] 
cat_col_names = ['interval', 'status']
continuous_col_names = ['amount']
date_col_list = [('date', 'M')] # Note: other timeframes don't work either.

train, test = train_test_split(data, random_state=42)
train, val = train_test_split(train, random_state=42)

data_config = DataConfig(
    target=target_cols, #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=continuous_col_names,
    categorical_cols=cat_col_names,
    date_columns=date_col_list,
    encode_date_columns=True,
    #    validation_split = 0.2,         #80% Train + test 20% validation
    num_workers=8,
    # continuous_feature_transform="quantile_normal",
)


trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=8,
    max_epochs=100,
    gpus=0, # index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-512-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

tabular_model.fit(train=train, validation=val)
result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)
tabular_model.save_model("examples/basic")
# loaded_model = TabularModel.load_from_checkpoint("examples/basic")
# result = loaded_model.evaluate(test)

here is the data

https://docs.google.com/spreadsheets/d/1jfV_p0pRXv0zkQLvaQXuDvVtw21FblT7YK83760SPDQ/edit?usp=sharing

Run example.py and see the error assert all(c in X.columns for c in self.cols)

Expected behavior
assert all(c in X.columns for c in self.cols) should pass when using date_columns and encode_date_columns=True

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
ios

Additional context

The text was updated successfully, but these errors were encountered:

manujosephv · 2021-06-10T01:11:31Z

Thanks @wrannaman for the excellent issue. I'll take a look at it as soon as I get some time.

manujosephv · 2021-06-19T14:58:40Z

@wrannaman This should be fixed now in the develop branch.. Not able to publish to PyPi cause of my CI/CD pipeline which is currently with travics-ci.org. need to migrate to travis-ci.com or Github Actions.

manujosephv · 2021-06-20T07:31:48Z

Fixed in v 0.6. Now in PyPi. Can you check and revert? gpus should be None for CPU and -1 for using all GPUs

wrannaman · 2021-06-23T16:44:03Z

I changed gpus = None and get a new error. Same code / dataset as above. Not sure if it's something on my end though.

Traceback (most recent call last):
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1145, in __del__
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1299, in close
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1492, in display
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1148, in __str__
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

manujosephv · 2021-06-24T00:47:26Z

Is that the entire stacktrace? If not can you share the entire stacktrace so that I can see from where the error is coming up. And also, probably the versions of the libraries you are using.

I run my unit-test cases with gpus=None, so it shouldn't be a problem really.

manujosephv · 2021-09-01T12:22:03Z

closing because of non-response

manujosephv self-assigned this Jun 10, 2021

manujosephv added the good first issue Good for newcomers label Jun 10, 2021

manujosephv removed the good first issue Good for newcomers label Jun 24, 2021

manujosephv closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in categorical_encoders.py ? #18

Bug in categorical_encoders.py ? #18

wrannaman commented Jun 6, 2021 •

edited

manujosephv commented Jun 10, 2021

manujosephv commented Jun 19, 2021

manujosephv commented Jun 20, 2021

wrannaman commented Jun 23, 2021

manujosephv commented Jun 24, 2021

manujosephv commented Sep 1, 2021

Bug in categorical_encoders.py ? #18

Bug in categorical_encoders.py ? #18

Comments

wrannaman commented Jun 6, 2021 • edited

manujosephv commented Jun 10, 2021

manujosephv commented Jun 19, 2021

manujosephv commented Jun 20, 2021

wrannaman commented Jun 23, 2021

manujosephv commented Jun 24, 2021

manujosephv commented Sep 1, 2021

wrannaman commented Jun 6, 2021 •

edited