Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in categorical_encoders.py ? #18

Closed
wrannaman opened this issue Jun 6, 2021 · 6 comments
Closed

Bug in categorical_encoders.py ? #18

wrannaman opened this issue Jun 6, 2021 · 6 comments
Assignees

Comments

@wrannaman
Copy link

wrannaman commented Jun 6, 2021

Love the repo! I'm trying to use this to predict customer behavior and have a data set with some continuous and some categorical data as well as some dates.

Describe the bug
When I have a date_column and encode_date_columns=True for a classification objective, the following error occurs

Traceback (most recent call last):
  File "1.py", line 53, in <module>
    tabular_model.fit(train=train, validation=val)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 455, in fit
    reset,
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 383, in _pre_fit
    train, validation, test, target_transform, train_sampler
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_model.py", line 294, in _prepare_dataloader
    self.datamodule.setup("fit")
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_datamodule.py", line 267, in setup
    self.validation, stage="inference"
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/tabular_datamodule.py", line 182, in preprocess_data
    data = self.categorical_encoder.transform(data)
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/pytorch_tabular/categorical_encoders.py", line 41, in transform
    assert all(c in X.columns for c in self.cols)
AssertionError

I put some logging in around X.columns and self.cols

X is:
Index(['interval', 'amount', 'status', 'target', '_Month', '_Quarter',
       '_Is_quarter_end', '_Is_year_end'],
      dtype='object')
self cols is:
['interval', 'status', '_Month', '_Quarter', '_Is_quarter_end', '_Is_year_end']
[Next call happens immediately after]
X is:
Index(['interval', 'amount', 'status', 'target', '_Month', '_Quarter',
       '_Is_quarter_start', '_Is_year_start'],
      dtype='object')
self cols is:
['interval', 'status', '_Month', '_Quarter', '_Is_quarter_end', '_Is_year_end']

To Reproduce
Steps to reproduce the behavior:

  1. Here is the script
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig
from sklearn.model_selection import train_test_split
import pandas as pd


df = pd.read_csv('./repro.csv')
data = df

target_cols = ['target'] 
cat_col_names = ['interval', 'status']
continuous_col_names = ['amount']
date_col_list = [('date', 'M')] # Note: other timeframes don't work either.

train, test = train_test_split(data, random_state=42)
train, val = train_test_split(train, random_state=42)

data_config = DataConfig(
    target=target_cols, #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=continuous_col_names,
    categorical_cols=cat_col_names,
    date_columns=date_col_list,
    encode_date_columns=True,
    #    validation_split = 0.2,         #80% Train + test 20% validation
    num_workers=8,
    # continuous_feature_transform="quantile_normal",
)


trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=8,
    max_epochs=100,
    gpus=0, # index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-512-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

tabular_model.fit(train=train, validation=val)
result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)
tabular_model.save_model("examples/basic")
# loaded_model = TabularModel.load_from_checkpoint("examples/basic")
# result = loaded_model.evaluate(test)
  1. here is the data

https://docs.google.com/spreadsheets/d/1jfV_p0pRXv0zkQLvaQXuDvVtw21FblT7YK83760SPDQ/edit?usp=sharing

  1. Run example.py and see the error assert all(c in X.columns for c in self.cols)

Expected behavior
assert all(c in X.columns for c in self.cols) should pass when using date_columns and encode_date_columns=True

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
ios

Additional context

@manujosephv manujosephv self-assigned this Jun 10, 2021
@manujosephv manujosephv added the good first issue Good for newcomers label Jun 10, 2021
@manujosephv
Copy link
Owner

Thanks @wrannaman for the excellent issue. I'll take a look at it as soon as I get some time.

@manujosephv
Copy link
Owner

@wrannaman This should be fixed now in the develop branch.. Not able to publish to PyPi cause of my CI/CD pipeline which is currently with travics-ci.org. need to migrate to travis-ci.com or Github Actions.

@manujosephv
Copy link
Owner

Fixed in v 0.6. Now in PyPi. Can you check and revert? gpus should be None for CPU and -1 for using all GPUs

@wrannaman
Copy link
Author

I changed gpus = None and get a new error. Same code / dataset as above. Not sure if it's something on my end though.

Traceback (most recent call last):
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1145, in __del__
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1299, in close
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1492, in display
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1148, in __str__
  File "/Users/andrewpierno/opt/anaconda3/envs/churn/lib/python3.7/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

@manujosephv
Copy link
Owner

Is that the entire stacktrace? If not can you share the entire stacktrace so that I can see from where the error is coming up. And also, probably the versions of the libraries you are using.

I run my unit-test cases with gpus=None, so it shouldn't be a problem really.

@manujosephv manujosephv removed the good first issue Good for newcomers label Jun 24, 2021
@manujosephv
Copy link
Owner

closing because of non-response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants