Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: incorrect shape size set for categorical field in classifier model. #25

Open
germanjoey opened this issue Apr 23, 2019 · 3 comments

Comments

@germanjoey
Copy link

germanjoey commented Apr 23, 2019

While trying out automl_gs on this uci dataset, I got this error:

Traceback (most recent call last):
  File "model.py", line 59, in <module>
    model_train(df, encoders, args, model)
  File "C:\Users\josep\automl_train\pipeline.py", line 835, in model_train
    batch_size=256)
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 776, in fit
    shuffle=shuffle)
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2382, in _standardize_user_data
    exception_prefix='input')
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training_utils.py", line 362, in standardize_input_data
    ' but got array with shape ' + str(data_shape))
ValueError: Error when checking input: expected input_son to have shape (1,) but got array with shape (2,)

After some sleuthing, I eventually figured out that the error is that the shape size for the offending column was set incorrectly in build_model():

    input_son_size = len(encoders['son_encoder'].classes_)
    input_son = Input(
        shape=(input_son_size if input_son_size != 2 else 1,), name="input_son")

I don't understand what the purpose of that `if-else clause. It looks like this change was introduced in 1dcb9e2; reverting that commit allows my model to work.

@germanjoey
Copy link
Author

germanjoey commented Apr 23, 2019

Hmmm, it seems I jumped the gun. When I said my model worked, it turns out it only worked for that set of hyperparameters or something? That is to say, it worked when I ran it manually via model.py -d ..\absenteeism\data.csv -m train, but when I actually run automl_gs from the top on the data again, it now fails after the 5th iteration of the search. The generated pipeline.py looks completely different now... I don't really understand what happened.

@germanjoey
Copy link
Author

Input here: https://gist.github.com/germanjoey/4204c3b4d49476b78fc3edcab0417b0e
Error is here:

(venv) (base) C:\Users\josep>automl_gs absenteeism\absenteeism.csv "Absenteeism time in hours"
Solving a regression problem, minimizing mse using tensorflow.

Modeling with field specifications:
ID: ignore
Reason for absence: numeric
Month of absence: categorical
Day of the week: categorical
Seasons: categorical
Transportation expense: numeric
Distance from Residence to Work: numeric
Service time: numeric
Age: numeric
Work load Average/day: numeric
Hit target: categorical
Disciplinary failure: categorical
Education: categorical
Son: categorical
Social drinker: categorical
Social smoker: categorical
Pet: categorical
Weight: numeric
Height: numeric
Body mass index: numeric

Metrics:
time_completed: 2019-04-23 21:01:32
mse: 133.92264804535634+22
r_2: -0.1073565667623002+20
epoch: 20
mae: 4.6226266268137345
trial_id: af365751-a791-43a6-97b8-cb8721a92841
  5%|###9                                                                           | 5/100 [00:35<11:23,  7.20s/trial]Traceback (most recent call last):                                                            | 0/20 [00:00<?, ?epoch/s]
  File "model.py", line 59, in <module>
    model_train(df, encoders, args, model)
  File "C:\Users\josep\automl_train\pipeline.py", line 794, in model_train
    batch_size=128)
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 776, in fit
    shuffle=shuffle)
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2382, in _standardize_user_data
    exception_prefix='input')
  File "C:\Users\josep\venv\lib\site-packages\tensorflow\python\keras\engine\training_utils.py", line 362, in standardize_input_data
    ' but got array with shape ' + str(data_shape))
ValueError: Error when checking input: expected input_social_drinker to have shape (2,) but got array with shape (1,)
                                                                                                                       Traceback (most recent call last):                                                            | 0/20 [00:00<?, ?epoch/s]
  File "C:\Users\josep\AppData\Local\Programs\Python\Python35\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\josep\AppData\Local\Programs\Python\Python35\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\josep\venv\Scripts\automl_gs.exe\__main__.py", line 9, in <module>
  File "C:\Users\josep\venv\lib\site-packages\automl_gs\automl_gs.py", line 175, in cmd
    tpu_address=args.tpu_address)
  File "C:\Users\josep\venv\lib\site-packages\automl_gs\automl_gs.py", line 94, in automl_grid_search
    train_results = results.tail(1).to_dict('records')[0]
IndexError: list index out of range

@germanjoey
Copy link
Author

Looks like this is a more fundamental issue - build_model() is essentially trying to predict the shape of how each feature is encoded based on the size of the encoder's classes_, but this doesn't necessarily always work.

Combined with the above reversion of templates/models/tensorflow/categorical.py from 1dcb9e2, I changed the last few lines of templates/encoders/categorical from this:

    with open(os.path.join('encoders', '{{ field }}_encoder.json'),
             'w', encoding='utf8') as outfile:
        json.dump({{ field }}_encoder.classes_.tolist(), outfile, ensure_ascii=False)

to this:

    {{ field }}_encoder_shape = {{ field }}_encoder.transform(df['{{ field_raw }}'].values).shape[1]
    with open(os.path.join('encoders', '{{ field }}_encoder.json'),
             'w', encoding='utf8') as outfile:
        json.dump(list(range({{ field }}_encoder_shape)), outfile, ensure_ascii=False)

That gets the model working through the entire search, but I'm hesitant to submit a PR because I wonder if there's a more efficient way to do this. The model I'm testing on is small, but I imagine this could be a big performance hit on bigger data sets. What do y'all think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant