Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while saving with EarlyStoppingCallback #29157

Closed
1 of 4 tasks
dhruvmullick opened this issue Feb 21, 2024 · 6 comments
Closed
1 of 4 tasks

Error while saving with EarlyStoppingCallback #29157

dhruvmullick opened this issue Feb 21, 2024 · 6 comments
Labels

Comments

@dhruvmullick
Copy link

dhruvmullick commented Feb 21, 2024

System Info

  • transformers version: 4.38.0.dev0 (also in 4.38.0 and 4.39.0.dev0)
  • Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0.dev0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: DeepSpeed

Who can help?

@muellerzr and @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Running a standard Causal LM training routine.

Reproduction

  • SFTTrainer is used for training the model
  • transformers.EarlyStoppingCallback is added to the trainer prior to .train()

This error has appeared in the last few days, likely due to some recent change.
Error is fixed by either rolling back to transformers version 4.37.2 or remove the early stopping callback.

Here's the stack trace:

File "/workspace/envs/torch_env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/usr/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable

Expected behavior

No error with 4.38.0.dev0 transformers version.

@webbigdata-jp
Copy link

pip3 install deepspeed==0.13.1

is work for me.

pip3 install deepspeed==0.13.2
is same JSON serializable error.

@ArthurZucker
Copy link
Collaborator

Wow thanks @webbigdata-jp !

Copy link

github-actions bot commented Apr 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@iFe1er
Copy link

iFe1er commented Apr 3, 2024

The error still occur in my situation even with deepspeed==0.13.1 and transformers==4.37.2

Could anyone help?

@ArthurZucker
Copy link
Collaborator

pip install --upgrade transformers

@iFe1er
Copy link

iFe1er commented Apr 7, 2024

pip install --upgrade transformers

problem fix. I shouldn't log a tensor object using self.log (working with wandb). After converting my logging variable from tensor to python object via xx.item() solve my problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants