Error while saving with EarlyStoppingCallback #29157

dhruvmullick · 2024-02-21T00:44:43Z

System Info

transformers version: 4.38.0.dev0 (also in 4.38.0 and 4.39.0.dev0)
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.28.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: DeepSpeed

Who can help?

@muellerzr and @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Running a standard Causal LM training routine.

Reproduction

SFTTrainer is used for training the model
transformers.EarlyStoppingCallback is added to the trainer prior to .train()

This error has appeared in the last few days, likely due to some recent change.
Error is fixed by either rolling back to transformers version 4.37.2 or remove the early stopping callback.

Here's the stack trace:

File "/workspace/envs/torch_env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/workspace/envs/torch_env/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/usr/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable

Expected behavior

No error with 4.38.0.dev0 transformers version.

The text was updated successfully, but these errors were encountered:

webbigdata-jp · 2024-02-22T09:56:45Z

pip3 install deepspeed==0.13.1

is work for me.

pip3 install deepspeed==0.13.2
is same JSON serializable error.

ArthurZucker · 2024-03-07T11:12:31Z

Wow thanks @webbigdata-jp !

github-actions · 2024-04-01T08:04:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

iFe1er · 2024-04-03T13:57:10Z

The error still occur in my situation even with deepspeed==0.13.1 and transformers==4.37.2

Could anyone help?

ArthurZucker · 2024-04-05T07:51:49Z

pip install --upgrade transformers

iFe1er · 2024-04-07T08:01:29Z

pip install --upgrade transformers

problem fix. I shouldn't log a tensor object using self.log (working with wandb). After converting my logging variable from tensor to python object via xx.item() solve my problem

NanoCode012 mentioned this issue Feb 23, 2024

fix: checkpoint saving with deepspeed OpenAccess-AI-Collective/axolotl#1321

Merged

hiyouga mentioned this issue Feb 23, 2024

sft save 问题 hiyouga/LLaMA-Factory#2563

Closed

1 task

lijiaoyang mentioned this issue Feb 23, 2024

TypeError: Object of type Tensor is not JSON serializable LlamaFamily/Llama-Chinese#294

Open

muellerzr added the solved label Mar 7, 2024

ArthurZucker closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while saving with EarlyStoppingCallback #29157

Error while saving with EarlyStoppingCallback #29157

dhruvmullick commented Feb 21, 2024 •

edited

webbigdata-jp commented Feb 22, 2024

ArthurZucker commented Mar 7, 2024

github-actions bot commented Apr 1, 2024

iFe1er commented Apr 3, 2024

ArthurZucker commented Apr 5, 2024

iFe1er commented Apr 7, 2024

Error while saving with EarlyStoppingCallback #29157

Error while saving with EarlyStoppingCallback #29157

Comments

dhruvmullick commented Feb 21, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

webbigdata-jp commented Feb 22, 2024

ArthurZucker commented Mar 7, 2024

github-actions bot commented Apr 1, 2024

iFe1er commented Apr 3, 2024

ArthurZucker commented Apr 5, 2024

iFe1er commented Apr 7, 2024

dhruvmullick commented Feb 21, 2024 •

edited