Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

psychocosine · 2024-05-12T06:14:49Z

System Info

transformers version: 4.40.0
Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Python version: 3.9.19
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The length of the eval_preds parameter received in the compute_metrics function is different from the original length in eval_dataset.

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        # In case the model returns more than the prediction logits
        if isinstance(preds, tuple):
            preds = preds[0]
        assert preds.shape[-1] == training_args.max_length
        assert preds.shape[0] == len(tokenized_datasets[-1]) #### assertion error preds.shape[0]=1024 ,len(tokenized_datasets[-1])=1012

trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets[0].shuffle(seed=42).select(range(int(1e6))),
        eval_dataset={data_args.task_name: tokenized_datasets[-1]},
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[LoggerCallback, DenserEvalCallback]
    )

My training args lists below :

n_gpus=2
per_device_train_batch_size=8
per_device_eval_batch_size=8
gradient_accumulation_steps=3
len(preds)=1024
len(tokenized_datasets[-1])=1012

Expected behavior

Everything works fine when using single gpu but not gpus
I started my script by calling accelerate launch scripy.py

The text was updated successfully, but these errors were encountered:

amyeroberts added the trainer label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

psychocosine commented May 12, 2024 •

edited

Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

Comments

psychocosine commented May 12, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

psychocosine commented May 12, 2024 •

edited