Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues occuring during parallel evaluation (using Trainer.evaluate) #30767

Open
4 tasks done
psychocosine opened this issue May 12, 2024 · 0 comments
Open
4 tasks done
Labels

Comments

@psychocosine
Copy link

psychocosine commented May 12, 2024

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
  • Python version: 3.9.19
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.13.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The length of the eval_preds parameter received in the compute_metrics function is different from the original length in eval_dataset.

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        # In case the model returns more than the prediction logits
        if isinstance(preds, tuple):
            preds = preds[0]
        assert preds.shape[-1] == training_args.max_length
        assert preds.shape[0] == len(tokenized_datasets[-1]) #### assertion error preds.shape[0]=1024 ,len(tokenized_datasets[-1])=1012
trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets[0].shuffle(seed=42).select(range(int(1e6))),
        eval_dataset={data_args.task_name: tokenized_datasets[-1]},
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[LoggerCallback, DenserEvalCallback]
    )

My training args lists below :

n_gpus=2
per_device_train_batch_size=8
per_device_eval_batch_size=8
gradient_accumulation_steps=3
len(preds)=1024
len(tokenized_datasets[-1])=1012

Expected behavior

Everything works fine when using single gpu but not gpus
I started my script by calling accelerate launch scripy.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants