Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DROP Evaluation with Llama3 (vs. lm-evaluation-harness) #165

Open
vipulraheja opened this issue Apr 21, 2024 · 1 comment
Open

DROP Evaluation with Llama3 (vs. lm-evaluation-harness) #165

vipulraheja opened this issue Apr 21, 2024 · 1 comment

Comments

@vipulraheja
Copy link

vipulraheja commented Apr 21, 2024

Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in Llama3, suggesting that the input size is greater than the maximum context size allowed by the model:

The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.

Here is the command I use:

accelerate launch --num_processes=1 run_evals_accelerate.py \
    --model_args "pretrained=meta-llama/Meta-Llama-3-8B" \
    --tasks "lighteval|drop|3|0" \
    --override_batch_size 16 \
    --output_dir "./log/"

I am able to reproduce this even after progressively reducing the batch size to 1.

Log:

WARNING:lighteval.logging.hierarchical_logger:    Model info: ModelInfo(model_name='meta-llama/Meta-Llama-3-8B', model_sha='561487d18c41c76bcb5fc6cfb73a324982f04f47', model_dtype='torch.bfloat16', model_size='15.08 GB')
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:15.762582]
WARNING:lighteval.logging.hierarchical_logger:  Tasks loading {
WARNING:lighteval.logging.hierarchical_logger:    If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger:    lighteval/drop_harness default
WARNING:lighteval.logging.hierarchical_logger:    Loading documents, and requests
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:34.926653]
WARNING:lighteval.logging.hierarchical_logger:  Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger:    setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000371]
WARNING:lighteval.logging.hierarchical_logger:  Evaluation {
WARNING:lighteval.logging.hierarchical_logger:    Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.GREEDY_UNTIL requests
Splits:   0%|                                                                                                                                                    | 0/4 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.  0/38 [00:00<?, ?it/s]

The process then either stays stuck indefinitely until manually killed, or crashes as follows:

note: The following traceback happened even after reducing the batch size to 1 with --override_batch_size.

WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9262) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:55,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9192) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9538) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
Splits:   0%|                                                                                                                                                                                                                             | 0/4 [00:40<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:01:02.769634]
WARNING:lighteval.logging.hierarchical_logger:} [0:01:49.892104]
Traceback (most recent call last):
  File "/home/vipul.raheja/lighteval/run_evals_accelerate.py", line 82, in <module>
    main(args)
  File "/home/vipul.raheja/lighteval/src/lighteval/logging/hierarchical_logger.py", line 166, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/main_accelerate.py", line 111, in main
    evaluation_tracker = evaluate(
                         ^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/evaluator.py", line 86, in evaluate
    full_resps = lm.greedy_until(requests, override_bs=override_bs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/models/base_model.py", line 570, in greedy_until
    max_new_tokens = min(self.max_length - biggest_context, max_new_tokens)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Running the same evaluation directly in lm-evaluation-harness does not throw any such warning and proceeds at a reasonable speed.

~/lm-evaluation-harness$ lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks drop --device cuda:0 --batch_size 16
2024-04-21:20:19:29,714 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-21:20:19:33,062 INFO     [__main__.py:335] Selected Tasks: ['drop']
2024-04-21:20:19:33,063 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-21:20:19:33,064 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3-8B'}
2024-04-21:20:19:33,164 INFO     [huggingface.py:164] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████| 4/4 [00:06<00:00,  1.62s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading builder script: 100%|█████████████████████████████| 7.46k/7.46k [00:00<00:00, 35.8MB/s]
Downloading readme: 100%|█████████████████████████████| 26.0/26.0 [00:00<00:00, 384kB/s]
Downloading data: 100%|█████████████████████████████| 8.31M/8.31M [00:00<00:00, 8.66MB/s]
Generating train split: 77409 examples [00:05, 13452.43 examples/s]
Generating validation split: 9536 examples [00:00, 11649.32 examples/s]
Map: 100%|█████████████████████████████| 77409/77409 [00:10<00:00, 7060.41 examples/s]
Map: 100%|█████████████████████████████| 9536/9536 [00:01<00:00, 4788.74 examples/s]
2024-04-21:20:20:11,675 INFO     [task.py:395] Building contexts for drop on rank 0...
100%|█████████████████████████████| 9536/9536 [00:03<00:00, 2793.13it/s]
2024-04-21:20:20:16,260 INFO     [evaluator.py:379] Running generate_until requests
Running generate_until requests:   9%|█████▊                             | 833/9536 [07:44<1:06:05,  2.19it/s]

Env:
transformers version: 4.39.3
Platform: Ubuntu 20.04.6 LTS
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Lighteval version: 0.4.0.dev0

@clefourrier
Copy link
Member

Iirc, the harness just does not check if the context fits within the max length of the model (the few shot context is built here and used there - only the gold prediction must fit within max length).

We have decided to print a warning when the context is too long for the max length, as it means that the model is likely to have non trivial issues when working. However, the bug you're betting are not normal, I'll check them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants