Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CodeEval respect device_eval_batch_size #2969

Merged
merged 29 commits into from Feb 15, 2024
Merged

Make CodeEval respect device_eval_batch_size #2969

merged 29 commits into from Feb 15, 2024

Conversation

josejg
Copy link
Contributor

@josejg josejg commented Feb 7, 2024

What does this PR do?

Re-implementation of our CodeEval so that it respects the device eval batch size

What issue(s) does this change relate to?

Since the previous implementation silently overrode device_eval_batch_size to be generations_per_sample, there are two clear benefits of the rewrite:

  • Better utilization when generations_per_sample < device_eval_batch_size
  • Prevents OOM when generations_per_sample >> device_eval_batch_size

Regression testing

These are some regression tests using the PR code and the previous implementation (plus a patch to use temperature 0.2) which was not possible before @maxisawesome foundry PR (which was not merged at the time I ran these experiments). This experiments are with generations_per_sample=20 so some variance is expected.

Model This PR Before
meta-llama/Llama-2-7b-hf 12.47 13.69
codellama/CodeLlama-7b-hf 27.99 27.68
codellama/CodeLlama-13b-hf 30.73 30.37
deepseek-ai/deepseek-coder-6.7b-base 39.82 40.3
codellama/CodeLlama-34b-hf 41.31 42.29
deepseek-ai/deepseek-coder-33b-base 47.96 47.56
Phind/Phind-CodeLlama-34B-v2 65.12 65.27

Before submitting

  • Have you read the contributor guidelines?
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, the metric thing is a little gross, but the current design does not afford any other options that i see, and i think its safe, so im good with it.

composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/datasets/in_context_learning_evaluation.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include in the PR description evidence that before and after this PR, results don't change (assuming you carefully set the hparams to be the same). Or if that isn't possible for some reason, at least a number on a popular model that we can match to something. Otherwise, LGTM and will approve once you have that

composer/metrics/nlp.py Outdated Show resolved Hide resolved
@josejg
Copy link
Contributor Author

josejg commented Feb 14, 2024

Could you include in the PR description evidence that before and after this PR, results don't change (assuming you carefully set the hparams to be the same). Or if that isn't possible for some reason, at least a number on a popular model that we can match to something. Otherwise, LGTM and will approve once you have that

Added experiments on public models to the PR description

@josejg
Copy link
Contributor Author

josejg commented Feb 14, 2024

I think we have a couple of tests that have the right intention but are testing the wrong thing:

The idea is to make sure that inputs aren't overly left padded. The issue is that the ICL dataset maps using dataset.map and decides the left padding based on the entire dataset. Before we were not catching that because we were only looking at all the examples.

To be fair, the test is testing for the ideal behaviour, the issue is that our ICL code doesn't do that and that contribution feels out of scope for this PR.

@josejg
Copy link
Contributor Author

josejg commented Feb 14, 2024

Reworked some of the tests. However, I'd like to know whether we can remove unnecessary left padding in a per batch basis.

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josejg I think it looks good to me and you are right that the tests were simply incorrect before. Are you saying you want to trim the extra left padding from each individual batch (since padding is determined based on the full dataset as opposed to each individual batch)?

composer/metrics/nlp.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@josejg josejg merged commit 6a5972f into dev Feb 15, 2024
14 checks passed
@josejg josejg deleted the batch_code_eval branch February 15, 2024 16:27
@josejg josejg restored the batch_code_eval branch February 16, 2024 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants