Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use dynamic batching when generating #9

Merged
merged 4 commits into from
Apr 2, 2024
Merged

Conversation

tengomucho
Copy link
Collaborator

What does this PR do?

This will increment the number of batches dynamically when you call prefill, and it will reduce the number of batches only when prefill is called again.
The intention is to avoid useless recompilation (keeping batch size the same as long as possible).

This will increment the number of batches dynamically when you call
prefill, and it will reduce the number of batches only when prefill is
called again.
The intention is to avoid useless recompilation (keeping batch size the
same as long as possible). Also note that if a slot was removed, the
whole KV cache should be rebuilt, and for now we do not do that.
Since batch size is now dynamic, expected results are now different, so
they are updated accordingly.
@tengomucho tengomucho force-pushed the dynamic-batches branch 2 times, most recently from 944ecbe to fe61d04 Compare March 26, 2024 16:06
Now it is only useful for warmup, since dynamic batching is being used.
@tengomucho tengomucho marked this pull request as ready for review March 26, 2024 17:27
Copy link
Member

@mfuntowicz mfuntowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me on the logic, left a few comments for potentially optimize a few minor things.

Super cool!

SEQUENCE_LENGTH = 1024


@pytest.fixture(scope="module")
def model_name_or_path():
os.environ["HF_BATCH_SIZE"] = str(BATCH_SIZE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should keep a HF_MAX_BATCH_SIZE somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned below, maybe we'll use it later

@@ -312,7 +312,10 @@ def __init__(
tokenizer.padding_side = "left"
self.tokenizer = tokenizer
self.special_tokens = self.tokenizer.all_special_ids
self.slots = [Slot(i, tokenizer, self.model.device) for i in range(self.model.config.batch_size)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we introduce the HF_MAX_BATCH_SIZE maybe we can initialize this list, wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to better understand the usage of the batches, if it can be increased/reduced often, how this affects performance. At that point it will be easier to think about a reasonable algorithm to reduce compilation and batch change overhead as much as possible.

@@ -350,13 +353,18 @@ def warmup(self, batch: Batch) -> int:
The maximum number of tokens the model supports.
"""
# Just check that the warmup request parameters match the model capacity
batch_size = self.model.config.batch_size
# NOTE: later self.model.config.batch_size might become self.model.config.max_batch_size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok maybe you want to keep it for later 🤗

# Assign each request to an empty slot
logger.debug(f"Prefilling {len(batch.requests)} new request(s) with {len(empty_slots)} empty slot(s)")
logger.debug(f"Prefilling {len(batch.requests)} new request(s) adding to {len(active_slots)} active slot(s)")
new_slots = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is new_slots used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless 😄

slot.assign(request, self.model.generation_config)
new_slots.append(slot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about new_slots

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer, I'll remove it

@mfuntowicz
Copy link
Member

Just to be sure: In this PR we are not limiting the maximum batch size the server can handle? If so, can we impl this in a following PR?

@tengomucho tengomucho merged commit 595b7b2 into main Apr 2, 2024
1 check passed
@mfuntowicz mfuntowicz deleted the dynamic-batches branch April 29, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants