-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Update sft trainer to include better packing #3100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, and sorry for the late reply. I've made some comments. Can you please also:
- include a test for
pack_example_smater - modify the packing argument in
SFTConfigto allow for this packing method
trl/data_utils.py
Outdated
| return knapsacks, [[tup[1]] for tup in numbers] | ||
|
|
||
|
|
||
| def pack_examples_smarter(examples: dict[str, list[list]], seq_length: int) -> dict[str, list[list]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt of
| def pack_examples_smarter(examples: dict[str, list[list]], seq_length: int) -> dict[str, list[list]]: | |
| def pack_examples_knapsack(examples: dict[str, list[list]], seq_length: int) -> dict[str, list[list]]: |
or even
| def pack_examples_smarter(examples: dict[str, list[list]], seq_length: int) -> dict[str, list[list]]: | |
| def pack_examples_knapsack_greedy(examples: dict[str, list[list]], seq_length: int) -> dict[str, list[list]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
trl/trainer/sft_trainer.py
Outdated
| packing: bool, | ||
| formatting_func: Optional[Callable[[dict], str]], | ||
| dataset_name: str, | ||
| pack_smart: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is better:
| packing: bool, | |
| formatting_func: Optional[Callable[[dict], str]], | |
| dataset_name: str, | |
| pack_smart: bool = False, | |
| packing: Union[bool, str], | |
| formatting_func: Optional[Callable[[dict], str]], | |
| dataset_name: str, |
and use if packing == "knapsack":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
|
Hi @qgallouedec sorry for the delay, I had my final exams :) |
|
I think we've arrived at something very satisfactory with FFD see #3521, so I'm closing this PR. Thanks for all your hard work! |
What does this PR do?
Fixes #2466
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
@qgallouedec @shirinyamani @AIR-hl
Hi everyone, I am new to TRL, have attempted a fix at the issue 2466 here.
The greedy knapsack tries to fit as many examples as possible without overshooting, the remaining ones are chunked as before by breaking it up into smaller example size (<= max_seq_len). Also I havent made this the default packing and kept it under a flag.
Please let me know if this is close to what was expected, I can add tests then, thanks!