-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension of lora.py for supervised ML of configurable dataset formats with YAML-based configuration of parameters #235
Conversation
Will need to sync with ml-explore#213 when merged into main
Fixed import path, iteration calculation, and creation of configuration namespace from YAML.
Removed as much model-specific code as possible to clearly separate training logic from model-related stuff (tokenizer, model structure, model loading, generation, etc.). Added additional parameters: max_tokens, tokens_per_eval, and temp (for generation). Added --dataset-summary options, which provides a summary of the training data and does nothing else. Fixed tqdm status to track iterations. Incorporated latest LoRa training from mlx-example main.
Incorporating latest from llms/mistral/*, passing them off to model-agnostic supervised LoRa training module
Uses supervised LoRa framework, implementing all the HF model-specific bits. Modules for other kinds of models can be added in the same way in these model-specific modules that use supersized_lora.py
Another pass at separating model-specific bits from training logic. Still keeping an eye on #213 to see if there is any synergy |
Just pushed (proposed) final version of #213 . Take a look and let me know how I can help utilize our changes together! |
That would be fantastic! Sorry I only just saw this. However, I see this comment from today and will be consolidating #337 merges into this PR. I still think a separate, configuration-based Lora example would be handy even just for the purposes of Lora hacking on this framework. Later this week, I'm planning on also adding the writing of loss/validation structured data for plotting purposes (perhaps as a separate PR). Let me know your updated thoughts regarding this and #213 |
Amazing job @chimezie 🚀 |
Based on lora.py and meant to perform supervised instruction fine-tuning for the various supported models, separating the model-specific logic from the common training and supporting logic (and management of training configuration parameters).
It breaks out argument parameters into a YAML file and allows arbitrary training data (and prompt) formats
A configuration .yaml file (the only command line argument) is expected to be in the following format:
Each entry under parameters is the default argument name of the command-line arguments originally provided to lora.py. The default values are the same as the original argument parameters they are based on.
An epoch parameter was added, determining the number of iterations if provided (the number needed for a
full pass of the data, i.e., an epoch). If the value is -1, it is ignored, and the iters parameter is used as
before. If iters is -1 then 1 epoch is performed.
A module for a particular prompt syntax or training dataset format can be defined with a class that specializes TrainingRecordHandler, and provides an instance of it to main, along with a function for getting model and tokenizer instances, and another to generate from a given prompt using the tokenizer. Otherwise, the iterating batch implementation calculates (Q)LoRa finetuning with configurable parameters.
It needs to define its own get_input and get_output methods, which take a Python dictionary and return an instruction input and output as strings, respectively, and with the appropriate prompt formatting.
This allows handling arbitrary dataset JSON formats and prompt formats separate from the training.
Will probably need to reconcile with #213 once that PR is merged to main
See mistral_supervised.py for an example.
Checklist
Put an x in the boxes that apply.