Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension of lora.py for supervised ML of configurable dataset formats with YAML-based configuration of parameters #235

Closed
wants to merge 43 commits into from

Conversation

chimezie
Copy link
Contributor

@chimezie chimezie commented Jan 5, 2024

Based on lora.py and meant to perform supervised instruction fine-tuning for the various supported models, separating the model-specific logic from the common training and supporting logic (and management of training configuration parameters).

It breaks out argument parameters into a YAML file and allows arbitrary training data (and prompt) formats

A configuration .yaml file (the only command line argument) is expected to be in the following format:

parameters:
    model: "..."
    num_tokens: 100
    write_every: 1
    temp: 0.8
    train: true
    [..]

Each entry under parameters is the default argument name of the command-line arguments originally provided to lora.py. The default values are the same as the original argument parameters they are based on.

An epoch parameter was added, determining the number of iterations if provided (the number needed for a
full pass of the data, i.e., an epoch). If the value is -1, it is ignored, and the iters parameter is used as
before. If iters is -1 then 1 epoch is performed.

A module for a particular prompt syntax or training dataset format can be defined with a class that specializes TrainingRecordHandler, and provides an instance of it to main, along with a function for getting model and tokenizer instances, and another to generate from a given prompt using the tokenizer. Otherwise, the iterating batch implementation calculates (Q)LoRa finetuning with configurable parameters.

It needs to define its own get_input and get_output methods, which take a Python dictionary and return an instruction input and output as strings, respectively, and with the appropriate prompt formatting.

This allows handling arbitrary dataset JSON formats and prompt formats separate from the training.

Will probably need to reconcile with #213 once that PR is merged to main

See mistral_supervised.py for an example.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes

Fixed import path, iteration calculation, and creation of configuration namespace from YAML.
@chimezie chimezie changed the title Extension of mlx-examples for supervised ML of configurable datasets with YAML-based configuration of parameters Extension of mlx-examples for supervised ML of configurable dataset formats with YAML-based configuration of parameters Jan 5, 2024
@chimezie chimezie changed the title Extension of mlx-examples for supervised ML of configurable dataset formats with YAML-based configuration of parameters Extension of lora.py for supervised ML of configurable dataset formats with YAML-based configuration of parameters Jan 5, 2024
Removed as much model-specific code as possible to clearly separate training logic from model-related stuff (tokenizer, model structure, model loading, generation, etc.).  Added additional parameters: max_tokens, tokens_per_eval, and temp (for generation).  Added --dataset-summary options, which provides a summary of the training data and does nothing else.  Fixed tqdm status to track iterations.  Incorporated latest LoRa training from mlx-example main.
Incorporating latest from llms/mistral/*,  passing them off to model-agnostic supervised LoRa training module
Uses supervised LoRa framework, implementing all the HF model-specific bits.  Modules for other kinds of models can be added in the same way in these model-specific modules that use supersized_lora.py
@chimezie
Copy link
Contributor Author

Another pass at separating model-specific bits from training logic. Still keeping an eye on #213 to see if there is any synergy

@ProjectProgramAMark
Copy link

Just pushed (proposed) final version of #213 . Take a look and let me know how I can help utilize our changes together!

@chimezie
Copy link
Contributor Author

Just pushed (proposed) final version of #213 . Take a look and let me know how I can help utilize our changes together!

That would be fantastic! Sorry I only just saw this. However, I see this comment from today and will be consolidating #337 merges into this PR.

I still think a separate, configuration-based Lora example would be handy even just for the purposes of Lora hacking on this framework. Later this week, I'm planning on also adding the writing of loss/validation structured data for plotting purposes (perhaps as a separate PR). Let me know your updated thoughts regarding this and #213

@ivanfioravanti
Copy link
Contributor

Amazing job @chimezie 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants