[ENH] Improved data preparation for LLM finetuning #8692

paxcema · 2024-01-25T22:41:38Z

Description

This PR introduces some common data preparation utilities for fine-tuning LLMs that operate following the ChatCompletion format from OpenAI. Currently, these coupled versions of these methods are actively being used in our integrations for Anyscale and OpenAI (hence indirectly by LangChain, too).

Added tests for all of these, so that we can use them on any other LLM engines.

Type of change

⚡ New feature (non-breaking change which adds functionality). Refactor.

Verification Process

To ensure the changes are working as expected:

Test Location: added unit tests, which pass.

Additional Media:

I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

My code follows the style guidelines(PEP 8) of MindsDB.
I have appropriately commented on my code, especially in complex areas.
Necessary documentation updates are either made or tracked in issues.
Relevant unit and integration tests are updated or added.

dusvyat

LGTM, see minor point

tests/unit/test_llm_utils.py

pedrofluxa

Left a comment on a hard-coded value. I am approving this PR as it should work as it is, but I strongly suggest to refactor the code a little bit.

mindsdb/integrations/handlers/anyscale_endpoints_handler/anyscale_endpoints_handler.py

chandrevdw31 · 2024-02-08T17:27:40Z

Unable to Finetune as it says False

paxcema · 2024-02-08T18:12:38Z

@chandrevdw31 status is marked as generating in the screenshot above, and these finetune runs can easily take hours so this is normal. The model cannot be active before its status is complete.

If the user reports an actual error in the model record, please ask them to open a bug with it.

paxcema added 4 commits January 24, 2024 20:26

first pass

b4676e2

Merge branch 'staging' into feat/improved_ft_dataprep

541f868

fix missing dep for bigQuery

4baf05a

refactor: common llm ft utils

7ddc7fe

paxcema changed the title ~~[Feat] Improved data preparation for LLM finetuning~~ [ENH] Improved data preparation for LLM finetuning Jan 25, 2024

paxcema added 7 commits January 25, 2024 21:22

remove dep for bigquery

3d81032

lint: flake8

a73e9a8

fix pytest < 8.0.0 to avoid error

985c1e0

Merge branch 'staging' into feat/improved_ft_dataprep

aca55da

add tests

abe98b9

refactored helper FT methods slightly for better separation of work

29ec33e

add test for get_completed_prompts

e66242c

paxcema marked this pull request as ready for review January 31, 2024 00:22

paxcema requested a review from dusvyat January 31, 2024 00:22

paxcema added 2 commits January 30, 2024 21:26

lint: flake8

326863f

add tests to CI

9b25bdc

paxcema requested a review from pedrofluxa January 31, 2024 00:27

dusvyat approved these changes Jan 31, 2024

View reviewed changes

tests/unit/test_llm_utils.py Show resolved Hide resolved

pedrofluxa approved these changes Jan 31, 2024

View reviewed changes

mindsdb/integrations/handlers/anyscale_endpoints_handler/anyscale_endpoints_handler.py Outdated Show resolved Hide resolved

paxcema added 2 commits January 31, 2024 17:18

address feedback: constant for min val length, and test fixtures

2e4837a

rm leftover comments

1822532

paxcema merged commit 650fd7c into staging Jan 31, 2024
10 checks passed

paxcema self-assigned this Feb 1, 2024

StpMax mentioned this pull request Feb 19, 2024

Release v24.2.3.0 #8784

Merged

hamishfagg deleted the feat/improved_ft_dataprep branch June 10, 2024 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Improved data preparation for LLM finetuning #8692

[ENH] Improved data preparation for LLM finetuning #8692

paxcema commented Jan 25, 2024 •

edited

Loading

dusvyat left a comment

pedrofluxa left a comment

chandrevdw31 commented Feb 8, 2024

paxcema commented Feb 8, 2024

[ENH] Improved data preparation for LLM finetuning #8692

[ENH] Improved data preparation for LLM finetuning #8692

Conversation

paxcema commented Jan 25, 2024 • edited Loading

Description

Type of change

Verification Process

Additional Media:

Checklist:

dusvyat left a comment

Choose a reason for hiding this comment

pedrofluxa left a comment

Choose a reason for hiding this comment

chandrevdw31 commented Feb 8, 2024

paxcema commented Feb 8, 2024

paxcema commented Jan 25, 2024 •

edited

Loading