Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add .json to SUPPORTED_EXTENSIONS #1114

Merged
merged 3 commits into from
Apr 19, 2024
Merged

add .json to SUPPORTED_EXTENSIONS #1114

merged 3 commits into from
Apr 19, 2024

Conversation

eitanturok
Copy link
Contributor

@eitanturok eitanturok commented Apr 14, 2024

Summary:
This PR allows us to create a finetuning dataset from .json files.

What Happened:
I was trying to finetune llama2-7b-chat on the OpenHermes-2.5 dataset but got an error because the dataset is a .json file, not a .jsonl file.

Looking into it, I discovered that when is_safe=True, we check that the dataset files all have extensions in SUPPORTED_EXTENSIONS (source). Currently,

SUPPORTED_EXTENSIONS = ['.csv', '.jsonl', '.parquet']

and because the OpenHermes-2.5 dataset consists of a .json file, which is not in SUPPORTED_EXTENSIONS, we get an error.

My Solution:
So I added .json to the SUPPORTED_EXTENSIONS and now I can finetune on this. This PR is really pretty simple. But I tested this on some runs to make sure everything works.

Testing:
My runs (below) show that when you currently finetune on a .json files it works when is_safe=False (run 2) but fails when is_safe=True (run 1). With my changes in the eitan-patch-json branch, you can finetune on a .json file when is_safe=True (run 3).

Runs:

  1. Run llama2-7b-chat-open-hermes-ft-DKoAb6 on main branch -- I successfully finetuned on the .json in OpenHermes-2.5 when is_safe=False.
  2. Run llama2-7b-chat-open-hermes-ft-CY6yNJ on main branch -- I got an error finetuning on the .json file in OpenHermes-2.5 whenis_safe=True.
  3. Run llama2-7b-chat-open-hermes-ft-5xoXNv on eitan-patch-json branch -- I successfully finetuned on the .json in OpenHermes-2.5 when is_safe=True by running on my branch eitan-patch-json.

Note: I stopped runs 1. and 3. after training for 10+ batches so the error you see in the logs is from me stopping the run early, not from preparing the dataset. To view the logs of run 2. do mcli logs llama2-7b-chat-open-hermes-ft-CY6yNJ --resumption 0. The error from this run looks like:

InvalidFileExtensionError: safe_load is set to True. No data files with safe 
extensions ['.csv', '.jsonl', '.parquet'] found for dataset at local path 
teknium/OpenHermes-2.5.

@dakinggg dakinggg requested a review from irenedea April 15, 2024 17:24
@eitanturok
Copy link
Contributor Author

@irenedea Wanted to bump this.

@dakinggg
Copy link
Collaborator

Did a little manual test too

In [18]: jsonl = datasets.load_dataset('jsonl-folder', split='train')
Generating train split: 2 examples [00:00, 193.61 examples/s]

In [19]: json = datasets.load_dataset('json-folder', split='train')
Generating train split: 2 examples [00:00, 220.07 examples/s]

In [20]: jsonl
Out[20]:
Dataset({
    features: ['prompt', 'response'],
    num_rows: 2
})

In [21]: jsonl[0]
Out[21]: {'prompt': 'hi', 'response': 'goodbye'}

In [22]: json
Out[22]:
Dataset({
    features: ['response', 'prompt'],
    num_rows: 2
})

In [23]: json[0]
Out[23]: {'response': 'goodbye', 'prompt': 'hi'}

@irenedea
Copy link
Contributor

Oops, so sorry, @eitanturok! I need to fix my GH slack notifs :( Please don't hesitate to ping on slack if you need something reviewed!

@dakinggg dakinggg merged commit 20cb40c into main Apr 19, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants