add `.json` to SUPPORTED_EXTENSIONS #1114

eitanturok · 2024-04-14T20:50:56Z

Summary:
This PR allows us to create a finetuning dataset from .json files.

What Happened:
I was trying to finetune llama2-7b-chat on the OpenHermes-2.5 dataset but got an error because the dataset is a .json file, not a .jsonl file.

Looking into it, I discovered that when is_safe=True, we check that the dataset files all have extensions in SUPPORTED_EXTENSIONS (source). Currently,

SUPPORTED_EXTENSIONS = ['.csv', '.jsonl', '.parquet']

and because the OpenHermes-2.5 dataset consists of a .json file, which is not in SUPPORTED_EXTENSIONS, we get an error.

My Solution:
So I added .json to the SUPPORTED_EXTENSIONS and now I can finetune on this. This PR is really pretty simple. But I tested this on some runs to make sure everything works.

Testing:
My runs (below) show that when you currently finetune on a .json files it works when is_safe=False (run 2) but fails when is_safe=True (run 1). With my changes in the eitan-patch-json branch, you can finetune on a .json file when is_safe=True (run 3).

Runs:

Run llama2-7b-chat-open-hermes-ft-DKoAb6 on main branch -- I successfully finetuned on the .json in OpenHermes-2.5 when is_safe=False.
Run llama2-7b-chat-open-hermes-ft-CY6yNJ on main branch -- I got an error finetuning on the .json file in OpenHermes-2.5 whenis_safe=True.
Run llama2-7b-chat-open-hermes-ft-5xoXNv on eitan-patch-json branch -- I successfully finetuned on the .json in OpenHermes-2.5 when is_safe=True by running on my branch eitan-patch-json.

Note: I stopped runs 1. and 3. after training for 10+ batches so the error you see in the logs is from me stopping the run early, not from preparing the dataset. To view the logs of run 2. do mcli logs llama2-7b-chat-open-hermes-ft-CY6yNJ --resumption 0. The error from this run looks like:

InvalidFileExtensionError: safe_load is set to True. No data files with safe 
extensions ['.csv', '.jsonl', '.parquet'] found for dataset at local path 
teknium/OpenHermes-2.5.

eitanturok · 2024-04-18T20:16:12Z

@irenedea Wanted to bump this.

dakinggg · 2024-04-19T01:34:11Z

Did a little manual test too

In [18]: jsonl = datasets.load_dataset('jsonl-folder', split='train')
Generating train split: 2 examples [00:00, 193.61 examples/s]

In [19]: json = datasets.load_dataset('json-folder', split='train')
Generating train split: 2 examples [00:00, 220.07 examples/s]

In [20]: jsonl
Out[20]:
Dataset({
    features: ['prompt', 'response'],
    num_rows: 2
})

In [21]: jsonl[0]
Out[21]: {'prompt': 'hi', 'response': 'goodbye'}

In [22]: json
Out[22]:
Dataset({
    features: ['response', 'prompt'],
    num_rows: 2
})

In [23]: json[0]
Out[23]: {'response': 'goodbye', 'prompt': 'hi'}

irenedea · 2024-04-19T06:35:14Z

Oops, so sorry, @eitanturok! I need to fix my GH slack notifs :( Please don't hesitate to ping on slack if you need something reviewed!

add .json to SUPPORTED_EXTENSIONS

60aad4c

dakinggg requested a review from irenedea April 15, 2024 17:24

Merge branch 'main' into eitan-patch-json

7da976c

Merge branch 'main' into eitan-patch-json

e72b8a2

dakinggg approved these changes Apr 19, 2024

View reviewed changes

irenedea approved these changes Apr 19, 2024

View reviewed changes

dakinggg merged commit 20cb40c into main Apr 19, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `.json` to SUPPORTED_EXTENSIONS #1114

add `.json` to SUPPORTED_EXTENSIONS #1114

eitanturok commented Apr 14, 2024 •

edited

eitanturok commented Apr 18, 2024

dakinggg commented Apr 19, 2024

irenedea commented Apr 19, 2024

add .json to SUPPORTED_EXTENSIONS #1114

add .json to SUPPORTED_EXTENSIONS #1114

Conversation

eitanturok commented Apr 14, 2024 • edited

eitanturok commented Apr 18, 2024

dakinggg commented Apr 19, 2024

irenedea commented Apr 19, 2024

add `.json` to SUPPORTED_EXTENSIONS #1114

add `.json` to SUPPORTED_EXTENSIONS #1114

eitanturok commented Apr 14, 2024 •

edited