A brandly new `DatasetGenerator` using `gpt-3.5-turbo` and `json` #20

zhaochenyang20 · 2023-04-24T03:43:45Z

Description

After contacting the authors of Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks, I thoroughly refactored our DatasetGenerator. We now use a prompt template and ask LLM to return the generated examples in json format, which is more controllable than our previous method of returning natural language examples and regenerating labels in a separate API call.

Followed are detailed changes:

To be Discussed

NA

Already Discussed

I fixed a bug in the DatasetGenerator by including split.value in the code. This ensures that DatasetSplit can be serialized with the save_to_disk method.
Following the previous change, I updated Vijay's run_locally.py by adding .value to the lines that define the generated_training, validation, and testing sets.
I created OpenAIDatasetGenerator, InputOutputGenerator and added unit test forInputOutputGenerator. However, I have some questions and concerns that need to be addressed:

Is the split argument necessary in the generate_examples function?
Currently, we use gpt-3.5-turbo rather than text-davinci-002 because turbo is derived from Codex and can handle json response, while text-davinci can't. But this narrows the models users can choose.
I Mocked the behavior of openai.ChatCompletion through MockCompletion class for our unit test. 🚀🚀🚀
I hard-coded natrual_instruction and few_shot_examples in MockPromptSpec.
Although the current response mining is much better than before, there are still some issues. First, the key LLM returns us in the json isn't always expected. Second, the code for extracting the expected key from the json response is not very Pythonic.
The exception handling in my code is not optimal, as I try to cover all potential errors that might occur in the try block. If any error other than json.JSONDecodeError,IndexError,TypeError,ValueError, AttributeError occurs, the program will be terminated. However, I cannot use a simple except: statement to catch all errors because it is blocked by mypy.
The generate_examples function is not very Pythonic in the for _ in tqdm(range(num_examples), desc="Generating examples"): loop. However, I do not have a better way to make it more Pythonic while also displaying the generation progress, which is essential.

I removed the use of pandas and directly returned a Dataset object created from a dictionary.
I added max_api_call to set the upper bound of the API call.

References

Blocked by

Issue 26. Currently just mock the behavior of rompt_spec.parse_from_prompt.

viswavi

Overall, things are looking pretty good! I have a couple comments (primarily, I still think the iogenerator.py class can be incorporated into the openai.py class, and I have some suggestions about variable naming), but I think these should be easy to resolve. Hopefully we'll be good to merge after that!

prompt2model/dataset_generator/iogenerator.py

prompt2model/dataset_generator/openai.py

tests/dataset_generator_test.py

prompt2model/dataset_generator/openai.py

viswavi

Looks good to me (LGTM)!

I made one typo correction which you can accept before merging

neubig

Looking nice! I had a few comments.

prompt2model/dataset_generator/openai.py

neubig

Looking nice! I just have one final suggestion about the place where we add mocked code.

prompt2model/dataset_generator/openai.py

neubig

Looks great, please go ahead and merge :)

viswavi and others added 30 commits April 3, 2023 23:05

save [wip]

db1eadc

Add integration test

a96d203

Update skeletons, now passing integration test

e394de4

Fix integration test

50783c8

Merge branch 'main' into vijay_init_stubs

db8d4d8

Run isort on whole package

a912646

Remove uses of deprecated Typing objects

4b2f563

Add docstrings to all functions and classes

39e4589

add demo code for data generator

45231dd

merge new document string

94df4ae

Merge branch 'vijay_init_stubs' into Eren_Dataset_Generator

4ed7058

add new data generator

69298b7

fix bugs

94974d2

Refactor files into modules

bbd8d1c

Separate interfaces from implementations

49d7e88

Run black and isort

57fdb40

fix confilts

5cdbf8f

fix typing

cb80b02

Fix type error

eb61517

Prevent import unused error in init files

156587b

Use absolute imports in init.py files

bfe8dbf

Fix flake8 errors

06d0176

Run isort and black

4a10bea

merge new data generator

27ba147

Use ABC instead of Protocol

05934b5

Supress import position errors in test

e92cf6c

Disable unnecessary pylint warnings

ca5cad9

Run isort

fa6b733

Use type instead of Type

71e4258

Fix all flake8 docstring errors

f79f1c5

zhaochen20 added 3 commits April 26, 2023 09:23

Merge branch 'Eren_Dataset_Generator_JSON' into Eren_Dataset_Generator

7f979f9

combine into InputOutputGenerator

cdf0c50

Merge branch 'Eren_Dataset_Generator' into Eren_Dataset_Generator_JSON

15b47b1

zhaochenyang20 requested a review from viswavi April 26, 2023 02:01

zhaochen20 added 4 commits April 26, 2023 10:40

fix typos

c219106

fix typos

52c510c

Merge branch 'main' of github.com:viswavi/prompt2model

c895817

Merge branch 'main' into Eren_Dataset_Generator_JSON

f6018fc

viswavi mentioned this pull request Apr 26, 2023

Modify DatasetGenerator to not assume the existence of few_shot_examples #27

Closed

viswavi requested changes Apr 26, 2023

View reviewed changes

viswavi reviewed Apr 27, 2023

View reviewed changes

tests/dataset_generator_test.py Outdated Show resolved Hide resolved

zhaochen20 added 2 commits April 27, 2023 09:20

[important] ready to remove InputOutputGenrator

6d9c961

combine InputOutputGenerator with OpenAIGenerator

41a8257

zhaochenyang20 requested a review from viswavi April 27, 2023 01:32

viswavi reviewed Apr 27, 2023

View reviewed changes

prompt2model/dataset_generator/openai.py Outdated Show resolved Hide resolved

viswavi approved these changes Apr 27, 2023

View reviewed changes

ready for new review

028ccf3

viswavi requested a review from neubig April 27, 2023 02:51

neubig reviewed Apr 27, 2023

View reviewed changes

zhaochenyang20 mentioned this pull request Apr 28, 2023

Implement the prompt_spec.parse_from_prompt function to get instruction, examples, or even prompt_template. #26

Closed

change the design of prompt_template

0d74b1b

zhaochenyang20 requested a review from neubig April 28, 2023 01:41

neubig reviewed Apr 28, 2023

View reviewed changes

prompt2model/dataset_generator/openai.py Outdated Show resolved Hide resolved

prompt2model/dataset_generator/openai.py Outdated Show resolved Hide resolved

mock the behavior of PromptSpec

a25fba9

zhaochenyang20 requested a review from neubig April 28, 2023 12:36

neubig approved these changes Apr 28, 2023

View reviewed changes

zhaochenyang20 merged commit e636192 into main Apr 28, 2023

zhaochenyang20 deleted the Eren_Dataset_Generator_JSON branch April 28, 2023 12:48

This was referenced Apr 28, 2023

Refactor "Mock OpenAI" classes into a test_helpers directory #29

Merged

Component Tracker #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A brandly new `DatasetGenerator` using `gpt-3.5-turbo` and `json` #20

A brandly new `DatasetGenerator` using `gpt-3.5-turbo` and `json` #20

zhaochenyang20 commented Apr 24, 2023 •

edited

Loading

viswavi left a comment

viswavi left a comment

neubig left a comment

neubig left a comment

neubig left a comment

A brandly new DatasetGenerator using gpt-3.5-turbo and json #20

A brandly new DatasetGenerator using gpt-3.5-turbo and json #20

Conversation

zhaochenyang20 commented Apr 24, 2023 • edited Loading

Description

To be Discussed

Already Discussed

References

Blocked by

viswavi left a comment

Choose a reason for hiding this comment

viswavi left a comment

Choose a reason for hiding this comment

neubig left a comment

Choose a reason for hiding this comment

neubig left a comment

Choose a reason for hiding this comment

neubig left a comment

Choose a reason for hiding this comment

A brandly new `DatasetGenerator` using `gpt-3.5-turbo` and `json` #20

A brandly new `DatasetGenerator` using `gpt-3.5-turbo` and `json` #20

zhaochenyang20 commented Apr 24, 2023 •

edited

Loading