-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A brandly new DatasetGenerator
using gpt-3.5-turbo
and json
#20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, things are looking pretty good! I have a couple comments (primarily, I still think the iogenerator.py class can be incorporated into the openai.py class, and I have some suggestions about variable naming), but I think these should be easy to resolve. Hopefully we'll be good to merge after that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (LGTM)!
I made one typo correction which you can accept before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nice! I had a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nice! I just have one final suggestion about the place where we add mocked code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, please go ahead and merge :)
Description
After contacting the authors of Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks, I thoroughly refactored our
DatasetGenerator
. We now use a prompt template and ask LLM to return the generated examples injson
format, which is more controllable than our previous method of returning natural language examples and regenerating labels in a separate API call.Followed are detailed changes:
To be Discussed
Already Discussed
DatasetGenerator
by includingsplit.value
in the code. This ensures thatDatasetSplit
can be serialized with thesave_to_disk
method.run_locally.py
by adding.value
to the lines that define thegenerated_training
,validation
, andtesting
sets.OpenAIDatasetGenerator
,InputOutputGenerator
and added unit test forInputOutputGenerator
. However, I have some questions and concerns that need to be addressed:split
argument necessary in thegenerate_examples
function?gpt-3.5-turbo
rather thantext-davinci-002
becauseturbo
is derived from Codex and can handlejson
response, while text-davinci can't. But this narrows the models users can choose.openai.ChatCompletion
throughMockCompletion
class for our unit test. 🚀🚀🚀natrual_instruction
andfew_shot_examples
inMockPromptSpec
.json
isn't always expected. Second, the code for extracting the expected key from thejson
response is not very Pythonic.json.JSONDecodeError,IndexError,TypeError,ValueError, AttributeError
occurs, the program will be terminated. However, I cannot use a simpleexcept:
statement to catch all errors because it is blocked bymypy
.generate_examples
function is not very Pythonic in thefor _ in tqdm(range(num_examples), desc="Generating examples"):
loop. However, I do not have a better way to make it more Pythonic while also displaying the generation progress, which is essential.pandas
and directly returned a Dataset object created from a dictionary.max_api_call
to set the upper bound of the API call.References
Blocked by
rompt_spec.parse_from_prompt
.