-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651
Conversation
reupload the files using lfs and add eval json data examples. |
@jorge-openai Hi Jorge, can you give me some tips or advice about what I can do to fix the error? I am bit of confused about the error from the check. It will be much appreciate if there is any help. Thanks in advance. |
Hi @14H034160212, a couple of changes that I have flagged for this PR
Concerning the error, I'll try to take a look at it later today to see if I can give you any advice, but first try uploading a smaller file, if there is an error there it will be easier to find, we can add more samples later. |
Hi @jorge-openai, Many thanks for your reply! I have updated the files and data according to your suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response, there are still a couple of things to update, but we are on the right path. Let's see if this will get us through the CI tests.
evals/registry/evals/pararule-plus-multi-step-deductive-reasoning.yaml
Outdated
Show resolved
Hide resolved
…ing.yaml Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
…ing.yaml Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
Hi @jorge-openai, Thanks a lot for your quick reply too! I just commit your suggestion in the PR now. It seems the |
Thanks for the patience, the issue is not related to this particular eval, so we'll try a workaround for the CI to pass in this case and merge if all goes as expected. Please pull and merge main into your branch, then commit. This will allow us to run the CI in your eval. I'll see if I have assigned the other PR you mentioned, but just in case do the same. |
g especiall if it merges an updated upstream into a topic branch.
Hi @jorge-openai, Thanks a lot for your reply. I have pulled and merged the latest code from openai/evals into this branch now. I also did the same operation for my next PR. |
Hi @jorge-openai, Thanks a lot for your help! It seems the all checks have passed. Looks good. Does the code ready to be merged? |
Hi, |
Hi @jorge-openai, Thanks a lot for your reminder. I have pulled the latest code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems everything is good to merge now. Thanks for the patience and contribution!
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <qiming.bao@xtracta.com> Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <qiming.bao@xtracta.com> Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <qiming.bao@xtracta.com> Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <qiming.bao@xtracta.com> Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
Thank you for contributing an eval!♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.
Eval details 📑
Eval name
[pararule-plus-multi-step-deductive-reasoning]
Eval description
[We proposed a multi-step deductive reasoning instruction for the PARARULE-Plus dataset, which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the
Huggingface/Datasets
. Here is the link. PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes2708
,2694
,2704
, and2692
questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the tweet link.What makes this a useful eval?
[Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.]
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval.If there is anything else that makes your eval worth including, please document it below.
Unique eval value
Eval structure 🏗️
Your eval should
evals/registry/data/{name}
evals/registry/evals/{name}.yaml
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
Submit eval
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and pushFailure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval