A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

14H034160212 · 2023-04-12T12:34:14Z

Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.

Eval details 📑

Eval name

[pararule-plus-multi-step-deductive-reasoning]

Eval description

[We proposed a multi-step deductive reasoning instruction for the PARARULE-Plus dataset, which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the Huggingface/Datasets. Here is the link. PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes 2708, 2694, 2704, and 2692 questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the tweet link.

What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
Include at least 15 high quality examples.

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Insert what makes your eval high quality that was not mentioned above. (Not required)

Eval structure 🏗️

Your eval should

Check that your data is in evals/registry/data/{name}
Check that your yaml is registered at evals/registry/evals/{name}.yaml
Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

I have filled out all required fields in the evals PR form
(Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"}

14H034160212 · 2023-04-24T05:57:22Z

reupload the files using lfs and add eval json data examples.

14H034160212 · 2023-05-23T13:11:16Z

@jorge-openai Hi Jorge, can you give me some tips or advice about what I can do to fix the error? I am bit of confused about the error from the check. It will be much appreciate if there is any help. Thanks in advance.

jorge-openai · 2023-05-23T13:31:29Z

Hi @14H034160212, a couple of changes that I have flagged for this PR

The jsonl file size should be reduced to 2k rows maximum, actually has more than 10k.
Changes to .gitattributes won't be merged so you'll need to revert that.
Some more specific guidance should be given for the output using Match because 'false' and 'False' are different values in this eval type. A better choice will be to use FuzzyMatch in this case.

Concerning the error, I'll try to take a look at it later today to see if I can give you any advice, but first try uploading a smaller file, if there is an error there it will be easier to find, we can add more samples later.
Thanks for your patience!

14H034160212 · 2023-05-24T00:14:15Z

Hi @jorge-openai, Many thanks for your reply! I have updated the files and data according to your suggestion.

jorge-openai

Thanks for the quick response, there are still a couple of things to update, but we are on the right path. Let's see if this will get us through the CI tests.

evals/registry/evals/pararule-plus-multi-step-deductive-reasoning.yaml

…ing.yaml Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>

14H034160212 · 2023-05-24T01:29:16Z

Hi @jorge-openai, Thanks a lot for your quick reply too! I just commit your suggestion in the PR now. It seems the Run new evals / check_files (pull_request) fails again.

jorge-openai · 2023-05-24T17:06:15Z

Thanks for the patience, the issue is not related to this particular eval, so we'll try a workaround for the CI to pass in this case and merge if all goes as expected.

Please pull and merge main into your branch, then commit. This will allow us to run the CI in your eval.

I'll see if I have assigned the other PR you mentioned, but just in case do the same.

g especiall if it merges an updated upstream into a topic branch.

14H034160212 · 2023-05-24T21:57:23Z

Hi @jorge-openai, Thanks a lot for your reply. I have pulled and merged the latest code from openai/evals into this branch now. I also did the same operation for my next PR.

14H034160212 · 2023-05-25T04:13:12Z

Hi @jorge-openai, Thanks a lot for your help! It seems the all checks have passed. Looks good. Does the code ready to be merged?

jorge-openai · 2023-05-25T13:28:22Z

Hi, .gitattributes is still to appearing for merge, did you pull the latest version from main for that file too?

the commit.update code

14H034160212 · 2023-05-25T23:41:35Z

Hi @jorge-openai, Thanks a lot for your reminder. I have pulled the latest code now. .gitattributes is not appearing for merge now. Can you re-review this PR? Thanks!

jorge-openai

Seems everything is good to merge now. Thanks for the patience and contribution!

…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651) # Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <qiming.bao@xtracta.com> Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>

14H034160212 added 3 commits April 12, 2023 10:07

update complex logical reasoning evals

ffa349c

add pararule-plus

856de37

update pararule plus dataset

379ec90

qiming bao added 4 commits April 24, 2023 17:43

remove the data

c8cffd6

use lfs to upload jsonl file

7a2676f

remove jsonl

48833eb

reupload data using lfs

78d2e93

14H034160212 mentioned this pull request May 23, 2023

Cannot pass the check and got this error: KeyError: 'sample' #1012

Open

remove redundant line

6fc12b2

14H034160212 requested review from andrew-openai, rlbayes, jwang47 and logankilpatrick as code owners May 23, 2023 22:37

Qiming Bao added 3 commits May 23, 2023 23:54

remove the largejsonl file

1ac53df

add another new the largejsonl file

9b82821

use fuzzymatch replace match

1823c24

remove redundant code

4d76c0b

jorge-openai requested changes May 24, 2023

View reviewed changes

evals/registry/evals/pararule-plus-multi-step-deductive-reasoning.yaml Outdated Show resolved Hide resolved

evals/registry/evals/pararule-plus-multi-step-deductive-reasoning.yaml Show resolved Hide resolved

14H034160212 and others added 2 commits May 24, 2023 13:26

Update evals/registry/evals/pararule-plus-multi-step-deductive-reason…

fe57ecc

…ing.yaml Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>

Update evals/registry/evals/pararule-plus-multi-step-deductive-reason…

82160db

…ing.yaml Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>

add description

43e5238

14H034160212 mentioned this pull request May 24, 2023

A Group of more Challenge Logical Reasoning Machine Reading Comprehension Datasets with Logical Reasoning Instruction For OpenAI EVAL #648

Merged

12 tasks

erge remote-tracking branch 'upstream/main'

1383dcd

g especiall if it merges an updated upstream into a topic branch.

resolve conflict

e2d0702

update description for logiqa plus

45b8eda

14H034160212 added 6 commits May 25, 2023 23:26

remove the changed gita file

d42b906

remove the change gita

ae2d1a7

Merge remote-tracking branch 'upstream/main'

50efba9

the commit.update code

add origin gita back

ed24b52

resolve conflict

e4f3051

remove redundant code

04c7f78

14H034160212 requested a review from jorge-openai May 25, 2023 23:43

jorge-openai approved these changes May 26, 2023

View reviewed changes

andrew-openai merged commit f077b82 into openai:main May 27, 2023
2 checks passed

14H034160212 mentioned this pull request May 28, 2023

Some new papers with logical reasoning zjunlp/Prompt4ReasoningPapers#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

14H034160212 commented Apr 12, 2023 •

edited

Loading

14H034160212 commented Apr 24, 2023

14H034160212 commented May 23, 2023

jorge-openai commented May 23, 2023 •

edited

Loading

14H034160212 commented May 24, 2023

jorge-openai left a comment

14H034160212 commented May 24, 2023 •

edited

Loading

jorge-openai commented May 24, 2023

14H034160212 commented May 24, 2023 •

edited

Loading

14H034160212 commented May 25, 2023

jorge-openai commented May 25, 2023

14H034160212 commented May 25, 2023 •

edited

Loading

jorge-openai left a comment

A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

Conversation

14H034160212 commented Apr 12, 2023 • edited Loading

Thank you for contributing an eval! ♥️

Eval details 📑

Eval name

Eval description

What makes this a useful eval?

Criteria for a good eval ✅

Unique eval value

Eval structure 🏗️

Final checklist 👀

Submission agreement

Email address validation

Limited availability acknowledgement

Submit eval

Eval JSON data

Eval

14H034160212 commented Apr 24, 2023

14H034160212 commented May 23, 2023

jorge-openai commented May 23, 2023 • edited Loading

14H034160212 commented May 24, 2023

jorge-openai left a comment

Choose a reason for hiding this comment

14H034160212 commented May 24, 2023 • edited Loading

jorge-openai commented May 24, 2023

14H034160212 commented May 24, 2023 • edited Loading

14H034160212 commented May 25, 2023

jorge-openai commented May 25, 2023

14H034160212 commented May 25, 2023 • edited Loading

jorge-openai left a comment

Choose a reason for hiding this comment

14H034160212 commented Apr 12, 2023 •

edited

Loading

jorge-openai commented May 23, 2023 •

edited

Loading

14H034160212 commented May 24, 2023 •

edited

Loading

14H034160212 commented May 24, 2023 •

edited

Loading

14H034160212 commented May 25, 2023 •

edited

Loading