Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL #651

Merged

Conversation

14H034160212
Copy link
Contributor

@14H034160212 14H034160212 commented Apr 12, 2023

Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.

Eval details 📑

Eval name

[pararule-plus-multi-step-deductive-reasoning]

Eval description

[We proposed a multi-step deductive reasoning instruction for the PARARULE-Plus dataset, which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the Huggingface/Datasets. Here is the link. PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes 2708, 2694, 2704, and 2692 questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the tweet link.

What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

  • Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • Include at least 15 high quality examples.

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Insert what makes your eval high quality that was not mentioned above. (Not required)

Eval structure 🏗️

Your eval should

  • Check that your data is in evals/registry/data/{name}
  • Check that your yaml is registered at evals/registry/evals/{name}.yaml
  • Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

  • I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

  • I have filled out all required fields in the evals PR form
  • (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"}

@14H034160212 14H034160212 changed the title A Larger Deep Multi-Step Deductive Reasoning Datasets over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL Apr 12, 2023
@14H034160212
Copy link
Contributor Author

reupload the files using lfs and add eval json data examples.

@14H034160212
Copy link
Contributor Author

@jorge-openai Hi Jorge, can you give me some tips or advice about what I can do to fix the error? I am bit of confused about the error from the check. It will be much appreciate if there is any help. Thanks in advance.

@jorge-openai
Copy link
Collaborator

jorge-openai commented May 23, 2023

Hi @14H034160212, a couple of changes that I have flagged for this PR

  • The jsonl file size should be reduced to 2k rows maximum, actually has more than 10k.
  • Changes to .gitattributes won't be merged so you'll need to revert that.
  • Some more specific guidance should be given for the output using Match because 'false' and 'False' are different values in this eval type. A better choice will be to use FuzzyMatch in this case.

Concerning the error, I'll try to take a look at it later today to see if I can give you any advice, but first try uploading a smaller file, if there is an error there it will be easier to find, we can add more samples later.
Thanks for your patience!

@14H034160212
Copy link
Contributor Author

Hi @jorge-openai, Many thanks for your reply! I have updated the files and data according to your suggestion.

Copy link
Collaborator

@jorge-openai jorge-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response, there are still a couple of things to update, but we are on the right path. Let's see if this will get us through the CI tests.

14H034160212 and others added 2 commits May 24, 2023 13:26
…ing.yaml

Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
…ing.yaml

Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
@14H034160212
Copy link
Contributor Author

14H034160212 commented May 24, 2023

Hi @jorge-openai, Thanks a lot for your quick reply too! I just commit your suggestion in the PR now. It seems the Run new evals / check_files (pull_request) fails again.

@jorge-openai
Copy link
Collaborator

Thanks for the patience, the issue is not related to this particular eval, so we'll try a workaround for the CI to pass in this case and merge if all goes as expected.

Please pull and merge main into your branch, then commit. This will allow us to run the CI in your eval.

I'll see if I have assigned the other PR you mentioned, but just in case do the same.

g especiall
 if it merges an updated upstream into a topic branch.
@14H034160212
Copy link
Contributor Author

14H034160212 commented May 24, 2023

Hi @jorge-openai, Thanks a lot for your reply. I have pulled and merged the latest code from openai/evals into this branch now. I also did the same operation for my next PR.

@14H034160212
Copy link
Contributor Author

Hi @jorge-openai, Thanks a lot for your help! It seems the all checks have passed. Looks good. Does the code ready to be merged?

@jorge-openai
Copy link
Collaborator

Hi, .gitattributes is still to appearing for merge, did you pull the latest version from main for that file too?

@14H034160212
Copy link
Contributor Author

14H034160212 commented May 25, 2023

Hi @jorge-openai, Thanks a lot for your reminder. I have pulled the latest code now. .gitattributes is not appearing for merge now. Can you re-review this PR? Thanks!

Copy link
Collaborator

@jorge-openai jorge-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems everything is good to merge now. Thanks for the patience and contribution!

@andrew-openai andrew-openai merged commit f077b82 into openai:main May 27, 2023
2 checks passed
h13e pushed a commit to h13e/evals that referenced this pull request May 29, 2023
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
[pararule-plus-multi-step-deductive-reasoning]

### Eval description

[We proposed a multi-step deductive reasoning instruction for the
[PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus),
which is a larger deep multi-step deductive reasoning dataset over
natural language. We also submitted the PARARULE-Plus into the
`Huggingface/Datasets`. Here is the
[link](https://huggingface.co/datasets/qbao775/PARARULE-Plus).
PARARULE-Plus dataset addresses the reasoning depth imbalance issue from
the RuleTaker dataset. The dataset specifically increases the dataset on
the deep reasoning depth, including depth=2, 3, 4, 5. In this pull
request, we submit a dataset that includes `2708`, `2694`, `2704`, and
`2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5,
respectively. Furthermore, we evaluate ChatGPT, and it fails on this
dataset. Here is the [tweet
link](https://twitter.com/qiming_bao/status/1615510552088018944).

### What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community.
We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal":
"true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is rough.
\nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is not rough.
\nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The rabbit is not slow. \nAnswer:
"}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The snake is reckless. \nAnswer:
"}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is big. \nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is not big. \nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is not bad. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22333"}
  ```
</details>

---------

Co-authored-by: qiming bao <qiming.bao@xtracta.com>
Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
arbreton pushed a commit to arbreton/evals that referenced this pull request Jul 8, 2023
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
[pararule-plus-multi-step-deductive-reasoning]

### Eval description

[We proposed a multi-step deductive reasoning instruction for the
[PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus),
which is a larger deep multi-step deductive reasoning dataset over
natural language. We also submitted the PARARULE-Plus into the
`Huggingface/Datasets`. Here is the
[link](https://huggingface.co/datasets/qbao775/PARARULE-Plus).
PARARULE-Plus dataset addresses the reasoning depth imbalance issue from
the RuleTaker dataset. The dataset specifically increases the dataset on
the deep reasoning depth, including depth=2, 3, 4, 5. In this pull
request, we submit a dataset that includes `2708`, `2694`, `2704`, and
`2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5,
respectively. Furthermore, we evaluate ChatGPT, and it fails on this
dataset. Here is the [tweet
link](https://twitter.com/qiming_bao/status/1615510552088018944).

### What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community.
We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal":
"true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is rough.
\nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is not rough.
\nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The rabbit is not slow. \nAnswer:
"}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The snake is reckless. \nAnswer:
"}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is big. \nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is not big. \nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is not bad. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22333"}
  ```
</details>

---------

Co-authored-by: qiming bao <qiming.bao@xtracta.com>
Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
jacobbieker pushed a commit to withmartian/-ARCHIVED--router-evals that referenced this pull request Jan 9, 2024
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
[pararule-plus-multi-step-deductive-reasoning]

### Eval description

[We proposed a multi-step deductive reasoning instruction for the
[PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus),
which is a larger deep multi-step deductive reasoning dataset over
natural language. We also submitted the PARARULE-Plus into the
`Huggingface/Datasets`. Here is the
[link](https://huggingface.co/datasets/qbao775/PARARULE-Plus).
PARARULE-Plus dataset addresses the reasoning depth imbalance issue from
the RuleTaker dataset. The dataset specifically increases the dataset on
the deep reasoning depth, including depth=2, 3, 4, 5. In this pull
request, we submit a dataset that includes `2708`, `2694`, `2704`, and
`2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5,
respectively. Furthermore, we evaluate ChatGPT, and it fails on this
dataset. Here is the [tweet
link](https://twitter.com/qiming_bao/status/1615510552088018944).

### What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community.
We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal":
"true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is rough.
\nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is not rough.
\nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The rabbit is not slow. \nAnswer:
"}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The snake is reckless. \nAnswer:
"}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is big. \nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is not big. \nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is not bad. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22333"}
  ```
</details>

---------

Co-authored-by: qiming bao <qiming.bao@xtracta.com>
Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
Linmj-Judy pushed a commit to TablewareBox/evals that referenced this pull request Feb 27, 2024
…guage with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (openai#651)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
[pararule-plus-multi-step-deductive-reasoning]

### Eval description

[We proposed a multi-step deductive reasoning instruction for the
[PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus),
which is a larger deep multi-step deductive reasoning dataset over
natural language. We also submitted the PARARULE-Plus into the
`Huggingface/Datasets`. Here is the
[link](https://huggingface.co/datasets/qbao775/PARARULE-Plus).
PARARULE-Plus dataset addresses the reasoning depth imbalance issue from
the RuleTaker dataset. The dataset specifically increases the dataset on
the deep reasoning depth, including depth=2, 3, 4, 5. In this pull
request, we submit a dataset that includes `2708`, `2694`, `2704`, and
`2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5,
respectively. Furthermore, we evaluate ChatGPT, and it fails on this
dataset. Here is the [tweet
link](https://twitter.com/qiming_bao/status/1615510552088018944).

### What makes this a useful eval?

[Logical reasoning ability is a fascinating topic in the NLP community.
We hope to see if ChatGPT and GPT4 sheds more light on this topic.]

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal":
"true", "id_string": "NegationRule-Animal-D2-11451"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 2 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is lazy. The wolf is strong. The wolf is fierce. The lion
chases the mouse. The wolf likes the dog. The mouse is smart. The dog is
smart. The dog is cute. The dog is small. If something is not smart then
it needs the mouse. If something needs the mouse then it is rough. If
something is not kind then it is strong. If something is not big then it
is furry. If something is cute then it is small. If something is small
and not awful then it is lovely. If something is strong and not kind
then it is heavy. If something is slow and lazy then it is awful. If
something is awful and not small then it is fierce. All furry animals
are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-Animal-D2-11452"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is rough.
\nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D3-10559"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 3 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The lion is slow.
The lion is sleepy. The tiger is fierce. The tiger is big. The lion
likes the dog. The tiger needs the mouse. The dog is smart. The mouse is
smart. The mouse is small. The mouse is cute. If something is not smart
then it sees the dog. If something sees the dog then it is lazy. If
something is not kind then it is fierce. If something is not horrible
then it is furry. If something is small then it is cute. If something is
cute and not strong then it is beautiful. If something is fierce and not
kind then it is awful. If something is slow and sleepy then it is
strong. If something is strong and not cute then it is big. If something
is furry then it is lovely. All lovely animals are round. All beautiful
animals are quiet. All awful animals are heavy. All big animals are
horrible. All lazy animals are rough. Question: The lion is not rough.
\nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D3-105510"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The rabbit is not slow. \nAnswer:
"}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 4 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is slow.
The snake is lazy. The snake is rough. The snake chases the mouse. The
crocodile sees the rabbit. The crocodile is fierce. The crocodile is
big. The mouse is smart. The mouse is quiet. The mouse is nice. The
rabbit is cute. The rabbit is small. The rabbit is adorable. Smart
animals are cute. If something is lazy then it attacks the mouse. If
something attacks the mouse then it is tired. If something is slow and
lazy then it is rough. If something is cute and small then it is
beautiful. If something is fierce and big then it is heavy. If something
is rough then it is dull. If something is dull then it is sleepy. All
sleepy animals are big. If something is cute then it is small. If
something is small then it is adorable. If something is adorable then it
is nice. All adorable animals are kind. If something is heavy then it is
awful. All awful animals are obese. All obese animals are lazy. If
something is beautiful then it is lovely. All lovely animals are furry.
All furry animals are slow. If something is tired then it is strong. All
strong animals are reckless. Question: The snake is reckless. \nAnswer:
"}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is big. \nAnswer: "}], "ideal": "true", "id_string":
"NegationRule-Animal-D5-23709"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: The snake is dull.
The snake is slow. The bald eagle is awful. The bald eagle is powerful.
The snake attacks the rabbit. The bald eagle likes the squirrel. The
rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The
squirrel is cute. If something is not quiet then it visits the rabbit.
If something visits the rabbit then it is rough. If something is not
kind then it is awful. If something is not fierce then it is furry. If
something is beautiful then it is cute. If something is cute and not
angry then it is small. If something is awful and not kind then it is
horrible. If something is dull and slow then it is angry. If something
is angry and not cute then it is powerful. If something is furry then it
is lovely. If something is lovely then it is clever. If something is
clever then it is kind. All kind animals are smart. All small animals
are round. If something is round then it is nice. All nice animals are
funny. If something is horrible then it is heavy. If something is heavy
then it is tired. All tired animals are reckless. If something is
powerful then it is fierce. If something is fierce then it is lazy. All
lazy animals are boring. All rough animals are sleepy. If something is
sleepy then it is strong. All strong animals are big. Question: The
snake is not big. \nAnswer: "}], "ideal": "false", "id_string":
"NegationRule-Animal-D5-237010"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22331"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Harry is not bad. \nAnswer: "}], "ideal":
"false", "id_string": "NegationRule-D5-22332"}
{"input": [{"role": "system", "content": "Instructions: You will be
presented with a passage and a question about that passage. You need to
answer true or false to the question. Read the question thoroughly and
answer true or false. Read the passage thoroughly to ensure you know
what the passage entails and you need to use 5 rules to answer the
question."}, {"role": "user", "content": "\nPassage: Harry is huge.
Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad.
Anne is poor. If someone is not big then they are sad. If someone is not
bad then they are kind. If someone is nice then they are smart. If
someone is smart and not rough then they are clever. If someone is sad
and not big then they are dull. If someone is dull then they are little.
If someone is little then they are thin. All thin people are bad. If
someone is small and tiny then they are rough. If someone is rough and
not smart then they are poor. If someone is poor then they are fashion.
If someone is fashion then they are energetic. If someone is energetic
then they are young. If someone is kind then they are wealthy. If
someone is wealthy then they are quiet. If someone is quiet then they
are smart. All smart people are wealthy. If someone is clever then they
are famous. If someone is famous then they are old. All old people are
experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true",
"id_string": "NegationRule-D5-22333"}
  ```
</details>

---------

Co-authored-by: qiming bao <qiming.bao@xtracta.com>
Co-authored-by: Jorge <133797909+jorge-openai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants