Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adversarial attack on reward models #39

Closed
pseudotensor opened this issue Apr 16, 2023 · 0 comments
Closed

Adversarial attack on reward models #39

pseudotensor opened this issue Apr 16, 2023 · 0 comments

Comments

@pseudotensor
Copy link
Collaborator

pseudotensor commented Apr 16, 2023

Question: What do reward models really optimize for? How much assumed context do they have?

E.g. adversarial attack might include:

  • arbitrary \n after some average number of words
  • long semi-random sequence of words in paragraphs
    i.e. just formatting.

It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.

Then reward models might assume alot about nature of input data, that already human readable, correct, etc.

How can RLHF can prune wrong/hallucinated responses?

Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.

  • thesis at front
  • average words per sentence
  • average sentences per paragraph
  • new lines between paragraphs
  • summary at end.

At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant