Adversarial attack on reward models #39

pseudotensor · 2023-04-16T01:10:15Z

Question: What do reward models really optimize for? How much assumed context do they have?

E.g. adversarial attack might include:

arbitrary \n after some average number of words
long semi-random sequence of words in paragraphs
i.e. just formatting.

It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.

Then reward models might assume alot about nature of input data, that already human readable, correct, etc.

How can RLHF can prune wrong/hallucinated responses?

Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.

thesis at front
average words per sentence
average sentences per paragraph
new lines between paragraphs
summary at end.

At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.

pseudotensor closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adversarial attack on reward models #39

Adversarial attack on reward models #39

pseudotensor commented Apr 16, 2023 •

edited

Adversarial attack on reward models #39

Adversarial attack on reward models #39

Comments

pseudotensor commented Apr 16, 2023 • edited

pseudotensor commented Apr 16, 2023 •

edited