Skip to content

Rewards spec design philosophy #107

@Darktex

Description

@Darktex

Leaving here some notes based on feedback I'm receiving for broader circulation and transparency.

The current spec is very lightweight on rewards -- to be honest, intentionally: as we wrote in RFC 0, we started elsewhere and we are going to look at rewards after we consolidate the bases on sandboxing, binary distribution, etc.

An upside of this is that we get some space to take in feedback from folks.

Here's some feedback I got from a researcher that's worth thinking about:

  • High-level API makes sense: {reset, step, state}
  • Reward as part of the observation may not be the best place for these reasons:
    1. What if reward calculation is super slow? Reward may not be available for a while and may come later (think a super large reward model)
    2. How do we handle scenarios where you may have multiple rewards, e.g. code style + accuracy + safety + something else

My thoughts: 2 should require a simple extension of our dataclass. Right now we do enforce rewards to be scalars.

Image

When you are training, rewards are scalars of course, but you could indeed have multiple trainers optimize for different things. For example, you may decide to do a training step against just the code style reward, and then the next training step can optimize for accuracy. You could also just do a weighted sum of all possible rewards so that you are back to a single scalar, but 1) these two things are very likely not equivalent and 2) as providers of standards, I think we need to be neutral wrt these design choices and allow both options.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions