Rewards spec design philosophy

Leaving here some notes based on feedback I'm receiving for broader circulation and transparency.

The current spec is very lightweight on rewards -- to be honest, intentionally: as we wrote in RFC 0, we started elsewhere and we are going to look at rewards after we consolidate the bases on sandboxing, binary distribution, etc. 

An upside of this is that we get some space to take in feedback from folks.

Here's some feedback I got from a researcher that's worth thinking about:

* High-level API makes sense: {reset, step, state}
* Reward as part of the observation may not be the best place for these reasons:
    1. What if reward calculation is super slow? Reward may not be available for a while and may come later (think a super large reward model)
    2. How do we handle scenarios where you may have multiple rewards, e.g. code style + accuracy + safety + something else

My thoughts: 2 should require a simple extension of our dataclass. Right now we do enforce rewards to be scalars.

<img width="2364" height="394" alt="Image" src="https://github.com/user-attachments/assets/27c38667-54ef-407b-8cff-621d345a6431" />

When you are training, rewards **are** scalars of course, but you could indeed have multiple trainers optimize for different things. For example, you may decide to do a training step against just the code style reward, and then the next training step can optimize for accuracy. You *could* also just do a weighted sum of all possible rewards so that you are back to a single scalar, but 1) these two things are very likely not equivalent and 2) as providers of standards, I think we need to be neutral wrt these design choices and allow both options.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewards spec design philosophy #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rewards spec design philosophy #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions