Commit 1361d74
committed
fix(rewards): restore reentrant state scaling
- turns out if you don't scale up the negative rewards for reentering states, you get a model that wins but only at the last moment after spinning in circles a bunch.
- observe that associative swap tends to create trees where it's difficult for the model to commute terms. add negative reward for it1 parent b896e9b commit 1361d74
1 file changed
+5
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
118 | | - | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
204 | 204 | | |
205 | 205 | | |
206 | 206 | | |
| 207 | + | |
207 | 208 | | |
208 | | - | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
209 | 212 | | |
210 | 213 | | |
211 | 214 | | |
| |||
0 commit comments