Skip to content

Commit 1361d74

Browse files
fix(rewards): restore reentrant state scaling
- turns out if you don't scale up the negative rewards for reentering states, you get a model that wins but only at the last moment after spinning in circles a bunch. - observe that associative swap tends to create trees where it's difficult for the model to commute terms. add negative reward for it
1 parent b896e9b commit 1361d74

File tree

1 file changed

+5
-2
lines changed
  • libraries/mathy_python/mathy

1 file changed

+5
-2
lines changed

libraries/mathy_python/mathy/env.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ def get_rewarding_actions(self, state: MathyEnvState) -> List[Type[BaseRule]]:
115115
def get_penalizing_actions(self, state: MathyEnvState) -> List[Type[BaseRule]]:
116116
"""Get the list of penalizing action types. When these actions
117117
are selected, the agent gets a negative reward."""
118-
return []
118+
return [AssociativeSwapRule]
119119

120120
def max_moves_fn(
121121
self, problem: MathyEnvProblem, config: MathyEnvProblemArgs
@@ -204,8 +204,11 @@ def get_state_transition(
204204
if list_count <= 1 or key != expression.raw:
205205
continue
206206

207+
# NOTE: the reward is scaled by how many times this state has been visited
207208
return time_step.transition(
208-
features, reward=EnvRewards.PREVIOUS_LOCATION, discount=self.discount,
209+
features,
210+
reward=EnvRewards.PREVIOUS_LOCATION * list_count,
211+
discount=self.discount,
209212
)
210213

211214
if len(agent.history) > 0:

0 commit comments

Comments
 (0)