Skip to content

Learning Curve and Rendering Mismatch #615

@ecada

Description

@ecada

I've trained PPO1 with the given hyperparameters but at the end of 1M timesteps I see a decline in the episode rewards. However when rendering the environment the humanoid is able to walk better after being trained 20M timesteps(where the episode reward is at minimum) compared to being trained 1M (where the episode reward is maximum). Has anyone noticed this dilemma? Am I missing something?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions