Our goal is to teach the Lunar Lander (our agent) how to correctly land their spaceship between two flags (our landing pad). The more accurately the agent is able to land, the bigger the ultimate reward he will be able to attain. The agent may choose any of the following four actions at any moment to achieve this objective: fire the left engine, fire the right engine, fire down the engine, or do nothing.
Useful Links:
- https://datascience.stackexchange.com/questions/115243/understanding-the-tensorboard-plots-on-a-stable-baseline3s-ppo
- https://www.reddit.com/r/reinforcementlearning/comments/vqb2wu/tips_and_tricks_for_rl_from_experimental_data/
- https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters
- https://medium.com/aureliantactics/understanding-ppo-plots-in-tensorboard-cbc3199b9ba2
- https://towardsdatascience.com/how-policy-gradients-in-reinforcement-learning-can-get-you-to-the-moon-15940cbc076a