Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 3 more gym environment examples. Small changes to pilco.py, mgp… #23

Merged
merged 7 commits into from
Mar 29, 2019

Conversation

kyr-pol
Copy link
Collaborator

@kyr-pol kyr-pol commented Mar 8, 2019

…r.py and additions rewards.py, expained further in the pull request.

Added 3 extra tasks:

  • a pendulum swing up
  • a double inverted pendulum stabilisation (mujoco)
  • a swimmer robot (mujoco)
    Each task is solved in a separate file.

For the swing up task, I modified the gym environment's initial conditions, setting the pendulum in the bottom position without velocity. PILCO in general needs a specific starting state to successfully plan from.

For the double pendulum task, a wrapper is used that terminates the episode when the pendulum reaches the limits of its state space, since this creates non-smooth behaviour that is hard for PILCO to model. Additionally, angles in rads are calculated from the sin and cos representation, reducing the state space dimensions (think of this as a much simpler version of the state augmentation the original PILCO uses).

For the swimmer, a wrapper is also used, that augments the state space by one state, that is actually the accumulated reward. In the original gym version the reward function is using a hidden state, which violates PILCO assumptions. Still no hidden information is accessed by PILCO, just the formulation is made compatible with its assumptions. Furthermore, I added a composite reward function, that includes penalties for putting the robot's joints to their limits (in terms of angles), again in order to maintain smooth behaviour that is easy for the GP to model.

On another note, I fixed the noise in some of the runs which helps conditioning, and I also added a pretty uninformative prior on the lengthscales and variances, just to penalise extreme values that otherwise occur in the higher dimensional tasks (this is something the original PILCO does too).

…r.py and additions rewards.py, expained further in the pull request
@kyr-pol
Copy link
Collaborator Author

kyr-pol commented Mar 8, 2019

I think we should add an option in the PILCO constructor for priors, because they have to be defined before the model is compiled (afaik), and for the moment I have hard coded them (they are general enough that they probably help with all environments, but still not best practice).

@codecov-io
Copy link

codecov-io commented Mar 8, 2019

Codecov Report

Merging #23 into master will decrease coverage by 4.81%.
The diff coverage is 47.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #23      +/-   ##
=========================================
- Coverage   95.12%   90.3%   -4.82%     
=========================================
  Files           7       7              
  Lines         328     361      +33     
=========================================
+ Hits          312     326      +14     
- Misses         16      35      +19
Impacted Files Coverage Δ
pilco/models/pilco.py 93.33% <100%> (+0.39%) ⬆️
pilco/models/mgpr.py 100% <100%> (ø) ⬆️
pilco/rewards.py 61.11% <26.92%> (-32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c923040...659f0e7. Read the comment docs.

@kyr-pol
Copy link
Collaborator Author

kyr-pol commented Mar 8, 2019

Possibly the extra reward functions etc, if we think they are env specific, can be kept in the swimmer.py file.

Also for the slight change in the policy optimisation function: I don't think we have to always cold start the optimisation by randomising, we can run it once using the last values as initialisation, and then randomise.

@nrontsis
Copy link
Owner

Amazing work; I will work on it later this week.

I definitely agree about the priors; an easy to use interface might be a great selling point for our implementation.

Furthermore, I think that we should:

  • Extract parts of the code that is reoccurring to relevant functions that will be defined only once and used by all the examples.
  • Write a Readme/Jupyter notebook detailing the challenges of each environment. This would be a great resource for someone wanting to use PILCO and/or our implementation.

After this is done, we could include the environments in unit tests, requiring any new versions of the library to solve them. This would allow automated testing of new ideas without requiring to manually try their validity on real world examples.

@kyr-pol
Copy link
Collaborator Author

kyr-pol commented Mar 20, 2019

Did some work on these two points, check the added notebook, it's in progress, but what do you think of a structure more or less like that? I though it'd be helpful for users getting started, stuck at a task with an pilco running but not seemingly learning.

We could add at the end information more specific to what we did in the examples we included too.

@nrontsis nrontsis merged commit 2bf469b into master Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants