## How to - some advice for setting hyperparameters reasonably and troubleshooting

Here we compile a list of practical advice on how to get PILCO to solve a problem. There are several hyperparameters that need to be set in advance, but these in general can be related to aspects of the problem at hand, and a bit of experimenting can help avoid massive hyperparameter searches later. Additionally, there are a number of things that can go wrong and when this happens, a good strategy is to start by figuring out which component in failing. So, we organise the rest around the major components of the framework and we comment on hyperparameter settings and troubleshooting for each. 

### Gaussian Process model
The model is maybe the most crucial component, but actually does not have hyperparameters that need to be set in advance. The hyperparameters used, signal variance, signal noise, and lengthscales are trained by maximum likelihood. It is beneficial sometimes to fix the signal noise, see examples. Also priors can be used, which can have a regularising effect, useful when we see the hyperparameters taking extreme values (lengthscales are especially prone to this with values like $10^5$).

#### Troubleshooting 
How to know something is wrong:

- By inspection, hyperparameters values can seem implausible (lengthscales > $10^5$).
- Check the one-step predictions "manually":

In [None]:
# Assuming a data set X and Y
pilco
import 
plt.plot

- Check multi-step predictions manually:

If these are reasonably accurate (no need for perfection here), then it's probably not the model's issue.

Possible fixes to try:

- Need more data. Try increasing the J parameter, the number of rollouts run before optimising a policy for the first time. If you need 10 runs to get a model that can capture *anything* useful, the first 10 policy optimisation runs are wasted time. On the other hand this is obviously not very data-efficient. Example: both inverted pendulum, since random rollouts tend to be extremely short.
- Are there non-smooth dynamics in the system you are trying to model? If so, is modeling them crucial for solving the task? If not it might be beneficial to filter them out. Example: inverted double pendulum
- Is the state fully observable? Example: swimmer.
- PILCO is only able to perform unimodal predictions. Is this a reasonable assumption in your case? If not, is it game-breaking or can you work around it? Example: pendulum stabilisation.

### The reward function

#### Initial Assumptions
PILCO assumes a given a priori reward function, given as an analytic function of the current state $r_t = r(x_t)$. The function form is that of an exponential (and in the swimmer env we introduce a linear r.f. and the possibility of combining reward functions). Can the reward structure of your problem be captured sufficiently with such a function?

#### Hyperparameters 
Let's take the exponential type reward function, we need to set the target and the weights, which correspond to the center of the exponential and the sensitivity of the reward to the different dimensions (for most cases we can use a diagonal weight matrix).

For the center of the exponential, it has to be on a state which when the system reaches and/or satabilises itself there, the task can be considered sufficiently solved. 

It can be a set of states, if for example a state space variable corresponds to a radius and another variable to an angular velocity, be setting a target like [r,v] and weights $I$, we can get the system on a circular trajectory.

The weights should take reasonable values between two extremes: 
- high magnitude values make the reward decay faster and make exploration much harder (the reward signal becomes sparser)
- low magintude values can make the reward gradient uninformative, or very small in magnitude and slow learning down.

State variables that are irrelevant for the current task, should correpsond to very low value weights. For example, if what we want from the system is to get to a certain position, velocity variables are irrelevant. If we have a 0 target for the velocity variable, and the weight is not small, higher velocities will be penalised for no reason, resulting in "optimal" (by reward function standards) behaviours that are slow (usually not what we want).

#### Testing
Check manually rewards for different states by:


In [None]:
import

Can also evaluate whole trajectories using something like:



In [None]:
# assuming X the data corresponding to a trajectory
import

In certain cases we found useful printing both the predicted return, based on the predicted trajectory by the model, given the policy and the reward function, and the actual return by running the policy in an episode on the real system:
- Significant disparity between the two points to a problem of the model. - If both are practically 0, then the reward function might be too narrow (too weights too high)
- If the reward is high, but the trajectory doesn't seem to solve the task, some other reward shaping issue (weights too low, targets at the wrong place, state variables correspond to different things than what is assumed).

### The policy (aka the controller)

### Misc

Subsampling

Initial conditions

Horizon - planning, training, testing