## How to - some advice for setting hyperparameters reasonably and troubleshooting

Here we compile a list of practical advice on how to get PILCO to solve a problem. There are several hyperparameters that need to be set in advance, but these in general can be related to aspects of the problem at hand, and a bit of experimenting can help avoid massive hyperparameter searches later. Additionally, there are a number of things that can go wrong and when this happens, a good strategy is to start by figuring out which component in failing. So, we organise the rest around the major components of the framework and we comment on hyperparameter settings and troubleshooting for each. 

### Gaussian Process model
The model is maybe the most crucial component, but actually does not have hyperparameters that need to be set in advance. The hyperparameters used, signal variance, signal noise, and lengthscales are trained by maximum likelihood. It is beneficial sometimes to fix the signal noise, see examples. Also priors can be used, which can have a regularising effect, useful when we see the hyperparameters taking extreme values (lengthscales are especially prone to this with values like $10^5$).

#### Troubleshooting 
How to know something is wrong:

- By inspection, hyperparameters values can seem implausible (lengthscales > $10^5$).
- Check the one-step predictions "manually":

In [None]:
# Assuming a data set X and Y
pilco
import 
plt.plot

- Check multi-step predictions manually:

If these are reasonably accurate (no need for perfection here), then it's probably not the model's issue.

Possible fixes to try:

- Need more data. Try increasing the J parameter, the number of rollouts run before optimising a policy for the first time. If you need 10 runs to get a model that can capture *anything* useful, the first 10 policy optimisation runs are wasted time. On the other hand this is obviously not very data-efficient. Example: both inverted pendulum tasks, since random rollouts tend to be extremely short.
- Are there non-smooth dynamics in the system you are trying to model? If so, is modeling them crucial for solving the task? If not it might be beneficial to filter them out. Example: inverted double pendulum
- Is the state fully observable? Example: swimmer.
- PILCO is only able to perform unimodal predictions. Is this a reasonable assumption in your case? If not, is it game-breaking or can you work around it? Example: pendulum stabilisation.

### The reward function

#### Initial Assumptions
PILCO assumes a given a priori reward function, given as an analytic function of the current state $r_t = r(x_t)$. The function form is that of an exponential (and in the swimmer env we introduce a linear r.f. and the possibility of combining reward functions). Can the reward structure of your problem be captured sufficiently with such a function?

#### Hyperparameters 
Let's take the exponential type reward function, we need to set the target and the weights, which correspond to the center of the exponential and the sensitivity of the reward to the different dimensions (for most cases we can use a diagonal weight matrix).

For the center of the exponential, it has to be on a state which when the system reaches and/or satabilises itself there, the task can be considered sufficiently solved. 

It can be a set of states, if for example a state space variable corresponds to a radius and another variable to an angular velocity, be setting a target like [r,v] and weights $I$, we can get the system on a circular trajectory.

The weights should take reasonable values between two extremes: 
- high magnitude values make the reward decay faster and make exploration much harder (the reward signal becomes sparser)
- low magintude values can make the reward gradient uninformative, or very small in magnitude and slow learning down.

State variables that are irrelevant for the current task, should correpsond to very low value weights. For example, if what we want from the system is to get to a certain position, velocity variables are irrelevant. If we have a 0 target for the velocity variable, and the weight is not small, higher velocities will be penalised for no reason, resulting in "optimal" (by reward function standards) behaviours that are slow (usually not what we want).

#### Testing
Check manually rewards for different states by:


In [None]:
import

Can also evaluate whole trajectories using something like:



In [None]:
# assuming X the data corresponding to a trajectory
import

In certain cases we found useful printing both the predicted return, based on the predicted trajectory by the model, given the policy and the reward function, and the actual return by running the policy in an episode on the real system:
- Significant disparity between the two points to a problem of the model. - If both are practically 0, then the reward function might be too narrow (too weights too high)
- If the reward is high, but the trajectory doesn't seem to solve the task, some other reward shaping issue (weights too low, targets at the wrong place, state variables correspond to different things than what is assumed).

### The policy (aka the controller)

The final goal of the algorithm is to learn a policy that efficiently solves the task at hand. The policy is parametrised and the values of these parameters is exactly what we optimise.

The action space, the available values for the controllers outputs, which are the system's inputs usually refered to as $u_t$, are assumed to be in a continuous interval around 0 with a maximum magnitude $e$. This value is specific to each environment.

For _linear_ controllers, there aren't any hyperparameters to be set, since all the parameters are learned.

For _rbf_ controllers, the hyperparameter that we need to set is the number of basis funnctions used. A higher number gives more flexibility, and might also give alleviate problems related to local minima (also see restarts), while it increases computational demands.



### Misc
#### Restarts
Since the GP model's hyperparameters and the controller's parameters are trained with gradient based iterative algorithms, in generally non-convex settings, they can be stuck in local minima. Restarts can help deal with these issues, and we recommend experimenting with them, especially when performance fluctuates a lot between runs.

#### Subsampling
We have introduced a subsampling parameter, that did not existed in the original algorithm. We did so because, unlike thre original implementation, we are solving scenarios that were pre-existing and the granularity of the timesteps don't always fit PILCO. For example, a long time horizon, (300 steps), when only very small differences are observed between consecutive states, makes planning unecessary hard, since predictions have to be performed iteratively, with errors and uncertainty compounded. By subsampling, using one of 5 samples, and repeating the same action for the rest, allows to deal with the same task on a much more manageable horizon of 60. Of course, the same action is repeated 5 times, so the control we apply is less fine tuned (alternatively, the system appears more stiff). We think this a reasonable trade off, since a task that requires precise, high frequency control over a long time horizon, can be fairly considered as hard task and it's reasonable for any algorithm to find it more challenging.

#### Initial conditions
PILCO takes in a mean and variance for the initial state of the task. Since planning is based on normally distributed and thus unimodal predictions, if the variance is very high (random initialisation anywhere in the eligible state space for example), PILCO can't use its model to learn. This was the case for the pendulum task (pendulum swing up), and when we fixed the initial state, PILCO solved the problem successfully.


#### Horizon - planning, training, testing

Another important component is the number of timesteps that's used for the different components of the algorithm.
These can be assumed and a lot of times are the same but that is not necessary.

For planning, we set the number of time steps for which we plan using the model. It is set when the pilco object is first constructed. We want this to be long enough for PILCO to guide the system to the target state, but an overly long planning horizon can introduce uneccessary noise, increase computational burden for no reason, and cause memory issues. We should about the planning horizon along with the subsampling variable.

By training horizon we refer to the number of time steps per episode, that are used for the training of the GP model. Should be greater or equal than the corresponding planning horizon. Too short and data (and information) is wasted, too long and computational demands increase without real benefit.

For the testing horizon, we usually refer to tasks where the system needs to be stabilised to certain state and not just reach it. In these cases it might make sense to have a much larger testing horizon, to make sure the policy learned is successfull in that regard.