# Planning

Planning is a method of simulating a sequence of actions in an environment model before actually taking an action in the real environment.

Concepts covered:
1. Cross entropy method
2. Monte Carlo tree search
3. Probabilistic ensembles

## Cross Entropy Method

The Cross Entroy Method (CEM) is a gradient-free optimization method commonly used for planning in model-based reinforcement learning.

CEM Algorithm
1. Create a Gaussian distribution $N(\mu,\sigma)$ that describes the weights $\theta$ of the neural network.
2. Sample $N$ batch samples of $\theta$ from the Gaussian.
3. Evaluate all $N$ samples of $\theta$ using the value function, e.g. running trials.
4. Select the top % of the samples of $\theta$ and compute the new $\mu$ and $\sigma$ to parameterise the new Gaussian distribution.
5. Repeat steps 1-4 until convergence.

In [1]:
import numpy as np
import tensorflow_probability as tfp
tfd = tfp.distributions
import gym
import warnings
warnings.filterwarnings("ignore")

In [2]:
# RL Gym
env = gym.make('CartPole-v1')

# Initialisation
n = 10  # number of candidate policies
top_k = 0.40  # top % selected for next iteration
mean = np.zeros((5,2))  # shape = (n_parameters, n_actions)
stddev = np.ones((5,2))  # shape = (n_parameters, n_actions)

In [3]:
def get_batch_weights(mean, stddev, n):
    mvn = tfd.MultivariateNormalDiag(
        loc=mean,
        scale_diag=stddev)
    return mvn.sample(n).numpy()

def policy(obs, weights):
    return np.argmax(obs @ weights[:4,:] + weights[4])

def run_trial(weights, render=False):
    obs = env.reset()
    done = False
    reward = 0
    while not done:
        a = policy(obs, weights)
        obs, r, done, _ = env.step(a)
        reward += r
        if render:
            env.render()
    return reward

def get_new_mean_stddev(rewards, batch_weights):
    idx = np.argsort(rewards)[::-1][:int(n*top_k)]
    mean = np.mean(batch_weights[idx], axis=0)
    stddev = np.sqrt(np.var(batch_weights[idx], axis=0))
    return mean, stddev

In [4]:
for i in range(20):
    batch_weights = get_batch_weights(mean, stddev, n)
    rewards = [run_trial(weights) for weights in batch_weights]
    mean, stddev = get_new_mean_stddev(rewards, batch_weights)
    print(rewards)

[9.0, 10.0, 9.0, 27.0, 12.0, 8.0, 10.0, 37.0, 10.0, 12.0]
[13.0, 23.0, 9.0, 15.0, 8.0, 16.0, 23.0, 45.0, 9.0, 25.0]
[16.0, 33.0, 18.0, 30.0, 17.0, 17.0, 34.0, 21.0, 59.0, 35.0]
[27.0, 19.0, 17.0, 26.0, 27.0, 25.0, 36.0, 45.0, 30.0, 34.0]
[26.0, 28.0, 50.0, 11.0, 27.0, 27.0, 30.0, 71.0, 42.0, 29.0]
[33.0, 25.0, 31.0, 30.0, 27.0, 21.0, 29.0, 33.0, 37.0, 44.0]
[38.0, 37.0, 24.0, 34.0, 84.0, 20.0, 23.0, 28.0, 45.0, 52.0]
[29.0, 23.0, 25.0, 38.0, 28.0, 32.0, 67.0, 25.0, 30.0, 26.0]
[34.0, 27.0, 65.0, 32.0, 70.0, 38.0, 41.0, 24.0, 21.0, 27.0]
[28.0, 39.0, 38.0, 47.0, 31.0, 22.0, 35.0, 37.0, 45.0, 20.0]
[58.0, 20.0, 75.0, 23.0, 23.0, 36.0, 31.0, 27.0, 31.0, 29.0]
[35.0, 32.0, 41.0, 33.0, 40.0, 52.0, 28.0, 34.0, 28.0, 46.0]


[27.0, 24.0, 54.0, 52.0, 27.0, 29.0, 38.0, 42.0, 28.0, 47.0]
[30.0, 27.0, 27.0, 31.0, 38.0, 28.0, 30.0, 20.0, 85.0, 45.0]
[47.0, 24.0, 53.0, 68.0, 60.0, 49.0, 28.0, 32.0, 54.0, 79.0]
[24.0, 39.0, 25.0, 36.0, 101.0, 58.0, 24.0, 27.0, 37.0, 37.0]
[23.0, 63.0, 24.0, 34.0, 24.0, 25.0, 34.0, 54.0, 55.0, 41.0]
[39.0, 48.0, 23.0, 87.0, 38.0, 26.0, 48.0, 27.0, 23.0, 59.0]
[49.0, 32.0, 41.0, 51.0, 38.0, 27.0, 30.0, 46.0, 26.0, 26.0]
[42.0, 23.0, 40.0, 24.0, 31.0, 29.0, 51.0, 22.0, 35.0, 43.0]


In [5]:
mean, stddev

(array([[-2.97851359,  4.88814473],
        [ 0.54255219, -4.37263154],
        [ 0.3179081 ,  0.02582262],
        [-1.71460631,  0.07162347],
        [ 0.52532392, -0.24820291]]),
 array([[5.46232708e-03, 1.19883604e-02],
        [3.09637559e-04, 1.72948488e-02],
        [1.75588319e-02, 3.66150916e-05],
        [2.44638998e-03, 2.06700714e-05],
        [5.37123333e-04, 1.22797736e-02]]))

In [6]:
best_weights = get_batch_weights(mean, stddev, 1)[0]

In [7]:
run_trial(best_weights, render=False)

37.0

## Monte Carlo Tree Search

Upcoming

## Probabilistic Ensembles

Upcoming