## Introduction:
The goal of this exercise is to implement a simple model-based reinforcement learning algorithm.  First, we will learn a dynamics function to model observed state transitions, and then we will use model decision timing planning to maximize predicted rewards [paper](https://arxiv.org/pdf/1708.02596.pdf)

Before we start, we install some necessary packages to visualise the network

In [None]:
!pip install graphviz
!pip install pydot
!pip install gym

## Developing an RL cycle using OpenAI GYM
`Gym` is a toolkit for developing and comparing reinforcement learning algorithms. `Gym` has a lot of built-in environments like the cartpole, pendulum,... In this [link](https://gym.openai.com/envs/), you can find a list of all defined environments.

<img src=img/rl.png width="400">

Import the required packages.

In [None]:
import numpy as np 
import gym
import matplotlib.pyplot as plt
import os
import pathlib
import shutil

### Environment

Create the environment

In [None]:
env = gym.make("Pendulum-v0")

In [None]:
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]
print("the shape of the observation space: ", obs_dim)
print("the shape of the action space: ", act_dim)

The observation space of our system contains 3 Measurements $ [\cos(\theta), \sin(\theta), \dot{\theta}] $. This task aims to control the pendulum to its rest position using motor torque $a$.

## Random Policy

The following code lets the RL agent plays for four episodes in which Agent makes 100 moves while the game is rendered at each step and prints the accumulated reward for each game.

In [None]:
# play 4 games
number_episodes = 4
number_moves    = 100
for i in range(number_episodes):
    # initialize the environment
    env.reset()
    done = False
    game_rew = 0  # accumulated reward
    for j in range(number_moves):
        # choose a random action
        action = env.action_space.sample()
        # take a step in the environment
        new_obs, rew, done, info = env.step(action)
        game_rew += rew
        env.render()
        # when is done, print the cumulative reward of the game and reset the environment
        if done:
            print("Done")
            break
    print('Episode %d finished, reward:%d, the lenght of the episode:%d'% (i, game_rew,j))
env.close()

The environment is initialized by calling `reset()`. After doing so, the cycle loops 10 times. In each iteration, `env.action_space.sample()` samples a random action, executes it in the environment with `env.step()`, and displays the result with the `render()` method; that is, the current state of the game, as in the preceding screenshot. In the end, the environment is closed by calling `env.close()`.  Indeed, the `step()` method returns four variables that provide information about the interaction with the environment; namely, Observation, Reward, Done, and Info.

Whenever `done` is True, this means that the episode has terminated and that the environment should be reset. 

The instance attributes  `low` and `high` return the minimum and maximum values of the observation space

In [None]:
print("The minimum value of the observation space :", env.observation_space.low)
print("The maximum value of the observation space :", env.observation_space.high)

## Machine Learning with TF 2.X (Recap)

As a recap of what we  used in the last exercises 

```python
## Load Dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

## Define the model
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

## Define the loss function
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

## Create the optimizer by minimizing the loss using the Adam optimizer
optimizer = tf.keras.optimizers.Adam(lr_schedule)

## Compile the model
model.compile(optimizer=optimizer,
              loss=loss_object,
              metrics=['accuracy'])

## Train the model
model.fit(x_train, 
          y_train,
          epochs=3,
          validation_data=(x_test, y_test),
          verbose=1)
```

## Model-Based Reinforcement Learning with TF 2.X
Model-Based Reinforcement Learning consists primarily of two steps:
1. Learn a dynamics model
2. Plan optimal action sequence using the model  

In [None]:
import tensorflow as tf
if(int(tf.__version__[0]) <= 1):
    print('tensorflow {} detected; Please install tensorflow >= 2.0.0'.format(tf.__version__))
else:
    print('tensorflow {} detected'.format(tf.__version__))
    
from tensorflow.keras.utils import plot_model

import ml2_utils

In [None]:
# Set the random seed:
SEED = 999
tf.random.set_seed(SEED)
np.random.seed(SEED)

## Dynamics Model
We parameterize our learned dynamics function $f_\theta (s_t, a_t)$ as a deep neural network, where the parameter vector $\theta$ represents the network's weights. 

We don't want to learn a network to predict the next state $s_{t+1}$, given the current state and the current action $s_t, a_t$.  This function can be challenging to learn when the states $s_t$  and $s_{t+1}$ are too similar, and the action has seemingly little effect on the output. This difficulty becomes more evident as the time between states $∆t$ becomes small.

Note that increasing this $∆t$ increases the information available from each data point and can help with dynamics learning and planning using the learned dynamics model. However, increasing $∆t$ also increases the discretization and complexity of the underlying continuous-time dynamics, making the learning process more difficult.

We will learn a neural network dynamics model encodes the change in state that occurs as a result of executing the action $a_t$from state $s_t$ of the form:
$$\hat{\Delta}_{t+1} = f_\theta (s_t, a_t)$$
such that
$$ s_{t+1} =  s_t + \hat{\Delta}_{t+1} $$

We will train $f_\theta$ in a standard supervised learning setup, by performing gradient descent on the following objective:
$$L(\theta) =   \sum_{(s_t, a_t,s_{t+1} ) \in D}  \lVert (s_{t+1} − s_t) − f_\theta(s_t, a_t)\rVert_2^2$$
$$L(\theta) =   \sum_{(s_t, a_t,s_{t+1} ) \in D}  \lVert \Delta_{t+1} − \hat{\Delta}_{t+1}\rVert_2^2$$


## Define the Model:
We will implement a neural network dynamics model and train it using a fixed dataset consisting of rollouts collected by a random policy.
<img src=img/5.png width="300">

In [None]:
def mlp(input_layer,hidden_layers, output_layer, activation=tf.tanh, last_activation=None):
    input_shape = (input_layer)
    # generate input vector
    inputs = tf.keras.layers.Input(shape=input_shape, name='model_input')
    x = inputs
    # generate hidden layers
    for filters in hidden_layers:
        x = tf.keras.layers.Dense(filters,
                                   activation=activation)(x)
    
    # generate output vector
    output = tf.keras.layers.Dense(units=output_layer,
                                   activation=last_activation, 
                                   name='modle_output')(x)
    # generate the model
    dynamic = tf.keras.models.Model(inputs,
                output,
                name='dynamic')
    return dynamic

In [None]:
# network parameters
input_layer = obs_dim + act_dim
# number of units per layer
hidden_layers = [64,64]
output_layer = obs_dim
# Define the model
dynamic = mlp(input_layer,hidden_layers, output_layer)
dynamic.summary()

In [None]:
plot_model(dynamic, to_file='mlp-model.png', show_shapes=True)

## Setup Training:
Model takes as input the current state, next state and compute both the actual state difference and the predicted state difference and predicted next state, and returns the loss and optimizer for training the dynamics model.

1. The loss function is the mean-squared-error between the normalized state difference and normalized predicted state difference
2. Use Adam optimizer with learning_rate to minimize the loss 
3. Compile the model

In [None]:
# Define the loss function
loss_object = tf.keras.losses.MeanSquaredError()
# Create the optimizer by minimizing the loss using the Adam optimizer with learning rate
learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate)

dynamic.compile(optimizer=optimizer,
                loss=loss_object,
                metrics=['accuracy'])

Use callbacks `TensorBoard` to generate TensorBoard logs to visualize the training

In [None]:
def get_callbacks(name):
    return [
        tf.keras.callbacks.TensorBoard(logdir/name),
    ]

### Collecting training data:

model_buffer is an instance of the FullBuffer class that contains the samples generated by the environment, and generate_random_dataset creates two partitions for training and validation, which are then returned by calling get_training_batch and get_valid_batch.
1. Random Policy: to generate Date for the model
2. Gather Rollouts and save them in buffer

In [None]:
def gather_rollouts(env, num_rollouts, max_rollout_length, render = False):
    dataset = ml2_utils.Dataset()
    for _ in range(num_rollouts):
        state = env.reset()
        done = False
        t = 0
        while not done:
            if render:
                env.render()
            # Random policy
            action = env.action_space.sample()
            next_state, reward, done, _ = env.step(action)
            done = done or (t >= max_rollout_length)
            dataset.add(state, action, next_state, reward, done)

            state = next_state
            t += 1
            
    if render:
        env.close()
    
    return dataset

In [None]:
# Define the hyperparameters
num_init_random_rollouts=250
max_rollout_length=500
render =False
print('Gathering random dataset')
random_dataset = gather_rollouts(env,num_init_random_rollouts, max_rollout_length,render)
print("The state mean: ", random_dataset.state_mean)
print("The state std: ",  random_dataset.state_std)
print("The action mean: ",random_dataset.action_mean)
print("The action std: ", random_dataset.action_std)
print("shape of the random dataset: ", random_dataset.__len__())

Save the statistical data of the random data set, because we will use it many more times

In [None]:
# define a dictionary 
args = {}
args['state_mean'] =random_dataset.state_mean
args['state_std'] =random_dataset.state_std
args['action_mean']=random_dataset.action_mean
args['action_std'] =random_dataset.action_std
args['delta_state_mean'] =random_dataset.delta_state_mean
args['delta_state_std'] =random_dataset.delta_state_std

print(args)

In [None]:
num_init_random_rollouts_valid=10
max_rollout_length=500
render =False
print('Gathering random dataset')
valid_dataset  = gather_rollouts(env,num_init_random_rollouts_valid, max_rollout_length,render)
print("The state mean: ", valid_dataset.state_mean)
print("The state std: ", valid_dataset.state_std)
print("The action mean: ", valid_dataset.action_mean)
print("The action std: ", valid_dataset.action_std)
print("shape of the validation dataset: ", valid_dataset.__len__())

## Training Dynamics Function

1. Normalize both the states and actions in this buffer
2. Concatenate the normalized state and action
3. Pass the concatenated, normalized state-action tensor through a neural network. The resulting output is the normalized predicted difference between the next state and the current state
4. Compute the actual state difference
5. Normalize the state difference
6. return the normalized state difference as labels and the normalized state-action tensor as features

**Note in order to produce the predicted next state you need to unnormalize the delta state prediction, and add it to the current state**

In [None]:
def tf_dataset(states, actions, rewards, next_states, dones):
    #### Define the Features
    # Normalize both the states and actions in this buffer
    states_norm  = ml2_utils.normalize(states, args["state_mean"] ,  args["state_std"])
    actions_norm = ml2_utils.normalize(actions,args["action_mean"], args["action_std"])
    # Concatenate the normalized state and action
    input_layer  = tf.concat([states_norm, actions_norm], axis=1)
    
    #### Define the Labels
    # the actual state difference
    diff = next_states - states
    # Normalize it by using the statistics random_dataset and normalize function
    diff_norm = ml2_utils.normalize(diff, args["delta_state_mean"], args["delta_state_std"])
    yield input_layer,diff_norm

In [None]:
dataset_tf = tf.data.Dataset.from_generator(tf_dataset, 
                                            output_types =(tf.float64,tf.float64),
                                            output_shapes = (tf.TensorShape([None,4]), tf.TensorShape([None,3])),
                                            args = (random_dataset.list_2_np()),
                                            ).unbatch()

batch_size = 200
batched_dataset_tf = dataset_tf.batch(batch_size=batch_size)

In [None]:
valid_dataset_tf = tf.data.Dataset.from_generator(tf_dataset, 
                                            output_types =(tf.float64,tf.float64),
                                            output_shapes = (tf.TensorShape([None,4]), tf.TensorShape([None,3])),
                                            args = (valid_dataset.list_2_np())
                                            )

## Training

In [None]:
logdir = pathlib.Path.home() / '.keras' /"tensorboard_logs"
# Delete an entire directory tree
shutil.rmtree(logdir, ignore_errors=True)

In [None]:
name = "model"

dynamic.fit(batched_dataset_tf,
            epochs=12,
            validation_data= valid_dataset_tf,
            callbacks=get_callbacks(name),
            verbose=1)

### View in TensorBoard
Open an embedded  TensorBoard viewer inside a notebook:

In [None]:
%load_ext tensorboard 

In [None]:
#docs_infra: no_execute
%tensorboard --logdir {logdir}

### Model Prediction

In [None]:
def predict(dynamic,state,action,args):
    # Normalize both the state and action
    states_norm = ml2_utils.normalize(state, args["state_mean"],   args["state_std"])
    actions_norm = ml2_utils.normalize(action, args["action_mean"],args["action_std"])
    # Concatenate the normalized state and action
    # Batch Case
    if len(actions_norm.shape)>1:
        input_layer =  tf.concat([states_norm, actions_norm], axis=1)
    else:
        input_layer = tf.concat([states_norm, actions_norm], axis=0)
        input_layer = tf.expand_dims(input_layer,0)
    # Pass the concatenated, normalized state-action tensor through a neural network. 
    # The resulting output is the normalized predicted difference between the next state and the current state
    pred_diff_norm = dynamic.predict(input_layer)
    # Compute the actual state difference
    pred_diff = ml2_utils.unnormalize(pred_diff_norm, args["delta_state_mean"],args["delta_state_std"])
    # The next State
    next_state = state +  pred_diff
    return next_state

### Model Evaluation

To evaluate the model for H-step in the future. We run first a random action sequence on the real system and save the resulted trajectory

In [None]:
horizon= 15
# Reset the environment:
init_state = env.reset()
# Lists to save the predicted observations and used actions
state_seq = [init_state]
used_action_seq= []
state = init_state
# Start the episode
for i in range(horizon):
    action = env.action_space.sample()
    state, reward, done , _ = env.step(action)
    # append the next observations and used actions to lists to plot them
    state_seq.append(state)
    used_action_seq.append(action)
    env.render()
env.close()
# convert to numpy array
state_seq = np.asarray(state_seq)

Run a the same action sequence on the model

In [None]:
state = init_state
# Lists to save the predicted observations and used actions
pred_state_seq = [init_state]
# Start the episode
for action in used_action_seq:
    next_state = predict(dynamic,state,action,args)  
    # 
    state = next_state[0]
    # append the next observations to list to plot them
    pred_state_seq.append(state)
# convert to numpy array
pred_state_seq = np.asarray(pred_state_seq)

Plot the result

In [None]:
# gripper position evaluation
#resulting_states_list = np.rollaxis(np.array(resulting_states_list), 1)
fig1, (ax1, ax2, ax3) = plt.subplots(figsize=(20,30), nrows=3, ncols=1)

# plot the predicted state
ax1.plot(np.arange(horizon+1), pred_state_seq[:,0], 'o-',label='model prediction state 0')
ax2.plot(np.arange(horizon+1), pred_state_seq[:,1], 'o-',label='model prediction state 1')
ax3.plot(np.arange(horizon+1), pred_state_seq[:,2], 'o-',label='model prediction state 2')

# plot real values
ax1.plot(np.arange(horizon+1), state_seq[:,0], 'o-',label='real state 0')
ax2.plot(np.arange(horizon+1), state_seq[:,1], 'o-',label='real state 1')
ax3.plot(np.arange(horizon+1), state_seq[:,2], 'o-',label='real state 2')

# plot the used action
# plot the predicted state
ax3.plot(np.arange(horizon), used_action_seq[:], 'o-',label='used action')

# set axis lables
for ax in (ax1, ax2, ax3):
    ax.set_xlabel('step',fontsize='x-large')
ax1.set_ylabel('Cos(θ)' ,fontsize='x-large')
ax2.set_ylabel('sin(θ)' ,fontsize='x-large')
ax3.set_ylabel('d(θ)'   ,fontsize='x-large')

# plot legend
ax1.legend(loc='best',fontsize='x-large')
ax2.legend(loc='best',fontsize='x-large')
ax3.legend(loc='best',fontsize='x-large')
fig1.show()

## Action Selection
Given the learned dynamics model, we now want to select and execute actions that maximize a known reward function (Decision-Time Planning)
$$ a^*_t = \arg \min_{a_t} \sum_{t'=t}^{t+H-1} r(\hat{s}_{t'},a_{t'})$$
$$\text{s.t.}\; \hat{s}_{t'+1} = \hat{s}_{t'} + f_\theta ( \hat{s}_{t'}, a_t)$$

<img src=https://imgur.com/lJA1kXQ.png width="400">


However, solving this Equation is impractical because the learned dynamics model is imperfect, so using it to plan in such an open-loop manner will lead to accumulating errors over time and planning far into the future will become very inaccurate.

We will solve this equation using the sampling method (gradient-free optimization), where we will sample $k$ random action sequences of length $H$, later we will use the model to predict the future states by taking each of these action sequences, then we will evaluate the reward with each candidate action sequence, and the last step will be to  select the best action sequence and return the first action in that sequence.

<img src=https://imgur.com/6gJcbv4.png width="400">

In [None]:
num_random_action_selection= 3000
mpc_horizon=10
action_dim       = env.action_space.shape[0]
action_space_low = env.action_space.low
action_space_high= env.action_space.high 
action_sequences = tf.random.uniform(
            shape=[num_random_action_selection, mpc_horizon, action_dim],
            minval=action_space_low,
            maxval=action_space_high,
            dtype=tf.float64
        )
print("The Shape of Actions: ", action_sequences.shape)
print("The first sequence: ", action_sequences[0])

In [None]:
# Define the Cost
costs = tf.zeros(num_random_action_selection, dtype=tf.float64)
print("The Shape of costs: ", costs.shape)

### Cost Function: 

We try to stabilize the pendulum in its rest position. Therefore, we define the cost function to achieve two goals:
1. Theta should be zero (rest position).
2. The rotation speed should also be damped. When our pendulum reaches its rest position, and the rotation speed is higher than zero, it will not stay there.
3. The torque should be as small as possible because we do not want to consume infinite energy to reach our goal.

* **Note:** We can also approximate the cost function with our model

In [None]:
def cost_fun(states,actions):
    cos_th = states[:,0]
    sin_th = states[:,1]
    th_dot = states[:,2]
    th = np.arctan2(sin_th,cos_th)
    th_normalize = (((th+np.pi) % (2*np.pi)) - np.pi)
    action = np.clip(actions,-2.0, 2.0)[0]
    costs = (th_normalize ** 2 + .1 * th_dot ** 2 + .001 * (action ** 2))
    return costs

### Plan in the Model:
We have only one initial state and 3000 action sequence candidates. We will use our model to estimate the predicted state sequence for each of these candidates. Then we can use the cost function to calculate the value of each sequence.

In [None]:
init_state = env.reset()

We can use our model to process one sample and a batch of samples. Using batch of samples makes finding an optimal trajecotry more efficent. Therfore we will use  we a batch of samples. For example 

In [None]:
# Batch of the first action in all action sequence candidates.
action1 = action_sequences[:, 0, :]
print(action1.shape)

In [None]:
# use the same initial state for 3000 trajectories 
states = tf.stack([init_state] * num_random_action_selection)

print(states.shape)

In [None]:
states

In [None]:
for t in range(mpc_horizon):
    #  if t = 1 then the first actions a1 batch   
    actions      = action_sequences[:, t, :]
    next_states  = predict(dynamic,states,actions,args) 
    # calculate the cost
    costs +=cost_fun(states, actions)
    #     
    states = next_states
    
# convert to numpy array
pred_state_seq = np.asarray(pred_state_seq)
# optimal sequence of actions
action_seq  = action_sequences[tf.argmin(costs)]
# the first action that minimizes the cost function
# optimal_seq = action_seq[0]

In [None]:
costs[146]

Run the optimal action sequence in the model (for evaluation)

In [None]:
# Lists to save the predicted observations and used actions
state_seq = [init_state]
state = init_state
# Start the episode
for action in action_seq:
    action = env.action_space.sample()
    state, reward, done , _ = env.step(action)
    # append the next observations and used actions to lists to plot them
    state_seq.append(state)
    env.render()
env.close()
# convert to numpy array
state_seq = np.asarray(state_seq)

In [None]:
state = init_state
# Lists to save the predicted observations and used actions
pred_state_seq = [init_state]
# Start the episode
for action in action_seq:
    next_state = predict(dynamic,state,action,args)  
    # 
    state = next_state[0]
    # append the next observations to list to plot them
    pred_state_seq.append(state)
# convert to numpy array
pred_state_seq = np.asarray(pred_state_seq)

In [None]:
# gripper position evaluation
#resulting_states_list = np.rollaxis(np.array(resulting_states_list), 1)
fig1, (ax1, ax2, ax3) = plt.subplots(figsize=(20,30), nrows=3, ncols=1)

# plot the predicted state
ax1.plot(np.arange(mpc_horizon+1), pred_state_seq[:,0], 'o-',label='model prediction state 0')
ax2.plot(np.arange(mpc_horizon+1), pred_state_seq[:,1], 'o-',label='model prediction state 1')
ax3.plot(np.arange(mpc_horizon+1), pred_state_seq[:,2], 'o-',label='model prediction state 2')

# plot real values
ax1.plot(np.arange(mpc_horizon+1), state_seq[:,0], 'o-',label='real state 0')
ax2.plot(np.arange(mpc_horizon+1), state_seq[:,1], 'o-',label='real state 1')
ax3.plot(np.arange(mpc_horizon+1), state_seq[:,2], 'o-',label='real state 2')

# plot the used action
# plot the predicted state
ax3.plot(np.arange(mpc_horizon), action_seq[:], 'o-',label='used action')

# set axis lables
for ax in (ax1, ax2, ax3):
    ax.set_xlabel('step',fontsize='x-large')
ax1.set_ylabel('Cos(θ)' ,fontsize='x-large')
ax2.set_ylabel('sin(θ)' ,fontsize='x-large')
ax3.set_ylabel('d(θ)'   ,fontsize='x-large')

# plot legend
ax1.legend(loc='best',fontsize='x-large')
ax2.legend(loc='best',fontsize='x-large')
ax3.legend(loc='best',fontsize='x-large')
fig1.show()

### Replanning:
<img src=img/4.png width="400">

1. Execute the first planned action a_t and observe the next state s_{t+1}
2. Use model again to optimize the action sequenc a_{t+1},..., a_{t+H}


In [None]:
# ToDO
for i in range(50):
    actions = tf.random.uniform(
            shape=[num_random_action_selection, mpc_horizon, action_dim],
            minval=action_space_low,
            maxval=action_space_high,
            dtype=tf.float64
        )
    init_state = state
    states = tf.stack([init_state] * num_random_action_selection)
    costs = 0
    
    # find the optimal action sequence 
    for t in range(mpc_horizon):
        # Normailize the state and the action
        states_norm = ml2_utils.normalize(states, random_dataset.state_mean,   random_dataset.state_std)
        actions_norm = ml2_utils.normalize(actions[:, t, :], random_dataset.action_mean,random_dataset.action_std)
        input_layer = tf.concat([states_norm, actions_norm], axis=1)
        # The resulting output is the normalized predicted difference between the next state and the current state
        pred_diffs_norm = dynamic.predict(input_layer)
        # calculate the cost
        costs +=cost_fun(states, actions[:, t, :])                    
        # The next State
        next_states = states + ml2_utils.unnormalize(pred_diffs_norm, random_dataset.delta_state_mean,random_dataset.delta_state_std)
        states = next_states
    #action_seq  = actions[tf.argmax(costs)]
    print("the cost of best action sequence: ", tf.reduce_min(costs))
    # the action that minimizes the cost function
    best_action = actions[tf.argmin(costs)][0]
    # run the actions on the real system
    
    state, reward, done , _ = env.step(best_action)
    env.render()'

In [None]:
env.close()