# Tutorial Six

create a parallel environment for the pendulum environment and then learn the dynamics model
from random rollouts initially and use MPC to collect more samples and refine the model in an iterative fashion.
Then load the saved model and using it to learn further or just for control.

In [1]:
from tf_neuralmpc.environment_utils import EnvironmentWrapper
import logging
from tf_neuralmpc.dynamics_functions import DeterministicMLP
from tf_neuralmpc.dynamics_handlers.system_dynamics_handler import SystemDynamicsHandler
from tf_neuralmpc.examples.cost_funcs import pendulum_actions_reward_function, pendulum_state_reward_function
from tf_neuralmpc import Runner
import tensorflow as tf
logging.getLogger().setLevel(logging.INFO)

In [2]:
number_of_agents = 2
log_path = './tutorial_6'
single_env, parallel_env = EnvironmentWrapper.make_standard_gym_env("Pendulum-v0", random_seed=0,
                                                                    num_of_agents=number_of_agents)
my_runner = Runner(env=[single_env, parallel_env],
                   log_path=log_path,
                   num_of_agents=number_of_agents)

Define the dynamics model architecture now

In [3]:
state_size = single_env.observation_space.shape[0]
input_size = single_env.action_space.shape[0]
dynamics_function = DeterministicMLP()
dynamics_function.add_layer(state_size + input_size,
                            32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, 32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, 32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, state_size)

Now learn the dynamics model using the random rollouts and then collect more samples from the MPC controller for finetuning the model further. Note: real number of rollouts in totoal is eqaul to number_of_agents*number_of_rollouts

In [4]:
system_dynamics_handler, mpc_policy = my_runner.learn_dynamics_iteratively_w_mpc(number_of_initial_rollouts=20,
                                                                                 number_of_rollouts_for_refinement=2,
                                                                                 number_of_refinement_steps=2,
                                                                                 dynamics_function=dynamics_function,
                                                                                 task_horizon=200,
                                                                                 planning_horizon=40,
                                                                                 state_reward_function=pendulum_state_reward_function,
                                                                                 actions_reward_function=pendulum_actions_reward_function,
                                                                                 optimizer_name='PI2',
                                                                                 exploration_noise=True)

INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.00036170363426208497
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.00027430057525634766
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002607285976409912
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002713155746459961
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0003305566310882568
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002593553066253662
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.000253450870513916
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0003247642517089844
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002527165412902832
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002457737922668457
INFO:root:Rollout length: 200
INFO:root:Average a

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets
INFO:root:Trained initial system model
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.09817265152931214
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.09045002818107604
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout
INFO:root:Started the system training
INFO:root:Saving the model now....


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.0816757893562317
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.08248288035392762
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout
INFO:root:Started the system training
INFO:root:Saving the model now....


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_6/saved_model/assets
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.08029894351959228
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.08095643043518067
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout


In [5]:
%load_ext tensorboard
%tensorboard --logdir {log_path}

Reusing TensorBoard on port 6009 (pid 85723), started 0:10:19 ago. (Use '!kill 85723' to kill it.)

In [6]:
my_runner.record_rollout(horizon=500, policy=mpc_policy,
                         record_file_path=log_path+'/episode_1')

Now load the saved model and create another policy using the loaded dynamics model.

In [7]:
new_log_path = './tutorial_6_loaded_model'
new_systems_dynamics_handler = SystemDynamicsHandler(dim_O=state_size,
                                                     dim_U=input_size,
                                                     num_of_agents=1,
                                                     log_dir=new_log_path,
                                                     saved_model_dir=log_path+'/saved_model',
                                                     load_saved_model=True)
new_mpc_controller = my_runner.make_mpc_policy(system_dynamics_handler=new_systems_dynamics_handler,
                                               state_reward_function=pendulum_state_reward_function,
                                               actions_reward_function=pendulum_actions_reward_function,
                                               planning_horizon=50,
                                               optimizer_name='PI2',
                                               true_model=False)

INFO:root:Loading the saved model now....


In [8]:
my_runner.record_rollout(horizon=500, policy=new_mpc_controller,
                         record_file_path=new_log_path+'/episode_1')