# Tutorial Five

create a parallel environment for the pendulum environment and then learn the dynamics model
from random rollouts initially and use MPC to collect more samples and refine the model in an iterative fashion.

In [1]:
from tf_neuralmpc.environment_utils import EnvironmentWrapper
import logging
from tf_neuralmpc.dynamics_functions import DeterministicMLP
from tf_neuralmpc.examples.cost_funcs import pendulum_actions_reward_function, pendulum_state_reward_function
from tf_neuralmpc import Runner
import tensorflow as tf
logging.getLogger().setLevel(logging.INFO)

In [2]:
number_of_agents = 2
log_path = './tutorial_5'
single_env, parallel_env = EnvironmentWrapper.make_standard_gym_env("Pendulum-v0", random_seed=0,
                                                                    num_of_agents=number_of_agents)
my_runner = Runner(env=[single_env, parallel_env],
                   log_path=log_path,
                   num_of_agents=number_of_agents)

Define the dynamics model architecture now

In [3]:
state_size = single_env.observation_space.shape[0]
input_size = single_env.action_space.shape[0]
dynamics_function = DeterministicMLP()
dynamics_function.add_layer(state_size + input_size,
                            32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, 32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, 32, activation_function=tf.math.tanh)
dynamics_function.add_layer(32, state_size)

Now learn the dynamics model using the random rollouts. Note: real number of rollouts in totoal is eqaul to number_of_agents*number_of_rollouts

In [4]:
system_dynamics_handler, mpc_policy = my_runner.learn_dynamics_iteratively_w_mpc(number_of_initial_rollouts=20,
                                                                                 number_of_rollouts_for_refinement=2,
                                                                                 number_of_refinement_steps=2,
                                                                                 dynamics_function=dynamics_function,
                                                                                 task_horizon=200,
                                                                                 planning_horizon=40,
                                                                                 state_reward_function=pendulum_state_reward_function,
                                                                                 actions_reward_function=pendulum_actions_reward_function,
                                                                                 optimizer_name='PI2',
                                                                                 exploration_noise=True)

INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.00023541569709777831
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.00024532914161682127
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.00026759982109069824
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.00023585081100463868
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.00026265621185302736
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002476179599761963
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.000246272087097168
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002491772174835205
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002476632595062256
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.0002444589138031006
INFO:root:Rollout length: 200
INFO:root:Averag

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets
INFO:root:Trained initial system model
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.08433211565017701
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.07166394948959351
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout
INFO:root:Started the system training
INFO:root:Saving the model now....


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.07713130831718445
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.07219652056694031
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout
INFO:root:Started the system training
INFO:root:Saving the model now....


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets


INFO:tensorflow:Assets written to: ./tutorial_5/saved_model/assets
INFO:root:Started collecting samples for rollouts
INFO:root:Average action selection time: 0.08324985504150391
INFO:root:Rollout length: 200
INFO:root:Average action selection time: 0.07785637497901916
INFO:root:Rollout length: 200
INFO:root:Finished collecting samples for rollout


In [5]:
%load_ext tensorboard
%tensorboard --logdir {log_path}

Now create an MPC controller with the learned dynamics.

In [6]:
my_runner.record_rollout(horizon=500, policy=mpc_policy,
                         record_file_path=log_path+'/episode_1')