# Parametric Actions

One of the ways to enforce constraints on the model to restrict the actions available to an agent in certain states. This ought to improve learning and speed results because some areas of the state space will be off limits. 

Take for example an online knpasack problem. The agent is given an item at every time step and must decide whether or not to accept the item and pack it into the sack, reject it and get a new one, or close up the sack and end the episode. We could simply provide a large negative reward for the agent if it were to accept the item and go over the weight limit, but it would be more efficacious to block the agent from accepting the item in these situations so that it now has two options: end the episode or reject the item.

Here, I'll implement a simple knapsack environment to limit the algorithm from selecting items that cause it to exceed its limit. There will be three actions available to the algorithm.

- 0: end episode
- 1: accept item
- 2: reject item

If 0 is selected, the episode ends and the agent collects no additional reward. If 1 is selected, the agent packs that item and collects the reward. If 2 is selected, the agent rejects the item and moves to the next. 

If the parametric action selection works properly, the agent should never exceed the capacity of the knapsack and receive a large, negative reward.

For this, I'm following the example laid out in the [Ray code for the parametric cartpole](https://github.com/ray-project/ray/blob/master/rllib/examples/parametric_action_cartpole.py).

In [1]:
import numpy as np
import ray
from ray import tune
import gym
from gym import spaces
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet_v2 import FullyConnectedNetwork
from ray.rllib.utils import try_import_tf
from ray.rllib.models import ModelCatalog
from or_gym.utils.env_config import *

# Building the Environment

We need to set up the environment to interact with the Ray framework properly so that the forbidden actions are masked given the goals outlined above. In this case, it's rather easy, we'll simply look to see if our next accepted item plus our current weight is greater than our weight capacity.

Referring to the Ray code (see lines 69-73 in the above link), we need to place our actions into a dictionary using the `spaces.Dict` function. This dictionary needs to include the normal state from our environment as well as an action mask and the available actions we can choose from. For our state, we'll only have three outputs, the current weight of the knapsack, the value of the next item, and the weight of the next item.

As discussed above, we also have three actions to choose from, so we'll need a corresponding list of three outputs for the mask and available actions.

In [2]:
class ParametricKnapsack(gym.Env):
    
    def __init__(self, *args, **kwargs):
        self.step_limit = 10
        self.item_values = np.random.randint(0, 10, self.step_limit)
        self.item_weights = np.random.randint(1, 5, self.step_limit)
        self.weight_capacity = 20
        self.action_space = spaces.Discrete(3)
        self.mask = True
        assign_env_config(self, kwargs)
        self.observation_space = spaces.Dict({
            "action_mask": spaces.Box(0, 1, shape=(3,)),
            "avail_actions": spaces.Box(0, 1, shape=(3,)),
            "state": spaces.Box(0, self.weight_capacity, shape=(3,))
        })

        self.reset()
        
    def reset(self):
        self.current_weight, self.current_step = 0, 0
        self.item_values = np.random.randint(0, 10, self.step_limit)
        self.item_weights = np.random.randint(1, 5, self.step_limit)
        self.state = {
            "action_mask": np.ones(3),
            "avail_actions": np.ones(3),
            "state": np.array(
                [self.current_weight, 
                 self.item_values[self.current_step], 
                 self.item_weights[self.current_step]])}
        self.update_state()

        return self.state
    
    def step(self, action):
        self.current_weight = self.state["state"][0]
        item_value = self.state["state"][1]
        item_weight = self.state["state"][2]
        done = False
        if action == 0:
            # End episode
            done = True
            reward = 0
        elif action == 1:
            # Accept item
            if self.current_weight + item_weight <= self.weight_capacity:
                self.current_weight += item_weight
                reward = item_value
                # End if capacity is met
                if self.current_weight == self.weight_capacity:
                    done = True
            else: # Overweight
                reward = -100
                done = True
        elif action == 2:
            # Reject item
            reward = 0
        
        self.current_step += 1
        if self.current_step >= self.step_limit:
            done = True
        self.update_state()
        return self.state, reward, done, {}
    
    def update_state(self):
        # Make action selection impossible if the knapsack would go over weight
        step = self.current_step if self.current_step < self.step_limit else self.step_limit-1
        knapsack = np.array([self.current_weight, 
                self.item_values[step], 
                self.item_weights[step]])
        action_mask = np.ones(3)
        if self.mask:
            if self.current_weight + knapsack[-1] > self.weight_capacity:
                action_mask = np.array([1, 0, 1])
            
        self.state = {
                "action_mask": action_mask,
                "avail_actions": np.ones(3),
                "state": knapsack
            }

In [3]:
tf = try_import_tf()

class KPParametricActionsModel(TFModelV2):
    
    def __init__(self, obs_space, action_space, num_outputs,
        model_config, name, true_obs_shape=(3,), action_embed_size=3,
        *args, **kwargs):
        super(KPParametricActionsModel, self).__init__(obs_space,
            action_space, num_outputs, model_config, name, *args, **kwargs)
        self.action_embed_model = FullyConnectedNetwork(
            spaces.Box(-1, 1, shape=true_obs_shape), action_space, action_embed_size,
            model_config, name + "_action_embedding")
        self.register_variables(self.action_embed_model.variables())
        
    def forward(self, input_dict, state, seq_lens):
        avail_actions = input_dict["obs"]["avail_actions"]
        action_mask = input_dict["obs"]["action_mask"]
        action_embedding, _ = self.action_embed_model({
            "obs": input_dict["obs"]["state"]
        })
        intent_vector = tf.expand_dims(action_embedding, 1)
        action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=1)
        inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
        return action_logits + inf_mask, state
    
    def value_function(self):
        return self.action_embed_model.value_function()

In [4]:
def create_env(config_env):
    return ParametricKnapsack()

ModelCatalog.register_custom_model("kp_param_model", KPParametricActionsModel)
tune.register_env("ParaKnapsack-v0", lambda config: create_env(config))

# ray.init(ignore_reinit_error=True)

results = tune.run(
        "PPO",
        stop={"training_iteration": 10},
        config={
            "env": "ParaKnapsack-v0",
            "env_config": {
                "mask": True
            },
            "model": {
                "custom_model": "kp_param_model"
            },
        },
        verbose=0,
        reuse_actors=True)

df = results.dataframe()
df.head()

2020-04-23 17:46:13,909	INFO resource_spec.py:216 -- Starting Ray with 3.47 GiB memory available for workers and up to 1.75 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-23 17:46:14,475	INFO ray_trial_executor.py:121 -- Trial PPO_ParaKnapsack-v0_3c894ede: Setting up new remote runner.


[2m[36m(pid=4758)[0m 2020-04-23 17:46:17,898	INFO trainer.py:371 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=4758)[0m 2020-04-23 17:46:17,900	INFO trainer.py:512 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=4758)[0m   ret = umr_sum(arr, axis, dtype, out, keepdims)


2020-04-23 17:47:06,374	INFO tune.py:334 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.


Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,timesteps_this_iter,done,timesteps_total,episodes_total,training_iteration,...,info/learner/default_policy/policy_loss,info/learner/default_policy/vf_loss,info/learner/default_policy/vf_explained_var,info/learner/default_policy/kl,info/learner/default_policy/entropy,info/learner/default_policy/entropy_coeff,config/env,config/env_config,config/model,logdir
0,39.0,0.0,4.453585,3.015094,1325,4000,True,40000,13566,10,...,-0.093364,36.640034,0.028345,0.014002,1.084052,0.0,ParaKnapsack-v0,{'mask': True},{'custom_model': 'kp_param_model'},/home/christian/ray_results/PPO/PPO_ParaKnapsa...


In [5]:
def create_env(config_env):
    return ParametricKnapsack()

ModelCatalog.register_custom_model("kp_param_model", KPParametricActionsModel)
tune.register_env("ParaKnapsack-v0", lambda config: create_env(config))

ray.init(ignore_reinit_error=True)

results = tune.run(
        "PPO",
        stop={"training_iteration": 10},
        config={
            "env": "ParaKnapsack-v0",
            "env_config": {
                "mask": False
            },
            "model": {
                "custom_model": "kp_param_model"
            },
        },
        verbose=0,
        reuse_actors=True)

df = results.dataframe()
df.head()

2020-04-23 17:47:06,461	ERROR worker.py:679 -- Calling ray.init() again after it has already been called.
2020-04-23 17:47:06,480	INFO ray_trial_executor.py:121 -- Trial PPO_ParaKnapsack-v0_5b8bdcfc: Setting up new remote runner.


[2m[36m(pid=4761)[0m 2020-04-23 17:47:08,956	INFO trainer.py:371 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=4761)[0m 2020-04-23 17:47:08,958	INFO trainer.py:512 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=4761)[0m   ret = umr_sum(arr, axis, dtype, out, keepdims)


2020-04-23 17:48:04,994	INFO tune.py:334 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.


Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,timesteps_this_iter,done,timesteps_total,episodes_total,training_iteration,...,info/learner/default_policy/policy_loss,info/learner/default_policy/vf_loss,info/learner/default_policy/vf_explained_var,info/learner/default_policy/kl,info/learner/default_policy/entropy,info/learner/default_policy/entropy_coeff,config/env,config/env_config,config/model,logdir
0,43.0,0.0,4.435268,2.97619,1344,4000,True,40000,13505,10,...,-0.098762,36.736065,0.019587,0.0146,1.083382,0.0,ParaKnapsack-v0,{'mask': False},{'custom_model': 'kp_param_model'},/home/christian/ray_results/PPO/PPO_ParaKnapsa...


# Add Parametric Actions to VM Packing

In [13]:
import or_gym
from or_gym.algos.rl_utils import *

In [14]:
class VMPackingEnv(gym.Env):
    
    def __init__(self, *args, **kwargs):
        self.cup_capacity = 1
        self.mem_capacity = 1
        self.t_interval = 20
        self.tol = 1e-5
        self.step_limit = int(60 * 24 / self.t_interval)
        self.n_pms = 50
        self.load_idx = np.array([1, 2])
        self.seed = 0
        self.mask = True
        assign_env_config(self, kwargs)
        self.action_space = spaces.Discrete(self.n_pms)
        self.observation_space = spaces.Dict({
            "action_mask": spaces.Box(0, 1, shape=(self.n_pms,)),
            "avail_actions": spaces.Box(0, 1, shape=(self.n_pms,)),
            "state": spaces.Box(0, 1, shape=(self.n_pms+1, 3))
        })
        self.reset()
        
    def reset(self):
        self.demand = self.generate_demand()
        self.current_step = 0
        self.state = {
            "action_mask": np.ones(self.n_pms),
            "avail_actions": np.ones(self.n_pms),
            "state": np.vstack([
                np.zeros((self.n_pms, 3)),
                self.demand[self.current_step]])
        }
        self.assignment = {}
        return self.state
    
    def step(self, action):
        done = False
        pm_state = self.state["state"][:-1]
        demand = self.state["state"][-1, 1:]
        
        if action < 0 or action >= self.n_pms:
            raise ValueError("Invalid action: {}".format(action))
            
        elif any(pm_state[action, 1:] + demand > 1 + self.tol):
            # Demand doesn't fit into PM
            reward = -10000
            done = True
        else:
            if pm_state[action, 0] == 0:
                # Open PM if closed
                pm_state[action, 0] = 1
            pm_state[action, self.load_idx] += demand
            reward = np.sum(pm_state[:, 0] * (pm_state[:,1:].sum(axis=1) - 2))
            self.assignment[self.current_step] = action
            
        self.current_step += 1
        if self.current_step >= self.step_limit:
            done = True
        self.update_state(pm_state)
        return self.state, reward, done, {}
    
    def update_state(self, pm_state):
        # Make action selection impossible if the PM would exceed capacity
        step = self.current_step if self.current_step < self.step_limit else self.step_limit-1
        data_center = np.vstack([pm_state, self.demand[step]])
        data_center = np.where(data_center>1,1,data_center) # Fix rounding errors
        self.state["state"] = data_center
        self.state["action_mask"] = np.ones(self.n_pms)
        self.state["avail_actions"] = np.ones(self.n_pms)
        if self.mask:
            action_mask = (pm_state[:, 1:] + self.demand[step, 1:]) <= 1
            self.state["action_mask"] = (action_mask.sum(axis=1)==2).astype(int)
                    
    def generate_demand(self):
        cpu_demand = np.random.uniform(0, 1, size=self.step_limit)
        mem_demand = np.random.uniform(0, 1, size=self.step_limit)
        return np.vstack([np.zeros(self.step_limit), cpu_demand, mem_demand]).T

In [15]:
class VMParametricActionsModel(TFModelV2):
    
    def __init__(self, obs_space, action_space, num_outputs,
        model_config, name, true_obs_shape=(51,3), action_embed_size=50,
        *args, **kwargs):
        super(VMParametricActionsModel, self).__init__(obs_space,
            action_space, num_outputs, model_config, name, *args, **kwargs)
#         print(model_config)
        self.action_embed_model = FullyConnectedNetwork(
            spaces.Box(0, 1, shape=true_obs_shape), action_space, action_embed_size,
            model_config, name + "_action_embedding")
        self.register_variables(self.action_embed_model.variables())
        
    def forward(self, input_dict, state, seq_lens):
        avail_actions = input_dict["obs"]["avail_actions"]
        action_mask = input_dict["obs"]["action_mask"]
        action_embedding, _ = self.action_embed_model({
            "obs": input_dict["obs"]["state"]
        })
        intent_vector = tf.expand_dims(action_embedding, 1)
        action_logits = tf.reduce_sum(avail_actions * intent_vector, axis=1)
        inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
        return action_logits + inf_mask, state
    
    def value_function(self):
        return self.action_embed_model.value_function()

In [12]:
env = VMPackingEnv()
state = env.reset()
avail_actions = state["avail_actions"]
action_mask = state["action_mask"]
# action_embed_model = FullyConnectedNetwork(
#     spaces.Box(0, 1, shape=env.observation_space["state"].shape),
#     action_space=env.action_space.n,
#     num_outputs=env.action_space.n,
#     model_config={"custom_model": "vm_param_model"},
#     name="ParamVMPacking-v0")

In [18]:
m = VMParametricActionsModel

In [17]:
tf = try_import_tf()

def create_env(config_env):
    return VMPackingEnv()

ModelCatalog.register_custom_model("vm_param_model", VMParametricActionsModel)
tune.register_env("ParaVMPacking-v0", lambda config: create_env(config))

ray.init(ignore_reinit_error=True)

results = tune.run(
        "PPO",
        stop={"training_iteration": 10},
        config={
            "env": "ParaVMPacking-v0",
            "env_config": {
                "mask": True
            },
            "model": {
                "custom_model": "vm_param_model"
            },
        },
#         verbose=0,
        reuse_actors=True)

df = results.dataframe()
df.head()

2020-04-22 14:29:59,443	ERROR worker.py:679 -- Calling ray.init() again after it has already been called.
2020-04-22 14:29:59,461	INFO ray_trial_executor.py:121 -- Trial PPO_ParaVMPacking-v0_a7ae71d4: Setting up new remote runner.


Trial name,status,loc
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,


[2m[36m(pid=31543)[0m 2020-04-22 14:30:01,571	INFO trainer.py:371 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=31543)[0m 2020-04-22 14:30:01,574	INFO trainer.py:512 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=31685)[0m   ret = umr_sum(arr, axis, dtype, out, keepdims)
[2m[36m(pid=31543)[0m   ret = umr_sum(arr, axis, dtype, out, keepdims)
Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-12
  done: false
  episode_len_mean: 66.31666666666666
  episode_reward_max: -1705.7551730784082
  episode_reward_mean: -10771.542671799129
  episode_reward_min: -11978.651093070603
  episodes_this_iter: 60
  episodes_total: 60
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4620.911
    learner:
      default_policy:
        cur_kl_coeff: 0.200000002980232

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,1,8.65994,4000,-10771.5


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-19
  done: false
  episode_len_mean: 65.99
  episode_reward_max: -1756.725259203345
  episode_reward_mean: -10796.289258395978
  episode_reward_min: -11978.651093070603
  episodes_this_iter: 60
  episodes_total: 120
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4550.819
    learner:
      default_policy:
        cur_kl_coeff: 0.30000001192092896
        cur_lr: 4.999999873689376e-05
        entropy: 2.9409902095794678
        entropy_coeff: 0.0
        kl: 0.14288713037967682
        policy_loss: -0.20204399526119232
        total_loss: 54315164.0
        vf_explained_var: 2.817953827616293e-05
        vf_loss: 54315164.0
    load_time_ms: 30.653
    num_steps_sampled: 8000
    num_steps_trained: 7936
    sample_time_ms: 2645.858
    update_time_ms: 272.438
  iterations_since_restore: 2
  node_ip: 192.168.0.11
  num_healthy_worke

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,2,15.0494,8000,-10796.3


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-25
  done: false
  episode_len_mean: 66.07
  episode_reward_max: -1720.37181796013
  episode_reward_mean: -10788.576038139063
  episode_reward_min: -11938.512003571645
  episodes_this_iter: 61
  episodes_total: 181
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4528.968
    learner:
      default_policy:
        cur_kl_coeff: 0.44999998807907104
        cur_lr: 4.999999873689376e-05
        entropy: 2.9669272899627686
        entropy_coeff: 0.0
        kl: 0.10123495757579803
        policy_loss: -0.18695253133773804
        total_loss: 55251548.0
        vf_explained_var: 1.4285887118603569e-05
        vf_loss: 55251548.0
    load_time_ms: 21.778
    num_steps_sampled: 12000
    num_steps_trained: 11904
    sample_time_ms: 2369.961
    update_time_ms: 182.69
  iterations_since_restore: 3
  node_ip: 192.168.0.11
  num_healthy_work

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,3,21.365,12000,-10788.6


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-33
  done: false
  episode_len_mean: 66.02
  episode_reward_max: -1610.1013434851554
  episode_reward_mean: -10670.417503275268
  episode_reward_min: -11987.581390181427
  episodes_this_iter: 60
  episodes_total: 241
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4816.979
    learner:
      default_policy:
        cur_kl_coeff: 0.675000011920929
        cur_lr: 4.999999873689376e-05
        entropy: 2.993440866470337
        entropy_coeff: 0.0
        kl: 0.07120780646800995
        policy_loss: -0.18975572288036346
        total_loss: 54589640.0
        vf_explained_var: 4.9241125452681445e-06
        vf_loss: 54589640.0
    load_time_ms: 17.34
    num_steps_sampled: 16000
    num_steps_trained: 15872
    sample_time_ms: 2243.152
    update_time_ms: 137.798
  iterations_since_restore: 4
  node_ip: 192.168.0.11
  num_healthy_worke

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,4,28.9229,16000,-10670.4


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-41
  done: false
  episode_len_mean: 66.21
  episode_reward_max: -1610.1013434851554
  episode_reward_mean: -10793.508658061488
  episode_reward_min: -11987.581390181427
  episodes_this_iter: 60
  episodes_total: 301
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4893.734
    learner:
      default_policy:
        cur_kl_coeff: 1.0125000476837158
        cur_lr: 4.999999873689376e-05
        entropy: 3.0297670364379883
        entropy_coeff: 0.0
        kl: 0.05637210234999657
        policy_loss: -0.20641808211803436
        total_loss: 50731368.0
        vf_explained_var: 2.584149797257851e-06
        vf_loss: 50731368.0
    load_time_ms: 14.952
    num_steps_sampled: 20000
    num_steps_trained: 19840
    sample_time_ms: 2320.442
    update_time_ms: 111.365
  iterations_since_restore: 5
  node_ip: 192.168.0.11
  num_healthy_wor

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,5,36.7702,20000,-10793.5


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-47
  done: false
  episode_len_mean: 65.96
  episode_reward_max: -1559.7512052288355
  episode_reward_mean: -10598.364275083646
  episode_reward_min: -12105.00755060112
  episodes_this_iter: 61
  episodes_total: 362
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4864.515
    learner:
      default_policy:
        cur_kl_coeff: 1.5187499523162842
        cur_lr: 4.999999873689376e-05
        entropy: 3.0277984142303467
        entropy_coeff: 0.0
        kl: 0.04154117777943611
        policy_loss: -0.1939602792263031
        total_loss: 49093676.0
        vf_explained_var: 6.268101628847944e-07
        vf_loss: 49093680.0
    load_time_ms: 13.306
    num_steps_sampled: 24000
    num_steps_trained: 23808
    sample_time_ms: 2284.815
    update_time_ms: 93.382
  iterations_since_restore: 6
  node_ip: 192.168.0.11
  num_healthy_worker

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,6,43.6094,24000,-10598.4


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-30-54
  done: false
  episode_len_mean: 65.39
  episode_reward_max: -1587.7718336506312
  episode_reward_mean: -10648.717836836688
  episode_reward_min: -11878.79930205598
  episodes_this_iter: 61
  episodes_total: 423
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4825.775
    learner:
      default_policy:
        cur_kl_coeff: 2.278125047683716
        cur_lr: 4.999999873689376e-05
        entropy: 3.0529417991638184
        entropy_coeff: 0.0
        kl: 0.029261332005262375
        policy_loss: -0.18305885791778564
        total_loss: 50421188.0
        vf_explained_var: 1.0459654049554956e-06
        vf_loss: 50421188.0
    load_time_ms: 12.021
    num_steps_sampled: 28000
    num_steps_trained: 27776
    sample_time_ms: 2233.067
    update_time_ms: 80.626
  iterations_since_restore: 7
  node_ip: 192.168.0.11
  num_healthy_work

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,7,50.139,28000,-10648.7


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-31-01
  done: false
  episode_len_mean: 65.49
  episode_reward_max: -1619.3302662978067
  episode_reward_mean: -11347.811343544156
  episode_reward_min: -11955.067840410211
  episodes_this_iter: 61
  episodes_total: 484
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4828.401
    learner:
      default_policy:
        cur_kl_coeff: 3.417187452316284
        cur_lr: 4.999999873689376e-05
        entropy: 3.0511744022369385
        entropy_coeff: 0.0
        kl: 0.018607523292303085
        policy_loss: -0.1532767117023468
        total_loss: 54535208.0
        vf_explained_var: 3.7300972621778783e-07
        vf_loss: 54535208.0
    load_time_ms: 11.262
    num_steps_sampled: 32000
    num_steps_trained: 31744
    sample_time_ms: 2202.416
    update_time_ms: 70.893
  iterations_since_restore: 8
  node_ip: 192.168.0.11
  num_healthy_work

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,8,56.9876,32000,-11347.8


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-31-07
  done: false
  episode_len_mean: 65.79
  episode_reward_max: -1619.3302662978067
  episode_reward_mean: -11158.055945097667
  episode_reward_min: -11955.067840410211
  episodes_this_iter: 61
  episodes_total: 545
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4797.406
    learner:
      default_policy:
        cur_kl_coeff: 3.417187452316284
        cur_lr: 4.999999873689376e-05
        entropy: 3.0295326709747314
        entropy_coeff: 0.0
        kl: 0.017420435324311256
        policy_loss: -0.14462466537952423
        total_loss: 56178296.0
        vf_explained_var: 2.1919127846103947e-07
        vf_loss: 56178300.0
    load_time_ms: 10.504
    num_steps_sampled: 36000
    num_steps_trained: 35712
    sample_time_ms: 2161.726
    update_time_ms: 63.345
  iterations_since_restore: 9
  node_ip: 192.168.0.11
  num_healthy_wor

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,RUNNING,192.168.0.11:31543,9,63.3855,36000,-11158.1


Result for PPO_ParaVMPacking-v0_a7ae71d4:
  custom_metrics: {}
  date: 2020-04-22_14-31-14
  done: true
  episode_len_mean: 65.54
  episode_reward_max: -1626.8245643665698
  episode_reward_mean: -10864.878687447072
  episode_reward_min: -11927.848078843032
  episodes_this_iter: 61
  episodes_total: 606
  experiment_id: 5f6756fd18f1452496355556a6124e24
  experiment_tag: '0'
  hostname: ubuntu
  info:
    grad_time_ms: 4776.208
    learner:
      default_policy:
        cur_kl_coeff: 3.417187452316284
        cur_lr: 4.999999873689376e-05
        entropy: 3.043523073196411
        entropy_coeff: 0.0
        kl: 0.02020074799656868
        policy_loss: -0.1774006485939026
        total_loss: 47420836.0
        vf_explained_var: 6.345010916675164e-08
        vf_loss: 47420836.0
    load_time_ms: 9.882
    num_steps_sampled: 40000
    num_steps_trained: 39680
    sample_time_ms: 2145.642
    update_time_ms: 57.315
  iterations_since_restore: 10
  node_ip: 192.168.0.11
  num_healthy_workers:

Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,TERMINATED,,10,69.984,40000,-10864.9


Trial name,status,loc,iter,total time (s),timesteps,reward
PPO_ParaVMPacking-v0_a7ae71d4,TERMINATED,,10,69.984,40000,-10864.9


2020-04-22 14:31:14,395	INFO tune.py:334 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.




Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,timesteps_this_iter,done,timesteps_total,episodes_total,training_iteration,...,info/learner/default_policy/policy_loss,info/learner/default_policy/vf_loss,info/learner/default_policy/vf_explained_var,info/learner/default_policy/kl,info/learner/default_policy/entropy,info/learner/default_policy/entropy_coeff,config/env,config/env_config,config/model,logdir
0,-1626.824564,-11927.848079,-10864.878687,65.54,61,4000,True,40000,606,10,...,-0.177401,47420836.0,6.345011e-08,0.020201,3.043523,0.0,ParaVMPacking-v0,{'mask': True},{'custom_model': 'vm_param_model'},/home/christian/ray_results/PPO/PPO_ParaVMPack...


In [12]:
56/15000*3000

11.2