[question] Using keras in Custom Policy #220

batu · 2019-03-04T04:43:16Z

I am trying to use keras to define my own custom policy, unfortunately after several hours of trying I couldn't get it to train on CartPole.

Here is the CustomPolicy example I have modified to work with Cartpole, and this trains properly.

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            activ = tf.nn.tanh

            extracted_features = tf.layers.flatten(self.processed_obs)

            pi_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                pi_h = activ(tf.layers.dense(pi_h, layer_size, name='pi_fc' + str(i)))
            pi_latent = pi_h

            vf_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                vf_h = activ(tf.layers.dense(vf_h, layer_size, name='vf_fc' + str(i)))
            value_fn = tf.layers.dense(vf_h, 1, name='vf')
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None        
        self._setup_init()

Here is the Keras version of my implementation that runs, but does NOT train. (tf.keras.layers vs keras.layers) doesn't make a difference.

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

I tried to ensure both implementations are as close to eachother as possible. Any help at this point would be grately appreciated.

Thank you in advance

Keras version: 2.2.2
Tensorflow version: 1.12.0
Stable Baselines version: 2.4.0a

Attached is the minimal code to reproduce the current issue with tensorboard graphs for comparison.
custom_model.py.zip

The text was updated successfully, but these errors were encountered:

araffin · 2019-03-06T11:00:00Z

Hello,
I tested your code and ... it worked fine.

See below for minimal code to reproduce (I got reward > 100)

import tensorflow as tf

from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy


class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self._value, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

I'm using tf-gpu (1.8.0) and latest version of stable-baselines (2.5.0a0 this is the gail branch but should not affect the results).

hill-a · 2019-03-06T12:06:02Z

Hey,

After having a trying the code, I am getting the same problem.

It seems that under TF 1.12.0 Keras is ignoring the reuse=True of the scope, meaning that the training model does not share all the parameters with the main model and ends up recreating a new independent model (this is visible under tensorboard as the main model only shares 4 tenors with the training model, rather than the 14 with pure TF code)

There isn't much of a fix unfortunatly, as Keras seems to be using tf.Variable rather than tf.get_variable (some reading here and here)

batu · 2019-03-07T18:39:39Z

@araffin, @hill-a thank you very much for looking into this! This problem has been haunting me for a while. I think the best case scenario as a bandaid is to downgrade to TF 1.8.0.

The difference between tf.get_variable vs tf.Variable is very unfortunate... Do you have an intuition as to how stable-baselines might change as years go on, given that with TF 2.0 is placing heavy bets on keras for the future facing way of doing things?

hill-a · 2019-03-08T10:09:05Z

If TF 2.0 were to be Keras-like, in my opinion the fix would be to have policies where the tensors are created, then the observation is passed through in an function like this:

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)
        
        self._build_kwargs = kwargs

        with tf.variable_scope("model", reuse=self.reuse):
            activ = tf.nn.relu
            self.extracted_features = nature_cnn(**self._build_kwargs)

            self.pi_layers = []
            for i, layer_size in enumerate([128, 128, 128]):
                self.pi_layers.append(activ(tf.layers.dense(layer_size, name='pi_fc' + str(i))))

            self.vf_layers = []
            for i, layer_size in enumerate([32, 32]):
                self.vf_layers.append(activ(tf.layers.dense(layer_size, name='vf_fc' + str(i))))

             self.value_fn = tf.layers.dense(1, name='vf')
          self._setup_init()

    def build(self, obs):
        with tf.variable_scope("model", reuse=self.reuse):
            pi_h = vf_h = self.extracted_features(obs)

            for layer in self.pi_layers:
                pi_h = layer(ph_h)
            pi_latent = pi_h

            for layer in self.vf_layers:
                vf_h = layer(vf_h)
            value_fn = self.value_fn(vf_h)
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None

of course , with quite a bit of backend to change (the init functions of base policies, and how the models build the policies)

michalgregor · 2019-05-14T18:38:37Z

Are there any further plans regarding this? Now that we know TF 2.0 is going to drop tf.variable_scope and even handle sessions differently, will everything pretty much have to be rewritten?

pirobot · 2019-09-24T14:49:24Z

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

jckastel · 2019-12-06T15:30:35Z

Would like to add my vote here as well. Will this get fixed at some point, or will we have to wait for the TF2.0 compatible version? Not being able to use predefined keras layers means that a ton of really useful model and layer libraries are unusable with stable-baselines, and that model code will be less future proof and much more difficult to read and maintain. This is a very unfortunate limitation to an otherwise really nice Deep RL library.

AvisekNaug · 2020-02-25T21:55:33Z

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

I made some changes to the code as shown below and it seems to be working on stable-baselines (2.9.0) with tf-gpu==1.14.x

import tensorflow as tf
from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self._proba_distribution, self._policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self._value_fn = value_fn
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self.value_flat, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1, tensorboard_log='./log')

model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

jtromans · 2020-04-08T10:47:20Z

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

AvisekNaug · 2020-04-14T04:00:43Z

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

Yeah, it does not for reasons discussed by @hill-a . It is an issue with Keras. where model=reuse does not seem to work has intended. See his response above. I merely tried to answer pirobots issues for Stable baselines 2.10. But yeah, it does not train with Keras layers properly.

araffin added the question Further information is requested label Mar 6, 2019

araffin mentioned this issue Aug 29, 2019

Tensorflow 2.0 support? #366

Closed

pirobot mentioned this issue Sep 25, 2019

CustomPolicy error: AttributeError: can't set attribute in self.pdtype.proba_distribution_from_latent #486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Using keras in Custom Policy #220

[question] Using keras in Custom Policy #220

batu commented Mar 4, 2019

araffin commented Mar 6, 2019

hill-a commented Mar 6, 2019 •

edited

batu commented Mar 7, 2019

hill-a commented Mar 8, 2019 •

edited

michalgregor commented May 14, 2019

pirobot commented Sep 24, 2019

jckastel commented Dec 6, 2019

AvisekNaug commented Feb 25, 2020

jtromans commented Apr 8, 2020 •

edited

AvisekNaug commented Apr 14, 2020

[question] Using keras in Custom Policy #220

[question] Using keras in Custom Policy #220

Comments

batu commented Mar 4, 2019

araffin commented Mar 6, 2019

hill-a commented Mar 6, 2019 • edited

batu commented Mar 7, 2019

hill-a commented Mar 8, 2019 • edited

michalgregor commented May 14, 2019

pirobot commented Sep 24, 2019

jckastel commented Dec 6, 2019

AvisekNaug commented Feb 25, 2020

jtromans commented Apr 8, 2020 • edited

AvisekNaug commented Apr 14, 2020

hill-a commented Mar 6, 2019 •

edited

hill-a commented Mar 8, 2019 •

edited

jtromans commented Apr 8, 2020 •

edited