Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Using keras in Custom Policy #220

Open
batu opened this issue Mar 4, 2019 · 10 comments
Open

[question] Using keras in Custom Policy #220

batu opened this issue Mar 4, 2019 · 10 comments
Labels
question Further information is requested

Comments

@batu
Copy link

batu commented Mar 4, 2019

I am trying to use keras to define my own custom policy, unfortunately after several hours of trying I couldn't get it to train on CartPole.

Here is the CustomPolicy example I have modified to work with Cartpole, and this trains properly.

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            activ = tf.nn.tanh

            extracted_features = tf.layers.flatten(self.processed_obs)

            pi_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                pi_h = activ(tf.layers.dense(pi_h, layer_size, name='pi_fc' + str(i)))
            pi_latent = pi_h

            vf_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                vf_h = activ(tf.layers.dense(vf_h, layer_size, name='vf_fc' + str(i)))
            value_fn = tf.layers.dense(vf_h, 1, name='vf')
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None        
        self._setup_init()

Here is the Keras version of my implementation that runs, but does NOT train. (tf.keras.layers vs keras.layers) doesn't make a difference.

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

I tried to ensure both implementations are as close to eachother as possible. Any help at this point would be grately appreciated.

Thank you in advance

Keras version: 2.2.2
Tensorflow version: 1.12.0
Stable Baselines version: 2.4.0a

Attached is the minimal code to reproduce the current issue with tensorboard graphs for comparison.
custom_model.py.zip

@araffin
Copy link
Collaborator

araffin commented Mar 6, 2019

Hello,
I tested your code and ... it worked fine.

See below for minimal code to reproduce (I got reward > 100)

import tensorflow as tf

from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy


class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self._value, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

I'm using tf-gpu (1.8.0) and latest version of stable-baselines (2.5.0a0 this is the gail branch but should not affect the results).

@araffin araffin added the question Further information is requested label Mar 6, 2019
@hill-a
Copy link
Owner

hill-a commented Mar 6, 2019

Hey,

After having a trying the code, I am getting the same problem.

It seems that under TF 1.12.0 Keras is ignoring the reuse=True of the scope, meaning that the training model does not share all the parameters with the main model and ends up recreating a new independent model (this is visible under tensorboard as the main model only shares 4 tenors with the training model, rather than the 14 with pure TF code)

There isn't much of a fix unfortunatly, as Keras seems to be using tf.Variable rather than tf.get_variable (some reading here and here)

@batu
Copy link
Author

batu commented Mar 7, 2019

@araffin, @hill-a thank you very much for looking into this! This problem has been haunting me for a while. I think the best case scenario as a bandaid is to downgrade to TF 1.8.0.

The difference between tf.get_variable vs tf.Variable is very unfortunate... Do you have an intuition as to how stable-baselines might change as years go on, given that with TF 2.0 is placing heavy bets on keras for the future facing way of doing things?

@hill-a
Copy link
Owner

hill-a commented Mar 8, 2019

If TF 2.0 were to be Keras-like, in my opinion the fix would be to have policies where the tensors are created, then the observation is passed through in an function like this:

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)
        
        self._build_kwargs = kwargs

        with tf.variable_scope("model", reuse=self.reuse):
            activ = tf.nn.relu
            self.extracted_features = nature_cnn(**self._build_kwargs)

            self.pi_layers = []
            for i, layer_size in enumerate([128, 128, 128]):
                self.pi_layers.append(activ(tf.layers.dense(layer_size, name='pi_fc' + str(i))))

            self.vf_layers = []
            for i, layer_size in enumerate([32, 32]):
                self.vf_layers.append(activ(tf.layers.dense(layer_size, name='vf_fc' + str(i))))

             self.value_fn = tf.layers.dense(1, name='vf')
          self._setup_init()

    def build(self, obs):
        with tf.variable_scope("model", reuse=self.reuse):
            pi_h = vf_h = self.extracted_features(obs)

            for layer in self.pi_layers:
                pi_h = layer(ph_h)
            pi_latent = pi_h

            for layer in self.vf_layers:
                vf_h = layer(vf_h)
            value_fn = self.value_fn(vf_h)
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None

of course , with quite a bit of backend to change (the init functions of base policies, and how the models build the policies)

@michalgregor
Copy link

Are there any further plans regarding this? Now that we know TF 2.0 is going to drop tf.variable_scope and even handle sessions differently, will everything pretty much have to be rewritten?

@pirobot
Copy link

pirobot commented Sep 24, 2019

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

@jckastel
Copy link

jckastel commented Dec 6, 2019

Would like to add my vote here as well. Will this get fixed at some point, or will we have to wait for the TF2.0 compatible version? Not being able to use predefined keras layers means that a ton of really useful model and layer libraries are unusable with stable-baselines, and that model code will be less future proof and much more difficult to read and maintain. This is a very unfortunate limitation to an otherwise really nice Deep RL library.

@AvisekNaug
Copy link

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

I made some changes to the code as shown below and it seems to be working on stable-baselines (2.9.0) with tf-gpu==1.14.x

import tensorflow as tf
from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self._proba_distribution, self._policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self._value_fn = value_fn
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self.value_flat, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1, tensorboard_log='./log')

model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

@jtromans
Copy link

jtromans commented Apr 8, 2020

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

image

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

@AvisekNaug
Copy link

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

image

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

Yeah, it does not for reasons discussed by @hill-a . It is an issue with Keras. where model=reuse does not seem to work has intended. See his response above. I merely tried to answer pirobots issues for Stable baselines 2.10. But yeah, it does not train with Keras layers properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants