Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does env.render() create multiple render screens? | LSTM policy predict with one env [question] #166

Closed
SerialIterator opened this issue Jan 16, 2019 · 24 comments
Labels
question Further information is requested

Comments

@SerialIterator
Copy link

SerialIterator commented Jan 16, 2019

When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.

import gym
import numpy as np

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines import ACKTR

def make_env(env_id, rank, seed=0):
    """
    Utility function for multiprocessed env.

    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environments you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    set_global_seeds(seed)
    return _init

env_id = "CartPole-v1"
num_cpu = 4  # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

model = ACKTR(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)

obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

System Info
Describe the characteristic of your environment:

  • Vanilla install, followed the docs using pip
  • gpus: 2-gtx-1080ti's
  • Python version 3.6.5
  • Tensorflow version 1.12.0
  • ffmpeg 4.0

Additional context
cartpole

@SerialIterator SerialIterator changed the title Why does env.render() create multiple render screens [question] Why does env.render() create multiple render screens <question> Jan 16, 2019
@SerialIterator SerialIterator changed the title Why does env.render() create multiple render screens <question> Why does env.render() create multiple render screens [question] Jan 16, 2019
@araffin araffin added the question Further information is requested label Jan 17, 2019
@hill-a
Copy link
Owner

hill-a commented Jan 17, 2019

Hey,

Well this seems to be on OpenAI's side.
At the CartPole render function there are no checks for whether a rendering window was asked, or an RGB image

Normaly, when mode=rgb_image is used, no rendering is done, as definied by the Gym doc:

    def render(self, mode='human'):
        """Renders the environment.
        The set of supported modes varies per environment. (And some
        environments do not support rendering at all.) By convention,
        if mode is:
        - human: render to the current display or terminal and
          return nothing. Usually for human consumption.
        - rgb_array: Return an numpy.ndarray with shape (x, y, 3),
          representing RGB values for an x-by-y pixel image, suitable
          for turning into a video.
        - ansi: Return a string (str) or StringIO.StringIO containing a
          terminal-style text representation. The text can include newlines
          and ANSI escape sequences (e.g. for colors).

So you get a rendering window for each environment due to CartPole, and one tiled one from SubprocVecEnv.

If you want to avoid this display issue, but keep the SubProcVecEnv, recreate the vectorized environment for the rendering code, but with only one environment:

...
model.learn(total_timesteps=25000)

env = DummyVecEnc([make_env(env_id, 0)])
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Its a stopgap fix, but it is better than 5 windows.

@SerialIterator
Copy link
Author

That works but does it mean that after I've trained a model I have to load all envs into memory just to use one of them for testing?

@araffin
Copy link
Collaborator

araffin commented Jan 22, 2019

Expect for LSTM policies, the predict() method only needs an observation or batch of observations (cf documentation) so you can use as many env as you want (e.g. only one) for testing.

For LSTMPolicies, you need to feed the predict method with the same observation as for training, which depends on the number of envs (to test it with only one env, a trick can consist in completing the batch of observations with zeros).

@SerialIterator
Copy link
Author

Once I get the model to converge, I'll probably need to pick your brain some more about the all zeros trick

@araffin
Copy link
Collaborator

araffin commented Jan 22, 2019

To make it clearer, for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape), so if you want to test with only one env, construct an ndarray of shape (1, obs_space.shape) and then concatenate it with zeros to create the final ndarray.

Note: the shape may change (not sure if it is n_envs or minibatch_size) but at least you got the idea.

@araffin araffin closed this as completed Mar 1, 2019
@op1490
Copy link

op1490 commented Jun 15, 2019

Hi @araffin , I followed your comments above but am really struggling to get it to work. I am using an LSTM policy with Subprocvecenv. My code is below:

env = DummyVecEnv([self.make_env(test_gym, 0)])

# for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape),
# so if you want to test with only one env,
# construct an ndarray of shape (1, obs_space.shape) and then
# concatenate it with zeros to create the final ndarray.
obs = env.reset()

zeroes = np.zeros(shape=(n_envs - 1, env.observation_space.shape[1]))
obs = np.concatenate((obs, zeroes), axis=0)
print(obs.shape)
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

With the above code, although print(obs.shape) gives me: (8, 1, 77), I get the following error when attempting to predict: ValueError: Cannot feed value of shape (1, 1, 77) for Tensor 'input/Ob:0', which has shape '(8, 1, 77)

Any ideas? Did I understand your comments correctly?

@araffin
Copy link
Collaborator

araffin commented Jun 15, 2019

Hello,

You can find below a working example:

import gym
import numpy as np

from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv

def make_env():
   def maker():
       env = gym.make("CartPole-v1")
       return env
   return maker

# Train with 2 envs
n_training_envs = 2
envs = DummyVecEnv([make_env() for _ in range(n_training_envs)])
model = PPO2("MlpLstmPolicy", envs, nminibatches=2)

# Create one env for testing
test_env = DummyVecEnv([make_env() for _ in range(1)])
test_obs = test_env.reset()

# model.predict(test_obs) would through an error
# because the number of test env is different from the number of training env
# so we need to complete the observation with zeroes
zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape)
zero_completed_obs[0, :] = test_obs

# IMPORTANT: with recurrent policies, don't forget the state
state = None
action, state = model.predict(zero_completed_obs, state=state)
# The test env is expecting only one action
new_obs, reward, done, info = test_env.step([action[0]])
# Update the obs
zero_completed_obs[0, :] = new_obs

Please look at the documentation on how to use recurrent policies during testing, here you were forgetting the state.

@araffin araffin changed the title Why does env.render() create multiple render screens [question] Why does env.render() create multiple render screens? | LSTM policy predict with one env [question] Jun 15, 2019
@hn2
Copy link

hn2 commented Jun 16, 2019

This is the code used for prediction:

n_cpu = 1
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = SubprocVecEnv([lambda: env for _ in range(n_cpu)])
   
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_cpu,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
#   state = model.initial_state   #   get the initial state vector for the reccurent network
#   done = np.zeros(state.shape[0])   #   set all environment to not done
    
weights, state = model.predict(zero_completed_obs, state)
    
#   print(weights)  

return weights, settings

I get this error in model.predict:

<class 'ValueError'>
Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test.py", line 47, in myTradingSystem
    weights, state = model.predict(zero_completed_obs, state)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 472, in predict
    actions, _, states, _ = self.step(observation, state, mask, deterministic=deterministic)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\policies.py", line 508, in step
    {self.obs_ph: obs, self.states_ph: state, self.dones_ph: mask})
  File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 675) for Tensor 'input/Ob:0', which has shape '(12, 675)'

@araffin
Copy link
Collaborator

araffin commented Jun 17, 2019

Please read carefully my example, you have to use n_training_envs not n_cpu.

@hn2
Copy link

hn2 commented Jun 17, 2019

Whats the difference between n_training_envs and n_cpu?
Just a name of a variable.

@araffin
Copy link
Collaborator

araffin commented Jun 17, 2019

You trained your agent with 12 envs (according to the error) and want to test it with only one.
But here, n_cpu != n_training_envs, so you get an error.

@hn2
Copy link

hn2 commented Jun 17, 2019

I changed it according to your example:

n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = DummyVecEnv([lambda: env for _ in range(1)])
   
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
#   mdl = 'futures_20100101_20180101_5000000_2000_3_return_False_c7616a5f58b141aa989379427458bbe8'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
#   state = model.initial_state   #   get the initial state vector for the reccurent network
#   done = np.zeros(state.shape[0])   #   set all environment to not done
    
pos, state = model.predict(zero_completed_obs, state)

Still get:

ValueError: could not broadcast input array from shape (12,45) into shape (45)

@hn2
Copy link

hn2 commented Jun 17, 2019

I guess that I have to take the first row of the pos matrix? pos[0] ?

@araffin
Copy link
Collaborator

araffin commented Jun 17, 2019

I guess that I have to take the first row of the pos matrix? pos[0] ?

ok, you did not show all the code. Sure, your test env is expecting only one action, and please try by yourself before asking question for each step.

EDIT: I updated the example accordingly

@hn2
Copy link

hn2 commented Jun 18, 2019

Still have a problem in

pos, state = model.predict(zero_completed_obs, state, done)

ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

Model was trained with n_env = 12

Where this 10 comes from?

@hill-a
Copy link
Owner

hill-a commented Jun 18, 2019

Still have a problem in

pos, state = model.predict(zero_completed_obs, state, done)

ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

Model was trained with n_env = 12

Where this 10 comes from?

A few things:

Your issue will not be addressed if you do not follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)

@hn2
Copy link

hn2 commented Jun 18, 2019

n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'], debug=settings['debug'])
env = SubprocVecEnv([lambda: env for _ in range(1)])

mdl = 'ES_19900102_20180101_5000000_7000_1_return_False_7a686c53e4a34338942a8b4bbe65fa47'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
state = model.initial_state   
done = np.zeros(state.shape[0])   

pos, state = model.predict(zero_completed_obs, state, done)

Traceback (most recent call last):
File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
position, settings = TSobject.myTradingSystem(*argList)
File "ppo2_quantiacs_test.py", line 68, in myTradingSystem
pos, state = model.predict(zero_completed_obs, state, done)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 469, in predict
vectorized_env = self._is_vectorized_observation(observation, self.observation_space)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 399, in _is_vectorized_observation
.format(", ".join(map(str, observation_space.shape))))
ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

@hill-a
Copy link
Owner

hill-a commented Jun 18, 2019

Please use the Markdown highlighting code format (https://help.github.com/en/articles/creating-and-highlighting-code-blocks)

Reading code in pure text is not pleasent, and only take a few seconds for you to do.

Also, you are not using the latest version of stable-baselines, you must :

follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)

as you will see that it says to describe with version of stable-baselines you have.

You are loading a model expecting a (n_env, 10) for the observation shape. It is an explicite message.

@hn2
Copy link

hn2 commented Jun 18, 2019

Ok it was my mistake I relieve it now.
I getting in prediction Nan.
Anyway I emailed you and Antonin privately.
Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data. I hope that you can help and finish this once and for all.

@hill-a
Copy link
Owner

hill-a commented Jun 18, 2019

I getting in prediction Nan.

You might want to have a look a this : https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html
It will help to find the NaNs in your code, specifically the VecCheckNan wrapper: https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html#vecchecknan-wrapper

Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data.

Reinforcement learning is not a magic bullet, it is in no way garanted to work all the time on every problem. For mathematical reference see the no free lunch theorem, which states:

Any two optimization algorithms are equivalent when their performance is 
averaged across all possible problems

including random optimization algorithms.

You might want to try some tricks like VecFrameStacking, VecNormalize, or hyperparam search to help the algorithm optimize the way you would like.

I hope that you can help and finish this once and for all.

If you believe you have found a bug in the code of stable-baselines, and can show it reliably:
We will adresse it.

If you need techsupport or consulting:
We will not help

We do not have the time, nor the obligation for consulting on stable-baselines. The library is "as is", as described in the MIT licence: https://github.com/hill-a/stable-baselines/blob/master/LICENSE.

@hn2
Copy link

hn2 commented Jun 18, 2019

I understand that you do not have any obligation to counsel.
I am trying to implement this:
http://www-scf.usc.edu/~zhan527/post/cs599/
with stable baseline.
In the original article it does work, even on unseen data.
He created his own ddpg agent, and I understand that PPO suppsoe to be better.

@hill-a
Copy link
Owner

hill-a commented Jun 18, 2019

In the original article it does work, even on unseen data.

Correction, on the given unseen data. it is possible to generate data that will not give a positive result for the algorithm. That is the hole point of adversarial learning.

He created his own ddpg agent, and I understand that PPO suppsoe to be better.

How did you get that impression? both have advantages and disadvantages.

EDIT:
if you are trying to replicate the results of the blogpost, why dont you use their hyperparameters with DDPG?

If that fails, then try and find the underlying implementation differences between the blogpost's DDPG and stable-baselines's DDPG?

In fact, why use stable-baselines at all, they have a github repo of their solution: https://github.com/vermouth1992/drl-portfolio-management

@hn2
Copy link

hn2 commented Jun 18, 2019

I know that they have github repository with their code. There are other similar works on github, for example https://github.com/yuriak/RLQuant
or
https://github.com/liangzp/Reinforcement-learning-in-portfolio-management-
I was hoping that stable baseline will let me test various agents and not be confined to ddpg only.
In addition, stable baseline has tensorboard integration.
In any case, at this point I still believe that the problem is with my code and not the agent or hyper parameters.
The original work is actually this:
https://arxiv.org/abs/1808.09940

@hill-a
Copy link
Owner

hill-a commented Jun 18, 2019

Locking issue, diverging too much from the original message.

Repository owner locked as off-topic and limited conversation to collaborators Jun 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants