New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does env.render() create multiple render screens? | LSTM policy predict with one env [question] #166
Comments
Hey, Well this seems to be on OpenAI's side. Normaly, when def render(self, mode='human'):
"""Renders the environment.
The set of supported modes varies per environment. (And some
environments do not support rendering at all.) By convention,
if mode is:
- human: render to the current display or terminal and
return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3),
representing RGB values for an x-by-y pixel image, suitable
for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a
terminal-style text representation. The text can include newlines
and ANSI escape sequences (e.g. for colors). So you get a rendering window for each environment due to CartPole, and one tiled one from SubprocVecEnv. If you want to avoid this display issue, but keep the ...
model.learn(total_timesteps=25000)
env = DummyVecEnc([make_env(env_id, 0)])
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render() Its a stopgap fix, but it is better than 5 windows. |
That works but does it mean that after I've trained a model I have to load all envs into memory just to use one of them for testing? |
Expect for LSTM policies, the For LSTMPolicies, you need to feed the predict method with the same observation as for training, which depends on the number of envs (to test it with only one env, a trick can consist in completing the batch of observations with zeros). |
Once I get the model to converge, I'll probably need to pick your brain some more about the all zeros trick |
To make it clearer, for LSTMPolicies, the predict method expect a shape of Note: the shape may change (not sure if it is |
Hi @araffin , I followed your comments above but am really struggling to get it to work. I am using an LSTM policy with Subprocvecenv. My code is below: env = DummyVecEnv([self.make_env(test_gym, 0)])
# for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape),
# so if you want to test with only one env,
# construct an ndarray of shape (1, obs_space.shape) and then
# concatenate it with zeros to create the final ndarray.
obs = env.reset()
zeroes = np.zeros(shape=(n_envs - 1, env.observation_space.shape[1]))
obs = np.concatenate((obs, zeroes), axis=0)
print(obs.shape)
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render() With the above code, although Any ideas? Did I understand your comments correctly? |
Hello, You can find below a working example: import gym
import numpy as np
from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
def make_env():
def maker():
env = gym.make("CartPole-v1")
return env
return maker
# Train with 2 envs
n_training_envs = 2
envs = DummyVecEnv([make_env() for _ in range(n_training_envs)])
model = PPO2("MlpLstmPolicy", envs, nminibatches=2)
# Create one env for testing
test_env = DummyVecEnv([make_env() for _ in range(1)])
test_obs = test_env.reset()
# model.predict(test_obs) would through an error
# because the number of test env is different from the number of training env
# so we need to complete the observation with zeroes
zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape)
zero_completed_obs[0, :] = test_obs
# IMPORTANT: with recurrent policies, don't forget the state
state = None
action, state = model.predict(zero_completed_obs, state=state)
# The test env is expecting only one action
new_obs, reward, done, info = test_env.step([action[0]])
# Update the obs
zero_completed_obs[0, :] = new_obs Please look at the documentation on how to use recurrent policies during testing, here you were forgetting the state. |
This is the code used for prediction: n_cpu = 1
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = SubprocVecEnv([lambda: env for _ in range(n_cpu)])
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
model = PPO2.load(mdl)
# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_cpu,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs
state = None
# state = model.initial_state # get the initial state vector for the reccurent network
# done = np.zeros(state.shape[0]) # set all environment to not done
weights, state = model.predict(zero_completed_obs, state)
# print(weights)
return weights, settings I get this error in model.predict: <class 'ValueError'>
Traceback (most recent call last):
File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
position, settings = TSobject.myTradingSystem(*argList)
File "ppo2_quantiacs_test.py", line 47, in myTradingSystem
weights, state = model.predict(zero_completed_obs, state)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 472, in predict
actions, _, states, _ = self.step(observation, state, mask, deterministic=deterministic)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\policies.py", line 508, in step
{self.obs_ph: obs, self.states_ph: state, self.dones_ph: mask})
File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 675) for Tensor 'input/Ob:0', which has shape '(12, 675)' |
Please read carefully my example, you have to use |
Whats the difference between n_training_envs and n_cpu? |
You trained your agent with 12 envs (according to the error) and want to test it with only one. |
I changed it according to your example: n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = DummyVecEnv([lambda: env for _ in range(1)])
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
# mdl = 'futures_20100101_20180101_5000000_2000_3_return_False_c7616a5f58b141aa989379427458bbe8'
model = PPO2.load(mdl)
# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs
state = None
# state = model.initial_state # get the initial state vector for the reccurent network
# done = np.zeros(state.shape[0]) # set all environment to not done
pos, state = model.predict(zero_completed_obs, state) Still get: ValueError: could not broadcast input array from shape (12,45) into shape (45) |
I guess that I have to take the first row of the pos matrix? pos[0] ? |
ok, you did not show all the code. Sure, your test env is expecting only one action, and please try by yourself before asking question for each step. EDIT: I updated the example accordingly |
Still have a problem in pos, state = model.predict(zero_completed_obs, state, done) ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape. Model was trained with n_env = 12 Where this 10 comes from? |
A few things:
Your issue will not be addressed if you do not follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md) |
n_env = 12
Traceback (most recent call last): |
Please use the Markdown highlighting code format (https://help.github.com/en/articles/creating-and-highlighting-code-blocks)Reading code in pure text is not pleasent, and only take a few seconds for you to do. Also, you are not using the latest version of stable-baselines, you must : follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)as you will see that it says to describe with version of stable-baselines you have. You are loading a model expecting a (n_env, 10) for the observation shape. It is an explicite message. |
Ok it was my mistake I relieve it now. |
You might want to have a look a this : https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html
Reinforcement learning is not a magic bullet, it is in no way garanted to work all the time on every problem. For mathematical reference see the no free lunch theorem, which states:
including random optimization algorithms. You might want to try some tricks like VecFrameStacking, VecNormalize, or hyperparam search to help the algorithm optimize the way you would like.
If you believe you have found a bug in the code of stable-baselines, and can show it reliably: If you need techsupport or consulting: We do not have the time, nor the obligation for consulting on stable-baselines. The library is "as is", as described in the MIT licence: https://github.com/hill-a/stable-baselines/blob/master/LICENSE. |
I understand that you do not have any obligation to counsel. |
Correction, on the given unseen data. it is possible to generate data that will not give a positive result for the algorithm. That is the hole point of adversarial learning.
How did you get that impression? both have advantages and disadvantages. EDIT: If that fails, then try and find the underlying implementation differences between the blogpost's DDPG and stable-baselines's DDPG? In fact, why use stable-baselines at all, they have a github repo of their solution: https://github.com/vermouth1992/drl-portfolio-management |
I know that they have github repository with their code. There are other similar works on github, for example https://github.com/yuriak/RLQuant |
Locking issue, diverging too much from the original message. |
When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.
System Info
Describe the characteristic of your environment:
Additional context
The text was updated successfully, but these errors were encountered: