Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Saved Model as Enemy Policy in Custom Environment (while training in a subprocvecenv) #835

Open
lukepolson opened this issue Apr 30, 2020 · 16 comments
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested

Comments

@lukepolson
Copy link

I am currently training in an environment that has multiple agents. In this case there are multiple snakes all on the same 11x11 grid moving around and eating food. There is one "player" snake and three "enemy" snakes. Every 1 million training steps I want to save the player model and update the enemies in such a way that they use that model to make their movements.

I can (sort of) do this in a SubprocVecEnv by importing tensorflow each time I call the update policy method of the Snake game class (which inherits from the gym environment class).

def set_enemy_policy(self, policy_path):
        import tensorflow as tf
        tf.reset_default_graph()
        self.ENEMY_POLICY = A2C.load(policy_path)

I consider this a hackish method because it imports tensorflow into each child process (of the subprocvecenv) every time the enemy policy is updated.

I use this hackish approach because I cannot simply pass in the model=A2C.load(policy_path) into some sort of callback as these models can't be pickled.

Is there a standard solution for this sort of problem?

@lukepolson lukepolson changed the title Using Presaved Feature in Custom Environment Using Saved Model as Enemy Policy in Custom Environment (while training in a subprocvecenv) Apr 30, 2020
@araffin araffin added custom gym env Issue related to Custom Gym Env question Further information is requested labels Apr 30, 2020
@araffin
Copy link
Collaborator

araffin commented Apr 30, 2020

Hello,
it sounds like you should take a look at @AdamGleave work (based on stable-baselines): https://github.com/HumanCompatibleAI/adversarial-policies

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 30, 2020

I did something similar here, where the opponent in the env loads up new policy at random times upon environment reset. You can use agent.load_parameters(path_to_file) to only load policy parameters (network weights), and save creation of new models and such.

@AdamGleave
Copy link
Collaborator

Yeah we've done something similar, the most relevant class is CurryVecEnv

@lukepolson
Copy link
Author

Cheers,
@Miffyli that looks like an elegant solution. I'm reading you code a little bit and I'm assuming that

  1. Since create_env is called only once (per environment) in the SubProcEnv, the PPO model is only loaded "n_envs" times. When you call the player2_reset(), it only opens the parameters of some new model and then updates the PPO model you had loaded in before. Since you're only loading in parameters, there is no memory error (as opposed to opening a new PPO model every update).

  2. I noticed you opened the PPO model inside the create_env function. If instead you opened up the PPO model in the actual environment, I assume you would get some sort of "can't be pickled error"? This was precisely the error I was getting.

@lukepolson
Copy link
Author

@AdamGleave you;re repository seems like the ultimate way to go leading forward for my task. I have a few questions

  1. In your "step_async" method you have self.venv.step_async(actions). Now self.venv is a vectorized environment which takes in multiple actions. In a standard vectorized environment this would be multiple independent environments each taking their own step and returning a new state. In this case, the environments are not independent as the output state for one environment depends on the actions of the other three. (For example, in my Snake Environment, all snakes move at the same time, so if snake Fix Deepq warning #2 moves right, its output state depends on how snakes Fixes and cleanup #1, [WIP] Refactoring RL models #3, and GAIL: bugfix in dataset loading (#447) #4) also move. Where exactly are these dependencies dealt with?

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 30, 2020

@lukepolson
Regarding questions towards me:

  1. Yup. The agent is kept as it is (only created once), and only the network parameters are updated with the ones from the file.
  2. I think the latter should work fine, too, as long the env/agent is created in the new process. This here is bit of a bottleneck since each env will have separate instance of an agent. A VecEnv along the lines of what @AdamGleave did could use one agent to create actions for all envs.

@AdamGleave
Copy link
Collaborator

@AdamGleave you;re repository seems like the ultimate way to go leading forward for my task. I have a few questions

  1. In your "step_async" method you have self.venv.step_async(actions). Now self.venv is a vectorized environment which takes in multiple actions. In a standard vectorized environment this would be multiple independent environments each taking their own step and returning a new state. In this case, the environments are not independent as the output state for one environment depends on the actions of the other three. (For example, in my Snake Environment, all snakes move at the same time, so if snake Fix Deepq warning #2 moves right, its output state depends on how snakes Fixes and cleanup #1, [WIP] Refactoring RL models #3, and GAIL: bugfix in dataset loading (#447) #4) also move. Where exactly are these dependencies dealt with?

The way we have it set up is that there are multiple players per environment. The observation and action spaces are n-tuples where n is the number of player. CurryVecEnv extracts the appropriate observation tuple and feeds it into the policy, then collects actions from all the players and reconstructs a full action tuple.

So each environment is still independent. We just use multiple environments to speed up training.

@lukepolson
Copy link
Author

lukepolson commented Apr 30, 2020

@AdamGleave thank you for the clarification. So if I'm correct, the venv object is just a collection of objects, each of which is a multi agent environment? If this is correct then in the line

def step_wait(self):
        observations, rewards, self._dones, infos = self.venv.step_wait()

observations would be an array that looks like

[[obs_agent1, obs_agent2, obs_agent_3], [obs_agent1, obs_agent2, obs_agent_3], ...]

supposing that there were 3 agents (and the length of axis=0 would be the number of vectorized environments). Is this the correct?

@AdamGleave
Copy link
Collaborator

@AdamGleave thank you for the clarification. So if I'm correct, the venv object is just a collection of objects, each of which is a multi agent environment?

Yes, each environment in the VecEnv is a multi-agent environment.

@lukepolson
Copy link
Author

@AdamGleave I just updated my comment above.

@AdamGleave
Copy link
Collaborator

@AdamGleave I just updated my comment above.

Yeah, you have the right idea. You can see more information in our definition of MultiAgentEnv. Each environment in the VecEnv is an instance of this.

@lukepolson
Copy link
Author

lukepolson commented Apr 30, 2020

Thanks a bunch for all your help @AdamGleave ! I appreciate the organization of your code as well. Before I go on a coding spree, I want to make sure I have the architecture of your package well summarized.

  1. One can create a VecEnv (subproc, dummyvec, etc...) with many multi agent environments put inside it. This does not have to be one of your Vec Env's (this can be one of stable baselines environments).

  2. Once the VecEnv is set up, it contains many many MultiAgent Environments. For each environment, the multi agent environment takes in (action_1, action_2, ...) and returns (obs1, obs2, ...), (reward_1, reward_2, ...), etc... per time step. Now the issue with this is that this does not work with the standard stable baselines,...

  3. ... so you created the CurryVecEnvironment which takes in a particular agent_idx (e.g. "2") and policy (A2C, PPO, etc...) and then this new vectorized environment is such that each environment step returns action_2, reward_2, ... per time step. Time steps are taken using the input policy.

Now say I want to train all 4 agents at once using one or more different policies in an adversarial sort of way. Furthermore, suppose that some agents can die (i.e. snake dies and is removed from the board) so that at some points in the game you might only have 2 or 3 agents on the board at a time. Is your code well structured for this sort of problem. If not I'm debating switching to some sort of alternative software package like RLLIB. IF you think there's some other architecture that is particularily well suited for this task let me know as well :)

@AdamGleave
Copy link
Collaborator

Thanks a bunch for all your help @AdamGleave ! I appreciate the organization of your code as well. Before I go on a coding spree, I want to make sure I have the architecture of your package well summarized.

  1. One can create a VecEnv (subproc, dummyvec, etc...) with many multi agent environments put inside it. This does not have to be one of your Vec Env's (this can be one of stable baselines environments).

My code is set up to expect a VecMultiEnv, although this is just a slight modification to a VecEnv to add a num_agents parameter. This is all in the multi_agent.py file. They use the Stable Baselines implementations under the hood.

  1. Once the VecEnv is set up, it contains many many MultiAgent Environments. For each environment, the multi agent environment takes in (action_1, action_2, ...) and returns (obs1, obs2, ...), (reward_1, reward_2, ...), etc... per time step. Now the issue with this is that this does not work with the standard stable baselines,...

Yeah, that's right.

  1. ... so you created the CurryVecEnvironment which takes in a particular agent_idx (e.g. "2") and policy (A2C, PPO, etc...) and then this new vectorized environment is such that each environment step returns action_2, reward_2, ... per time step. Time steps are taken using the input policy.

Kind of. CurryVecEnv just fixes one agent and returns a VecMultiEnv with one fewer agent. So if you're in a two-player game, after one application of CurryVecEnv you have a VecMultiEnv with a single agent. This isn't the same as a single-agent environment though because everything's still in tuples (just of length 1), but you can use FlattenSingletonVecEnv to turn it into a standard single-agent VecEnv.

Now say I want to train all 4 agents at once using one or more different policies in an adversarial sort of way. Furthermore, suppose that some agents can die (i.e. snake dies and is removed from the board) so that at some points in the game you might only have 2 or 3 agents on the board at a time. Is your code well structured for this sort of problem. If not I'm debating switching to some sort of alternative software package like RLLIB. IF you think there's some other architecture that is particularily well suited for this task let me know as well :)

The code isn't really set up for training multiple agents simultaneously. It wouldn't be hard to make your own interface to do this though. But I've heard RLLib has good multi-agent support.

Note Stable Baselines doesn't implement multi-agent RL algorithms. If you have 4 agents I'm not sure self-play is going to converge to anything (normal guarantees are only in 2-player zero-sum games).

@lukepolson
Copy link
Author

@Miffyli I was looking at your solution and implemented it in my code. I'm finding that doing this for 4 agents is relatively slow for my code.

  • For reference, with 8 parallel environments I can chug through about 8000 training steps in 22 seconds when using the enemy policy but in about 6 seconds without using the enemy policy. Furthermore, if I try to use a subprocenv with 16 different environments (with the 4 different agent policies) I run into memory errors with the graphics card; I don't run into these errors when I don't use enemy policies.

Is the model.predict() usually the bottleneck in these sorts of things? My environment is only 39x39 with one channel so I can't see why the computation time would take so long...

@lukepolson
Copy link
Author

lukepolson commented May 1, 2020

Kind of. CurryVecEnv just fixes one agent and returns a VecMultiEnv with one fewer agent. So if you're in a two-player game, after one application of CurryVecEnv you have a VecMultiEnv with a single agent. This isn't the same as a single-agent environment though because everything's still in tuples (just of length 1), but you can use FlattenSingletonVecEnv to turn it into a standard single-agent VecEnv.

@AdamGleave you mention that after 1 application of Curry Environment you fix one agent and get a VecMultiEnv with one fewer agent. I suppose then the proper protocol for three enemies on the board would be to apply CurryVecEnv three times? In a psuedo-code type manner:

FlattenSingletonVecEnv(CurryVecEnv(CurryVecEnv(CurryVecEnv(env, enemy_pol_1),
 enemy_pol_2), enemy_pol_3)))

I suspect I'll just modify the CurryVecEnv to input multiple agent indices and take in multiple policies so that I can get rid of all three at once (unless you already have a class for this).

Now @Miffyli mentions that the CurryVecEnv implements one agent to create actions for all environments in the subprocenv (for each enemy). How does this work with paralellizing code? Does 16 vs. 8 environments in a subprocenv then take twice as long?

Thanks again for answering all my questions. Still relatively new to machine learning and reinforcement learning in general. Tried using tf_agents for a month or two but found the documentation was somewhat lacking. The help here has been much appreciated!

@Miffyli
Copy link
Collaborator

Miffyli commented May 2, 2020

@lukepolson

Yeah, my solution is not very optimized as there are separate agents for each env, each running at their own paces. The predict is not slow compared to train, but it does slow gathering samples when there is an agent involved (I needed this for my own experiments, though). And indeed, it runs out of VRAM quickly if you use CNNs.

I recommend going the Adam's way here, and try to create one agent that predicts for bunch of environments at once (should have done this myself, too :') ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants