Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling the gym-env by name #73

Closed
lolanchen opened this issue Sep 27, 2020 · 19 comments
Closed

Calling the gym-env by name #73

lolanchen opened this issue Sep 27, 2020 · 19 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@lolanchen
Copy link

Hi Haris! First of all thank you for putting in the effort of making poke-env.

I run your rl_with_open_ai_gym_wrapper.py and tried a bunch of other RL algorithms in keras-rl2 on the play_against method and they worked just fine.

Then naturally I would like to get poke-env working on other newer and better maintained RL libraries than keras-rl2.
I tried to get RLlib working with poke-env, specifically with the plain_against method but couldn't get it to work.

RLlib's training flow goes like this (code copied from RLlib's doc )

ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 1
config["eager"] = False
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

# Can optionally call trainer.restore(path) to load a checkpoint.

for i in range(1000):
   # Perform one iteration of training the policy with PPO
   result = trainer.train()
   print(pretty_print(result))

   if i % 100 == 0:
       checkpoint = trainer.save()
       print("checkpoint saved at", checkpoint)

where the whole gym-env class is passed to the trainer object.
3 days of try and error concludes that there is no workaround between this syntax and the play_against method in poke-env.

I wonder if it's possible to wrap poke-env into a registered gym-env and make it callable by its name,
like env gym.make('poke_env-v0') ?

@hsahovic
Copy link
Owner

Hey @lolanchen !

Thanks for opening this issue. Revamping the gym API is next on the project todo list, after doubles support (#71).

I'll take a deeper look at RLlib's api and examples later today and let you know if I think of a workaround.

@lolanchen
Copy link
Author

Thanks for the prompt reply.
Among all the python RL libraries I know, RLlib seems to be the most feature-rich and well-maintained, so it’d really nice if poke-env works with it.
Meanwhile I’ll try if I can get any other RL libraries, like stable-baselines, to work with poke-env.

@hsahovic
Copy link
Owner

I was able to run this adapted version of the SimpleRLPlayer example code with RLlib:

import asyncio
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo

from asyncio import ensure_future, new_event_loop, set_event_loop
from gym.spaces import Box, Discrete
from poke_env.player.env_player import Gen8EnvSinglePlayer
from poke_env.player.random_player import RandomPlayer


class SimpleRLPlayer(Gen8EnvSinglePlayer):
    def __init__(self, *args, **kwargs):
        Gen8EnvSinglePlayer.__init__(self)
        self.observation_space = Box(low=-10, high=10, shape=(10,))

    @property
    def action_space(self):
        return Discrete(22)

    def embed_battle(self, battle):
        # -1 indicates that the move does not have a base power
        # or is not available
        moves_base_power = -np.ones(4)
        moves_dmg_multiplier = np.ones(4)
        for i, move in enumerate(battle.available_moves):
            moves_base_power[i] = (
                move.base_power / 100
            )  # Simple rescaling to facilitate learning
            if move.type:
                moves_dmg_multiplier[i] = move.type.damage_multiplier(
                    battle.opponent_active_pokemon.type_1,
                    battle.opponent_active_pokemon.type_2,
                )

        # We count how many pokemons have not fainted in each team
        remaining_mon_team = (
            len([mon for mon in battle.team.values() if mon.fainted]) / 6
        )
        remaining_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6
        )

        # Final vector with 10 components
        return np.concatenate(
            [
                moves_base_power,
                moves_dmg_multiplier,
                [remaining_mon_team, remaining_mon_opponent],
            ]
        )

    def compute_reward(self, battle) -> float:
        return self.reward_computing_helper(
            battle, fainted_value=2, hp_value=1, victory_value=30
        )

    def observation_space(self):
        return np.array


ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 0  # Training will not work with poke-env if this value != 0
config["framework"] = "tfe"

trainer = ppo.PPOTrainer(config=config, env=SimpleRLPlayer)

def ray_training_function(player):
    for i in range(2):
        result = trainer.train()
        print(result)
    player.complete_current_battle()

env_player = trainer.workers.local_worker().env
opponent = RandomPlayer()

env_player.play_against(
    env_algorithm=ray_training_function,
    opponent=opponent,
)

Let me know if this workaround works for you!

Regardless, please do not close this issue, as I would like to come back to it once #71 is done - there is a lot of work to do to make this kind of experiments easier to run :)

@hsahovic hsahovic self-assigned this Sep 28, 2020
@hsahovic hsahovic added bug Something isn't working enhancement New feature or request labels Sep 28, 2020
@hsahovic hsahovic added this to To do in Poke-env - general via automation Sep 28, 2020
@lolanchen
Copy link
Author

Hey!
Sorry for the late reply. I got caught in the new semester and some trouble with my remote server.

Your code does run on my server and by modifying the range in ray_trainning_function, I ran PPO for 10 iterations (that is 10*1000 steps as step/iteration is 1000 in the default config)
The resulting episode_mean_reward is pasted below

episode_reward_mean: -2.7280377745174103
episode_reward_mean: 0.35718424950124494
episode_reward_mean: 5.994585932058545
episode_reward_mean: 1.1399170581483733
episode_reward_mean: 0.8904312965046478
episode_reward_mean: -0.1688560565443128
episode_reward_mean: 6.294702439147294
episode_reward_mean: 11.273508633647912
episode_reward_mean: 7.598307024096139

which is really hard to tell whether the agent is learning or not.
So to test whether it's really learning, I ran DQN on default config for 10 iterations and evaluated it against RandomPlayer with the code below

ray.init()
#config = ppo.DEFAULT_CONFIG.copy()
config = dqn.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 0  # Training will not work with poke-env if this value != 0
config["framework"] = "tfe"

#trainer = ppo.PPOTrainer(config=config, env=SimpleRLPlayer)
trainer = dqn.DQNTrainer(config=config, env=SimpleRLPlayer)

def ray_training_function(player):
    for i in range(10):
        result = trainer.train()
        print(pretty_print(result))
    player.complete_current_battle()

def ray_evaluating_function(player):
    player.reset_battles()
    for _ in range(100):
        done = False
        obs = player.reset()
        while not done:
            action = trainer.compute_action(obs)
            obs, _, done, _ = player.step(action)
    player.complete_current_battle()

    print(
        "DQN Evaluation: %d victories out of %d episodes"
        % (player.n_won_battles, 100)
    )

env_player = trainer.workers.local_worker().env
opponent = RandomPlayer()

#training
env_player.play_against(
    env_algorithm=ray_training_function,
    opponent=opponent,
)

#evaluating
env_player.play_against(
    env_algorithm=ray_evaluating_function,
    opponent=opponent,
)

and the episode_reward_mean is

episode_reward_mean: -8.791754878163145
episode_reward_mean: -3.807043700629318
episode_reward_mean: 0.357559964225597
episode_reward_mean: 2.8638216535487953
episode_reward_mean: 6.245930132105003
episode_reward_mean: 8.699014366167646
episode_reward_mean: 13.316892510656178
episode_reward_mean: 16.019107498538155
episode_reward_mean: 18.09931113046513
episode_reward_mean: 22.781653675237703

and the evaluation result against RandomPlayer is
DQN Evaluation: 56 victories out of 100 episodes

The reward does seems to increase steadily but the result against random is much worse than your example based on keras-rl.
The increasing reward kinda suggests it's working but 56% win rate against random isn't very convincing.
I will try to fix the hyperparameters to match your keras-rl2 example and see how it goes.

@hsahovic
Copy link
Owner

hsahovic commented Sep 29, 2020

Hey @lolanchen,

Thanks for keeping me posted!
I think that the first parameter you should tune is gamma: the keras rl example has a relatively small value, which leads to a more greedy policy that is a lot easier to learn than a higher value which would need to learn long term dependencies - if I recall correctly, RLlib uses .99 as its default value.

Edit: regarding reward values, you can customize compute_reward, potentially using reward_computing_helper. In the setting used in the examples above, you should probably aim for an average reward of 20+.

@lolanchen
Copy link
Author

@hsahovic

Yes, just changing config["gamma"] = 0.5 raise the win rate against Random after 10 iteration to 76%
But interestingly the mean reward dropped to around 15, compared to 22+ when gamma was 0.99.

Thanks a lot for all the advice! I'll try to play more with the parameters tomorrow.

@mancho1987
Copy link

Hey @lolanchen,

Did you manage to make it work with other RL libraries?

@lolanchen
Copy link
Author

@mancho1987
I haven’t tried any others since RLlib has been working just fine for me. Tell me what you are trying to use, maybe I can help testing that. :)

@mancho1987
Copy link

@lolanchen thanks for the fast reply :) I am just starting in RL so my knowledge is quite limited. I found acme from deepmind, and was wondering if it could be used here, but after checking RLlib seems to me that there is no need for acme. I will try it with RLlib as mentioned in the posts above to see if I can make it work :) btw, what features are you selecting? I read about embeddings, are they useful for things like moves?

@pablovin
Copy link

Hi all,

I was able to run the Stable_Baselines (https://stable-baselines.readthedocs.io/en/master/index.html) examples with some small changes:

import gym
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import PPO2, DQN

import asyncio
import numpy as np
from gym.spaces import Box, Discrete

from poke_env.player.env_player import Gen8EnvSinglePlayer
from poke_env.player.random_player import RandomPlayer


class SimpleRLPlayer(Gen8EnvSinglePlayer):

    observation_space = Box(low=-10, high=10, shape=(10,))
    action_space = Discrete(22)

    def getThisPlayer(self):
        return self

    def __init__(self, *args, **kwargs):
        Gen8EnvSinglePlayer.__init__(self)

    def embed_battle(self, battle):
        # -1 indicates that the move does not have a base power
        # or is not available
        moves_base_power = -np.ones(4)
        moves_dmg_multiplier = np.ones(4)
        for i, move in enumerate(battle.available_moves):
            moves_base_power[i] = (
                move.base_power / 100
            )  # Simple rescaling to facilitate learning
            if move.type:
                moves_dmg_multiplier[i] = move.type.damage_multiplier(
                    battle.opponent_active_pokemon.type_1,
                    battle.opponent_active_pokemon.type_2,
                )

        # We count how many pokemons have not fainted in each team
        remaining_mon_team = (
            len([mon for mon in battle.team.values() if mon.fainted]) / 6
        )
        remaining_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6
        )

        # Final vector with 10 components
        return np.concatenate(
            [
                moves_base_power,
                moves_dmg_multiplier,
                [remaining_mon_team, remaining_mon_opponent],
            ]
        )

    def compute_reward(self, battle) -> float:
        return self.reward_computing_helper(
            battle, fainted_value=2, hp_value=1, victory_value=30
        )





envPlayer = SimpleRLPlayer()
opponent = RandomPlayer()


model = DQN(MlpPolicy, envPlayer, gamma=0.5, verbose=1)
def ray_training_function(player):

    print ("Training...")
    model.learn(total_timesteps=1000)
    print("Training complete.")


def ray_evaluating_function(player):
    player.reset_battles()
    for _ in range(100):
        done = False
        obs = player.reset()
        while not done:
            action = model.predict(obs)[0]
            obs, _, done, _ = player.step(action)
            # print ("done:" + str(done))
    player.complete_current_battle()

    print(
        "DQN Evaluation: %d victories out of %d episodes"
        % (player.n_won_battles, 100)
    )

# Training
envPlayer.play_against(
    env_algorithm=ray_training_function,
    opponent=opponent,
)

envPlayer.play_against(
    env_algorithm=ray_evaluating_function,
    opponent=opponent,
)

model.save("/home/pablo/Documents/Datasets/PokeEnv/TrainedAgents/DQN/vsRandom/DQNvsRandom")

What I am trying to do now is to train an agent and use it to play against humans on the shadow server. I am facing some problems when starting the game, but I am still not sure how to use a loaded agent to select moves. I am thinking of loading the saved model inside the SimpleRLPlayer and use it to select a move, using the "choose_move" and "embed_battle" functions. I am just having a problem with how to use the predicted actions (one from 22 values) and return the string choice the method needs to return. Anyone ever tried something like that?

Cheers,

Pablo

@pablovin
Copy link

A quick updated, I checked the closed issues and I got my answer there, I would implement the choose_move like that:

    def choose_move(self, battle):
        # If the player can attack, it will7
        observations = self.embed_battle(battle)
        action = self.model.predict(observations)
        return self._action_to_move(action, battle)

in case anyone is looking for it :)

@hsahovic hsahovic moved this from To do to Next 0.X release in Poke-env - general Dec 22, 2020
@mnguyen0226
Copy link

@pablovin I assumed that you mentioned this: #119 (comment)
Will you post the full choose_move implementation script in here. What is _action_to_move?

@hsahovic
Copy link
Owner

@mnguyen0226 _action_to_move is a method implemented in GenXEnvSinglePlayer, which converts an action (eg. an integer, as returned by the model) to a battle order.

@mnguyen0226
Copy link

mnguyen0226 commented Aug 16, 2021

@hsahovic Hmmmm I see I see. So I approach training with Stable Baseline. What I am trying to do is to having 2 trained model fight each other on 2 different laptop/accounts. I was able to do this with MaxDamage vs Random agents. My plan now is to have TrainedRLAgent (with DQN) vs Random agent. I have trained the simpleRL DQN with Stable Baseline provided above with good evaluation score. The Random Agent was able to automatically take move but the TrainedRLAgent I got the trained model was not able to automatically select action.

Here is my script for running the trainedRLPlayer

import asyncio
from poke_env.player.env_player import Gen8EnvSinglePlayer
from gym.spaces import Box, Discrete
from poke_env.player.player import Player

from poke_env.player.random_player import RandomPlayer
from poke_env.player_configuration import PlayerConfiguration
from poke_env.server_configuration import ShowdownServerConfiguration
from minh_dev.rl_poke.sb_rl_agent import SimpleRLPlayer
from stable_baselines import DQN
import numpy as np

trained_model = DQN.load(
    "/home/mnguyen/Documents/summer2021/pokemon/poke_env/trained_models/saved_simpleRL"
)

class TrainedRLPlayer(SimpleRLPlayer):
    def __init__(self, *args, **kwargs):
        SimpleRLPlayer.__init__(self, *args, **kwargs)
        self.model = trained_model

    def choose_move(self, battle):
        # if the player can attack, it will
        observations = self.embed_battle(battle=battle)
        action = self.model.predict(observations)
        return(self._action_to_move(action, battle))

async def main():
    simplerl_player = TrainedRLPlayer(
        player_configuration=PlayerConfiguration("bot_1_account", "bot_1_pw"),
        server_configuration=ShowdownServerConfiguration,
    )

    # Sending challenges to "your_username"
    await simplerl_player.send_challenges("bot_2_account", n_challenges=1)

    # Accepting one challenge from any user
    await simplerl_player.accept_challenges(None, 1)

    # Accepting three challenges from "your_username"
    await simplerl_player.accept_challenges("bot_2_account", 2)

    # Playing 5 games on the ladder
    await simplerl_player.ladder(5)

    # Print the rating of the player and tis opponent after each battle
    for battle in simplerl_player.battles.values():
        print(battle.rating, battle.opponent_rating)


if __name__ == "__main__":
    asyncio.get_event_loop().run_until_complete(main())

^ did I missed anything to get the loaded & trained DQN agent to automatically take action?

@hsahovic
Copy link
Owner

Is any error raised? If not, can you set:

    simplerl_player = TrainedRLPlayer(
        player_configuration=PlayerConfiguration("bot_1_account", "bot_1_pw"),
        server_configuration=ShowdownServerConfiguration,
        log_level=20,
    )

and see where the logs stop?

@mnguyen0226
Copy link

There is no error raised. Set up and play 1 game (I still have to manually get choose the action instead of RL agent manually choose the action). There is no log provided. Oddly, on the terminal it stated that the bot that I send the challenge to was not founded although I invite one to challege the other and get them to be in the same fight.

@hsahovic
Copy link
Owner

Can you open a separate issue, with the logs (eg. the output on the terminal)?

Poke-env - general automation moved this from Next 0.X release to Done Aug 16, 2021
@hsahovic hsahovic reopened this Aug 16, 2021
Poke-env - general automation moved this from Done to In progress Aug 16, 2021
@mnguyen0226
Copy link

@hsahovic I was able to load and has the agent automatically pick action now. Figured out the issue with "action" variable return a tuple instead of int.

Thank you for the help!

@hsahovic
Copy link
Owner

This should be fixed by upgrading to 0.4.17. If this issue, or another similar RAM increase bug arises, please open a new issue :)

Poke-env - general automation moved this from In progress to Done Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants