Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[action_mask error] #158

Closed
lewis841214 opened this issue Dec 4, 2023 · 6 comments
Closed

[action_mask error] #158

lewis841214 opened this issue Dec 4, 2023 · 6 comments
Assignees
Labels
config New or improved configuration enhancement New feature or request

Comments

@lewis841214
Copy link

for any game which set the "action_mask" not equal all 1, for example when creating the BaseEnv:

    if not self._continuous:
        action_mask = np.ones(self.discrete_action_num, 'int8')
    else:
        action_mask = None
    
    # Here I set the action 2 to be invalid:
    action_mask[2] = 0
    
    obs = {'observation': obs, 'action_mask': action_mask, 'to_play': -1}
    return BaseEnvTimestep(obs, rew, done, info)

Will result in the following error:

Traceback (most recent call last):
File "./zoo/custom/pkgir/config/pjk_disc_gumbel_muzero_config.py", line 93, in
train_muzero([main_config, create_config], seed=0, max_env_step=max_env_step)
File "/home/LightZero-main/lzero/entry/train_muzero.py", line 174, in train_muzero
train_data = replay_buffer.sample(batch_size, policy)
File "/home/LightZero-main/lzero/mcts/buffer/game_buffer_muzero.py", line 76, in sample
batch_target_policies_non_re = self._compute_target_policy_non_reanalyzed(
File "/home/LightZero-main/lzero/mcts/buffer/game_buffer_muzero.py", line 681, in _compute_target_policy_non_reanalyzed
batch_target_policies_non_re = np.asarray(batch_target_policies_non_re)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (128, 6) + inhomogeneous part.
Exception ignored in: <function MuZeroEvaluator.del at 0x7f8bebff93a0>

After reading the code in game_buffer_muzero around p.661

I found that when
if self._cfg.env_type == 'not_board_games':

The legal_actions isn't processed. But when the case is board game, the legal action is processed.

So I guess the action_mask for not_board_games scenario isn't supported?

@puyuan1996
Copy link
Collaborator

Heelo, I understand your concerns. In previous versions, we did not specifically test for scenarios where the action_mask in not_board_games contains zeros. However, theoretically, our handling of variable action spaces should be extendable to not_board_games. Therefore, we have proposed to expand the original env_type into two variables: env_type and action_type.

In our latest PR #160, we have implemented and optimized this adjustment. We warmly invite you to review and test these modifications. Thank you for your valuable feedback, it is greatly appreciated and beneficial for the advancement of LightZero. Best wishes!

@puyuan1996 puyuan1996 added the enhancement New feature or request label Dec 4, 2023
@lewis841214
Copy link
Author

Hi I run the following file to test
python3 ./zoo/box2d/lunarlander/config/lunarlander_disc_gumbel_muzero_config.py

I set the polocy dict as:
env_type='not_board_games',
action_type = 'varied_action_space',

and set
action_mask[0] = 0
in env file.

If I didn't set action_type = 'varied_action_space', the error mentioned above still occur. But after setting action_type = 'varied_action_space', the error dissapeared, and the reward does increase through training step increases.

But some weired part remain: the complete value will become -inf throughout the training process as:

[12-04 22:47:17] INFO collect end: muzero_collector.py:729
episode_count: 16
envstep_count: 1248
avg_envstep_per_episode: 78.0
avg_envstep_per_sec: 224.66893464049363
avg_episode_per_sec: 2.8803709569294056
collect_time: 5.554840067217127
reward_mean: -373.18820628837983
reward_std: 212.4814746359075
reward_max: -135.52152484840005
reward_min: -788.0163128857876
total_envstep_count: 1248
total_episode_count: 32
total_duration: 14.115308258408682
visit_entropy: 1.3028649394086207
completed_value: -inf
[12-04 22:47:17] WARNING NaN or Inf found in input tensor. x2num.py:14
[12-04 22:47:17] WARNING NaN or Inf found in input tensor.

Doesn't know whether this is an issue?

By the way I've a question:
Screenshot from 2023-12-04 23-37-12

Today before you upload the fixed version pull request, I am checking the same place and tried to fix this bug. I did the exactly same thing as you did but I just use the "else" part to run the code.
But the weired thing is :
the else part in the picture I capture, the code inside it is independent to
state_index and current_index
so they just keep on producing same thing? I.e. the variable "target_policies" keeps appending same thing?

Thanks for your reply!!

@puyuan1996
Copy link
Collaborator

Thank you for your feedback.

  • Regarding the issue of encountering completed_value: -inf when running lunarlander_disc_gumbel_muzero_config.py, I would like to confirm, did you only use the default configuration and make no additional modifications? Did this problem arise at the very beginning of the program execution? On my macOS system, I executed 30K environment steps and did not encounter a similar issue. In order to pinpoint the problem more accurately, please provide more detailed information.

  • About your observation that the code segment does not use state_index and current_index, this is because our goal here is to transform the visit count distribution obtained from MCTS search into target_policies that comply with a specific data format. This is mainly accomplished through distributions = roots_distributions[policy_index] and policy_index += 1. I acknowledge that there is redundancy in this section of the code and there are more efficient implementation methods. We will optimize it in the coming weeks. I greatly appreciate your valuable suggestion.

Best wishes!

@puyuan1996 puyuan1996 added the config New or improved configuration label Dec 5, 2023
@lewis841214
Copy link
Author

Hi I've run
lunarlander_disc_gumbel_muzero_config.py

under the default config with
action_type = 'varied_action_space', added at p.43
and
action_mask[0] = 0 added at p.139 in
LightZero-fix-action-mask/zoo/box2d/lunarlander/envs/lunarlander_env.py

and
completed_value: -inf
occurs.

I guess that you didn't add action_mask[0] = 0 in the env file so all element in action_mask = 1. If I don't put action_mask[0] = 0 into the env file, the error won't occur neither, but this is not what we want right? Since this enhancement is created for some action is masked as 0.

Thanks!

@puyuan1996
Copy link
Collaborator

Hello, indeed, after following your modifications, we did encounter this issue. We are currently investigating the cause and searching for a solution. Thank you for your patience and feedback.

@karroyan
Copy link
Collaborator

Hi, this problem occurs because masked actions were not handled properly in gumbel muzero collecting. Now the error has been solved in #178 . We welcome you to review the change and test if it resolves the problem you were facing. Please let us know if you have any other questions or feedback. Thank you for reporting this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config New or improved configuration enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants