-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about population-based algorithms #1114
Comments
Specifically, I want to know how the performance of the algorithm can be evaluated. Similar to the evaluation of reinforcement learning algorithms can be represented by the reward obtained from the environment. Taking leduc poker as an example, how to prove the algorithm is effective? After the model training is completed, what we expect to save, rl model or the policy. I am not completely famaliar with this. I hope you can give some advice, thank you! |
Hi @Root970103, If you're using PSRO or some form of fictitious play, the thing you save is either the average strategy, or the entire set of policies coupled with the meta-strategy. The latter can be turned into one policy using the policy_aggregator (if the game is small enough). A good place to start is this example: https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/examples/psro_v2_example.py Hope this helps, but please don't hesitate to ask more questions if it's not clear. |
Thank you for your reply! I have run this example script. And I obeserved the changes in nash_conv. I also wonder if I can used the trained model (or policy ) against other algorithms in leduc poker env. For example, I want to test the trained model against CFR, the entir policies or the aggregate policy should be used? In addition, in an adversarial scenario, is it appropriate to use the q value to evaluate the algorithms? It's very kind of you to give these advices. |
Yes, you can extract the policy (that is what the NashConv computation needs) and you can simulate the policy against CFR's policy.
Q-values are just estimates of values for a state and action. You can turn that into a policy by choosing argmax Q(s,a) but these will be deterministic. So if the environment requires any kind of mixing, you would lose that if you argmax over the Q-values. |
Thank you for your contribution to provide population-based algorithms, such as fictitious play, PSRO and so on. The examples you provided show the nash_conv value during the training process. I still have a question about how to evaluate the algorithm.
The text was updated successfully, but these errors were encountered: