Skip to content

Using Value Iteration and Policy Iteration to discover the optimal solution for the strategic dice game PIG. Ultimately interested in whether the optimal solution can be reached through self-play alone.

Notifications You must be signed in to change notification settings

lukasmyth96/Piggy

Repository files navigation

Piggy

The Rules

Pig is a strategic, two-player dice game. The rules are simple - each turn, a player repeatedly rolls a die until they either roll a 1 or they decide to 'hold':

  • If a 1 is rolled, the player looses all accumulated points in that turn and it becomes the other player's turn
  • If any other number is rolled, that number gets added to their turn total and the player decides whether to roll again or hold
  • If a player chooses to hold, their turn total is added to their overall score and it becomes the other player's turn

The first player to reach 100 points wins!

Basic Strategy Will Only Take You Sow Far

It is straightforward to calculate that in order to maximise the expected number of points gained in a single turn, you should roll until your turn score reaches 20 and then hold. Whilst this simple strategy will typically serve you well, there are many circumstances in which you'd do best to deviate from it - sometimes significantly. The reason for this is that maximising the number of points you add to your score in a single turn is not the same thing as maximising your changes of winning. Consider, for example, the extreme case in which your opponent sits on 99 points with you far behind at just 50. In this situation the 'hold at 20' policy is clearly of no use as your opponent is most certain to win on their next go.

Finding the Optimal Policy

With a bit of thought it's easy to figure out generally under what circumstances you should play with more or less risk. And yet, determining precisely the optimal strategy in each of the 100^3 possible states is far beyond human comprehension. Thankfully, given that we have a complete understanding of the game's dynamics we can use the methods of dynamic programming (such as value iteration and policy iteration) to compute the optimal policy. Todd Neller was the first to do this in his 2004 paper.

Alt text

The Purpose of this Work

I am currently working my way through Reinforcement Learning - the classic text by Sutton & Barto so this work is primarily to aid in my understanding of some of the topics there. Pig also has a small place in my heart as one of the things that inspired me to begin studying machine learning was a university project centered around discovering good strategy in Pig - although we were primarily focused on genetic algorithms which, looking back, are quite poorly suited to this problem but alas. I have included a copy of our report from that project in this repo for no particular reason.

My plan is to first reproduce the Neller's optimal policy using both value iteration and policy iteration and try and understand the differences between them in terms of their rate of convergence. Neller also discusses some 'tricks' that can be used to improve the convergence rate by applying value/policy iteration incrementally on disjoint subsets states. I'd like to analyse what impact these tricks have on convergence - are they a requirement for a reasonable convergence time or just nice to have? It'd also be interesting to investigate the impact that the initial policy has on the convergence to the optimal policy. For example does starting at a reasonable policy like 'hold at 20' lead to significantly faster convergence?

Moving beyond 'vanilla' pig I'd also like to experiment with variants for which dynamic programming cannot be used. This will be the case whenever we no longer have a perfect model of the games dynamics. One idea I have is to simply simulate a biased die for which the probability of rolling each number is no longer equal and is unknown to the agent. Of course it'd be quite simple to just have the agent estimate the die's probability distribution from some initial set of experiences and then apply value/policy iteration but where's the fun in that? I'm more interested in seeing if an agent can learn to approximate the optimal policy through self-play alone.

About

Using Value Iteration and Policy Iteration to discover the optimal solution for the strategic dice game PIG. Ultimately interested in whether the optimal solution can be reached through self-play alone.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages