Pig is a strategic, two-player dice game. The rules are simple - each turn, a player repeatedly rolls a die until they either roll a 1 or they decide to 'hold':
- If a 1 is rolled, the player looses all accumulated points in that turn and it becomes the other player's turn
- If any other number is rolled, that number gets added to their turn total and the player decides whether to roll again or hold
- If a player chooses to hold, their turn total is added to their overall score and it becomes the other player's turn
The first player to reach 100 points wins!
It is straightforward to calculate that in order to maximise the expected number of points gained in a single turn, you should roll until your turn score reaches 20 and then hold. Whilst this simple strategy will typically serve you well, there are many circumstances in which you'd do best to deviate from it - sometimes significantly. The reason for this is that maximising the number of points you add to your score in a single turn is not the same thing as maximising your changes of winning. Consider, for example, the extreme case in which your opponent sits on 99 points with you far behind at just 50. In this situation the 'hold at 20' policy is clearly of no use as your opponent is most certain to win on their next go.
With a bit of thought it's easy to figure out generally under what circumstances you should play with more or less risk. And yet, determining precisely the optimal strategy in each of the 100^3 possible states is far beyond human comprehension. Thankfully, given that we have a complete understanding of the game's dynamics we can use the methods of dynamic programming (such as value iteration and policy iteration) to compute the optimal policy. Todd Neller was the first to do this in his 2004 paper.
I am currently working my way through Reinforcement Learning - the classic text by Sutton & Barto so this work is primarily to aid in my understanding of some of the topics there. Pig also has a small place in my heart as one of the things that inspired me to begin studying machine learning was a university project centered around discovering good strategy in Pig - although we were primarily focused on genetic algorithms which, looking back, are quite poorly suited to this problem but alas. I have included a copy of our report from that project in this repo for no particular reason.
My plan is to first reproduce the Neller's optimal policy using both value iteration and policy iteration and try and understand the differences between them in terms of their rate of convergence. Neller also discusses some 'tricks' that can be used to improve the convergence rate by applying value/policy iteration incrementally on disjoint subsets states. I'd like to analyse what impact these tricks have on convergence - are they a requirement for a reasonable convergence time or just nice to have? It'd also be interesting to investigate the impact that the initial policy has on the convergence to the optimal policy. For example does starting at a reasonable policy like 'hold at 20' lead to significantly faster convergence?
Moving beyond 'vanilla' pig I'd also like to experiment with variants for which dynamic programming cannot be used. This will be the case whenever we no longer have a perfect model of the games dynamics. One idea I have is to simply simulate a biased die for which the probability of rolling each number is no longer equal and is unknown to the agent. Of course it'd be quite simple to just have the agent estimate the die's probability distribution from some initial set of experiences and then apply value/policy iteration but where's the fun in that? I'm more interested in seeing if an agent can learn to approximate the optimal policy through self-play alone.