An OCaml implementation of command-line backgammon with a bot trained using reinforcement learning.
Play against the bot here!
The current version plays about as well as the top human players using a single neural network to evaluate board positions with no look-ahead or other modifications (achieving a 42–43% win rate against GNU Backgammon in single-game match play).
How it works
Backgammon was the first major board game to be successfully attacked with techniques that resemble modern reinforcement learning, by Gerald Tesauro in the early 1990s. A first-hand account of the development of his program, TD-Gammon, can be found here. A good second-hand summary, including an introduction to the TD(λ) algorithm on which the program is based, can be found in . Details of some later (re-)implementations can be found here and here.
Our implementation uses a closely-related but distinct algorithm. In outline, the general form of our algorithm is as follows.
- Set up a feedforward neural network that takes as input a representation of a board position from the perspective of a particular player, and is intended to output an estimate of that player's "equity", i.e. the probability of that player winning the game (one may additionally ask for the probabilities of more specific outcomes, i.e. winning or losing a gammon or a backgammon).
- Construct an "amplified" equity estimator, which is allowed to be more expensive than the neural network but should also be more accurate. In the simplest case, the amplified estimator can use 1-ply look-ahead, i.e. consider every possible roll of the dice and calculate the expected equity, as estimated by the neural network itself, once the opponent has played their next move. The amplified estimator should correctly identify who has won (and whether by a gammon or a backgammon) once the game is over.
- Generate a sequence of board positions (using every position evaluated by the amplified estimator during self-play, for example), evaluate each position using the amplified estimator, and use these as training examples for the neural network.
One may view the 1-ply look-ahead version of this algorithm as being obtained from TD(λ) by making the following modifications.
- Set the time-decay parameter λ to zero, i.e. only feed back errors one time step. Tesauro mentions here that most later development of TD-Gammon used this modification.
- Rather than choosing a single roll of the dice at random, consider every possible roll and take the expected value before feeding back the error.
- Train the neural network between games rather than learning online.
This algorithm cannot be used in quite as general a setting as TD(λ) itself: it requires us to have perfect knowledge of the environment, in order to consider every possible roll of the dice. But given that this knowledge is available in backgammon, the algorithm offers a number of advantages.
- The averaging should smoothen the training process. Formally, it provides a stronger convergence property: with an idealised function approximator instead of the neural network (such as a lookup table), and assuming that every possible board position is reached infinitely many times, the algorithm converges to produce a completely accurate equity estimator without needing to lower the learning rate (whereas TD(λ) only converges if the learning rate tends to zero).
- It is straightforward to use a different amplification scheme in place of 1-ply look-ahead, such as Monte Carlo tree search or one of its variants. With such a modification the algorithm may be viewed as a simplified version of AlphaZero.
- The training of the neural network is decoupled from the process of generating training examples. This makes it possible to use the training examples more efficiently and to apply techniques such as experience replay to further smoothen training. It also simplifies parallelisation of the algorithm.
The implementation uses tensorflow-ocaml.
Training using a handcrafted bot
Tesauro remarks that with random play, backgammon games often last several hundred or even several thousand moves. This threatens to significantly slow down the initial stages of training when using self-play. This is especially true since we only feed back one time step, so initially the neural network only learns to improve on board positions that are one move away from the end of the game.
Since it is possible to change the amplified equity estimator in our algorithm, we therefore tried using a handcrafted bot as the amplified equity estimator, at least during the initial stages of training. Our handcrafted bot uses 1-ply look-ahead, but instead of using the neural network, it uses the ratio of the players' pip counts as a heuristic evaluation function. This ratio is a very poor estimate of the probability of winning, but with look-ahead it produces a bot that is able to implement simple strategies such as capturing and playing safe, preventing the game from lasting a very long time. We refer to this bot as the "pip count ratio bot".
We compared three variants of the algorithm for 2,000 training games: using self-play (with 1-ply look-ahead), using only the pip count ratio bot, and using a hybrid method in which the pip count ratio bot was used for the first 500 training games before switching to self-play. This experiment used a very similar neural network architecture to Tesauro, with a single fully-connected hidden layer of 40 units and sigmoid activation. Only our board encoding is slightly different: in the encoding of the number n of a player's counters on a particular point, we replace the condition n = 3 by the condition n ≥ 3, in an attempt to take advantage of the fact that this indicates that the player will still have a made point after removing a counter; the unit specifying the number of counters on the bar is split in two, one specifying whether any counters are on the bar; and the unit indicating whose turn it is is removed, thereby removing a symmetry of the original representation. For training, we use the Adam optimisation algorithm, and train on 500 minibatches of size 128 after every 10 games, using a replay memory with a capacity of 50,000 board positions. For testing, after every 10 training games, 100 games were played between a bot that chooses moves using the neural network (with no look-ahead) and the pip count ratio bot.
Here are the results, with moving averages displayed in order to cancel out some of the noise.
As expected, training using the pip count ratio bot very quickly produces something with reasonable performance, but this performance plateaus at 50%, since naturally it cannot outperform the pip count ratio bot itself. The hybrid method clearly improves upon this, while training using self-play starts out much more slowly. However, the self-play method catches up with the hybrid method in under 1,500 training games. Interestingly, in this particular run, after the full 2,000 games, the bot produced using self-play was slightly better than the bot produced using the hybrid method, winning 52,024 games in a 100,000-game head-to-head. While this difference could be put down to noise in the training process, it certainly appears that any early advantage gained by using the hybrid method is at best modest after 2,000 training games. Our explanation for this is that, while the pip count ratio bot is able to make reasonable moves, its equity estimates are very poor, so the hybrid method has a lot of "unlearning" to do after it switches to self-play.
Despite those remarks, the graph does not show the length of the early training games and the time saved as a result. Therefore the best approach may be to use the pip count ratio bot for a very small number of initial training games (a single game, say), to put the bot on the right track, before switching to self-play. Since the benefit appears to be modest, in our remaining experiments we choose to forgo the use of the pip count ratio bot entirely for the sake of simplicity. Nonetheless, the general idea could be useful in other reinforcement learning settings.
Increasing the size of the neural network
We studied the effect of increasing the size of the neural network on performance. As part of this we introduced an expanded board representation, with additional nodes specifying whether the number of counters of a player on a point is at greater than or equal to 4, 5 and 6, and further additional nodes representing the number of counters of a player on the bar and the number of counters that a player has borne off. This expanded representation has a total of 326 rather than 198 nodes.
We tested three different neural network architectures:
- with 1 hidden layer of size 40, sigmoid activation, and the same board representation as before;
- with 2 hidden layers of size 80, sigmoid activation, and the expanded board representation;
- and with 5 hidden layers of size 400, relu activation, and the expanded board representation.
100 test games were played after every 10 training games between a bot that chooses moves using the neural network and a bot that chooses moves using a fully-trained neural network with the first architecture (both with no look-ahead).
Here are the results, again with moving averages displayed.
The larger architectures clearly performed better, though the training process for the largest architecture was less smooth.
Benchmarking against GNU Backgammon
We tested the largest architecture described in the previous section by pitting a bot that chooses moves using the neural network (with no look-ahead) against GNU Backgammon, a strong open-source program that uses three neural networks and an endgame database. 100 test games were played after every 10 training games.
Here are the results, again with moving averages displayed.
A win rate of around 41% was achieved after around 2,500 training games. With each game lasting around 57 plies on average, this corresponds to an "error rating" (thousandths of expected games given up per move made) of around 3.2 above that of GNU Backgammon, which is about as good as the top human players.
It may appear that our bot requires relatively few training games: Tesauro notes that TD-Gammon achieves an intermediate level of play after 200,000 training games. However, our algorithm evaluates around a few hundred times as many positions in its 1-ply look-ahead as TD-Gammon does, making the performance of the two algorithms somewhat comparable.
Allowing the bot to use 1-ply look-ahead during play improves the win rate against GNU Backgammon by around 2 percentage points: the number of games won by the bot in a 10,000-game head-to-head against GNU Backgammon increased from 4,255 to 4,469 with this change. The latter win rate of just under 45% corresponds to an error rating of around 1.8 above that of GNU Backgammon.
- An experiment between 1-ply look-ahead and some form of Monte Carlo tree search guided by the output of the neural network
 Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.