Temporal difference learning for ultimate tic-tac-toe.
What is ultimate tic-tac-toe?
It's like tic-tac-toe, but each square of the game contains another game of tic-tac-toe in it! Win small games to claim the squares in the big game. Simple, right? But there is a catch: Whichever small square you pick is the next big square your opponent must play in. Read more...
What is temporal difference learning?
Temporal difference (TD) learning is a reinforcement learning algorithm trained only using self-play. The algorithm learns by bootstrapping from the current estimate of the value function, i.e. the value of a state is updated based on the current estimate of the value of future states. Read more...
How to use
To begin training:
or set the learning hyperparameters using any of the optional arguments:
python train.py --lr LEARN_RATE --a ALPHA --e EPSILON
You can play against a trained model using
python player.py --params path/to/parameters.params
If no parameters are provided, the opponent will make moves randomly.
- Scale the value of terminal results by the game length to prefer shorter games.
- Implement UT3 neural network in other frameworks, eg: TensorFlow.
- Make asynchronous, i.e. do self-play, neural net training and model comparison in parallel.