In [1]:
!pip uninstall mcts -y
!pip install .

Uninstalling mcts-0.4:
  Successfully uninstalled mcts-0.4
Processing /notebooks
Building wheels for collected packages: mcts
  Running setup.py bdist_wheel for mcts ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-dk8hca1w/wheels/b6/62/b1/600ed0c11030d88f67fd6813772ff38d9f0a25ea8277435239
Successfully built mcts
Installing collected packages: mcts
Successfully installed mcts-0.4


# Step 1: Build Environment

In [2]:
from mcts.environments import TicTacToe, DotsAndBoxes
env = DotsAndBoxes()

In [3]:
env.board()

.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
Player 1 Score is 0
Player 2 Score is 0



# Step 2: Build Neural Network
I've built some utility scripts to aid in this. All that's required for a working model is to have both a policy output and a value output. We'll use the `load_zeronet` utility to load a neural-net architecture similar to the AlphaGo Zero architecture.

In [5]:
import tensorflow as tf
import keras.backend as K
from keras.models import load_model
from mcts.nn.utils import load_zeronet

from mcts.nn.model import Model
keras_model = load_zeronet(env.state.shape, env.action_space, lr=0.001, residual_layers=2)
mcts_model = Model(keras_model) # Takes a Keras/TF Model

pretrained_model = load_model('models/dotsandboxes/model4')
mcts_model_pretrained = Model(pretrained_model)


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



# Step 3: Configuring Policies
There are a couple different types to choose from, but only a couple are required for MCTS to run.
1. Selection - Policy that chooses an action during the selection phase of MCTS
2. Expansion - Policy that expands a leaf node in MCTS.
3. Update - Policy that determines how nodes are updated at the end of a MCTS.
4. Action - Policy that chooses what action to play based on results of MCTS.

Building a config file is pretty straightforward! Just use a json-like structure. You can check the supported policies by running the command below.

A simple config dictionary is shown below. If you want to add keyword arguments, which some policies take, just use add `_kwargs` after the policy type and put the keyword arguments in a dictionary.

In [6]:
config = {
    'model' : mcts_model,
    'action' : 'proportional-to-visit-count',
    'selection' : 'puct',
    'selection_kwargs' : {'C' : 1.14},
    'expansion' : 'neural',
    'update' : 'value'
}

### Building the MCTS

If you don't care about actually training your model, then you can build the MCTS with a config dictionary. Just specify the policy _type_ as the key and the policy object as the value.
You can check the supported policy types by using `mcts.SUPPORTED_POLICY_TYPES`

In [8]:
from mcts.mcts import MCTS

m = MCTS(ttt, calculation_time=3)
m.build(config)

In [16]:
m.act()
ttt.board()

array([[ 1., -1.,  1.],
       [ 1., -1., -1.],
       [-1.,  1.,  0.]])

# Step 4: Building the Replay Table, Trainer, Evaluator and Terminal Callback
However, we don't have a pretrained neural net. In order to _train_ the neural net, we'll need some extra classes.
1. A Replay Table - This is just data storage for our training data,
2. An Evaluator - This class lets us pit old models against new models in a tournament. This is how we determine if the model we're training is ready to take over in guiding the MCTS.
3. A Trainer - This class handles the legwork in actually training the neural net.

The trainer we'll be using is `StagedModelTrainer` - this will load game results into a replay table and, once a certain number of games have been reached, train the model and evaluate it.

### The Replay Table
The replay table stores the training data. In order to format itself efficiently, it needs the dimensions of the state space and action space. 

In [10]:
from mcts.nn.replay import BasicReplay
from keras.callbacks import TensorBoard
replay = BasicReplay(ttt.state.shape, ttt.action_space, capacity=10000)

You can save a replay table to a file in its current state by using the `save()` method. This comes in handy if you want to keep all the data your MCTS generates.

In [11]:
replay.save('replay/tictactoe/test')

You can load the saved model by using the `load_replay` function.

In [12]:
from mcts.nn.replay import load_replay
replay2 = load_replay('replay/tictactoe/test')

### The Evaluator
The evaluator is used to determine if one MCTS model is better than another. The NNEvaluator specifically runs a tournament between two identical MCTS trees with the exception that one is using a different neural network. 
To instantiate the evaluator, we only need the config dictionary for the MCTS tree.

We'll just use the `most_visited` action policy here for demonstration. This action policy will just choose the action that has been explored the most.

In [13]:
from mcts.policies.action import MostVisited
from mcts.evaluators import NNEvaluator

evaluation_config = {
    'model' : mcts_model,
    'selection' : 'puct',
    'expansion' : 'neural',
    'update' : 'value',
    'action' : 'most-visited'
}
    
evaluator = NNEvaluator(ttt, evaluation_config)

If we ever want to run the evaluator manually (rather than let a terminal policy handle it), we can simply use the `.evaluate()` method. The NNEvaluator takes `incumbent_model` and `challenger_model`. We'll just test this briefly using the exact same model to see how it works.

In [14]:
# results = evaluator.evaluate(mcts_model, mcts_model, games=1)
# results.winner

### The Trainer
The trainer is the thing that actually allows you to train a neural net with MCTS. To instantiate it, we require:
1. The environment
2. The config for our mcts (including the model)
3. The replay table
4. The evaluator
5. Any Keras Callbacks that we want. We'll use tensorboard here. (optional)
6. A model directory. The staged model trainer will save our model every time it get updated. If no model directory is specified, then it just won't save the model. (optional)
7. A replay directory. The trainer will save the replay table to this directory at the end of every "data generation" stage. (optional)


In [16]:
# For the terminal policy
from mcts.nn.trainers import StagedModelTrainer
from keras.callbacks import TensorBoard

tensorboard = TensorBoard(log_dir='logs/demo', histogram_freq=1, write_grads=True, batch_size=16)
trainer = StagedModelTrainer(ttt, config, replay, evaluator, 
                             callbacks=[tensorboard], 
                             model_dir='models/tictactoe',
                             replay_dir='replay/tictactoe')

# 5. Initiate Self-Play
You can simply use the `trainer.train()` method. Just set the number of games you want to play and it'll do the rest!

In [17]:
trainer.train(epochs=2, generation_steps=10, training_steps=10, evaluation_steps=1)

[1;37m[1531897477.6278813][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Starting epoch 0[0m
[1;37m[1531897477.6289308][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Entering Generation Phase[0m
[1;37m[1531897477.62938][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Playing Generation Game 0[0m
[1;37m[1531897486.642571][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Playing Generation Game 1[0m
[1;37m[1531897495.6545496][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Playing Generation Game 2[0m
[1;37m[1531897502.6616416][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Playing Generation Game 3[0m
[1;37m[1531897511.6705499][localhost][/usr/local/lib/python3.5/dist-p

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[1;37m[1531897587.432838][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Entering Evaluation Phase[0m
[1;37m[1531897603.4395761][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Challenger model wins - updating model...[0m
[1;37m[1531897604.0004056][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Saving Model to models/tictactoe/model0[0m
[1;37m[1531897605.7654622][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Starting epoch 1[0m
[1;37m[1531897605.7660315][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Entering Generation Phase[0m
[1;37m[1531897605.7664115][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Playing Generation Game 0[0m
[1;37m[1531897612.774538][localhost][

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[1;37m[1531897695.9104323][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Entering Evaluation Phase[0m
[1;37m[1531897713.9195662][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Challenger model wins - updating model...[0m
[1;37m[1531897713.9322832][localhost][/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py][StagedModelTrainer][INFO] Saving Model to models/tictactoe/model1[0m


In [6]:
good_model = load_model('models/dotsandboxes/model4')
good_mcts_model = Model(good_model)
keras_model = load_zeronet(env.state.shape, env.action_space, lr=0.001, residual_layers=4)
random_model = Model(keras_model) # Takes a Keras/TF Model

good_config = {
    'model' : good_mcts_model,
    'selection' : 'puct',
    'expansion' : 'neural',
    'update' : 'value',
    'action' : 'most-visited'
}

bad_config = {
    'model' : random_model,
    'selection' : 'puct',
    'expansion' : 'neural',
    'update' : 'value',
    'action' : 'most-visited'
}

In [13]:
from mcts.mcts import MCTS
good_player = MCTS(env, calculation_time=1)
bad_player = MCTS(env, calculation_time=1)

good_player.build(good_config)
bad_player.build(bad_config)

In [19]:
players = {
    1: bad_player,
    2: good_player
}

In [28]:
env.reset()
env.board()
while not env.terminal:
    m = players[env.player]
    node = m.tree.get_by_state(env.state)
    v = m.expand.model.predict_from_node(node)[1]

    print("Value: {}".format(v[0][0]))
    players[env.player].act()
    env.board()

.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
Player 1 Score is 0
Player 2 Score is 0

Value: 0.016941165551543236
.----.    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
Player 1 Score is 0
Player 2 Score is 0

Value: -0.2530079483985901
.----.    .    .    .
                    
.    .    .    .    .
                    
.    .    .    .    .
                    |    
.    .    .    .    .
                    
.    .    .    .    .
Player 1 Score is 0
Player 2 Score is 0

Value: 0.04569195955991745
.----.    .    .    .
                    
.    .    .    .    .
                    
.    .----.    .    .
                    |    
.    .    .    .    .
                    
.    .    .    .    .
Player 1 

.----.    .----.----.
|    |         |    |    
.    .    .----.----.
|    |    |    |    
.----.----.----.----.
|                   |    
.----.----.    .----.
|    |    |    |    |    
.____.____.____.____.
Player 1 Score is 4
Player 2 Score is 1

Value: 0.03987700864672661
.----.    .----.----.
|    |         |    |    
.    .    .----.----.
|    |    |    |    
.----.----.----.----.
|    |              |    
.----.----.    .----.
|    |    |    |    |    
.____.____.____.____.
Player 1 Score is 5
Player 2 Score is 1

Value: 0.021535372361540794
.----.    .----.----.
|    |         |    |    
.    .----.----.----.
|    |    |    |    
.----.----.----.----.
|    |              |    
.----.----.    .----.
|    |    |    |    |    
.____.____.____.____.
Player 1 Score is 6
Player 2 Score is 1

Value: 0.018151627853512764
.----.    .----.----.
|    |         |    |    
.    .----.----.----.
|    |    |    |    |    
.----.----.----.----.
|    |              |    
.----.----.    .----.
|

In [23]:
env.player

1