## Using PV-MCTS method with "perfect" copy of the environment in the stochastic setting

If implemented properly (still to be checked), the most probable mode of failure is the one of planning in hindsight of the wrong event: basically you sample a transition and treat it as if it's the only possible transition and find the best action in this new determinization of the environment (which is similar to re-planning in hindsight if the transition actually happened). The problem with this is that you are planning while taking for sure an event that might not happen and you do not take into account any other possibility nor any uncertainty.

TODO:
- load stochastic environment - ok
- re-use code from notebook 9 to show that the simulated transitions can be different from those of that are actually realized in the environment - ok
- run the script run_PVMCTS_v2 with plausible parameters for a baseline and the option --game_name groups_simple

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import time
import copy
# custom imports
import utils
import train
import mcts
from rtfm import featurizer as X
import os
from torch import multiprocessing as mp
import random

Using device cuda:0
Using device cuda:0


In [2]:
# Check only if main logic of the training loop works
ucb_C = 1.0
discount = 0.9 # try with smaller discount
episode_length = 32
max_actions = 20
num_simulations = 10
num_trees = 3
#device = mcts.device
n_episodes = 4000
memory_size = 1024
batch_size = 32
n_steps = 5
tau = 0.1 # new_trg_params = (1-tau)*old_trg_params + tau*value_net_params
dir_noise = False
dirichlet_alpha = 0.5 # no real reason to choose this value, except it's < 1
exploration_fraction = 0.25
temperature = 1.
full_cross_entropy = True
entropy_bonus = True
entropy_weight = 1e-2

In [3]:
flags = utils.Flags(env="rtfm:groups_simple-v0")
gym_env = utils.create_env(flags)
featurizer = X.Render()
env = mcts.FullTrueSimulator(gym_env, featurizer)
object_ids = utils.get_object_ids_dict(env)
pv_net = mcts.DiscreteSupportPVNet_v3(gym_env)

In [4]:
# run little MCTS in debug mode, look at the transition for a particular action 
# and then look at the same transition if the action is played in the real env

In [15]:
random.seed(1) 
frame, valid_actions = env.reset()
env.render() # this should be fixed even if the notebook is restarted


██████
█ @!n█
█    █
█    █
█?  y█
██████



In [16]:
def run_mcts_step(debug_render=True):
    tree = mcts.PV_MCTS(
                         frame, 
                         env, 
                         valid_actions, 
                         ucb_C, 
                         discount, 
                         max_actions, 
                         pv_net,
                         render=debug_render, 
                         ucb_method='p-UCT-AlphaGo'
                         )
    
    root, info = tree.run(num_simulations, 
                          mode='predict', 
                          dir_noise=dir_noise, 
                          dirichlet_alpha=dirichlet_alpha, 
                          exploration_fraction=exploration_fraction
                         )
    
    return tree, root

In [17]:
tree, root = run_mcts_step()

Valid actions as child:  [0 2 3 4]
Prior over the children:  [0.16766171 0.         0.24954093 0.31134227 0.27145508]
Weights over the children:  [0.2059684  0.         0.25127804 0.28067434 0.26207924]
Terminal node:  False

Simulation 1 started.

██████
█ @!n█
█    █
█    █
█?  y█
██████

actions:  [0 2 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0.16766171 0.24954093 0.31134227 0.27145508]
ucb_values:  [0.16766171 0.24954093 0.31134227 0.27145508]
max_U:  0.3113422691822052
mask:  [False False  True False]
best_actions:  [3]
Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 4]
prior:  [0.23154376 0.         0.36660463 0.         0.40185165]
reward:  0
done:  False
Valid actions as child:  [0 2 4]
Prior over the children:  [0.23154376 0.         0.36660463 0.         0.40185165]
Weights over the children:  [0.27966624 0.         0.3519026  0.         0.36843118]
Terminal node:  Fa

actions:  [0 2 3 4]
value_terms:  [-1.         -0.22924642 -0.4527414  -1.        ]
exploration_terms:  [0.25149257 0.14972456 0.31134227 0.40718262]
ucb_values:  [-0.74850743 -0.07952186 -0.14139913 -0.59281738]
max_U:  -0.0795218610484153
mask:  [False  True False False]
best_actions:  [2]

██████
█ ! n█
█ @  █
█    █
█ ? y█
██████

Current tree depth:  1
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 0.00000000e+00  0.00000000e+00 -1.00000000e+00 -4.14535403e-04
 -1.14536544e-02]
exploration_terms:  [0.27364466 0.30874759 0.21983489 0.26666692 0.22230206]
ucb_values:  [ 0.27364466  0.30874759 -0.78016511  0.26625238  0.21084841]
max_U:  0.3087475895881653
mask:  [False  True False False False]
best_actions:  [1]
Current tree depth:  2
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 3 4]
prior:  [0.13641916 0.15356545 0.2195319  0.26148

In [18]:
# let's look at the left action's (3) transition
env.step(3)
env.render() # this trasition is clearly different from the one used in the internal simulation of the MCTS


██████
█@! n█
█    █
█    █
█ ? y█
██████

