# Possible bugs

2 things are unconvincing:
- formula of UCT with prior: doesn't seem to explore enough; also investigate the effect of the default Q value (e.g. 0 vs parent.value!
- entropy bonus, the flag is turned to true, but the effects are not observed, not even for h=0.1

## UCT formula

**Old code**:
```python
def ucb_score(self, parent, child, eps=1e-3):
    """
    The score for a node is based on its value, plus an exploration bonus.
    """
    exploration_term = self.ucb_c*np.sqrt(np.log(parent.visit_count)/(child.visit_count+eps))

    if child.visit_count > 0:
        # Mean value Q
        value_term = child.reward + self.discount*child.value() 
    else:
        value_term = 0

    return value_term + exploration_term, value_term, exploration_term
```

**Formula from Kocsis-Szepesvári-2006**:
$$
a_t = \arg \underset{a}{\max} \left[Q(s,a) + \sqrt{\frac{ 2\ln N_{s}}{N_{s,a}}}\right]
$$

Where is assumed that all actions have to be explored at least once before starting to consider the Q values (in other words for all actions which have $N_{s,a}=0$ the exploration bonus (or bias term) is infinite and we need to sample one og them randomly.

I can't find any mention of what should be the default value for the Q before they are explored, but given that we first explore all of them once and then start using the formula, the default value will not have influence on the allocation of visit counts.

**What I actually implemented**:
$$
a_t = \arg \underset{a}{\max} \left[Q(s,a) + c \sqrt{\frac{\ln N_{s}}{N_{s,a} + \epsilon}}\right]
$$

where $c=1$, $\epsilon=10^{-3}$. This formula does exactly what it's supposed to, except for a factor $\sqrt{2}$ which is lacking in the exploration term (thus it would explore more).
Q values are defaulted at 0 when the action still has to be explored, but for an action space of at most 5 and the selected value of $\epsilon$, the UCB value of unexplored actions will be:
- 0 at the beginning (but it's the same for all actions)
- 26.32768848 after the first action has been selected
- 33.14532077 after the second action has been selected
- 37.23297411 after the third action has been selected
- 40.11780044 after the fourth action has been selected

Since these values are all much bigger than 1 (maximum Q value), the default initialization of the Qs does not change the exploration process.


## p-UCT formulas

**Old code**:

```python
def ucb_score(self, parent, child):
    """
    The score for a node is based on its value, plus an exploration bonus.
    """
    exploration_term = self.ucb_c*child.prior*np.sqrt(np.log(parent.visit_count)/(child.visit_count+1))

    if child.visit_count > 0:
        # Mean value Q
        value_term = child.reward + self.discount*child.value() 
    else:
        value_term = 0

    return value_term + exploration_term, value_term, exploration_term
```

**Formula from Rosin-2012** (reference from AlphaGo when they talk about their own exploration bonus):
$$
a_t = \arg \underset{a}{\max} \left[Q(s,a) + c(N_s, N_{s,a}) - m(N_s, a) \right]
$$
where:
$$
c(N_s, N_{s,a}) =
\begin{cases}
\sqrt{\frac{3 \log N_s}{2 N_{s,a}}} & \text{if}~ N_{s,a} > 0 \\
0 & \text{otherwise}
\end{cases}
$$

$$
m(N_{s}, a) = 
\begin{cases}
\frac{2}{M(s,a)}\sqrt{\frac{\ln N_s}{N_s}} & \text{if}~ N_s > 1 \\
\frac{2}{M(s,a)} & \text{otherwise}
\end{cases}
$$

Where $M(s,a) = \frac{\sqrt{P(s,a)}}{\sum_b P(s,b)}$ (Sec. 3.2 of the paper).

Also $Q(s,a)$ is defaulted to 1 (max possible payoff) if $N_{s,a}=0$.

**Formula from AlphaGo:**
$$
a_t = \arg \underset{a}{\max} \left[Q(s,a) + c P(s,a) \frac{\sqrt{\sum_b N_{s,b}}}{1 + N_{s,a}} \right]
$$

I still have to understand at which value $Q(s,a)$ is defaulted if $N_{s,a}=0$.

**What I actually implemented:**
$$
a_t = \arg \underset{a}{\max} \left[Q(s,a) + c P(s,a) \sqrt\frac{{\ln \sum_b N_{s,b}}}{1 + N_{s,a}} \right]
$$

Also $Q(s,a)$ is defaulted to 0 (payoffs are in interval [-1,1]) if $N_{s,a}=0$.

## Theoretical analysis of Rosin's p-UCT formula

Assuming 2 actions with prior $p$ and $q=1-p$, unknown deterministic payoffs $Q_p$ and $Q_q$ $\in [-1,1]$ (this is a simplifying assumption, in reality we will have a distribution of $Q_p$ to average upon) and $p>q$, after how many "selection rounds" $n$ (this would be the $N_s$, i.e. the parent visit count) will action $q$ be chosen?

First we compute the weights $M_p$ and $M_q$:
$$
M_p = \frac{\sqrt{p}}{\sqrt{p}+\sqrt{q}} \\
M_q = \frac{\sqrt{q}}{\sqrt{p}+\sqrt{q}} \\
$$

Then we can compute the pUCB scores for the 2 actions in the initial round $N_s$:
$$
pUCB(p) = 1 + 0 - \frac{2(\sqrt{p}+\sqrt{q})}{\sqrt{p}} \\
pUCB(q) = 1 + 0 - \frac{2(\sqrt{p}+\sqrt{q})}{\sqrt{q}}
$$
where the default Q value is 1 because that is the maximum payoff/reward that can be achieved in the task. 

Since $p>q$ for assumption, action $p$ will be selected.

Now for $n \ge 2$, assuming that all previous rounds the action $p$ is selected and the reward $Q_p$ has been received, what will be the minimum $n$ for which action $q$ will be selected?

The new pUCB scores are:
$$
pUCB(p) = Q_p + \sqrt{\frac{3 \log n}{2 (n-1)}} - \frac{2(\sqrt{p}+\sqrt{q})}{\sqrt{p}}\sqrt{\frac{\log n}{n}} \\
pUCB(q) = 1 + 0 - \frac{2(\sqrt{p}+\sqrt{q})}{\sqrt{q}}\sqrt{\frac{\log n}{n}}
$$

Action $q$ will be selected only if $pUCB(p) < pUCB(q)$. Rearranging the terms we get:

$$
\Delta Q^* \ge \sqrt{\frac{\log n}{n}} \left[ \sqrt{\frac{3}{2}} + 2 \frac{2p-1}{\sqrt{p(1-p)}}\right]
$$
where I approximated $n-1 \approx n$ for simplicity and defined $\Delta Q^* \equiv 1 - Q_p$.

Without doing any computation we can see that $\forall \Delta Q^* > 0$ it exists a $n > \bar{n}$ for which this inequality is satisfied, since $\underset{n\rightarrow \infty}{lim} \sqrt{\frac{\log n}{n}} = 0$.

Calling 
$$
C_p = \sqrt{\frac{3}{2}} + 2 \frac{2p-1}{\sqrt{p(1-p)}} 
$$
$\bar{n}$ has to satisfy:


## Development - confront the 3 functions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import time
import copy
# custom imports
import utils
import train
#import mcts
from rtfm import featurizer as X
import os

Using device cpu
Using device cpu


In [2]:
from mcts import *

In [3]:
# We can confront the AlphaGo pUCT formula with the one I implemented with just minimal changes 
# to the original class
class PV_MCTS_debug(PolicyValueMCTS):
    def __init__(self, 
             root_frame,
             simulator,
             valid_actions,
             ucb_c,
             discount,
             max_actions,
             pv_net,
             root=None,
             render=False,
             ucb_method="p-UCT-old"
                ):

        super().__init__(
            root_frame,
            simulator,
            valid_actions,
            ucb_c,
            discount,
            max_actions,
            pv_net,
            root,
            render
        )
        
        possible_ucb_methods = ["p-UCT-old", "p-UCT-AlphaGo"]
        assert ucb_method in possible_ucb_methods, \
            ("ucb method not recognized, should be one of: ", possible_ucb_methods)
        self.ucb_method = ucb_method
        
    def ucb_score(self, parent, child):
        if self.ucb_method == "p-UCT-old":
            return self.ucb_score_old(parent, child)
        else:
            return self.ucb_score_AlphaGo(parent, child)
        
    def ucb_score_old(self, parent, child):
        """
        The score for a node is based on its value, plus an exploration bonus.
        """
        exploration_term = self.ucb_c*child.prior*np.sqrt(np.log(parent.visit_count)/(child.visit_count+1))

        if child.visit_count > 0:
            # Mean value Q
            value_term = child.reward + self.discount*child.value() 
        else:
            value_term = 0

        return value_term + exploration_term, value_term, exploration_term
    
    def ucb_score_AlphaGo(self, parent, child):
        """
        The score for a node is based on its value, plus an exploration bonus.
        """
        exploration_term = self.ucb_c*child.prior/(child.visit_count+1)*np.sqrt(parent.visit_count)

        if child.visit_count > 0:
            # Mean value Q
            value_term = child.reward + self.discount*child.value() 
        else:
            value_term = 0

        return value_term + exploration_term, value_term, exploration_term

In [4]:
# To implement the Rosen p-UCT formula instead we need to modify the node class too, in order 
# to store the weights M obtained from the probabilities and the expand function

class RosinPVNode(PriorValueNode):
    def __init__(self, prior=0., weight=0.):
        super().__init__(prior)
        # this weight is sqrt(prior)/sum_{all_children}sqrt(prior), needs to be computed beforehand
        self.weight = weight 
    
    def expand(self, frame, valid_actions, priors, reward, done, simulator):
        self.expanded = True
        vprint("Valid actions as child: ", valid_actions)
        vprint("Prior over the children: ", priors)
        weights = np.sqrt(priors)/np.sqrt(priors).sum()
        vprint("Weights over the children: ", weights)
        vprint("Terminal node: ", done)
        self.full_action_space = len(priors) # trick to pass this information
        self.frame = frame
        self.reward = reward
        self.terminal = done
        self.valid_actions = valid_actions
        if not done:
            for action in valid_actions:
                self.children[action] = RosinPVNode(priors[action], weights[action])
        self.simulator_dict = simulator.save_state_dict()
        
    def add_exploration_noise(self, dirichlet_alpha=0.5, exploration_fraction=0.25):
        """
        At the start of each search, we add dirichlet noise to the prior of the root to
        encourage the search to explore new actions.
        """
        actions = list(self.children.keys())
        noise = np.random.dirichlet([dirichlet_alpha] * len(actions))
        frac = exploration_fraction
        priors = []
        for a, n in zip(actions, noise):
            self.children[a].prior = self.children[a].prior * (1 - frac) + n * frac
            priors.append(self.children[a].prior) 
        # recompute all weights as the square root of the prior and then normalize them to 1
        priors = np.array(priors)
        weights = np.sqrt(priors)/np.sqrt(priors).sum()
        for i,a in enumerate(actions):
            self.children[a].weight = weights[i]

In [5]:
class Rosin_PV_MCTS(PolicyValueMCTS):
    def __init__(self, 
             root_frame,
             simulator,
             valid_actions,
             ucb_c,
             discount,
             max_actions,
             pv_net,
             root=None,
             render=False
                ):

        super().__init__(
            root_frame,
            simulator,
            valid_actions,
            ucb_c,
            discount,
            max_actions,
            pv_net,
            root,
            render
        )
        
    def run(self, num_simulations, mode="simulate", dir_noise=False, dirichlet_alpha=1.0, exploration_fraction=0.25):
        """
        Runs num_simulations searches starting from the root node corresponding to the internal
        state of the simulator given during initialization.
        Returns the root node and an extra_info dictionary
        """
        if self.root is None or self.root.visit_count==0:
            self.root = RosinPVNode() 
            
            with torch.no_grad():
                _, root_prior = self.pv_net(self.root_frame)
                root_prior = root_prior.reshape(-1).cpu().numpy()
                
                
            self.root.expand(
                self.root_frame,
                self.valid_actions,
                root_prior,
                0, # reward to get to root
                False, # terminal node
                self.simulator # state of the simulator at the root node 
            )
                
            # not sure about this
            self.root.visit_count += 1
            
        if dir_noise:
            # add noise to root even if the tree was inherited from previous time-step 
            self.root.add_exploration_noise(dirichlet_alpha, exploration_fraction)
                
        max_tree_depth = 0
        root = self.root
        #print("root.simulator_dict :", root.simulator_dict) # not ok
        #print("root: ", root) # ok
        #print("self.simulator: ", self.simulator)
        for n in range(num_simulations):
            ### Start of a simulation/search ###
            vprint("\nSimulation %d started."%(n+1))
            node = root
            # make sure that the simulator internal state is reset to the original one
            self.simulator.load_state_dict(root.simulator_dict)
            search_path = [node]
            current_tree_depth = 0
            if self.render:
                node.render(self.simulator)
            ### Selection phase until leaf node is reached ###
            while node.expanded or (current_tree_depth<self.max_actions):
                current_tree_depth += 1
                action, node = self.select(node)
                if self.render and node.expanded:
                    node.render(self.simulator)
                vprint("Current tree depth: ", current_tree_depth)
                vprint("Action selected: ", action, action_dict[action])
                vprint("Child node terminal: ", node.terminal)
                vprint("Child node expanded: ", node.expanded)
                if node.expanded or node.terminal:
                    search_path.append(node)
                    if node.terminal:
                        break
                else:
                    break
                
            ### Expansion of leaf node (if not terminal)###
            vprint("Expansion phase started")
            if not node.terminal:
                parent = search_path[-1] # last expanded node on the search path
                node = self.expand(node, parent, action)
                if self.render:
                    node.render(self.simulator)
                search_path.append(node)
            
            ### Simulation phase for self.max_actions - current_tree_depth steps ###
            vprint("Value prediction/simulation phase started")
            if mode=="simulate":
                value = self.simulate(node, current_tree_depth)
            elif mode=="predict":
                value = self.predict(node)
            elif mode=="simulate_and_predict":
                value = self.simulate_and_predict(node, current_tree_depth)
            elif mode =="hybrid":
                value1 = self.simulate(node, current_tree_depth)
                value2 =self.predict(node)
                value = 0.5*value1 + 0.5*value2
            else:
                raise Exception("Mode "+mode+" not implemented.")
            vprint("Predicted/simulated value: ", value)
            
            ### Backpropagation of the leaf node value along the seach_path ###
            vprint("Backpropagation phase started")
            self.backprop(search_path, value)
        
            max_tree_depth = max(max_tree_depth, current_tree_depth)
            vprint("Simulation %d done."%(n+1))
        extra_info = {
            "max_tree_depth": max_tree_depth
        }
        # just a check to see if root works as a shallow copy of self.root
        assert root.visit_count == self.root.visit_count, "self.root not updated during search"
        
        # make sure that the simulator internal state is reset to the original one
        self.simulator.load_state_dict(root.simulator_dict)
        return root, extra_info
    
    def ucb_score(self, parent, child):
        """
        The score for a node is based on its value, plus an exploration bonus.
        """
        # c_term increases p-UCT with time (i.e. parent's visit counts) and decreases with child's visit count
        if child.visit_count > 0:
            c_term = np.sqrt(3*np.log(parent.visit_count)/(2*child.visit_count))
        else:
            c_term = 0
        
        # m_term is a penalty term; increases for small probabilities and decreases with parent's visit counts
        if parent.visit_count > 1:
            m_term = 2/child.weight*np.sqrt(np.log(parent.visit_count)/parent.visit_count)
        else:
            m_term = 2/child.weight
            
        exploration_term = c_term - m_term
        
        if child.visit_count > 0:
            # Mean value Q
            value_term = child.reward + self.discount*child.value() 
        else:
            value_term = 1 # max payoff

        return value_term + exploration_term, value_term, exploration_term

In [3]:
def play_episode(
    frame,
    valid_actions,
    ucb_method,
    pv_net,
    env,
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids,
    mode="predict",
    dir_noise=False,
    render = True,
    debug_render=False,
    single_step = False
):
    """
    Plays an episode with a policy and value MCTS. 
    Starts building the tree from the sub-tree of the root's child node that has been selected at the previous step.
    
    If mode='simulate', it's identical to a policy MCTS with MC rollout evaluations, if mode='predict', the value network 
    is used to estimate the value of the leaf nodes (instead of a MC rollout).
    
    Chooses the best action as the one with highest Q-value according to the MCTS step and actually it's not returning
    any signal on which to train the policy (probably I used this to test a policy trained in a supervised fashion to
    predict the optimal actions given by a hard-coded policy; value net is not trained, thus this shoudl be used only
    in 'simulate' mode.
    """
    action_dict = {
        0:"Stay",
        1:"Up",
        2:"Down",
        3:"Left",
        4:"Right"
    }
    #frame, valid_actions = env.reset()
    if render:
        env.render()
    total_reward = 0
    done = False
    new_root = None
    # variables used for training of value net
    frame_lst = [frame]
    reward_lst = []
    done_lst = []
    action_is_optimal = []
    if render:
        prior_is_optimal = []
    for i in range(episode_length):
        
        tree = PV_MCTS(
                     frame, 
                     env, 
                     valid_actions, 
                     ucb_C, 
                     discount, 
                     max_actions, 
                     pv_net,
                     render=debug_render, 
                     root=new_root,
                     ucb_method=ucb_method
                     )

        #print("Performing MCTS step")
        root, info = tree.run(num_simulations, mode=mode, dir_noise=dir_noise)
        #show_root_summary(root, discount)
        #print("Tree info: ", info)
        action = root.best_action(discount)
        best_actions = utils.get_optimal_actions(frame, object_ids)
        if render:
            #print("probs from MCTS: ", probs)
            best_prior = show_policy_summary(pv_net, frame, root, discount, action, best_actions)
            
            if best_prior in best_actions:
                prior_is_optimal.append(True)
            else:
                prior_is_optimal.append(False)

        # Evaluate chosen action against optimal policy
        if action in best_actions:
            action_is_optimal.append(True)
        else:
            action_is_optimal.append(False)
            
        new_root = tree.get_subtree(action)
        frame, valid_actions, reward, done = env.step(action)
        
        frame_lst.append(frame)
        reward_lst.append(reward)
        done_lst.append(done)
        
        if render:
            env.render()
            print("Reward received: ", reward)
            print("Done: ", done)
        total_reward += reward
        if done:
            break
            
        if single_step:
            break
            
    if render:
        return total_reward, frame_lst, reward_lst, done_lst, action_is_optimal, prior_is_optimal
    else:
        return total_reward, frame_lst, reward_lst, done_lst, action_is_optimal

In [9]:
def compare_pUCT_formulas(
    pv_net, 
    game_simulator, 
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids,
    dir_noise=False,
    render=True,
    debug_render=False,
    single_step=False
):
    
    # Save original frame and simulator state dictionary to compare the 3 methods on the exact same episode
    # with the exact same network
    frame, valid_actions = game_simulator.reset()
    original_sim_state = game_simulator.save_state_dict()

    game_simulator.load_state_dict(copy.deepcopy(original_sim_state))
    print("-"*40)
    print("p-UCT old:")
    _ = play_episode(
            frame,
            valid_actions,
            "p-UCT-old",
            pv_net,
            game_simulator,
            episode_length,
            ucb_C,
            discount,
            max_actions,
            num_simulations,
            object_ids,
            dir_noise = dir_noise,
            render = render,
            debug_render = debug_render,
            single_step = single_step
    )
    
    game_simulator.load_state_dict(copy.deepcopy(original_sim_state))
    print("-"*40)
    print("p-UCT AlphaGo:")
    _ = play_episode(
            frame,
            valid_actions,
            "p-UCT-AlphaGo",
            pv_net,
            game_simulator,
            episode_length,
            ucb_C,
            discount,
            max_actions,
            num_simulations,
            object_ids,
            dir_noise = dir_noise,
            render = render,
            debug_render = debug_render,
            single_step = single_step
    )
    
    game_simulator.load_state_dict(copy.deepcopy(original_sim_state))
    print("-"*40)
    print("p-UCT Rosin:")
    _ = play_episode(
            frame,
            valid_actions,
            "p-UCT-Rosin",
            pv_net,
            game_simulator,
            episode_length,
            ucb_C,
            discount,
            max_actions,
            num_simulations,
            object_ids,
            dir_noise = dir_noise,
            render = render,
            debug_render = debug_render,
            single_step = single_step
    )

## Test on untrained pv net

In [5]:
from play_functions import show_policy_summary

In [6]:
# Check only if main logic of the training loop works
ucb_C = 1.0
discount = 0.9 # try with smaller discount
episode_length = 32
max_actions = 20
num_simulations = 50
#device = mcts.device
n_episodes = 4000
memory_size = 1024
batch_size = 32
n_steps = 5
tau = 0.1 # new_trg_params = (1-tau)*old_trg_params + tau*value_net_params
dir_noise = False
dirichlet_alpha = 0.5 # no real reason to choose this value, except it's < 1
exploration_fraction = 0.25
temperature = 1.
full_cross_entropy = True
entropy_bonus = True
entropy_weight = 1e-2

In [7]:
flags = utils.Flags(env="rtfm:groups_simple_stationary-v0")
gym_env = utils.create_env(flags)
featurizer = X.Render()
game_simulator = FullTrueSimulator(gym_env, featurizer)
object_ids = utils.get_object_ids_dict(game_simulator)

pv_net = DiscreteSupportPVNet_v3(gym_env).to(device)

### Single step - no root noise

In [10]:
compare_pUCT_formulas(
    pv_net, 
    game_simulator, 
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids,
    render=True,
    debug_render=False,
    single_step=True,
    dir_noise=False
)

----------------------------------------
p-UCT old:

██████
█   n█
█@y  █
█?   █
█!   █
██████

Action  Stay : Prior=0.177 - Q-value=-0.443 - Visit counts=2
Action  Up : Prior=0.213 - Q-value=-0.193 - Visit counts=4
Action  Down : Prior=0.320 - Q-value=-1.000 - Visit counts=1
Action  Right : Prior=0.290 - Q-value=-0.040 - Visit counts=43
Action with best prior:  2 (Down)
Action selected from MCTS:  4 (Right)
Best actions:  [4] ['Right']

██████
█   n█
█ @  █
█?   █
█!   █
██████
blessed sword
Reward received:  0
Done:  False
----------------------------------------
p-UCT AlphaGo:

██████
█   n█
█@y  █
█?   █
█!   █
██████

Action  Stay : Prior=0.177 - Q-value=-0.443 - Visit counts=2
Action  Up : Prior=0.213 - Q-value=-0.093 - Visit counts=15
Action  Down : Prior=0.320 - Q-value=-1.000 - Visit counts=2
Action  Right : Prior=0.290 - Q-value=-0.037 - Visit counts=31
Action with best prior:  2 (Down)
Action selected from MCTS:  4 (Right)
Best actions:  [4] ['Right']

██████
█   n█
█ @  █
█

In [11]:
# Debug render on
compare_pUCT_formulas(
    pv_net, 
    game_simulator, 
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids,
    render=True,
    debug_render=True,
    single_step=True,
    dir_noise=False
)

----------------------------------------
p-UCT old:

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.14916302 0.19584553 0.1838727  0.16505615 0.30606255]
Terminal node:  False

Simulation 1 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [0 0 0 0 0]
exploration_terms:  [0. 0. 0. 0. 0.]
ucb_values:  [0. 0. 0. 0. 0.]
max_U:  0.0
mask:  [ True  True  True  True  True]
best_actions:  [0 1 2 3 4]
Current tree depth:  1
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 3 4]
prior:  [0.13791168 0.19442321 0.1973946  0.1719308  0.29833978]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.13791168 0.19442321 0.1973946  0.1719308  0.29833978]
Terminal node:  False

██████
█ !  █
█n@  █
█  y █
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.

ucb_values:  [0.114819   0.16186794 0.16434178 0.05940156 0.24838416]
max_U:  0.2483841629820077
mask:  [False False False False  True]
best_actions:  [4]
Current tree depth:  2
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 3 4]
prior:  [0.16371053 0.19814923 0.18433253 0.16769339 0.2861143 ]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.16371053 0.19814923 0.18433253 0.16769339 0.2861143 ]
Terminal node:  False

██████
█ !  █
█n @ █
█  y █
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.04655185714364052
Backpropagation phase started
Simulation 12 done.

Simulation 13 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03206886 -0.03687645 -0.04017869 -0.03958509 -0.25070772]
exploration_terms:  [0.16892175 0.15682781 0.17001844 0.15261966 0.21921213]
ucb_values:  [ 0.13685289  0.

Predicted/simulated value:  -0.03506828099489212
Backpropagation phase started
Simulation 18 done.

Simulation 19 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03314455 -0.20789826 -0.32678579 -0.03601072 -0.25070772]
exploration_terms:  [0.12797716 0.13719533 0.15775696 0.14161296 0.23486942]
ucb_values:  [ 0.09483261 -0.07070294 -0.16902883  0.10560224 -0.0158383 ]
max_U:  0.10560224232414564
mask:  [False False False  True False]
best_actions:  [3]

██████
█ !  █
█n   █
█@ y █
█  ? █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [-0.04166852  0.          0.         -0.03206886]
exploration_terms:  [0.12948467 0.23235171 0.24388687 0.27491569]
ucb_values:  [0.08781615 0.23235171 0.24388687 0.24284683]
max_U:  0.24388686772634677
mask:  [False False  True False]
best_actions:  [2]
Current tree depth:  2
Action selected:  2 Down
Child node te


██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.03506828099489212
Backpropagation phase started
Simulation 24 done.

Simulation 25 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03261668 -0.20789826 -0.32678579 -0.03346789 -0.25070772]
exploration_terms:  [0.10925408 0.14344656 0.16494508 0.10469811 0.24557113]
ucb_values:  [ 0.0766374  -0.0644517  -0.16184071  0.07123022 -0.00513659]
max_U:  0.07663739508590288
mask:  [ True False False False False]
best_actions:  [0]

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Current tree depth:  1
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 0.         -0.03528887 -0.04328831  0.         -0.03349728]
exploration_terms:  [0.18923362 0.17568544 0.16494508 0.20939622 0.22417474]
ucb_values:  [0.18923362 0.14039657 0.12165678 0.20939622 0.19067746]
max_U:


██████
█ !  █
█n   █
█@ y █
█  ? █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [-0.04166852 -0.04181497 -0.04118785 -0.03063153]
exploration_terms:  [0.18311897 0.23235171 0.24388687 0.22446772]
ucb_values:  [0.14145045 0.19053675 0.20269902 0.19383619]
max_U:  0.20269901696791343
mask:  [False False  True False]
best_actions:  [2]

██████
█ !  █
█n   █
█  y █
█@ ? █
██████

Current tree depth:  2
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [0 0 0]
exploration_terms:  [0. 0. 0.]
ucb_values:  [0. 0. 0.]
max_U:  0.0
mask:  [ True  True  True]
best_actions:  [0 1 4]
Current tree depth:  3
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 3 4]
prior:  [0.18398207 0.23482426 0.         0.2153004  0.3658933 ]
reward:  0
done:  False
Valid actions as

best_actions:  [1]
Current tree depth:  3
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 4]
prior:  [0.17470732 0.22167854 0.23268382 0.         0.37093028]
reward:  0
done:  False
Valid actions as child:  [0 1 2 4]
Prior over the children:  [0.17470732 0.22167854 0.23268382 0.         0.37093028]
Terminal node:  False

██████
█ !  █
█n   █
█@ y █
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.04629835486412048
Backpropagation phase started
Simulation 36 done.

Simulation 37 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.11905867 -0.20789826 -0.32678579 -0.03250299 -0.25070772]
exploration_terms:  [0.08963342 0.15193116 0.17470127 0.07841164 0.26009619]
ucb_values:  [-0.02942525 -0.05596711 -0.15208452  0.04590865  0.00938848]
max_U:  0.045908653015035966
mask:  [False False False  True False]
best_actions:  [3]

██████
█ 

value_terms:  [-0.11905867 -0.20789826 -0.32678579 -0.07370803 -0.14094479]
exploration_terms:  [0.09147963 0.15506053 0.17829964 0.0754499  0.19785734]
ucb_values:  [-0.02757904 -0.05283774 -0.14848615  0.00174188  0.05691255]
max_U:  0.05691255020638489
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [-0.03156145 -0.03937012 -1.         -0.03958211 -0.03575352]
exploration_terms:  [0.16874618 0.19919092 0.20065908 0.16358028 0.20328647]
ucb_values:  [ 0.13718473  0.1598208  -0.79934092  0.12399817  0.16753295]
max_U:  0.16753294924237408
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█   @█
█  ? █
██████
blessed sword
Current tree depth:  2
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3]
value_te

actions:  [0 1 2 3 4]
value_terms:  [-0.11905867 -0.20789826 -0.32678579 -0.07370803 -0.14287414]
exploration_terms:  [0.09305454 0.15773004 0.18136925 0.07674885 0.15589815]
ucb_values:  [-0.02600413 -0.05016822 -0.14541654  0.00304082  0.01302401]
max_U:  0.013024006315322523
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [-0.03349728 -0.03317676 -1.         -0.03958211 -0.1635519 ]
exploration_terms:  [0.15521703 0.15867394 0.22605279 0.18428161 0.17311732]
ucb_values:  [ 0.12171975  0.12549718 -0.77394721  0.1446995   0.00956542]
max_U:  0.14469950098216172
mask:  [False False False  True False]
best_actions:  [3]

██████
█ !  █
█n   █
█ @  █
█  ? █
██████
blessed sword
Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
action

valid_actions:  [0 1 2 3 4]
prior:  [0.14916302 0.19584553 0.1838727  0.16505615 0.30606255]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.14916302 0.19584553 0.1838727  0.16505615 0.30606255]
Terminal node:  False

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.03563206270337105
Backpropagation phase started
Simulation 5 done.

Simulation 6 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03206886 -0.03528887 -0.04328831 -0.04166852 -0.03156145]
exploration_terms:  [0.18268665 0.23986081 0.22519715 0.20215168 0.37484854]
ucb_values:  [0.15061779 0.20457194 0.18190884 0.16048316 0.34328708]
max_U:  0.3432870847492852
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True

valid_actions:  [0 1 4]
prior:  [0.2336847  0.3153596  0.         0.         0.45095566]
reward:  0
done:  False
Valid actions as child:  [0 1 4]
Prior over the children:  [0.2336847  0.3153596  0.         0.         0.45095566]
Terminal node:  False

██████
█ !  █
█n   █
█  y █
█@ ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.04576427862048149
Backpropagation phase started
Simulation 13 done.

Simulation 14 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03023708 -0.03395261 -0.47164415 -0.03586652 -0.32240232]
exploration_terms:  [0.18603898 0.18319672 0.22932955 0.15439589 0.2862953 ]
ucb_values:  [ 0.15580189  0.1492441  -0.2423146   0.11852938 -0.03610702]
max_U:  0.15580189314209633
mask:  [ True False False False False]
best_actions:  [0]

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Current tree depth:  1
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 


██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.03506828099489212
Backpropagation phase started
Simulation 20 done.

Simulation 21 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.03373906 -0.25046446 -0.47164415 -0.03471658 -0.32240232]
exploration_terms:  [0.11392514 0.17949539 0.28087019 0.10805462 0.3506387 ]
ucb_values:  [ 0.08018608 -0.07096907 -0.19077396  0.07333804  0.02823638]
max_U:  0.08018608276097686
mask:  [ True False False False False]
best_actions:  [0]

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Current tree depth:  1
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 0.         -0.03528887 -0.04328831 -0.04166852 -0.03156145]
exploration_terms:  [0.33353866 0.21896196 0.20557593 0.18453839 0.34218833]
ucb_values:  [0.33353866 0.18367309 0.16228762 0.14286987 0.31062688]
max_U:


██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.03506828099489212
Backpropagation phase started
Simulation 27 done.

Simulation 28 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.12995478 -0.25046446 -0.47164415 -0.03399337 -0.20620908]
exploration_terms:  [0.08769961 0.20726343 0.32432096 0.0970439  0.2699218 ]
ucb_values:  [-0.04225516 -0.04320103 -0.1473232   0.06305052  0.06371272]
max_U:  0.06371272002718975
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [-0.03156145 -0.03937012 -1.          0.         -0.03960613]
exploration_terms:  [0.18502508 0.21840681 0.2200166  0.35872165 0.31522462]
ucb_values:  [ 0.15346363  0.17903669 -0.7799834   0.35872165 

max_U:  0.23456801341216826
mask:  [ True False False False]
best_actions:  [0]

██████
█ !  █
█n   █
█@ y █
█  ? █
██████

Current tree depth:  2
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0.17470732 0.22167854 0.23268382 0.37093028]
ucb_values:  [0.17470732 0.22167854 0.23268382 0.37093028]
max_U:  0.3709302842617035
mask:  [False False False  True]
best_actions:  [4]
Current tree depth:  3
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 3 4]
prior:  [0.14916302 0.19584553 0.1838727  0.16505615 0.30606255]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.14916302 0.19584553 0.1838727  0.16505615 0.30606255]
Terminal node:  False

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.03563206270337105
Ba

valid_actions:  [0 1 2 3 4]
prior:  [0.16549146 0.195349   0.19678883 0.1604252  0.28194547]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [0.16549146 0.195349   0.19678883 0.1604252  0.28194547]
Terminal node:  False

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.03506828099489212
Backpropagation phase started
Simulation 40 done.

Simulation 41 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [-0.12995478 -0.25046446 -0.47164415 -0.03361392 -0.097301  ]
exploration_terms:  [0.10612326 0.25080465 0.39245325 0.07549108 0.13998261]
ucb_values:  [-0.02383151  0.00034019 -0.07919091  0.04187716  0.04268161]
max_U:  0.04268161406338758
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child

Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [-0.03586652 -0.0382363  -0.03867951 -0.03292621]
exploration_terms:  [0.18008419 0.22850101 0.191876   0.21848353]
ucb_values:  [0.14421767 0.19026471 0.15319648 0.18555732]
max_U:  0.19026470502665477
mask:  [False  True False False]
best_actions:  [1]

██████
█ !  █
█@   █
█  y █
█  ? █
██████
gleaming sword
Current tree depth:  2
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [ 0.          0.         -0.03594184 -0.04505144]
exploration_terms:  [0.33941464 0.4225319  0.21283106 0.27222113]
ucb_values:  [0.33941464 0.4225319  0.17688922 0.2271697 ]
max_U:  0.4225319007861361
mask:  [False  True False False]
best_actions:  [1]
Current tree depth:  3
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 4]
prior:  [0.2527489  0.         0.32539406 0.         0.42185706]
reward:  0
done:  Fals

valid_actions:  [0 1 3]
prior:  [0.29727235 0.39036798 0.         0.3123597  0.        ]
reward:  0
done:  False
Valid actions as child:  [0 1 3]
Prior over the children:  [0.29727235 0.39036798 0.         0.3123597  0.        ]
Weights over the children:  [0.31535822 0.36137992 0.         0.3232618  0.        ]
Terminal node:  False

██████
█ !  █
█n   █
█    █
█  ?@█
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.044320493936538696
Backpropagation phase started
Simulation 3 done.

Simulation 4 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.         1.         1.         1.        -0.0331722]
exploration_terms:  [-6.7588924  -5.89860484 -6.08761967 -6.42525332 -3.8859134 ]
ucb_values:  [-5.7588924  -4.89860484 -5.08761967 -5.42525332 -3.91908561]
max_U:  -3.9190856050948164
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Curren

value_terms:  [ 1.         -0.03441277 -0.03594787  1.        ]
exploration_terms:  [-4.4458638  -3.17887271 -3.00482042 -4.29616504]
ucb_values:  [-3.4458638  -3.21328548 -3.04076828 -3.29616504]
max_U:  -3.040768281028467
mask:  [False False  True False]
best_actions:  [2]

██████
█ !  █
█n   █
█    █
█  ?@█
██████
blessed sword
Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 3]
value_terms:  [ 1.         -0.03775286  1.        ]
exploration_terms:  [-3.83784505 -2.44137384 -3.74401182]
ucb_values:  [-2.83784505 -2.47912671 -2.74401182]
max_U:  -2.4791267071517558
mask:  [False  True False]
best_actions:  [1]

██████
█ !  █
█n   █
█   @█
█  ? █
██████
blessed sword
Current tree depth:  4
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3]
value_terms:  [ 1.          1.         -0.03988844  1.        ]
exploration_terms:  [-4.96411136 -4.65079527 -3.43678698 -4.79696244]


██████
█ ! @█
█n   █
█    █
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.04109874367713928
Backpropagation phase started
Simulation 13 done.

Simulation 14 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.          1.          1.          1.         -0.02740537]
exploration_terms:  [-4.98469046 -4.35022745 -4.48962609 -4.73863128 -2.92805409]
ucb_values:  [-3.98469046 -3.35022745 -3.48962609 -3.73863128 -2.95545946]
max_U:  -2.9554594608877283
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 1.          1.          1.          1.         -0.03006559]
exploration_terms:  [-4.85655245 -4.47002861 -4.4536456  -4.93264203 -3.1545424 ]
ucb_values:  [-3.85655245 -3.47002861 -3.4536456  -3


██████
█ !  █
█n   █
█   @█
█  ? █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.04400680586695671
Backpropagation phase started
Simulation 18 done.

Simulation 19 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.          1.          1.          1.         -0.02686958]
exploration_terms:  [-4.51962371 -3.94435549 -4.07074837 -4.29652161 -2.65985803]
ucb_values:  [-3.51962371 -2.94435549 -3.07074837 -3.29652161 -2.68672761]
max_U:  -2.686727606815966
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current tree depth:  1
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 1.          1.          1.          1.         -0.02954843]
exploration_terms:  [-4.38127952 -4.03258176 -4.01780203 -4.44992281 -2.85164349]
ucb_values:  [-3.38127952 -3.03258176 -3.01780203 -3.

max_U:  -3.1420441856781873
mask:  [False  True False False]
best_actions:  [1]

██████
█ ! @█
█n   █
█    █
█  ? █
██████
blessed sword
Current tree depth:  4
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3]
value_terms:  [ 1.         -0.03528316  1.        ]
exploration_terms:  [-3.66852596 -2.30546037 -3.62328861]
ucb_values:  [-2.66852596 -2.34074353 -2.62328861]
max_U:  -2.3407435319614303
mask:  [False  True False]
best_actions:  [2]

██████
█ !  █
█n  @█
█    █
█  ? █
██████
blessed sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3]
value_terms:  [1 1 1 1]
exploration_terms:  [-8.45322871 -7.80816011 -7.56542541 -8.23380172]
ucb_values:  [-7.45322871 -6.80816011 -6.56542541 -7.23380172]
max_U:  -6.5654254093470605
mask:  [False False  True False]
best_actions:  [2]
Current tree depth:  6
Action selected:  2 Down
Child node terminal:  False
Child node expan

Predicted/simulated value:  -0.04655185714364052
Backpropagation phase started
Simulation 30 done.

Simulation 31 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.         -0.03649794  1.          1.         -0.11388383]
exploration_terms:  [-3.82116938 -1.7299686  -3.44166241 -3.63254507 -2.23869675]
ucb_values:  [-2.82116938 -1.76646653 -2.44166241 -2.63254507 -2.35258058]
max_U:  -1.7664665344491806
mask:  [False  True False False False]
best_actions:  [1]

██████
█ !  █
█n@  █
█  y █
█  ? █
██████

Current tree depth:  1
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [ 1.          1.          1.          1.         -0.04189667]
exploration_terms:  [-7.03038664 -5.92114461 -5.87640996 -6.29655291 -3.76029175]
ucb_values:  [-6.03038664 -4.92114461 -4.87640996 -5.29655291 -3.80218842]
max_U:  -3.8021884211211465
mask:  [False False False False  True]
best_actions:  [4]

█████

actions:  [0 1 2 3]
value_terms:  [ 1.          1.         -0.03960613  1.        ]
exploration_terms:  [-4.9764581  -4.59670299 -3.43413686 -4.84728033]
ucb_values:  [-3.9764581  -3.59670299 -3.47374299 -3.84728033]
max_U:  -3.473742985890685
mask:  [False False  True False]
best_actions:  [2]

██████
█ !  █
█n   █
█   @█
█  ? █
██████
blessed sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3]
value_terms:  [1 1 1 1]
exploration_terms:  [-8.43225599 -7.90004363 -7.56992702 -8.14832955]
ucb_values:  [-7.43225599 -6.90004363 -6.56992702 -7.14832955]
max_U:  -6.569927016817344
mask:  [False False  True False]
best_actions:  [2]
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 3]
prior:  [0.29727235 0.39036798 0.         0.3123597  0.        ]
reward:  0
done:  False
Valid actions as child:  [0 1 3]
Prior over the

valid_actions:  [0 1 3]
prior:  [0.29727235 0.39036798 0.         0.3123597  0.        ]
reward:  0
done:  False
Valid actions as child:  [0 1 3]
Prior over the children:  [0.29727235 0.39036798 0.         0.3123597  0.        ]
Weights over the children:  [0.31535822 0.36137992 0.         0.3232618  0.        ]
Terminal node:  False

██████
█ !  █
█n   █
█    █
█  ?@█
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  -0.044320493936538696
Backpropagation phase started
Simulation 41 done.

Simulation 42 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.         -0.02906986  1.          1.         -0.09868536]
exploration_terms:  [-3.42494506 -2.0940636  -3.08478989 -3.25587956 -1.98492199]
ucb_values:  [-2.42494506 -2.12313346 -2.08478989 -2.25587956 -2.08360735]
max_U:  -2.083607352727273
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword


valid_actions:  [0 1 2 3]
prior:  [0.21834621 0.28340966 0.2746046  0.22363955 0.        ]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3]
Prior over the children:  [0.21834621 0.28340966 0.2746046  0.22363955 0.        ]
Weights over the children:  [0.23403898 0.2666384  0.26246372 0.23685889 0.        ]
Terminal node:  False

██████
█ !  █
█n   █
█  y@█
█  ? █
██████

Value prediction/simulation phase started
Predicted/simulated value:  -0.04500729963183403
Backpropagation phase started
Simulation 48 done.

Simulation 49 started.

██████
█ !  █
█n   █
█ @y █
█  ? █
██████

actions:  [0 1 2 3 4]
value_terms:  [ 1.         -0.02912729 -0.46803757  1.         -0.09457893]
exploration_terms:  [-3.23560985 -1.96953956 -1.70618909 -3.0758905  -1.85613011]
ucb_values:  [-2.23560985 -1.99866685 -2.17422666 -2.0758905  -1.95070905]
max_U:  -1.9507090456571752
mask:  [False False False False  True]
best_actions:  [4]

██████
█ !  █
█n   █
█  @ █
█  ? █
██████
blessed sword
Current t

## Test on trained pv net

In [11]:
# If enabled, use a trained network
load = True
load_dir="./save_dir"
ID="BAA"
checkpoint=6000
if load:
    d = torch.load("%s/%s/training_dict_%s"%(load_dir, ID, checkpoint), map_location=torch.device('cpu'))
    pv_net = d["pv_net"]

In [12]:
compare_pUCT_formulas(
    pv_net, 
    game_simulator, 
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids)

----------------------------------------
p-UCT old:

██████
█    █
█   ?█
█  n!█
█@y  █
██████

Action  Stay : Prior=0.001 - Q-value=0.000 - Visit counts=0
Action  Up : Prior=0.002 - Q-value=0.000 - Visit counts=0
Action  Right : Prior=0.997 - Q-value=0.729 - Visit counts=50
Action with best prior:  4 (Right)
Action selected from MCTS:  4 (Right)
Best actions:  [4] ['Right']

██████
█    █
█   ?█
█  n!█
█ @  █
██████
blessed sword
Reward received:  0
Done:  False
Action  Stay : Prior=0.000 - Q-value=0.000 - Visit counts=0
Action  Up : Prior=0.000 - Q-value=0.000 - Visit counts=0
Action  Left : Prior=0.000 - Q-value=0.000 - Visit counts=0
Action  Right : Prior=1.000 - Q-value=0.810 - Visit counts=99
Action with best prior:  4 (Right)
Action selected from MCTS:  4 (Right)
Best actions:  [4] ['Right']

██████
█    █
█   ?█
█  n!█
█  @ █
██████
blessed sword
Reward received:  0
Done:  False
Action  Stay : Prior=0.000 - Q-value=0.000 - Visit counts=0
Action  Up : Prior=0.000 - Q-value=0.000

In [18]:
compare_pUCT_formulas(
    pv_net, 
    game_simulator, 
    episode_length,
    ucb_C,
    discount,
    max_actions,
    num_simulations,
    object_ids,
    render=True,
    debug_render=True,
    single_step=True,
    dir_noise=False
)

----------------------------------------
p-UCT old:

██████
█  @!█
█    █
█n  ?█
█y   █
██████

Valid actions as child:  [0 2 3 4]
Prior over the children:  [6.1928187e-02 0.0000000e+00 3.1541348e-02 9.0637797e-01 1.5249924e-04]
Terminal node:  False

Simulation 1 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0. 0. 0. 0.]
ucb_values:  [0. 0. 0. 0.]
max_U:  0.0
mask:  [ True  True  True  True]
best_actions:  [0 2 3 4]
Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 3 4]
prior:  [0.3316533  0.         0.29976946 0.3617069  0.00687034]
reward:  0
done:  False
Valid actions as child:  [0 2 3 4]
Prior over the children:  [0.3316533  0.         0.29976946 0.3617069  0.00687034]
Terminal node:  False

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Value prediction/simulation phase started
Predicted/simulated value:  0.35472828


██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Value prediction/simulation phase started
Predicted/simulated value:  0.5686187744140625
Backpropagation phase started
Simulation 6 done.

Simulation 7 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.31973944 0.        ]
exploration_terms:  [8.63872709e-02 4.39988813e-02 4.77883145e-01 2.12730163e-04]
ucb_values:  [8.63872709e-02 4.39988813e-02 7.97622589e-01 2.12730163e-04]
max_U:  0.7976225885992247
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.        0.        0.3553736 0.       ]
exploration_terms:  [0.44393989 0.40126127 0.19766102 0.00919641]
ucb_values:  [0.44393989 0.40126127 0.55303462 0.00919641]
max_U:  0.5530346240012164
mask:  [False False  True False]
best_actions:

Current tree depth:  6
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0. 0. 0. 0.]
ucb_values:  [0. 0. 0. 0.]
max_U:  0.0
mask:  [ True  True  True  True]
best_actions:  [0 1 3 4]
Current tree depth:  7
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 3 4]
prior:  [0.00149575 0.95729417 0.         0.01875474 0.02245528]
reward:  0
done:  False
Valid actions as child:  [0 1 3 4]
Prior over the children:  [0.00149575 0.95729417 0.         0.01875474 0.02245528]
Terminal node:  False

██████
█   !█
█    █
█   ?█
█ @  █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.6839844584465027
Backpropagation phase started
Simulation 9 done.

Simulation 10 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.33362674 0.     

valid_actions:  [0 1 2 4]
prior:  [9.8140437e-05 9.8735631e-01 2.6330741e-05 0.0000000e+00 1.2519167e-02]
reward:  0
done:  False
Valid actions as child:  [0 1 2 4]
Prior over the children:  [9.8140437e-05 9.8735631e-01 2.6330741e-05 0.0000000e+00 1.2519167e-02]
Terminal node:  False

██████
█   !█
█@   █
█   ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.7727738618850708
Backpropagation phase started
Simulation 12 done.

Simulation 13 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.34314919 0.        ]
exploration_terms:  [9.91808350e-02 5.05149166e-02 4.02603052e-01 2.44234537e-04]
ucb_values:  [9.91808350e-02 5.05149166e-02 7.45752244e-01 2.44234537e-04]
max_U:  0.7457522437608857
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expa


██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.8999980092048645
Backpropagation phase started
Simulation 15 done.

Simulation 16 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.34580412 0.        ]
exploration_terms:  [1.03117195e-01 5.25197886e-02 3.77304580e-01 2.53927892e-04]
ucb_values:  [1.03117195e-01 5.25197886e-02 7.23108701e-01 2.53927892e-04]
max_U:  0.7231087014008499
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.         0.         0.38633384 0.        ]
exploration_terms:  [0.54577379 0.49330526 0.15368785 0.01130594]
ucb_values:  [0.54577379 0.49330526 0.54002169 0.01130594]
max_U:  0.5457737886762746
mask:  [ True False False False]
best_act

ucb_values:  [0.09721405 0.39483418 0.94459585 0.00269634]
max_U:  0.944595849699678
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.55354454 0.        ]
exploration_terms:  [0.10912376 0.00127854 0.44206348 0.00909735]
ucb_values:  [0.10912376 0.00127854 0.99560802 0.00909735]
max_U:  0.995608019679727
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [0.         0.61379291 0.59999513]
exploration_terms:  [0.0555102  0.41207664 0.20870807]
ucb_values:  [0.0555102  1.02586954 0.80870319]
max_U:  1.0258695441269834
mask:  [False  True False]
best_actions:  [1]

██████
█   !█
█


██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.32509958 0.        ]
exploration_terms:  [1.10399978e-01 5.62290658e-02 3.29825588e-01 2.71861873e-04]
ucb_values:  [1.10399978e-01 5.62290658e-02 6.54925166e-01 2.71861873e-04]
max_U:  0.6549251656170073
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.31913395 0.         0.38129562 0.        ]
exploration_terms:  [0.20763106 0.53081158 0.16012156 0.01216554]
ucb_values:  [0.52676502 0.53081158 0.54141719 0.01216554]
max_U:  0.5414171869705134
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [0.         0.43055933 0.31925


██████
█   !█
█ @  █
█n  ?█
█y   █
██████

Value prediction/simulation phase started
Predicted/simulated value:  0.4068705439567566
Backpropagation phase started
Simulation 26 done.

Simulation 27 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.32574402 0.        ]
exploration_terms:  [1.12427154e-01 5.72615496e-02 3.16672473e-01 2.76853829e-04]
ucb_values:  [1.12427154e-01 5.72615496e-02 6.42416490e-01 2.76853829e-04]
max_U:  0.6424164904887268
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.31913395 0.36618349 0.37973725 0.        ]
exploration_terms:  [0.21165148 0.38260828 0.15388721 0.01240111]
ucb_values:  [0.53078543 0.74879177 0.53362446 0.01240111]
max_U:  0.7487917689382924
mask:  [False  True False False]
best_actions:  [2]

█

valid_actions:  [0 1 2 4]
prior:  [2.4347559e-03 3.0648735e-04 9.9627411e-01 0.0000000e+00 9.8462438e-04]
reward:  0
done:  False
Valid actions as child:  [0 1 2 4]
Prior over the children:  [2.4347559e-03 3.0648735e-04 9.9627411e-01 0.0000000e+00 9.8462438e-04]
Terminal node:  False

██████
█   !█
█@   █
█   ?█
█y   █
██████
gleaming sword
Value prediction/simulation phase started
Predicted/simulated value:  0.5220934152603149
Backpropagation phase started
Simulation 30 done.

Simulation 31 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.32474763 0.        ]
exploration_terms:  [1.14759249e-01 5.84493357e-02 3.01667019e-01 2.82596655e-04]
ucb_values:  [1.14759249e-01 5.84493357e-02 6.26414646e-01 2.82596655e-04]
max_U:  0.6264146459657866
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node exp

actions:  [0 1 2 4]
value_terms:  [0.         0.         0.48425215 0.36618349]
exploration_terms:  [8.60271239e-02 3.42822067e-04 5.33255052e-01 1.68719646e-03]
ucb_values:  [8.60271239e-02 3.42822067e-04 1.01750720e+00 3.67870686e-01]
max_U:  1.017507200666028
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.46988407 0.55059562 0.        ]
exploration_terms:  [0.09123229 0.00075584 0.58436461 0.00760579]
ucb_values:  [0.09123229 0.47063991 1.13496024 0.00760579]
max_U:  1.1349602368511866
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [0.58441162 0.         0.        ]
exploration_t

██████
█   !█
█ @  █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [0.         0.         0.         0.42204022 0.        ]
exploration_terms:  [0.13798637 0.00146454 0.03034677 0.41404391 0.00548795]
ucb_values:  [0.13798637 0.00146454 0.03034677 0.83608413 0.00548795]
max_U:  0.8360841299588518
mask:  [False False False  True False]
best_actions:  [3]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.48470562 0.36618349]
exploration_terms:  [9.35797448e-02 3.72919610e-04 4.73626291e-01 1.83532132e-03]
ucb_values:  [9.35797448e-02 3.72919610e-04 9.58331907e-01 3.68018811e-01]
max_U:  0.9583319072153773
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current 

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.72191473 0.855     ]
exploration_terms:  [4.26368636e-04 2.14479946e-05 1.03024799e-05 6.79511021e-01]
ucb_values:  [4.26368636e-04 2.14479946e-05 7.21925035e-01 1.53451102e+00]
max_U:  1.5345110208326087
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.9 0.  0.  0. ]
exploration_terms:  [1.29288373e-06 2.09799556e-07 8.56157616e-08 8.32552527e-01]
ucb_values:  [9.00001293e-01 2.09799556e-07 8.56157616e-08 8.32552527e-01]
max_U:  0.9000012928837281
mask:  [ True False False False]
best_actions:  [0]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0. 0. 0.

actions:  [0 2 4]
value_terms:  [0.         0.42056015 0.31925545]
exploration_terms:  [0.0831411  0.31688027 0.21951238]
ucb_values:  [0.0831411  0.73744042 0.53876784]
max_U:  0.7374404177298437
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.39456024 0.4740398  0.        ]
exploration_terms:  [1.02687301e-01 2.89357741e-04 4.02574543e-01 2.84814430e-03]
ucb_values:  [0.1026873  0.3948496  0.87661434 0.00284814]
max_U:  0.8766143421414722
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.        0.        0.5234872 0.       ]
exploration_terms:  [0.11682503 0.00136878 0.399979   0.00973939]
ucb_val

ucb_values:  [4.84727184e-04 2.43836557e-05 7.21926445e-01 1.41491107e+00]
max_U:  1.4149110735945765
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.81449149 0.         0.         1.        ]
exploration_terms:  [1.49289354e-06 2.96701377e-07 1.21078971e-07 8.32552527e-01]
ucb_values:  [8.14492986e-01 2.96701377e-07 1.21078971e-07 1.83255253e+00]
max_U:  1.8325525269445797
mask:  [False False False  True]
best_actions:  [4]

██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  True
Child node expanded:  True
Expansion phase started
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropagation phase started
Simulation 44 done.

Simulation 45 started.

██████
█  @!█
█    █
█n  ?█
█y   █
█

actions:  [0 1 2 4]
value_terms:  [0.         0.68213252 0.         0.78466802]
exploration_terms:  [1.51971831e-04 4.83491685e-01 4.07735184e-05 1.37080454e-02]
ucb_values:  [1.51971831e-04 1.16562420e+00 4.07735184e-05 7.98376062e-01]
max_U:  1.1656242017435057
mask:  [False  True False False]
best_actions:  [1]

██████
█@  !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  8
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [0.         0.         0.75239971]
exploration_terms:  [2.01586786e-04 3.96677722e-05 4.94020810e-01]
ucb_values:  [2.01586786e-04 3.96677722e-05 1.24642052e+00]
max_U:  1.246420520517238
mask:  [False False  True]
best_actions:  [4]

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  9
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.         0.         0.72191473 0.84434745]
exploration_terms: 

mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.81449149 0.         0.         1.        ]
exploration_terms:  [1.76873391e-06 3.51522578e-07 1.43450605e-07 6.23842994e-01]
ucb_values:  [8.14493262e-01 3.51522578e-07 1.43450605e-07 1.62384299e+00]
max_U:  1.6238429940784331
mask:  [False False False  True]
best_actions:  [4]

██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  True
Child node expanded:  True
Expansion phase started
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropagation phase started
Simulation 49 done.

Simulation 50 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.32855271 0.        ]
exploration_ter

Valid actions as child:  [0 1 2 4]
Prior over the children:  [6.1670009e-02 2.4575784e-04 9.3637371e-01 0.0000000e+00 1.7104850e-03]
Terminal node:  False

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Value prediction/simulation phase started
Predicted/simulated value:  0.44550737738609314
Backpropagation phase started
Simulation 3 done.

Simulation 4 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.33304485 0.        ]
exploration_terms:  [1.23856373e-01 6.30826950e-02 4.53188986e-01 3.04998481e-04]
ucb_values:  [1.23856373e-01 6.30826950e-02 7.86233835e-01 3.04998481e-04]
max_U:  0.7862338352501392
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.         0.         0.37771061 0.        ]
exploration_terms:  [0.57444036 0.51921594 0.2088315

actions:  [0 2 3 4]
value_terms:  [0.33304485 0.36352223 0.38998143 0.        ]
exploration_terms:  [0.24873997 0.29976946 0.27128018 0.02061103]
ucb_values:  [0.58178482 0.66329169 0.66126162 0.02061103]
max_U:  0.6632916937768459
mask:  [False  True False False]
best_actions:  [2]

██████
█   !█
█ @  █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 3 4]
value_terms:  [0.         0.         0.         0.40095664 0.        ]
exploration_terms:  [0.12601901 0.00133752 0.02771484 0.62706511 0.00501199]
ucb_values:  [0.12601901 0.00133752 0.02771484 1.02802175 0.00501199]
max_U:  1.0280217507711134
mask:  [False False False  True False]
best_actions:  [3]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0 0 0 0]
exploration_terms:  [6.16700090e-02 2.45757838e-04 9.

actions:  [0 1 2 4]
value_terms:  [0.        0.        0.5117569 0.       ]
exploration_terms:  [8.72145632e-02 3.47554067e-04 6.62116201e-01 2.41899104e-03]
ucb_values:  [8.72145632e-02 3.47554067e-04 1.17387310e+00 2.41899104e-03]
max_U:  1.1738730974855827
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0 0 0 0]
exploration_terms:  [7.19136745e-02 8.42573994e-04 9.21248496e-01 5.99524938e-03]
ucb_values:  [7.19136745e-02 8.42573994e-04 9.21248496e-01 5.99524938e-03]
max_U:  0.9212484955787659
mask:  [False False  True False]
best_actions:  [2]
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 4]
prior:  [0.0374486  0.68095183 0.         0.         0.28159958]
reward:  0
done:  False
Valid 

actions:  [0 1 2 3 4]
value_terms:  [0.         0.         0.         0.44913769 0.        ]
exploration_terms:  [0.1782178  0.00189155 0.03919471 0.44340199 0.00708802]
ucb_values:  [0.1782178  0.00189155 0.03919471 0.89253968 0.00708802]
max_U:  0.8925396816134452
mask:  [False False False  True False]
best_actions:  [3]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.52580913 0.        ]
exploration_terms:  [1.06815589e-01 4.25665061e-04 5.40615614e-01 2.96264687e-03]
ucb_values:  [1.06815589e-01 4.25665061e-04 1.06642474e+00 2.96264687e-03]
max_U:  1.0664247406899994
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.   

mask:  [False False False  True False]
best_actions:  [3]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.        0.        0.5212326 0.       ]
exploration_terms:  [0.12334002 0.00049152 0.46818686 0.00342097]
ucb_values:  [1.23340018e-01 4.91515675e-04 9.89419460e-01 3.42096994e-03]
max_U:  0.9894194595813751
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.58441162 0.        ]
exploration_terms:  [0.12455814 0.00145938 0.53188307 0.01038408]
ucb_values:  [0.12455814 0.00145938 1.11629469 0.01038408]
max_U:  1.1162946868145953
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed swo

value_terms:  [0.33190705 0.39828264 0.40047186 0.        ]
exploration_terms:  [0.25330439 0.17171453 0.18417215 0.03148387]
ucb_values:  [0.58521144 0.56999717 0.584644   0.03148387]
max_U:  0.5852114366035808
mask:  [ True False False False]
best_actions:  [0]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.31925545 0.         0.38998143 0.        ]
exploration_terms:  [0.37079966 0.67030489 0.20220031 0.01536255]
ucb_values:  [0.69005511 0.67030489 0.59218175 0.01536255]
max_U:  0.6900551118164033
mask:  [ True False False False]
best_actions:  [0]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [0.3316533  0.29976946 0.36170691 0.00687034]
ucb_values:  [0.3316533  0.29976946 0.36170691 0

mask:  [False False  True]
best_actions:  [4]
Current tree depth:  9
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 3 4]
prior:  [3.6212418e-04 0.0000000e+00 1.8216249e-05 1.2374539e-05 9.9960726e-01]
reward:  0
done:  False
Valid actions as child:  [0 2 3 4]
Prior over the children:  [3.6212418e-04 0.0000000e+00 1.8216249e-05 1.2374539e-05 9.9960726e-01]
Terminal node:  False

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.8999980092048645
Backpropagation phase started
Simulation 24 done.

Simulation 25 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.34126691 0.        ]
exploration_terms:  [0.30964093 0.15770674 0.18127559 0.0007625 ]
ucb_values:  [0.30964093 0.15770674 0.5225425  0.0007625 ]
max_U:  0.5225424995012395
mask:  [False False  True False]
best_action

actions:  [0 1 2 4]
value_terms:  [0.         0.72545656 0.         0.        ]
exploration_terms:  [1.69984224e-04 5.70050429e-01 4.56061814e-05 2.16838330e-02]
ucb_values:  [1.69984224e-04 1.29550699e+00 4.56061814e-05 2.16838330e-02]
max_U:  1.2955069885045516
mask:  [False  True False False]
best_actions:  [1]

██████
█@  !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  8
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [0.         0.         0.80999821]
exploration_terms:  [1.92326812e-04 3.78456165e-05 7.06991636e-01]
ucb_values:  [1.92326812e-04 3.78456165e-05 1.51698984e+00]
max_U:  1.5169898442807939
mask:  [False False  True]
best_actions:  [4]

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  9
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0 0 0 0]
exploration_terms:  [3.62124178e-04 1.82162494e-05 1.2

actions:  [0 2 3 4]
value_terms:  [0.  0.  0.  0.9]
exploration_terms:  [7.24248355e-04 3.64324987e-05 2.47490789e-05 4.99803632e-01]
ucb_values:  [7.24248355e-04 3.64324987e-05 2.47490789e-05 1.39980363e+00]
max_U:  1.3998036324977874
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0. 0. 0. 1.]
exploration_terms:  [3.80384108e-06 4.36468052e-07 1.78115462e-07 5.77348824e-01]
ucb_values:  [3.80384108e-06 4.36468052e-07 1.78115462e-07 1.57734882e+00]
max_U:  1.577348823853802
mask:  [False False False  True]
best_actions:  [4]

██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  True
Child node expanded:  True
Expansion phase started
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropag

Valid actions as child:  [0 1 3 4]
Prior over the children:  [0.00149575 0.95729417 0.         0.01875474 0.02245528]
Terminal node:  False

██████
█   !█
█    █
█   ?█
█ @  █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.6839844584465027
Backpropagation phase started
Simulation 32 done.

Simulation 33 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.34019379 0.        ]
exploration_terms:  [0.35575035 0.18119125 0.15778015 0.00087604]
ucb_values:  [0.35575035 0.18119125 0.49797394 0.00087604]
max_U:  0.4979739441419143
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.32918363 0.39502391 0.39671868 0.        ]
exploration_terms:  [0.20845715 0.15415929 0.14615166 0.03886453]
ucb_values:  [0.537


██████
█   !█
█@   █
█n  ?█
█y   █
██████

Value prediction/simulation phase started
Predicted/simulated value:  0.44550737738609314
Backpropagation phase started
Simulation 35 done.

Simulation 36 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.33590127 0.        ]
exploration_terms:  [0.37156912 0.18924809 0.151063   0.000915  ]
ucb_values:  [0.37156912 0.18924809 0.48696426 0.000915  ]
max_U:  0.4869642619644185
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.32508516 0.39433269 0.38890519 0.        ]
exploration_terms:  [0.19620874 0.14778834 0.14265913 0.0406455 ]
ucb_values:  [0.5212939  0.54212103 0.53156432 0.0406455 ]
max_U:  0.5421210277488051
mask:  [False  True False False]
best_actions:  [2]

██████
█   !█
█ @  █
█n  ?█
█y  

Prior over the children:  [7.2688999e-04 6.6966198e-02 1.0355999e-05 1.1315590e-03 9.3116492e-01]
Terminal node:  False

██████
█   !█
█    █
█ @ ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.778236985206604
Backpropagation phase started
Simulation 37 done.

Simulation 38 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.33722841 0.        ]
exploration_terms:  [0.38175098 0.19443392 0.14703392 0.00094007]
ucb_values:  [0.38175098 0.19443392 0.48426233 0.00094007]
max_U:  0.4842623310653472
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.32508516 0.39375667 0.39055065 0.        ]
exploration_terms:  [0.20173682 0.14026357 0.13751108 0.04179067]
ucb_values:  [0.52682198 0.53402025 0.52

valid_actions:  [0 1 2 3 4]
prior:  [7.2688999e-04 6.6966198e-02 1.0355999e-05 1.1315590e-03 9.3116492e-01]
reward:  0
done:  False
Valid actions as child:  [0 1 2 3 4]
Prior over the children:  [7.2688999e-04 6.6966198e-02 1.0355999e-05 1.1315590e-03 9.3116492e-01]
Terminal node:  False

██████
█   !█
█    █
█ @ ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.778236985206604
Backpropagation phase started
Simulation 41 done.

Simulation 42 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [0.         0.         0.33747744 0.        ]
exploration_terms:  [0.40134052 0.20441129 0.13985716 0.00098831]
ucb_values:  [0.40134052 0.20441129 0.4773346  0.00098831]
max_U:  0.4773345951588359
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 

best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.54881975 0.        ]
exploration_terms:  [0.24911628 0.00291876 0.26594153 0.02076815]
ucb_values:  [0.24911628 0.00291876 0.81476128 0.02076815]
max_U:  0.8147612810101752
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [0.         0.59941793 0.62297899]
exploration_terms:  [0.12420295 0.25094019 0.31132005]
ucb_values:  [0.12420295 0.85035812 0.93429904]
max_U:  0.9342990360094726
mask:  [False False  True]
best_actions:  [4]

██████
█   !█
█    █
█   ?█
█ @  █
██████
blessed sword
Current tree depth:  6
Action selected:  4 Right
Child node terminal:  False


██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.33304485 0.38052252 0.401808   0.        ]
exploration_terms:  [0.27499239 0.24855571 0.23992922 0.02278635]
ucb_values:  [0.60803724 0.62907822 0.64173722 0.02278635]
max_U:  0.6417372232769176
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [0.         0.44913769 0.        ]
exploration_terms:  [0.09780677 0.38424906 0.36519697]
ucb_values:  [0.09780677 0.83338675 0.36519697]
max_U:  0.8333867506384849
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:

value_terms:  [0.33880851 0.3947735  0.38620425 0.        ]
exploration_terms:  [0.17490003 0.12088917 0.13051251 0.0471007 ]
ucb_values:  [0.51370855 0.51566267 0.51671675 0.0471007 ]
max_U:  0.516716753396066
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [0.         0.440755   0.33717984]
exploration_terms:  [0.2074795  0.20377884 0.25823326]
ucb_values:  [0.2074795  0.64453384 0.59541309]
max_U:  0.6445338351290114
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [0.         0.         0.49288637 0.        ]
exploration_terms:  [0.23884692 0.00095182 0.24177065 0.00662468]
ucb_values:  [0.23884692 0.00095182 0.73465703 0.00662

Action  Stay : Prior=0.062 - Q-value=0.000 - Visit counts=0
Action  Down : Prior=0.032 - Q-value=0.000 - Visit counts=0
Action  Left : Prior=0.906 - Q-value=0.339 - Visit counts=50
Action  Right : Prior=0.000 - Q-value=0.000 - Visit counts=0
Action with best prior:  3 (Left)
Action selected from MCTS:  3 (Left)
Best actions:  [3, 2] ['Left', 'Down']

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Reward received:  0
Done:  False
----------------------------------------
p-UCT Rosen:

██████
█  @!█
█    █
█n  ?█
█y   █
██████

Valid actions as child:  [0 2 3 4]
Prior over the children:  [6.1928187e-02 0.0000000e+00 3.1541348e-02 9.0637797e-01 1.5249924e-04]
Weights over the children:  [0.1789233  0.         0.12769175 0.6845062  0.00887885]
Terminal node:  False

Simulation 1 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1 1 1 1]
exploration_terms:  [ -11.1779744   -15.66271953   -2.92181439 -225.25450523]
ucb_values:  [ -10.1779744   -14.66271953   

actions:  [0 1 4]
value_terms:  [1 1 1]
exploration_terms:  [-16.0128439   -3.75515673  -5.83942722]
ucb_values:  [-15.0128439   -2.75515673  -4.83942722]
max_U:  -2.7551567271489725
mask:  [False  True False]
best_actions:  [1]
Current tree depth:  6
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 1 2 4]
prior:  [5.6809408e-04 9.6889561e-01 2.9525207e-04 0.0000000e+00 3.0240977e-02]
reward:  0
done:  False
Valid actions as child:  [0 1 2 4]
Prior over the children:  [5.6809408e-04 9.6889561e-01 2.9525207e-04 0.0000000e+00 3.0240977e-02]
Weights over the children:  [0.01987483 0.82078934 0.01432814 0.         0.14500773]
Terminal node:  False

██████
█   !█
█    █
█@  ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.7024410963058472
Backpropagation phase started
Simulation 6 done.

Simulation 7 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 

exploration_terms:  [-7.27227538 -1.13702344 -3.763494  ]
ucb_values:  [-6.27227538 -0.68854288 -2.763494  ]
max_U:  -0.6885428786416108
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.        1.        0.5088726 1.       ]
exploration_terms:  [ -5.60269451 -88.75242399  -0.70467299 -33.6414178 ]
ucb_values:  [ -4.60269451 -87.75242399  -0.19580039 -32.6414178 ]
max_U:  -0.19580039187539844
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.56461281 1.        ]
exploration_terms:  [ -5.64645004 -52.16476145  -0.80070736 -19.55589023]
ucb_values:  [ -4.64645004 -51.16476145  -0.23609455 -18.

value_terms:  [1.  1.  1.  0.9]
exploration_terms:  [-6.35196690e+01 -2.83209324e+02 -3.43615274e+02 -1.89321903e-01]
ucb_values:  [ -62.51966897 -282.20932383 -342.6152742     0.7106781 ]
max_U:  0.7106780971985941
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1 1 1 1]
exploration_terms:  [-1.35269011e+03 -3.99331247e+03 -6.25113212e+03 -2.00460922e+00]
ucb_values:  [-1.35169011e+03 -3.99231247e+03 -6.25013212e+03 -1.00460922e+00]
max_U:  -1.0046092155635105
mask:  [False False False  True]
best_actions:  [4]
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 3]
prior:  [0.03875165 0.         0.6093712  0.35187712 0.        ]
reward:  1
done:  True
Valid actions as child:  [0 2 3]
Prior 


██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.35364358 1.        ]
exploration_terms:  [ -4.85314766  -6.80029207  -0.71674468 -97.79887981]
ucb_values:  [ -3.85314766  -5.80029207  -0.36310109 -96.79887981]
max_U:  -0.3631010909211096
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1.        1.        0.3961214 1.       ]
exploration_terms:  [ -2.78859356  -2.93314595  -2.10399968 -19.37484096]
ucb_values:  [ -1.78859356  -1.93314595  -1.70787828 -18.37484096]
max_U:  -1.7078782805892514
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.         0.44029259 1.       

value_terms:  [1.        1.        0.8099997]
exploration_terms:  [ -91.93811264 -207.2562795    -0.37476397]
ucb_values:  [ -90.93811264 -206.2562795     0.43523574]
max_U:  0.43523573558140793
mask:  [False False  True]
best_actions:  [4]

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  9
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1.  1.  1.  0.9]
exploration_terms:  [ -58.96234158 -262.88998607 -318.96200813   -0.38908501]
ucb_values:  [ -57.96234158 -261.88998607 -317.96200813    0.51091499]
max_U:  0.5109149867302508
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1. 1. 1. 1.]
exploration_terms:  [-7.67450595e+02 -2.26561132e+03 -3.54658840e+03 -3.60439929e-01]
ucb_values:  [-7.66450

Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [1.         0.59698304 1.        ]
exploration_terms:  [-7.2867335  -1.1266959  -2.65726377]
ucb_values:  [-6.2867335  -0.52971286 -1.65726377]
max_U:  -0.5297128557160271
mask:  [False  True False]
best_actions:  [1]

██████
█   !█
█    █
█@  ?█
█    █
██████
blessed sword
Current tree depth:  6
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         0.65940183 1.         1.        ]
exploration_terms:  [-46.98350605  -0.53793511 -65.17169738  -6.43958094]
ucb_values:  [-45.98350605   0.12146672 -64.17169738  -5.43958094]
max_U:  0.12146672314522677
mask:  [False  True False False]
best_actions:  [1]

██████
█   !█
█@   █
█   ?█
█    █
██████
blessed sword
Current tree depth:  7
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_term

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.39356231 1.        ]
exploration_terms:  [ -2.51569565  -2.64610183  -1.9039092  -17.47877634]
ucb_values:  [ -1.51569565  -1.64610183  -1.51034689 -16.47877634]
max_U:  -1.5103468855520297
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.         0.43722216 1.        ]
exploration_terms:  [-5.63083917 -0.90505717 -2.91403011]
ucb_values:  [-4.63083917 -0.46783501 -1.91403011]
max_U:  -0.46783500982862036
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.48848873 1.        ]
exploration_terms:  [ -4.26791193 -67.6080997   -0.56873296 -25.62670659]


actions:  [0 2 3 4]
value_terms:  [1.         1.         0.37771061 1.        ]
exploration_terms:  [ -3.79908576  -3.99601907  -2.73011153 -26.39562951]
ucb_values:  [ -2.79908576  -2.99601907  -2.35240093 -25.39562951]
max_U:  -2.3524009256068177
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.         0.40095664 1.        ]
exploration_terms:  [-8.11999342 -1.02868106 -4.20219875]
ucb_values:  [-7.11999342 -0.62772442 -3.20219875]
max_U:  -0.627724418641746
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1 1 1 1]
exploration_terms:  [ -10.25256932 -162.41120713   -2.63114688  -61.56162309]
ucb_values:  [  -9.25256932 -161.4


██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.        1.        0.5212326 1.       ]
exploration_terms:  [ -6.03573893 -95.61229152  -0.71641474 -36.24163602]
ucb_values:  [ -5.03573893 -94.61229152  -0.19518214 -35.24163602]
max_U:  -0.1951821360632502
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.58441162 1.        ]
exploration_terms:  [ -6.02261328 -55.63994769  -0.77496104 -20.85869232]
ucb_values:  [ -5.02261328 -54.63994769  -0.19054942 -19.85869232]
max_U:  -0.19054942379114692
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  6
Action select

Action selected:  4 Right
Child node terminal:  False
Child node expanded:  False
Expansion phase started
valid_actions:  [0 2 3 4]
prior:  [3.6212418e-04 0.0000000e+00 1.8216249e-05 1.2374539e-05 9.9960726e-01]
reward:  0
done:  False
Valid actions as child:  [0 2 3 4]
Prior over the children:  [3.6212418e-04 0.0000000e+00 1.8216249e-05 1.2374539e-05 9.9960726e-01]
Weights over the children:  [0.01853615 0.         0.00415738 0.00342654 0.97387993]
Terminal node:  False

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0.8999980092048645
Backpropagation phase started
Simulation 28 done.

Simulation 29 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.34185089 1.        ]
exploration_terms:  [ -3.80894541  -5.33714262  -0.57089705 -76.75649293]
ucb_values:  [ -2.80894541  -4.33714262  -0.22904616 -75.75649293]
max_U:  -0.22904616347804302
mask:  [Fa

valid_actions:  [0 2 3]
prior:  [0.03875165 0.         0.6093712  0.35187712 0.        ]
reward:  1
done:  True
Valid actions as child:  [0 2 3]
Prior over the children:  [0.03875165 0.         0.6093712  0.35187712 0.        ]
Weights over the children:  [0.12533155 0.         0.49699986 0.37766856 0.        ]
Terminal node:  True

██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropagation phase started
Simulation 30 done.

Simulation 31 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.33998153 1.        ]
exploration_terms:  [ -3.72033116  -5.21297522  -0.55809237 -74.97077067]
ucb_values:  [ -2.72033116  -4.21297522  -0.21811083 -73.97077067]
max_U:  -0.21811083262188546
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  Fal

actions:  [0 1 2 4]
value_terms:  [1.         0.65863987 1.         1.        ]
exploration_terms:  [-43.69050388  -0.50611309 -60.60391265  -5.98824055]
ucb_values:  [-42.69050388   0.15252678 -59.60391265  -4.98824055]
max_U:  0.15252678139962395
mask:  [False  True False False]
best_actions:  [1]

██████
█   !█
█@   █
█   ?█
█    █
██████
blessed sword
Current tree depth:  7
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         0.72840943 1.         1.        ]
exploration_terms:  [-100.48901339   -0.43562464 -194.00415216   -8.89723107]
ucb_values:  [ -99.48901339    0.29278479 -193.00415216   -7.89723107]
max_U:  0.29278478669089536
mask:  [False  True False False]
best_actions:  [1]

██████
█@  !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  8
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.         1.         0.80999984]
explora

ucb_values:  [-4.42978237 -0.43789602 -1.80998069]
max_U:  -0.4378960216960235
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.48728969 1.        ]
exploration_terms:  [ -4.10840562 -65.08135618  -0.54934443 -24.66894982]
ucb_values:  [-3.10840562e+00 -6.40813562e+01 -6.20547443e-02 -2.36689498e+01]
max_U:  -0.06205474425187135
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.53973388 1.        ]
exploration_terms:  [ -4.06292268 -37.53533469  -0.61977962 -14.07150851]
ucb_values:  [ -3.06292268 -36.53533469  -0.08004574 -13.07150851]
max_U:  -0.08004573959637462
mas

actions:  [0 2 3 4]
value_terms:  [0.35364358 1.         0.39239244 1.        ]
exploration_terms:  [ -1.36040218  -2.1046166   -1.41203012 -13.90200578]
ucb_values:  [ -1.00675859  -1.1046166   -1.01963768 -12.90200578]
max_U:  -1.0067585911866477
mask:  [ True False False False]
best_actions:  [0]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  0 Stay
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1.        1.        0.3961214 1.       ]
exploration_terms:  [ -2.78859356  -2.93314595  -2.10399968 -19.37484096]
ucb_values:  [ -1.78859356  -1.93314595  -1.70787828 -18.37484096]
max_U:  -1.7078782805892514
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.         0.44029259 1.        ]
exploration_terms:  [-6.27656977 -1.00121

actions:  [0 1 2 4]
value_terms:  [1.         1.         0.49219485 1.        ]
exploration_terms:  [ -4.66548858 -73.90612129  -0.61520897 -28.01395829]
ucb_values:  [ -3.66548858 -72.90612129  -0.12301413 -27.01395829]
max_U:  -0.12301412512157073
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.        1.        0.5447096 1.       ]
exploration_terms:  [ -4.64667333 -42.92831848  -0.69851659 -16.09326795]
ucb_values:  [ -3.64667333 -41.92831848  -0.15380698 -15.09326795]
max_U:  -0.15380698008098437
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  6
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [1.         0.59842594 1.        ]
exploration_

exploration_terms:  [ -1.31886761  -2.02388299  -1.35366849 -13.36872142]
ucb_values:  [ -0.96588604  -1.02388299  -0.96170839 -12.36872142]
max_U:  -0.9617083944739739
mask:  [False False  True False]
best_actions:  [3]

██████
█@  !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  2
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.        0.4353799 1.       ]
exploration_terms:  [-5.09268751 -0.82231293 -2.63552986]
ucb_values:  [-4.09268751 -0.38693303 -1.63552986]
max_U:  -0.3869330277628998
mask:  [False  True False]
best_actions:  [2]

██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  3
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.48557678 1.        ]
exploration_terms:  [ -3.84303074 -60.87754618  -0.5163666  -23.07550457]
ucb_values:  [-2.84303074e+00 -5.98775462e+01 -3.07898208e-02 -2.20755046e+01]
max_U

actions:  [0 2 4]
value_terms:  [1.         1.         0.80999974]
exploration_terms:  [ -88.90200933 -200.41198546   -0.36930559]
ucb_values:  [ -87.90200933 -199.41198546    0.44069415]
max_U:  0.4406941540927197
mask:  [False False  True]
best_actions:  [4]

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1.  1.  1.  0.9]
exploration_terms:  [ -56.88828809 -253.64259396 -307.74223213   -0.38529242]
ucb_values:  [ -55.88828809 -252.64259396 -306.74223213    0.51470758]
max_U:  0.5147075833737532
mask:  [False False False  True]
best_actions:  [4]

██████
█  @!█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1. 1. 1. 1.]
exploration_terms:  [-7.39200998e+02 -2.18221493e+03 -3.41603968e+03 -3.62290609e-01]
u


██████
█   !█
█@   █
█n  ?█
█y   █
██████

Current tree depth:  4
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.        1.        0.4900567 1.       ]
exploration_terms:  [ -4.451364   -70.51416845  -0.59054539 -26.72824578]
ucb_values:  [ -3.451364   -69.51416845  -0.10048869 -25.72824578]
max_U:  -0.10048868933474248
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█@  ?█
█y   █
██████
gleaming sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         1.         0.54249817 1.        ]
exploration_terms:  [ -4.42070057 -40.84066785  -0.66888597 -15.3106349 ]
ucb_values:  [ -3.42070057 -39.84066785  -0.1263878  -14.3106349 ]
max_U:  -0.12638780207624445
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  6
Action selec


██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  11
Action selected:  4 Right
Child node terminal:  True
Child node expanded:  True
Expansion phase started
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropagation phase started
Simulation 44 done.

Simulation 45 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.33799402 1.        ]
exploration_terms:  [ -3.25109036  -4.55546905  -0.48956395 -65.51479936]
ucb_values:  [ -2.25109036  -3.55546905  -0.15156993 -64.51479936]
max_U:  -0.15156992705502192
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.35247532 1.         0.3914363  1.        ]
exploration_terms:  [ -1.26326095  -1.93653817  -1.29570962 -12.79176678]
ucb_values:  [ -0.9107

actions:  [0 1 2 4]
value_terms:  [1.         0.72829131 1.         1.        ]
exploration_terms:  [-105.62570587   -0.45333164 -203.92105387   -9.35203044]
ucb_values:  [-104.62570587    0.27495967 -202.92105387   -8.35203044]
max_U:  0.2749596691571341
mask:  [False  True False False]
best_actions:  [1]

██████
█@  !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  9
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 4]
value_terms:  [1.        1.        0.8099998]
exploration_terms:  [ -83.67410135 -188.62670159   -0.35637596]
ucb_values:  [ -82.67410135 -187.62670159    0.45362384]
max_U:  0.4536238423526959
mask:  [False False  True]
best_actions:  [4]

██████
█ @ !█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  10
Action selected:  4 Right
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [1.  1.  1.  0.9]
exploration_terms:  [ -53.31218259 -237.69814025 -288.39697272

value_terms:  [1.         1.         0.53674844 1.        ]
exploration_terms:  [ -3.52304896 -32.54770824  -0.54218159 -12.20171226]
ucb_values:  [-2.52304896e+00 -3.15477082e+01 -5.43314452e-03 -1.12017123e+01]
max_U:  -0.005433144523207423
mask:  [False False  True False]
best_actions:  [2]

██████
█   !█
█    █
█   ?█
█@   █
██████
blessed sword
Current tree depth:  5
Action selected:  2 Down
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 4]
value_terms:  [1.         0.59346598 1.        ]
exploration_terms:  [-5.74580095 -0.89891165 -2.09532964]
ucb_values:  [-4.74580095 -0.30544567 -1.09532964]
max_U:  -0.30544567237123055
mask:  [False  True False]
best_actions:  [1]

██████
█   !█
█    █
█@  ?█
█    █
██████
blessed sword
Current tree depth:  6
Action selected:  1 Up
Child node terminal:  False
Child node expanded:  True
actions:  [0 1 2 4]
value_terms:  [1.         0.65753558 1.         1.        ]
exploration_terms:  [-36.61859378  -0.43143005 -50.79433


██████
█   @█
█    █
█   ?█
█    █
██████
blessed sword
Current tree depth:  12
Action selected:  4 Right
Child node terminal:  True
Child node expanded:  True
Expansion phase started
Value prediction/simulation phase started
Predicted/simulated value:  0
Backpropagation phase started
Simulation 49 done.

Simulation 50 started.

██████
█  @!█
█    █
█n  ?█
█y   █
██████

actions:  [0 2 3 4]
value_terms:  [1.         1.         0.33766109 1.        ]
exploration_terms:  [ -3.12664669  -4.381097    -0.47121768 -63.00705543]
ucb_values:  [ -2.12664669  -3.381097    -0.13355659 -62.00705543]
max_U:  -0.1335565887615352
mask:  [False False  True False]
best_actions:  [3]

██████
█ @ !█
█    █
█n  ?█
█y   █
██████

Current tree depth:  1
Action selected:  3 Left
Child node terminal:  False
Child node expanded:  True
actions:  [0 2 3 4]
value_terms:  [0.35207564 1.         0.39102087 1.        ]
exploration_terms:  [ -1.21497663  -1.86099113  -1.24551423 -12.29274222]
ucb_values:  [ -0.86290