## <font color='darkblue'>5. DQN Techniques: Experience Replay and Target Networks</font>
([article source](https://colab.corp.google.com/drive/1DEv8FSjMvsgCDPlOGQrUFoJeAf67cFSo?usp=sharing), [main page](https://developers.google.com/machine-learning/reinforcement-learning)) <b><font size='3ptx'>In the previous Colab, you trained a neural network on the results of every state transition. This approach tends to produce unstable training. </font></b>

Here you'll understand why training becomes unstable. Then, you'll understand the following two techniques that stabilize Deep Q-Network (DQN) training:
* Experience replay
* Target networks

## <font color='darkblue'>Disadvantages of Online DQN</font>
<b><font size='3ptx'>In the previous Colab, every state transition generated a tuple, and you trained your agent on that tuple. Training your agent only on tuples generated by live training is called <font color='darkblue'>online DQN</font>. Let's see why online DQN training is unstable.</font></b>

<b>The problem with online DQN is that training an agent on a trajectory of states means successive states are probably similar</b>. Therefore, input data can be correlated. However, <b>in general, input data to a model must be independent and identically distributed</b> (<font color='brown'>i.i.d</font>). In practice, correlated input data means that the agent might not generalize well to other states, resulting in unstable training.

<b>In general, neural network training relies on the assumption that data is i.i.d. In this Colab, you'll apply a technique called experience replay to satisfy this assumption</b>.

## <font color='darkblue'>Setup</font>
Run the following cell to import libraries and setup the environment:

In [35]:
import gym
import time
import numpy as np
import matplotlib.pyplot as plt
import random
from tensorflow import keras
from collections import deque
from IPython.display import clear_output # to clear output on every episode run

CHECK_SUCCESS_INTERVAL = 100
EPSILON_MIN = 0.01

env = gym.make('FrozenLake-v1', render_mode="rgb_array_list")

num_states = env.observation_space.n
num_actions = env.action_space.n

Run the following cell to define functions that perform the following tasks:
* Define the neural network.
* Calculate the Bellman update.
* Select an action.
* Check the agent's training for success.

<br/>

These functions are identical to functions in the previous Colab.

In [39]:
def one_hot_encode_state(state):
  """Turns state into one hot result.
  
  Args:
     state: An integer representing the agent's state.
     
   Returns:
     A one-hot encoded vector of the input `state`.
  """
  #  state=(0, {'prob': 1})  
  state = state[0] if isinstance(state, tuple) else state
  return np.identity(num_states)[state:state+1]


def compute_bellman_target(discount_factor, reward, model, state_next):
  '''Returns the updated return calculation given the reward and next state.

  Args:
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
    reward: reward from state transition.
    model: model used to predict Q-values
    state_next: next state after state transition.
    
  Returns:
    updated Q-value using Bellman update
  '''
  return reward + discount_factor * \
           np.max(model.predict(one_hot_encode_state(state_next)))


def define_model(learning_rate):
  '''Returns a shallow neural net defined using tf.keras.
  
  Args:
    learning_rate: optimizer learning rate
    
  Returns:
    model: A shallow neural net defined using tf.keras input dimension equal to
    num_states and output dimension equal to num_actions.
  '''
  model = []
  model = keras.Sequential()
  model.add(keras.layers.Dense(
      input_dim = num_states,
      units = num_actions,
      activation = 'relu',
      use_bias = False,
      kernel_initializer = keras.initializers.RandomUniform(minval=1e-5, maxval=0.05)))
  
  model.compile(
      optimizer = keras.optimizers.SGD(lr = learning_rate),
      loss = 'mse')
  
  print("======= Neural Network Summary =======")
  print(model.summary())
  return model


def select_action(epsilon, state, model):
  """Select action given Q-values using epsilon-greedy algorithm.
  
  Args:
    q_values: q_values for all possible actions from a state.
    epsilon: Current value of epsilon used to select action using epsilon-greedy
      algorithm.
    model: Model to make prediction
      
  Returns:
    action: action to take from the state.
  """
  if(np.random.rand() < epsilon):
    return np.random.randint(num_actions)
  
  q_values = model.predict(one_hot_encode_state(state))
  return np.argmax(q_values)


def check_success(episode, epsilon, reward_history, length_history, time_history, success_percent_threshold):
  if((episode+1) % CHECK_SUCCESS_INTERVAL == 0):
    # Check the success % in the last 100 episodes
    success_percent = np.sum(reward_history[-100:-1])
    length_avg = int(np.sum(length_history[-100:-1])/100.0)
    time_avg = np.sum(time_history[-100:-1])/100.0
    print("Episode: " + f"{episode:0>4d}" + \
          ", Success: " + f"{success_percent:2.0f}" + "%" + \
          ", Avg length: " + f"{length_avg:0>2d}" + \
          ", Epsilon: " + f"{epsilon:.2f}" + \
          ", Avg time(s): " + f"{time_avg:.2f}"
         )
    
    if(success_percent > success_percent_threshold):
      print("Agent crossed success threshold of " + str(success_percent_threshold) + '%.')
      return(1)
  return(0)


learning_rate = 0.2
model = define_model(learning_rate)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 4)                 64        
                                                                 
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
None


## <font color='darkblue'>Improving DQN with Experience Replay</font>
<b>In online DQN, all previous tuples are discarded. Instead, previous tuples can be collected in a buffer</b>. Now, the agent can replay those state transitions and train without needing to again experience those state transitions. This technique is called <b><font color='darkblue'>experience replay</font></b>. The buffer storing the tuples is called a <b>replay buffer</b>.

To implement experience replay, the agent follows these steps on every state transition:
1. Save the transition's tuple  `s,a,r,s′`  in the replay buffer.
2. Create a batch of tuples by sampling the buffer.
3. Train the neural network on the batch of tuples.

The following schematic shows these steps:
![RL replay buffer](images/5_1.PNG)

Implement the first step by creating a replay buffer using a Python [**deque**](https://docs.python.org/3/library/collections.html#collections.deque). Set the buffer size to 2000. You will understand the context for why the buffer size is 2000 later in this Colab.

In [12]:
replay_buffer_size = 2000
replay_buffer = deque(maxlen = replay_buffer_size)

Collect transitions by using a random policy for a few episodes:

In [13]:
help(env.action_space.sample)

Help on method sample in module gym.spaces.discrete:

sample(mask: Optional[numpy.ndarray] = None) -> int method of gym.spaces.discrete.Discrete instance
    Generates a single random sample from this space.
    
    A sample will be chosen uniformly at random with the mask if provided
    
    Args:
        mask: An optional mask for if an action can be selected.
            Expected `np.ndarray` of shape `(n,)` and dtype `np.int8` where `1` represents valid actions and `0` invalid / infeasible actions.
            If there are no possible actions (i.e. `np.all(mask == 0)`) then `space.start` will be returned.
    
    Returns:
        A sampled integer from the space



In [14]:
# Check what we get from env.step
env.reset()
action = env.action_space.sample()

# observation, reward, terminated, truncated, info
env.step(action)

(0, 0.0, False, False, {'prob': 0.3333333333333333})

In [16]:
state, reward, done, _, _ = env.step(env.action_space.sample())

while not done:
  action = env.action_space.sample()
  state_next, reward, done, _, _ = env.step(action)
  print(f'Take action={action}: state from {state} -> {state_next}, reward={reward}, done={done}')
  state = state_next
  
print('Bye')

Take action=2: state from 1 -> 2, reward=0.0, done=False
Take action=2: state from 2 -> 6, reward=0.0, done=False
Take action=2: state from 6 -> 2, reward=0.0, done=False
Take action=2: state from 2 -> 3, reward=0.0, done=False
Take action=0: state from 3 -> 2, reward=0.0, done=False
Take action=3: state from 2 -> 3, reward=0.0, done=False
Take action=2: state from 3 -> 3, reward=0.0, done=False
Take action=2: state from 3 -> 7, reward=0.0, done=True
Bye


In [17]:
replay_buffer = deque(maxlen = replay_buffer_size)
for episode in range(5):
  state = env.reset()
  done = False
  while not done:
    action = env.action_space.sample()
    state_next, reward, done, _, _ = env.step(action)
    state = state[0] if isinstance(state, tuple) else state
    replay_buffer.append((state, action, reward, state_next))
      
    state = state_next

print(replay_buffer)

deque([(0, 1, 0.0, 1), (1, 2, 0.0, 1), (1, 2, 0.0, 5), (0, 2, 0.0, 4), (4, 3, 0.0, 0), (0, 3, 0.0, 0), (0, 2, 0.0, 1), (1, 3, 0.0, 0), (0, 1, 0.0, 1), (1, 1, 0.0, 0), (0, 3, 0.0, 0), (0, 2, 0.0, 0), (0, 0, 0.0, 4), (4, 2, 0.0, 5), (0, 0, 0.0, 4), (4, 0, 0.0, 0), (0, 1, 0.0, 4), (4, 3, 0.0, 0), (0, 3, 0.0, 0), (0, 1, 0.0, 1), (1, 1, 0.0, 2), (2, 0, 0.0, 1), (1, 1, 0.0, 2), (2, 2, 0.0, 6), (6, 2, 0.0, 2), (2, 0, 0.0, 6), (6, 2, 0.0, 7), (0, 1, 0.0, 1), (1, 3, 0.0, 1), (1, 3, 0.0, 2), (2, 0, 0.0, 2), (2, 3, 0.0, 1), (1, 2, 0.0, 1), (1, 1, 0.0, 5), (0, 1, 0.0, 4), (4, 2, 0.0, 0), (0, 3, 0.0, 0), (0, 3, 0.0, 0), (0, 0, 0.0, 0), (0, 1, 0.0, 4), (4, 3, 0.0, 0), (0, 3, 0.0, 0), (0, 2, 0.0, 0), (0, 2, 0.0, 0), (0, 2, 0.0, 1), (1, 3, 0.0, 1), (1, 0, 0.0, 5)], maxlen=2000)


Implement experience replay by defining a function to sample a batch from `replay_buffer` and train the agent on every tuple in the batch. Vectorize the code to train the model on the entire batch because training the model on a single tuple at a time is slow.

In [18]:
def sample_from_replay_buffer_and_train_model(replay_buffer, batch_size, model, discount_factor):
  '''Samples a batch from the buffer and trains the agent on the batch.
  
  Unpacks feature data from tuples of (state, action, reward, state_next).
  Encodes states as one-hot vectors and stacks these vectors into a matrix.
  Creates matrix of target Q-values. Uses both matrices to train model in one
  call for faster training.
  
  Args:
    replay_buffer: deque containing recorded tuples.
    batch_size: integer specifying training batch size.
    model: neural network representing agent.
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
      
  Returns:
    model: neural network trained on sampled batch.
  '''
  if(len(replay_buffer) > batch_size):
    batch = random.sample(replay_buffer, batch_size)
    # extract s, a, r, s' from tuples into vectors
    states = [item[0] for item in batch]
    actions = [item[1] for item in batch]
    rewards = [item[2] for item in batch]
    states_next = [item[3] for item in batch]
    # encode states as a matrix of one-hot vectors
    one_hot_encoded_states = np.empty(shape=(0, num_states))
    for state in states:
      one_hot_encoded_states = np.vstack((one_hot_encoded_states, one_hot_encode_state(state)))
      
    # predict Q-values and update predictions using Bellman update
    target_q_values = model.predict(one_hot_encoded_states) # TODO. This TODO is
            # a placeholder. You'll fill in code later, in Part 2 of this Colab.
    for i in range(len(states)):
      target_q_values[i, actions[i]] = compute_bellman_target(discount_factor, rewards[i], model, states_next[i])
      
    # now, you can run the following training step without a loop
    model.fit(one_hot_encoded_states, target_q_values, epochs = 1, verbose = 0)
    
  return model

Train the agent on the `replay_buffer` by running the following cell. Compare the best action for the first state before and after training.

In [19]:
batch_size = 8
discount_factor = 0.95
print("Q-values for state 0 -")
print("Before training epoch:", model.predict(one_hot_encode_state(0)))

model = sample_from_replay_buffer_and_train_model(
  replay_buffer, batch_size, model, discount_factor)

print("After training epoch: ", model.predict(one_hot_encode_state(0)))

Q-values for state 0 -
Before training epoch: [[0.02235954 0.02473554 0.0061638  0.01072832]]
After training epoch:  [[0.02235954 0.02498301 0.0068313  0.01104758]]


To summarize, on every state transition, the agent follows these steps:
* Save the tuple from the state transition to the buffer.
* Samples a batch of tuples from replay_buffer and trains on the batch.

## <font color='darkblue'>Train and Evaluate DQN</font>
<b><font size='3ptx'>Training with experience replay is slow. This slowness restricts how much you can explore the hyperparameter space.</font></b>

Follow these steps:
1. From the previous Colab, copy the values for `eps_decay`, `discount_factor`, `episodes`, and `learning_rate`.
1. Set `replay_buffer_size` to an initial value. How can you estimate such a value?
1. `batch_size` is typically 16, 32, or 64. These are standard values in DQN. However, because FrozenLake is a simple environment, set `batch_size = 8` for faster training.

Run the cell and experiment with hyperparameter values to train the agent. How does training with experience replay compare with training with online DQN? Expand the following section for a discussion.

<a id='train_eval_dqn'></a>

In [22]:
# Hyperparameters
epsilon = 1.0
eps_decay = 0.99
discount_factor = 0.999
episodes = 100
learning_rate = 0.7
replay_buffer_size = 2000
batch_size = 8
CHECK_SUCCESS_INTERVAL = 10
# TODO. This TODO is a placeholder. You'll fill in code later,
# in Part 2 of this Colab.

# Parameters & model
model = define_model(learning_rate)
success_percent_threshold = 20 # in percent, so 60 = 60%
# TODO. This TODO is a placeholder. You'll fill in code later,
# in Part 2 of this Colab.
replay_buffer = deque(maxlen = replay_buffer_size) # create new replay_buffer

# Training metrics
length_history = []
reward_history = []
time_history = []

# Test if parameter values are valid
assert eps_decay < 1.0 and eps_decay > 0.
assert success_percent_threshold > 9 # agent could reach 9% randomly

print("======= Begin Training =======")
for episode in range(episodes):
  print(f'\tEpisode={episode}...')
  state = env.reset()
  done = False
  episode_reward = 0
  episode_length = 0
  episode_time_start = time.time()
  while not done:
    episode_length += 1
    action = select_action(epsilon, state, model)
    state_next, reward, done, _, _ = env.step(action)
    if done:
      print(f'Take action={action}: state from {state} -> {state_next}, reward={reward}, done={done}, episode_length={episode_length}')
    else:
      print(f'Take action={action}: state from {state} -> {state_next}, reward={reward}')
    state = state[0] if isinstance(state, tuple) else state
    replay_buffer.append((state, action, reward, state_next))
      
    model = sample_from_replay_buffer_and_train_model(
        replay_buffer, batch_size, model, discount_factor)
    
    # TODO. This TODO is a placeholder. You'll fill in code later,
    # in Part 2 of this Colab.
    episode_reward += reward
    state = state_next

  # Decreasing epsilon here instead of inside sample_from_replay_buffer_and_train_model introduces
  # the possible edge condition that epsilon decreases before the
  # model starts training because the batch doesn't build up
  if epsilon > EPSILON_MIN:
    epsilon *= eps_decay
  length_history.append(episode_length)
  reward_history.append(episode_reward)
  time_history.append(time.time() - episode_time_start)
  
  if check_success(episode, epsilon, reward_history, length_history, time_history, success_percent_threshold):
    break

	Episode=0...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=4
	Episode=1...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=2...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0

Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=9
	Episode=4...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0


Take action=2: state from 2 -> 2, reward=0.0
Take action=2: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=12
	Episode=5...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=10
	Episode=6...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0


Take action=1: state from 8 -> 9, reward=0.0
Take action=0: state from 9 -> 13, reward=0.0
Take action=1: state from 13 -> 14, reward=0.0
Take action=2: state from 14 -> 10, reward=0.0
Take action=3: state from 10 -> 11, reward=0.0, done=True, episode_length=7
	Episode=7...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=7
	Episode=8...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=9...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=5
Episode: 0009, Success:  0%, Avg length: 00, Epsilon: 0.90, Avg time(s): 0.28
success percent=0.0%
	Episode=10...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=11...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=12...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=13...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 5, 

	Episode=14...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=4
	Episode=15...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 5, reward=0.0, done=True, episode_length=5
	Episode=16...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=1: state from 8 -> 9, reward=0.0


Take action=0: state from 9 -> 13, reward=0.0
Take action=0: state from 13 -> 13, reward=0.0
Take action=0: state from 13 -> 9, reward=0.0
Take action=1: state from 9 -> 10, reward=0.0
Take action=2: state from 10 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=12
	Episode=17...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=18...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=7
	Episode=19...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 12, reward=0.0, done=True, episode_length=8
Episode: 0019, Success:  0%, Avg length: 01, Epsilon: 0.82, Avg time(s): 0.52
success percent=0.0%
	Episode=20...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=12
	Episode=21...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0


Take action=1: state from 2 -> 3, reward=0.0
Take action=1: state from 3 -> 7, reward=0.0, done=True, episode_length=6
	Episode=22...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=2: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=13
	Episode=23...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=6
	Episode=24...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=25...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0


Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=12
	Episode=26...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=6
	Episode=27...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0


Take action=1: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=7
	Episode=28...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=0: state from 10 -> 14, reward=0.0
Take action=3: state from 14 -> 15, reward=1.0, done=True, episode_length=6
	Episode=29...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0


Take action=3: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 13, reward=0.0
Take action=1: state from 13 -> 13, reward=0.0
Take action=2: state from 13 -> 13, reward=0.0
Take action=1: state from 13 -> 12, reward=0.0, done=True, episode_length=17
Episode: 0029, Success:  1%, Avg length: 02, Epsilon: 0.74, Avg time(s): 0.90
success percent=100.0%
	Episode=30...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=6


	Episode=31...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=1: state from 8 -> 12, reward=0.0, done=True, episode_length=4
	Episode=32...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=33...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0


Take action=1: state from 6 -> 10, reward=0.0
Take action=1: state from 10 -> 11, reward=0.0, done=True, episode_length=9
	Episode=34...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=6
	Episode=35...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=36...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=10
	Episode=37...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=6
	Episode=38...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0


Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=30
	Episode=39...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=3
Episode: 0039, Success:  1%, Avg length: 03, Epsilon: 0.67, Avg time(s): 1.38
success percent=100.0%
	Episode=40...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=2: state from 6 -> 7, reward=0.0, done=True, episode_length=10


	Episode=41...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 5, reward=0.0, done=True, episode_length=8
	Episode=42...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=43...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=44...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=4
	Episode=45...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=46...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=7
	Episode=47...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=0: state from 6 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=15
	Episode=48...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0


Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=21
	Episode=49...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=18
Episode: 0049, Success:  1%, Avg length: 03, Epsilon: 0.61, Avg time(s): 1.73
success percent=100.0%
	Episode=50...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 12, reward=0.0, done=True, episode_length=11
	Episode=51...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=52...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=3: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 10, reward=0.0
Take action=0: state from 10 -> 9, reward=0.0
Take action=2: state from 9 -> 5, reward=0.0, done=True, episode_length=8
	Episode=53...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=54...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0


Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=11
	Episode=55...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=56...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=5
	Episode=57...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=2
	Episode=58...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0


Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=13
	Episode=59...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=1: state from 3 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=2: state from 6 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0


Take action=1: state from 6 -> 5, reward=0.0, done=True, episode_length=12
Episode: 0059, Success:  1%, Avg length: 04, Epsilon: 0.55, Avg time(s): 2.16
success percent=100.0%
	Episode=60...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=2: state from 6 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0


Take action=3: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=21
	Episode=61...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=8


	Episode=62...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=15


	Episode=63...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=64...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=65...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=5
	Episode=66...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=5


	Episode=67...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=4
	Episode=68...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=14
	Episode=69...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=2
Episode: 0069, Success:  1%, Avg length: 05, Epsilon: 0.49, Avg time(s): 2.59
success percent=100.0%
	Episode=70...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=9


	Episode=71...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=1: state from 3 -> 7, reward=0.0, done=True, episode_length=21
	Episode=72...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 12, reward=0.0, done=True, episode_length=14
	Episode=73...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=24
	Episode=74...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=8
	Episode=75...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=76...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=20
	Episode=77...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0


Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=78...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=8
	Episode=79...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=2
Episode: 0079, Success:  1%, Avg length: 06, Epsilon: 0.45, Avg time(s): 3.13
success percent=100.0%
	Episode=80...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0


Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=12
	Episode=81...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=82...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=2


	Episode=83...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=5
	Episode=84...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=9
	Episode=85...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=15
	Episode=86...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=6
	Episode=87...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=6
	Episode=88...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=89...


Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0


Take action=1: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=18
Episode: 0089, Success:  1%, Avg length: 07, Epsilon: 0.40, Avg time(s): 3.44
success percent=100.0%
	Episode=90...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=91...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0


Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 5, reward=0.0, done=True, episode_length=29
	Episode=92...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=11


	Episode=93...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=5
	Episode=94...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 9, reward=0.0
Take action=2: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 9, reward=0.0
Take action=0: state from 9 -> 13, reward=0.0
Take action=1: state from 13 -> 14, reward=0.0


Take action=2: state from 14 -> 14, reward=0.0
Take action=3: state from 14 -> 13, reward=0.0
Take action=1: state from 13 -> 13, reward=0.0
Take action=1: state from 13 -> 12, reward=0.0, done=True, episode_length=14
	Episode=95...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=7
	Episode=96...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=13
	Episode=97...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=0: state from 9 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=13
	Episode=98...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=99...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=13
Episode: 0099, Success:  1%, Avg length: 08, Epsilon: 0.37, Avg time(s): 4.01
success percent=100.0%


### <font color='darkgreen'>Discussion</font>
<font size='3ptx'><b>Replay buffer size is a balance between weighing new trajectories vs. old trajectories.</b></font>

As your agent improves, new trajectories are probably more rewarding than old trajectories. However, <b>using old trajectories makes your training more stable because your agent trains on more diverse data</b>.

Here, each episode has a length of about 7. The agent's initial success rate is about 2%. To ensure you have at least a few successful episodes in your memory, estimate a replay buffer containing about 200 episodes. 200 episodes are equivalent to about $200\cdot7 = 1400$ state transitions. Any buffer size in that range is okay.

Hyperparameter values that let the agent solve the environment are:
* `epsilon = 1.0`
* `eps_decay = 0.999`
* `discount_factor = 0.99`
* `episodes = 2000`
* `learning_rate = 0.2`
* `replay_buffer_size = 2000`
* `batch_size = 8`

Observations from training:
* Training using experience replay is slower because you're training on a batch of tuples instead of a single tuple.
* When compared to the previous Colab, your agent solves the environment in approximately the same number of episodes. Possible causes are:
  * Frozen Lake is not a complex enough environment for experience replay to be advantageous.
  * The hyperparameters are not correctly optimized.

### <font color='darkgreen'>Visualize Performance of Trained Model</font>
Seeing the metrics plots is one thing, but visualizing your agent succeed at retrieving the frisbee is another. Run the following code to visualize your agent solving `FrozenLake`.

In [36]:
state = env.reset()
done = False

epsilon = 0. # greedy policy
while not done:
  action = select_action(epsilon, state, model)
  state_new, reward, done, _, _ = env.step(action)
  state = state_new
  clear_output()
  env.render()
  time.sleep(2)

### <font color='darkgreen'>Advantages of Experience Replay</font>
The advantages of experience replay over online DQN are as follows:

* Makes training more stable by training on batches of tuples instead of single tuples.
* Allows agent to generalize better by remembering past experience.

<b>However, experience replay does not fully address the instability in DQN. The next section describes another technique to stabilize DQN training—target networks.</b>

### <font color='darkgreen'>Target Networks</font>
When you train the neural network using Bellman update, you're calculating the target Q-values for training using the neural network itself. Because the neural network trains using its own predictions, you create a feedback loop. Changes in the neural networks predictions can reinforce each other because the neural network tries to target its own fluctuating Q-values.

The effect of fluctuations in target Q-values is magnified because the Q-values for a state depend on Q-values of successive states. Hence, changes in a state's Q-value can lead to changes in previous states' Q-values.

<b>To break the feedback loop, calculate target Q-values using a separate neural network, called a <font color='darkblue'>target network</font></b>. To stabilize training, update your target network slowly to your main neural network. The simplest approach is to update your target network to the main network on every  N  steps. Alternatively, on every step, add a small correction to the target network's weights.

The following schematic shows Q-learning with experience replay and target networks:
![Q-learning with experience replay and target networks](images/5_2.PNG)

Define a function to update the target network to the main neural network at a fixed interval of episodes:

In [41]:
def update_target_network(
    episode, update_target_network_interval, main_network, target_network):
  '''Updates the target network on every certain number of episodes by copying
  the model to the target network.
  
  Args:
    episode: integer representing episode number in agent's training.
    update_target_network_interval: integer  representing interval of episodes
      on which `target_network` is updated to `model`.
    main network: main neural network used to choose actions and train.
    target_network: neural network used to predict Q-values.
  Returns:
    the `target_network`, whether updated or not.
  '''
  if((episode+1) % update_target_network_interval == 0):
    target_network.set_weights(main_network.get_weights())
  return target_network

The remaining steps consist of editing previously defined code to implement target networks.

1. Add a hyperparameter to control the interval for the target network update:
  
  a. Go to this [line](#scrollTo=dyqN5EQuhEqx&line=9&uniqifier=1) marked by `TODO`.
  
  b. Set this hyperparameter.

  > `update_target_network_interval = 10`

1. Define the target network on this [line](#scrollTo=dyqN5EQuhEqx&line=14&uniqifier=1) marked by `TODO`. Insert this code:

  > `target_network = define_model(learning_rate)`

1. Update the target network:

  a. Go to this [cell](#train_eval_dqn).
  
  b. Insert the call to `update_target_network` at the appropriate place.

1. Predict Q-values by using `target network` instead of `model`:

  a. Go to this [line](#scrollTo=Evho_UrWhEqn&line=30)  marked by `#TODO`. You are in the function definition for  `sample_from_replay_buffer_and_train_model`.
  
  b. Edit the line to predict target Q-values using `target_network` instead of `model`.
  
  c. Similarly, edit the following call to `compute_bellman_target` to use `target_network` instead of `model`.
  
  d. In the function's argument list, append the argument `target_network`. Accordingly, update the call to `sample_from_replay_buffer_and_train_model`.
 

In [42]:
def train_model_with_target_network():
  # Hyperparameters
  epsilon = 1.0
  eps_decay = 0.99
  discount_factor = 0.999
  episodes = 100
  learning_rate = 0.7
  replay_buffer_size = 2000
  batch_size = 8
  CHECK_SUCCESS_INTERVAL = 10
  update_target_network_interval = 10  

  # Parameters & model
  model = define_model(learning_rate)
  target_network = define_model(learning_rate)
  success_percent_threshold = 20 # in percent, so 60 = 60%
  replay_buffer = deque(maxlen = replay_buffer_size) # create new replay_buffer

  # Training metrics
  length_history = []
  reward_history = []
  time_history = []

  # Test if parameter values are valid
  assert eps_decay < 1.0 and eps_decay > 0.
  assert success_percent_threshold > 9 # agent could reach 9% randomly

  print("======= Begin Training =======")
  for episode in range(episodes):
    print(f'\tEpisode={episode}...')
    state = env.reset()
    done = False
    episode_reward = 0
    episode_length = 0
    episode_time_start = time.time()
    while not done:
      episode_length += 1
      action = select_action(epsilon, state, target_network)
      state_next, reward, done, _, _ = env.step(action)
      if done:
        print(f'Take action={action}: state from {state} -> {state_next}, reward={reward}, done={done}, episode_length={episode_length}')
      else:
        print(f'Take action={action}: state from {state} -> {state_next}, reward={reward}')
      state = state[0] if isinstance(state, tuple) else state
      replay_buffer.append((state, action, reward, state_next))
      
      model = sample_from_replay_buffer_and_train_model(
          replay_buffer, batch_size, model, discount_factor)
    
      episode_reward += reward
      state = state_next

    update_target_network(
        episode, update_target_network_interval, model, target_network)
    
    # Decreasing epsilon here instead of inside sample_from_replay_buffer_and_train_model introduces
    # the possible edge condition that epsilon decreases before the
    # model starts training because the batch doesn't build up
    if epsilon > EPSILON_MIN:
      epsilon *= eps_decay
      
    length_history.append(episode_length)
    reward_history.append(episode_reward)
    time_history.append(time.time() - episode_time_start)
  
    if check_success(episode, epsilon, reward_history, length_history, time_history, success_percent_threshold):
      break
      
  return model

In [43]:
train_model_with_target_network()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_5 (Dense)             (None, 4)                 64        
                                                                 
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
None
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 4)                 64        
                                                                 
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
None
	Episode=0...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.

Take action=3: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=1: state from 3 -> 7, reward=0.0, done=True, episode_length=6
	Episode=3...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=14
	Episode=4...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=5...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=6...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0


Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=11
	Episode=7...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=8...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=9...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=0: state from 6 -> 5, reward=0.0, done=True, episode_length=4
	Episode=10...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=3


	Episode=11...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=1: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 12, reward=0.0, done=True, episode_length=13
	Episode=12...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0


Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=14
	Episode=13...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0


Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=13
	Episode=14...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=2: state from 2 -> 2, reward=0.0
Take action=2: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 12, reward=0.0, done=True, episode_length=35
	Episode=15...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0


Take action=2: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=15
	Episode=16...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0


Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=5
	Episode=17...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=10
	Episode=18...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=1: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=8
	Episode=19...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0


Take action=0: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=14
	Episode=20...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0
Take action=1: state from 8 -> 9, reward=0.0
Take action=1: state from 9 -> 10, reward=0.0
Take action=2: state from 10 -> 11, reward=0.0, done=True, episode_length=7
	Episode=21...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=22...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0


Take action=0: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=8
	Episode=23...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=0: state from 6 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0


Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=15
	Episode=24...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=7
	Episode=25...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0


Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=6
	Episode=26...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=2: state from 10 -> 6, reward=0.0
Take action=0: state from 6 -> 5, reward=0.0, done=True, episode_length=13


	Episode=27...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=4
	Episode=28...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=29...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=10


	Episode=30...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0
Take action=0: state from 6 -> 10, reward=0.0
Take action=1: state from 10 -> 9, reward=0.0
Take action=0: state from 9 -> 5, reward=0.0, done=True, episode_length=7
	Episode=31...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=0: state from 6 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0


Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=9
	Episode=32...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=17
	Episode=33...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=34...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0


Take action=0: state from 8 -> 8, reward=0.0
Take action=1: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 13, reward=0.0
Take action=3: state from 13 -> 9, reward=0.0
Take action=2: state from 9 -> 5, reward=0.0, done=True, episode_length=15
	Episode=35...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=1: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 2, reward=0.0


Take action=0: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=29
	Episode=36...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=2: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=16
	Episode=37...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=38...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0


Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=15
	Episode=39...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=3
	Episode=40...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=41...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0


Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=1: state from 9 -> 10, reward=0.0
Take action=0: state from 10 -> 6, reward=0.0
Take action=1: state from 6 -> 7, reward=0.0, done=True, episode_length=8
	Episode=42...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=43...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 8, reward=0.0
Take action=1: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=1: state from 10 -> 11, reward=0.0, done=True, episode_length=8


	Episode=44...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=26
	Episode=45...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0


Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 10, reward=0.0
Take action=2: state from 10 -> 6, reward=0.0
Take action=2: state from 6 -> 7, reward=0.0, done=True, episode_length=11
	Episode=46...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=8
	Episode=47...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=2: state from 6 -> 10, reward=0.0
Take action=3: state from 10 -> 9, reward=0.0
Take action=0: state from 9 -> 5, reward=0.0, done=True, episode_length=8
	Episode=48...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=4
	Episode=49...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0


Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 12, reward=0.0, done=True, episode_length=9
	Episode=50...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=51...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=1: state from 3 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 7, reward=0.0, done=True, episode_length=14
	Episode=52...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 7, reward=0.0, done=True, episode_length=8
	Episode=53...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0


Take action=1: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=5
	Episode=54...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=5
	Episode=55...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=56...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 5, reward=0.0, done=True, episode_length=4


	Episode=57...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 12, reward=0.0, done=True, episode_length=7
	Episode=58...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 6, reward=0.0
Take action=3: state from 6 -> 7, reward=0.0, done=True, episode_length=8


	Episode=59...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=9
	Episode=60...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0


Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 3, reward=0.0


Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=1: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 5, reward=0.0, done=True, episode_length=33
	Episode=61...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0


Take action=3: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=7
	Episode=62...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 12, reward=0.0, done=True, episode_length=7
	Episode=63...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=0: state from 8 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0


Take action=2: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=8
	Episode=64...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=65...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0


Take action=0: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 9, reward=0.0
Take action=0: state from 9 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 11, reward=0.0, done=True, episode_length=17
	Episode=66...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0


Take action=1: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=1: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=15
	Episode=67...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0


Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=3: state from 6 -> 7, reward=0.0, done=True, episode_length=11
	Episode=68...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=8
	Episode=69...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0


Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=4
	Episode=70...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 5, reward=0.0, done=True, episode_length=11
	Episode=71...
Take action=0: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0


Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 13, reward=0.0


Take action=0: state from 13 -> 13, reward=0.0
Take action=0: state from 13 -> 12, reward=0.0, done=True, episode_length=20
	Episode=72...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=2
	Episode=73...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=1: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 10, reward=0.0
Take action=3: state from 10 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=7
	Episode=74...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0


Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=8
	Episode=75...
Take action=0: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0


Take action=0: state from 8 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=2: state from 10 -> 14, reward=0.0
Take action=3: state from 14 -> 15, reward=1.0, done=True, episode_length=26


	Episode=76...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 5, reward=0.0, done=True, episode_length=8
	Episode=77...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0


Take action=3: state from 1 -> 2, reward=0.0
Take action=1: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=12
	Episode=78...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=6
	Episode=79...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=2
	Episode=80...
Take action=3: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0


Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=2: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 11, reward=0.0, done=True, episode_length=8
	Episode=81...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0


Take action=1: state from 4 -> 8, reward=0.0
Take action=2: state from 8 -> 12, reward=0.0, done=True, episode_length=10
	Episode=82...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=0: state from 9 -> 13, reward=0.0
Take action=0: state from 13 -> 12, reward=0.0, done=True, episode_length=5
	Episode=83...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0


Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=1: state from 3 -> 3, reward=0.0
Take action=0: state from 3 -> 7, reward=0.0, done=True, episode_length=16
	Episode=84...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=5
	Episode=85...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0


Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 8, reward=0.0
Take action=2: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 10, reward=0.0
Take action=3: state from 10 -> 11, reward=0.0, done=True, episode_length=9
	Episode=86...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0
Take action=0: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=3: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=0: state from 4 -> 8, reward=0.0


Take action=3: state from 8 -> 8, reward=0.0
Take action=3: state from 8 -> 9, reward=0.0
Take action=3: state from 9 -> 5, reward=0.0, done=True, episode_length=11
	Episode=87...
Take action=1: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=11
	Episode=88...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0


Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=5
	Episode=89...
Take action=3: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=2: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=3: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=23
	Episode=90...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=2


	Episode=91...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=0: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=0: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=10
	Episode=92...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0


Take action=0: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=1: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=3: state from 0 -> 0, reward=0.0


Take action=3: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=25
	Episode=93...
Take action=1: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=2: state from 4 -> 5, reward=0.0, done=True, episode_length=4
	Episode=94...
Take action=1: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0


Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=9
	Episode=95...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=2: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0


Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 5, reward=0.0, done=True, episode_length=14
	Episode=96...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 8, reward=0.0
Take action=3: state from 8 -> 4, reward=0.0
Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 0, reward=0.0
Take action=0: state from 0 -> 4, reward=0.0
Take action=2: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=0: state from 1 -> 1, reward=0.0
Take action=1: state from 1 -> 0, reward=0.0
Take action=1: state from 0 -> 4, reward=0.0


Take action=0: state from 4 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=0: state from 1 -> 5, reward=0.0, done=True, episode_length=21
	Episode=97...
Take action=2: state from (0, {'prob': 1}) -> 4, reward=0.0
Take action=0: state from 4 -> 4, reward=0.0
Take action=1: state from 4 -> 5, reward=0.0, done=True, episode_length=3
	Episode=98...
Take action=2: state from (0, {'prob': 1}) -> 0, reward=0.0
Take action=2: state from 0 -> 0, reward=0.0
Take action=2: state from 0 -> 1, reward=0.0


Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=2: state from 2 -> 6, reward=0.0
Take action=1: state from 6 -> 7, reward=0.0, done=True, episode_length=9
	Episode=99...
Take action=2: state from (0, {'prob': 1}) -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0
Take action=2: state from 3 -> 3, reward=0.0
Take action=3: state from 3 -> 2, reward=0.0
Take action=3: state from 2 -> 1, reward=0.0
Take action=3: state from 1 -> 2, reward=0.0
Take action=3: state from 2 -> 3, reward=0.0


Take action=2: state from 3 -> 3, reward=0.0
Take action=2: state from 3 -> 7, reward=0.0, done=True, episode_length=11
Episode: 0099, Success:  1%, Avg length: 09, Epsilon: 0.37, Avg time(s): 4.54


<keras.engine.sequential.Sequential at 0x7fe61d43bee0>

## <font color='darkblue'>Conclusion and Next Steps</font>
You learned how to stabilize neural network training by using the following techniques:

* experience replay
* target networks

These two techiques are building blocks in the success of modern deep Q-learning programs.

Congratulations! You've completed the course Colabs. Return to the course [landing page](https://developers.google.com/machine-learning/reinforcement-learning/) to explore the Tensorflow library for Reinforcement Learning.