## Wumpus World - DeepQAgent

## Action-value network with two inputs

**Note:** To run the code, using a GPU is needed (as the model layers have the 'channels_first' data format)

**Q-learning: using an action-value network with two inputs (the states and actions) and one output (the action-value). Epsilon-greedy policy was used.**

The encoded state goes through several convolutional layers. The proposed action goes into a separate input. The output of the convolutional layers is combined with the proposed action and passed through a dense layer.

Belief state is encoded as a 3-D tensor using 13 feature planes (each plane is a grid_height x grid_width matrix). The state shape is (13, grid_height, grid_width). See the `DeepQAgent.encode_belief_state()` function.

Feature planes:
- Plane 1 - location of the agent
- Plane 2 - visited locations
- Plane 3 - stench locations
- Plane 4 - breeze locations
- Planes 5-8 - orientation of the agent
- Plane 9 - does the agent have gold?
- Plane 10 - does the agent perceive a glitter?
- Plane 11 - does the agent have an arrow?
- Plane 12 - have the agent heard a scream?
- Plane 13 - does the agent perceive a bump?

The experience data generated by the probabilistic agent `ProbAgent` was used as the first experience set to train the `DeepQAgent`. Please see the notebook **"prob_agent_collect_experience.ipynb"**

The `DeepQAgent` learned to climb out of the cave without gold.

**The `Orientation` class: orientation of the Agent (north, south, east, west)**

In [1]:
import enum


class Orientation(enum.Enum):
    north = 1
    south = 2
    east = 3
    west = 4
    
    
    @property
    def turn_left(self):
        dict_turn_left = {
            Orientation.north: Orientation.west, 
            Orientation.south: Orientation.east, 
            Orientation.east: Orientation.north, 
            Orientation.west: Orientation.south
        }
        new_orientation = dict_turn_left.get(self)
        return new_orientation
    
    
    @property
    def turn_right(self):
        dict_turn_right = {
            Orientation.north: Orientation.east, 
            Orientation.south: Orientation.west, 
            Orientation.east: Orientation.south, 
            Orientation.west: Orientation.north
        }
        new_orientation = dict_turn_right.get(self)
        return new_orientation

**The `Action` class**

The agent can move forward, turn left by 90 degrees, or turn right by 90 degrees. 

The action Grab can be used to pick up the gold if it is in the same square as the agent. 

The action Shoot can be used to fire an arrow in a straight line in the direction the agent is facing, the arrow continues until it either kills the Wumpus or hits a wall. The Agent has only one arrow. 

The action Climb, can be used to climb out of the cave, but only from the start square.

In [2]:
class Action():
    def __init__(self, is_forward=False, is_turn_left=False, is_turn_right=False, 
                 is_shoot=False, is_grab=False, is_climb=False):
        assert is_forward ^ is_turn_left ^ is_turn_right ^ is_shoot ^ is_grab ^ is_climb
        self.is_forward = is_forward
        self.is_turn_left = is_turn_left
        self.is_turn_right = is_turn_right
        self.is_shoot = is_shoot
        self.is_grab = is_grab
        self.is_climb = is_climb
    
    @classmethod
    def forward(cls):
        return Action(is_forward=True)
    
    @classmethod
    def turn_left(cls):
        return Action(is_turn_left=True)
    
    @classmethod
    def turn_right(cls):
        return Action(is_turn_right=True)
    
    @classmethod
    def shoot(cls):
        return Action(is_shoot=True)
    
    @classmethod
    def grab(cls):
        return Action(is_grab=True)
    
    @classmethod
    def climb(cls):
        return Action(is_climb=True)
    
    def show(self):
        if self.is_forward:
            action_str = "forward"
        elif self.is_turn_left:
            action_str = "turn_left"
        elif self.is_turn_right:
            action_str = "turn_right"
        elif self.is_shoot:
            action_str = "shoot"
        elif self.is_grab:
            action_str = "grab"
        else:
            action_str = "climb"
        return action_str

**The `Coords` class** 

Each square on the grid has two coordinates: x (column) and y (row). The start square is `Coords(x=0, y=0)`

In [3]:
from collections import namedtuple


class Coords(namedtuple('Coords', 'x y')):
    def adjacent_cells(self, grid_width, grid_height):
        neighbors = []
        if self.x > 0: # to left
            neighbors.append(Coords(self.x - 1, self.y))
        if self.x < (grid_width - 1): # to right
            neighbors.append(Coords(self.x + 1, self.y))
        if self.y > 0: # below
            neighbors.append(Coords(self.x, self.y - 1))
        if self.y < (grid_height - 1): # above
            neighbors.append(Coords(self.x, self.y + 1))
        return neighbors

**The `Percept` class**

The Agent has five sensors.

- In the square containing the Wumpus and in the directly (not diagonally) adjacent squares, the Agent will receive a Stench.
- In the squares directly adjacent to a pit, the Agent will perceive a Breeze.
- In the square where the gold is, the Agent will perceive a Glitter.
- When an Agent walks into a wall it will perceive a Bump.
- When the Wumpus is killed, the Agent will hear a Scream.

The percept also contains the reward calculated by the environment after each agent's action : +1000 for climbing out of the cave with the gold, -1000 for falling into a pit or being eaten by the Wumpus, -1 for each action taken and -10 for using the arrow.

`Percept.is_terminated`: The game ends either when the Agent dies or when the Agent climbs out of the cave.

In [4]:
class Percept():
    def __init__(self, stench, breeze, glitter, bump, scream, is_terminated, reward):
        self.stench = stench
        self.breeze = breeze
        self.glitter = glitter
        self.bump = bump
        self.scream = scream
        self.is_terminated = is_terminated
        self.reward = reward
    
    def show(self):
        print("stench: {}, breeze: {}, glitter: {}, bump: {}, scream: {}, is_terminated: {}, reward: {}"
              .format(self.stench, self.breeze, self.glitter, self.bump, self.scream, self.is_terminated, self.reward))

**The `AgentState` class**

Information about the Agent: location, orientation and whether the Agent is alive, has gold and has arrow

In [5]:
import copy


class AgentState():
    def __init__(self, location=Coords(0, 0), orientation=Orientation.east, has_gold=False, has_arrow=True, is_alive=True):
        self.location = location
        self.orientation = orientation
        self.has_gold = has_gold
        self.has_arrow = has_arrow
        self.is_alive = is_alive
    
    def turn_left(self):
        new_state = copy.deepcopy(self)
        new_state.orientation = new_state.orientation.turn_left
        return new_state
    
    def turn_right(self):
        new_state = copy.deepcopy(self)
        new_state.orientation = new_state.orientation.turn_right
        return new_state
    
    def forward(self, grid_width, grid_height):
        if self.orientation == Orientation.north:
            new_loc = Coords(self.location.x, min(grid_height - 1, self.location.y + 1))
        elif self.orientation == Orientation.south:
            new_loc = Coords(self.location.x, max(0, self.location.y - 1))
        elif self.orientation == Orientation.east:
            new_loc = Coords(min(grid_width - 1, self.location.x + 1), self.location.y)
        else:
            new_loc = Coords(max(0, self.location.x - 1), self.location.y) # if Orientation.west
        new_state = copy.deepcopy(self)
        new_state.location = new_loc
        return new_state
    
    def apply_move_action(self, action, grid_width, grid_height):
        if action.is_forward:
            return self.forward(grid_width, grid_height)
        if action.is_turn_left:
            return self.turn_left()
        if action.is_turn_right:
            return self.turn_right()
        if action.is_shoot:
            return self.use_arrow()
        if action.is_climb:
            return self
    
    def use_arrow(self):
        new_state = copy.deepcopy(self)
        new_state.has_arrow = False
        return new_state
    
    def show(self):
        print("location: {}, orientation: {}, has_gold: {}, has_arrow: {}, is_alive: {}"
              .format(self.location, self.orientation, self.has_gold, self.has_arrow, self.is_alive))

**Functions to create the list of all locations on the board and to generate the locations of gold, wumpus and pits**

The locations of the gold and the Wumpus are chosen randomly, with a uniform distribution, from the squares other than the start square. 

Each square other than the start can be a pit, with probability = `pit_prob`

In [6]:
import random


# Create a list with all locations

def list_all_locations(grid_width, grid_height):
    all_cells = []
    for x in range(grid_width):
        for y in range(grid_height):
            all_cells.append(Coords(x, y))
    return all_cells



# Create locations for gold and wumpus

def random_location_except_origin(grid_width, grid_height):
    locations = list_all_locations(grid_width, grid_height)
    locations.remove(Coords(0, 0))
    return random.choice(locations)



# Create pit locations

def create_pit_locations(grid_width, grid_height, pit_prob):
    locations = list_all_locations(grid_width, grid_height)
    locations.remove(Coords(0, 0))
    pit_locations = [loc for loc in locations if random.random() < pit_prob]
    return pit_locations

**The `Environment` class**

An environment is initialized with these parameters:
- width of the grid
- height of the grid
- allow climb without gold
- pit probability: the probability of a pit being added to each square except (0, 0)

The standard game is an initialization of (4, 4, True, 0.2).

In [7]:
import copy


class Environment():
    def __init__(self, grid_width, grid_height, pit_prob, allow_climb_without_gold, agent, pit_locations,
                 terminated, wumpus_loc, wumpus_alive, gold_loc):
        self.grid_width = grid_width
        self.grid_height = grid_height
        self.pit_prob = pit_prob
        self.allow_climb_without_gold = allow_climb_without_gold
        self.agent = agent
        self.pit_locations = pit_locations
        self.terminated = terminated
        self.wumpus_loc = wumpus_loc
        self.wumpus_alive = wumpus_alive
        self.gold_loc = gold_loc
    
    
    def is_pit_at(self, coords):
        return coords in self.pit_locations
    
    
    def is_wumpus_at(self, coords):
        return coords == self.wumpus_loc
    
    
    def is_agent_at(self, coords):
        return coords == self.agent.location
    
    
    def is_glitter(self):
        return self.gold_loc == self.agent.location
    
    
    def is_gold_at(self, coords):
        return coords == self.gold_loc
    
    
    def wumpus_in_line_of_fire(self):
        if self.agent.orientation == Orientation.west:
            return self.agent.location.x > self.wumpus_loc.x and self.agent.location.y == self.wumpus_loc.y
        if self.agent.orientation == Orientation.east:
            return self.agent.location.x < self.wumpus_loc.x and self.agent.location.y == self.wumpus_loc.y
        if self.agent.orientation == Orientation.south:
            return self.agent.location.x == self.wumpus_loc.x and self.agent.location.y > self.wumpus_loc.y
        if self.agent.orientation == Orientation.north:
            return self.agent.location.x == self.wumpus_loc.x and self.agent.location.y < self.wumpus_loc.y
    
    
    def kill_attempt_successful(self):
        return self.agent.has_arrow and self.wumpus_alive and self.wumpus_in_line_of_fire()
    
    
    def is_pit_adjacent(self, coords):
        for cell in coords.adjacent_cells(self.grid_width, self.grid_height):
            if cell in self.pit_locations:
                return True
        return False
    
    
    def is_wumpus_adjacent(self, coords):
        for cell in coords.adjacent_cells(self.grid_width, self.grid_height):
            if self.is_wumpus_at(cell):
                return True
        return False
    
    
    def is_breeze(self):
        return self.is_pit_adjacent(self.agent.location)
    
    
    def is_stench(self):
        return self.is_wumpus_adjacent(self.agent.location) or self.is_wumpus_at(self.agent.location)
    
    
    def apply_action(self, action):
        if self.terminated:
            return (self, Percept(False, False, False, False, False, True, 0))
        else:
            if action.is_forward:
                moved_agent = self.agent.forward(self.grid_width, self.grid_height)
                death = (self.is_wumpus_at(moved_agent.location) and self.wumpus_alive) or self.is_pit_at(moved_agent.location)
                new_agent = copy.deepcopy(moved_agent)
                new_agent.is_alive = not death
                new_gold_loc = new_agent.location if self.agent.has_gold else self.gold_loc
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      new_agent, self.pit_locations, death, self.wumpus_loc, self.wumpus_alive, new_gold_loc)
                percept = Percept(new_env.is_stench(), new_env.is_breeze(), new_env.is_glitter(), 
                                  new_agent.location == self.agent.location, False, death, 
                                  -1 if new_agent.is_alive else -1001)
                return (new_env, percept)
            
            if action.is_turn_left:
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      self.agent.turn_left(), self.pit_locations, self.terminated, self.wumpus_loc, 
                                      self.wumpus_alive, self.gold_loc)
                percept = Percept(self.is_stench(), self.is_breeze(), self.is_glitter(), False, False, False, -1)
                return (new_env, percept)
            
            if action.is_turn_right:
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      self.agent.turn_right(), self.pit_locations, self.terminated, self.wumpus_loc, 
                                      self.wumpus_alive, self.gold_loc)
                percept = Percept(self.is_stench(), self.is_breeze(), self.is_glitter(), False, False, False, -1)
                return (new_env, percept)
            
            if action.is_grab:
                new_agent = copy.deepcopy(self.agent)
                new_agent.has_gold = self.is_glitter()
                new_gold_loc = new_agent.location if new_agent.has_gold else self.gold_loc
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      new_agent, self.pit_locations, self.terminated, self.wumpus_loc, self.wumpus_alive, 
                                      new_gold_loc)
                percept = Percept(self.is_stench(), self.is_breeze(), self.is_glitter(), False, False, False, -1)
                return (new_env, percept)
            
            if action.is_climb:
                in_start_loc = self.agent.location == Coords(0, 0)
                success = self.agent.has_gold and in_start_loc
                is_terminated = success or (self.allow_climb_without_gold and in_start_loc)
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      self.agent, self.pit_locations, is_terminated, self.wumpus_loc, self.wumpus_alive, 
                                      self.gold_loc)
                percept = Percept(self.is_stench(), self.is_breeze(), self.is_glitter(), False, False, is_terminated, 
                                  999 if success else -1)
                return (new_env, percept)
            
            if action.is_shoot:
                had_arrow = self.agent.has_arrow
                wumpus_killed = self.kill_attempt_successful()
                new_agent = copy.deepcopy(self.agent)
                new_agent.has_arrow = False
                new_env = Environment(self.grid_width, self.grid_height, self.pit_prob, self.allow_climb_without_gold, 
                                      new_agent, self.pit_locations, self.terminated, self.wumpus_loc, 
                                      self.wumpus_alive and (not wumpus_killed), self.gold_loc)
                percept = Percept(self.is_stench(), self.is_breeze(), self.is_glitter(), False, wumpus_killed, False, 
                                  -11 if had_arrow else -1)
                return (new_env, percept)
    
    
    @classmethod
    def new_game(cls, grid_width, grid_height, pit_prob, allow_climb_without_gold):
        new_pit_locations = create_pit_locations(grid_width, grid_height, pit_prob)
        new_wumpus_loc = random_location_except_origin(grid_width, grid_height)
        new_gold_loc = random_location_except_origin(grid_width, grid_height)
        env = Environment(grid_width, grid_height, pit_prob, allow_climb_without_gold, 
                          AgentState(), new_pit_locations, False, new_wumpus_loc, True, new_gold_loc)
        percept = Percept(env.is_stench(), env.is_breeze(), False, False, False, False, 0.0)
        return (env, percept)
    
    
    def visualize(self):
        wumpus_symbol = "W" if self.wumpus_alive else "w"
        all_rows = []
        for y in range(self.grid_height - 1, -1, -1):
            row = []
            for x in range (self.grid_width):
                agent = "A" if self.is_agent_at(Coords(x, y)) else " "
                pit = "P" if self.is_pit_at(Coords(x, y)) else " "
                gold = "G" if self.is_gold_at(Coords(x, y)) else " "
                wumpus = wumpus_symbol if self.is_wumpus_at(Coords(x, y)) else " "
                cell = agent + pit + gold + wumpus
                row.append(cell)
            row_str = "|".join(row)
            all_rows.append(row_str)
        final_str = "\n".join(all_rows)
        print(final_str)

**Functions to encode and decode actions**

In [8]:
# Convert action to int

def encode_action_to_int(action):
    if action.is_forward:
        action_int = 0
    elif action.is_turn_left:
        action_int = 1
    elif action.is_turn_right:
        action_int = 2
    elif action.is_shoot:
        action_int = 3
    elif action.is_grab:
        action_int = 4
    else: # climb
        action_int = 5
    return action_int



# Convert action index (int) to action

def decode_action_index(index):
    actions = [Action.forward(), Action.turn_left(), Action.turn_right(), Action.shoot(), Action.grab(), Action.climb()]
    return actions[index]

**The `ExperienceBuffer` and `ExperienceCollector` classes: for handling experience data**

In [9]:
import numpy as np


# The ExperienceBuffer class to store the states, actions and rewards as NumPy arrays

class ExperienceBuffer:
    def __init__(self, states, actions, rewards):
        self.states = states
        self.actions = actions
        self.rewards = rewards
    
    def serialize(self, h5file):
        h5file.create_group('experience')
        h5file['experience'].create_dataset('states', data=self.states)
        h5file['experience'].create_dataset('actions', data=self.actions)
        h5file['experience'].create_dataset('rewards', data=self.rewards)



# Function to load the experience buffer from HDF5 file

def load_experience(h5file):
    return ExperienceBuffer(
        states=np.array(h5file['experience']['states']),
        actions=np.array(h5file['experience']['actions']),
        rewards=np.array(h5file['experience']['rewards']))



# Function to combine experience buffers

def combine_experience(buffers):
    combined_states = np.concatenate([b.states for b in buffers])
    combined_actions = np.concatenate([b.actions for b in buffers])
    combined_rewards = np.concatenate([b.rewards for b in buffers])

    return ExperienceBuffer(
        combined_states,
        combined_actions,
        combined_rewards)



# The ExperienceCollector class to collect all the states, decisions and rewards (as Python lists)

class ExperienceCollector:
    def __init__(self):
        self.states = []
        self.actions = []
        self.rewards = []
    
    def record_state(self, state):
        self.states.append(state)
    
    def record_action(self, action):
        self.actions.append(action)
    
    def record_reward(self, reward):
        self.rewards.append(reward)
    
    def to_buffer(self):
        return ExperienceBuffer(
            states=np.array(self.states), 
            actions=np.array(self.actions), 
            rewards=np.array(self.rewards))

In [10]:
class Agent:
    def __init__(self):
        pass
    
    def select_action(self, percept):
        raise NotImplementedError()

**The `DeepQAgent` class: a Q-learning agent**

In [11]:
import numpy as np
from tensorflow import keras



class DeepQAgent(Agent):
    def __init__(self, model, grid_width, grid_height, agent_state,
                 visited_locations, stench_locations, breeze_locations, 
                 perceives_glitter, heard_scream, perceives_bump):
        self.model = model
        self.grid_width = grid_width
        self.grid_height = grid_height
        self.agent_state = agent_state
        self.visited_locations = set(visited_locations)
        self.stench_locations = set(stench_locations)
        self.breeze_locations = set(breeze_locations)
        self.perceives_glitter = perceives_glitter
        self.heard_scream = heard_scream
        self.perceives_bump = perceives_bump
        self.epsilon = 0.0
        self.collector = None
        
    
    
    # Control the epsilon-greedy policy
    def set_epsilon(self, epsilon):
        self.epsilon = epsilon
    
    
    
    # Attach an ExperienceCollector object to record the experience data
    def set_collector(self, collector):
        self.collector = collector
    
    
    
    def select_action(self, percept):
        
        # Update agent's variables
        visiting_new_location = self.agent_state.location not in self.visited_locations
        if visiting_new_location:
            self.visited_locations.add(self.agent_state.location)
        if percept.breeze:
            self.breeze_locations.add(self.agent_state.location)
        if percept.stench:
            self.stench_locations.add(self.agent_state.location)
        new_heard_scream = self.heard_scream or percept.scream
        self.heard_scream = new_heard_scream
        self.perceives_glitter = percept.glitter
        self.perceives_bump = percept.bump
        
        num_actions = 6
        
        state_tensor = self.encode_belief_state() # encode belief state
        state_tensors_list = [state_tensor for i in range(num_actions)] # list with 6 state tensors (the same items)
        state_tensors_array = np.array(state_tensors_list)
        
        # One-hot encode all 6 actions
        action_vectors = np.zeros((num_actions, num_actions))
        for i in range(num_actions):
            action_vectors[i][i] = 1
        
        # Predict action-values (using two inputs)
        values = self.model.predict([state_tensors_array, action_vectors])
        values = values.reshape(num_actions) # convert a matrix to a vector
        ranked_moves = self.rank_moves_eps_greedy(values) # rank the actions according to the epsilon-greedy policy
        action_index = ranked_moves[0] # index of the largest value
        if self.collector is not None: # record the state and decision if collecting experience
            self.collector.record_state(state=state_tensor)
            self.collector.record_action(action_index)
        next_action = decode_action_index(action_index) # decode the action from index

        if next_action.is_grab:
            if percept.glitter and not self.agent_state.has_gold:
                self.agent_state.has_gold = True
        else:
            self.agent_state = self.agent_state.apply_move_action(next_action, self.grid_width, self.grid_height)
        return (self, next_action)
    
    

    def rank_moves_eps_greedy(self, values):
        if np.random.random() < self.epsilon:
            values = np.random.random(values.shape)
        ranked_moves = np.argsort(values) # rank the actions from worst to best
        # Return actions in best-to-worst order (a reversed vector)
        return ranked_moves[::-1]
    
    
    
    def train(self, experience, lr=0.01, batch_size=128, epochs=1):
        opt = keras.optimizers.Adam(lr=lr)
        self.model.compile(loss='mse', optimizer=opt)

        n = experience.states.shape[0] # number of experience samples
        num_actions = 6
        y = np.zeros((n,)) # the target vector with rewards
        actions = np.zeros((n, num_actions))
        for i in range(n):
            action = experience.actions[i]
            reward = experience.rewards[i] / 1001.0 # rescale the reward values, so they are in the range from -1 to +1
            actions[i][action] = 1 # one_hot encode actions
            y[i] = reward

        self.model.fit([experience.states, actions], y, batch_size=batch_size, epochs=epochs)
    
    
    
    @classmethod
    def new_agent(cls, model, grid_width, grid_height):
        return DeepQAgent(model, grid_width, grid_height, AgentState(), set(), set(), set(), False, False, False)
    
    
    
    # Encode belief state using 13 feature planes (each plane is a grid_height x grid_width matrix)
    # The state shape is (13, grid_height, grid_width)
    
    def encode_belief_state(self):
        state_tensor = np.zeros((13, self.grid_height, self.grid_width)) # create a 3-D tensor
        all_cells = list_all_locations(self.grid_width, self.grid_height)
        
        # The first plane has a 1 for agent's location and 0s for other locations
        state_tensor[0][self.agent_state.location.y][self.agent_state.location.x] = 1
        
        for cell in all_cells:
            if cell in self.visited_locations:
                state_tensor[1][cell.y][cell.x] = 1 # 1s for visited locations
            if cell in self.stench_locations:
                state_tensor[2][cell.y][cell.x] = 1 # 1s for stench locations
            if cell in self.breeze_locations:
                state_tensor[3][cell.y][cell.x] = 1 # 1s for breeze locations
        
        if self.agent_state.orientation == Orientation.north: # a plane filled with 1s if Orientation.north
            state_tensor[4] = 1
        elif self.agent_state.orientation == Orientation.south: # a plane filled with 1s if Orientation.south
            state_tensor[5] = 1
        elif self.agent_state.orientation == Orientation.east: # a plane filled with 1s if Orientation.east
            state_tensor[6] = 1
        else: # a plane filled with 1s if Orientation.west
            state_tensor[7] = 1
        
        if self.agent_state.has_gold: # a plane filled with 1s if agent has gold, and 0s otherwise
            state_tensor[8] = 1
        if self.perceives_glitter: # a plane filled with 1s if agent perceives glitter, and 0s otherwise
            state_tensor[9] = 1
        if self.agent_state.has_arrow: # a plane filled with 1s if agent has arrow, and 0s otherwise
            state_tensor[10] = 1
        if self.heard_scream: # a plane filled with 1s if wumpus is not alive, and 0s otherwise
            state_tensor[11] = 1
        if self.perceives_bump: # a plane filled with 1s if agent perceives bump, and 0s otherwise
            state_tensor[12] = 1
        
        return state_tensor

**Function to create an action-value network with two inputs (states and actions) and one output (the action-value)**

The encoded state goes through several convolutional layers. The proposed action goes into a separate input. The output of the convolutional layers is combined with the proposed action and passed through a dense layer.

The output layer is Dense(1, activation='tanh'). The rewards used for training are divided by 1001.0, so that they are in the range from -1 to +1.

In [12]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Flatten, Dense, concatenate
from tensorflow.keras.layers import ZeroPadding2D, Conv2D, BatchNormalization, Activation


def create_action_value_network(state_shape):
    state_input = Input(shape=state_shape, name='state_input')
    action_input = Input(shape=(6,), name='action_input')
    
    conv1a = ZeroPadding2D((1, 1), data_format='channels_first')(state_input)
    conv1b = Conv2D(64, (3, 3), data_format='channels_first')(conv1a)
    conv1c = BatchNormalization(axis=1)(conv1b)
    conv1d = Activation('relu')(conv1c)
    
    conv2a = ZeroPadding2D((1, 1), data_format='channels_first')(conv1d)
    conv2b = Conv2D(64, (3, 3), data_format='channels_first')(conv2a)
    conv2c = BatchNormalization(axis=1)(conv2b)
    conv2d = Activation('relu')(conv2c)
    
    conv3a = ZeroPadding2D((1, 1), data_format='channels_first')(conv2d)
    conv3b = Conv2D(48, (3, 3), data_format='channels_first')(conv3a)
    conv3c = BatchNormalization(axis=1)(conv3b)
    conv3d = Activation('relu')(conv3c)

    conv4a = ZeroPadding2D((1, 1), data_format='channels_first')(conv3d)
    conv4b = Conv2D(32, (3, 3), data_format='channels_first')(conv4a)
    conv4c = BatchNormalization(axis=1)(conv4b)
    conv4d = Activation('relu')(conv4c)
    
    flat = Flatten()(conv4d)
    processed_state = Dense(512, activation='relu')(flat)
    
    state_and_action = concatenate([action_input, processed_state])
    hidden_layer = Dense(256, activation='relu')(state_and_action)
    value_output = Dense(1, activation='tanh')(hidden_layer)
    
    model = Model(inputs=[state_input, action_input], outputs=value_output)
    return model

## Collecting experience, training and evaluating the agent

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Train the DeepQAgent from the experience data generated by the probabilistic agent

In [14]:
# Load the experience datasets generated by the probabilistic agent (5,000 games each)

import h5py

prob_agent_experience_02 = load_experience(h5py.File('drive/My Drive/prob_agent_experience_02', 'r'))
print(prob_agent_experience_02.states.shape)

prob_agent_experience_03 = load_experience(h5py.File('drive/My Drive/prob_agent_experience_03', 'r'))
print(prob_agent_experience_03.states.shape)

(76528, 13, 4, 4)
(76398, 13, 4, 4)


In [15]:
# Combine two experience buffers

prob_agent_experience = combine_experience([prob_agent_experience_02, prob_agent_experience_03])
print(prob_agent_experience.states.shape)

(152926, 13, 4, 4)


In [16]:
# Create an action-value network

model_10_2 = create_action_value_network(prob_agent_experience.states[0].shape)
model_10_2.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
state_input (InputLayer)        [(None, 13, 4, 4)]   0                                            
__________________________________________________________________________________________________
zero_padding2d (ZeroPadding2D)  (None, 13, 6, 6)     0           state_input[0][0]                
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 64, 4, 4)     7552        zero_padding2d[0][0]             
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 64, 4, 4)     256         conv2d[0][0]                     
_______________________________________________________________________________________

In [17]:
# Create a DeepQAgent (4x4 grid) and train it on the experience data
# Adam optimizer, lr=0.001, epochs=100

agent_10_2 = DeepQAgent.new_agent(model_10_2, 4, 4)
agent_10_2.train(prob_agent_experience, lr=0.001, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

**Save the model**

In [18]:
# Save the trained model as the "model_10_2"

agent_10_2.model.save('drive/My Drive/model_10_2')

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: drive/My Drive/model_10_2/assets


### Evaluating the updated agent: running 1,000 games and calculating the average score per game

**Only if the agent climbs out with gold, it is counted as a win.**

In [19]:
model = keras.models.load_model('drive/My Drive/model_10_2')
#model.summary()

In [20]:
# epsilon = 0.0
# Run 100 games to see the actions
# The agent climbs out without gold immediately after the game start, or turns right and then climbs out


n_games = 100
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    #agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
    
    if total_reward > 0: # only if the agent climbs out with gold, it is counted as a win
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.2f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: turn_right
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: climb
next_action: climb
next_action: climb
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
next_action: 

**Evaluate the updated agent (1,000 games), epsilon=0.0**

In [21]:
# epsilon = 0.0

# Number of games:  1000
# avg_reward_per_game: -1.36
# Total number of moves:  1036
# wins/games: 0.000
# games_stopped: 0 (after 200 moves)


n_games = 1000
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0 

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    #agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        #print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
    
    if total_reward > 0: # only if the agent climbs out with gold, it is counted as a win
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.2f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

Game 100/1000 ...
Game 200/1000 ...
Game 300/1000 ...
Game 400/1000 ...
Game 500/1000 ...
Game 600/1000 ...
Game 700/1000 ...
Game 800/1000 ...
Game 900/1000 ...
Game 1000/1000 ...
epsilon = 0.0
Number of games:  1000
Total number of moves:  1360
n_games_reward: -1360
avg_reward_per_game: -1.36
wins/games: 0.000
games_stopped: 0
rewards_list[:100]: [-1, -2, -1, -1, -2, -1, -2, -1, -2, -1, -1, -2, -1, -2, -2, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -1, -1, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -1, -1, -2, -2, -1, -1, -1, -1, -1, -2, -2, -1, -2, -1, -1, -1, -2, -2, -2, -1, -2, -1, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -2, -1, -1, -2, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -1, -1, -1, -1, -1, -2, -1]


**Evaluate the updated agent (1,000 games), epsilon=0.5**

In [22]:
# epsilon = 0.5

# Number of games:  1000
# Average total reward per game = -84
# wins/games: 0.003
# games_stopped: 11 (after 200 moves)


n_games = 1000
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        #print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games:", n_games)
print("Total number of moves:", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

Game 100/1000 ...
Game 200/1000 ...
Game 300/1000 ...
Game 400/1000 ...
Game 500/1000 ...
Game 600/1000 ...
Game 700/1000 ...
Game 800/1000 ...
Game 900/1000 ...
Game 1000/1000 ...
epsilon = 0.5
Number of games: 1000
Total number of moves: 8197
n_games_reward: -84477
avg_reward_per_game: -84
wins/games: 0.003
games_stopped: 11
rewards_list[:100]: [-1, -1, -2, -1, -46, -1, -3, -1, -1, -1, -1, -2, -2, -13, -3, -2, -2, -2, -1076, -2, -2, -2, -2, -1036, -2, -2, -14, -12, -13, -2, -1001, -59, -2, -1, -1, -1, -3, -3, -1, -1, -1005, -2, -2, -1, -2, -1001, -1, -2, -1, -1093, -2, -2, -1, -2, -2, -2, -2, -1, -1, -1015, -2, -2, -2, -3, -82, -2, -3, -1, -1, -1, -4, -1, -12, -1, -1, -3, -14, -1, -1100, -1, -1, -5, -1027, -4, -2, -2, -2, -1, -8, -1, -1, -13, -1, -1, -1, -13, -1, -3, -2, -3]


### Collect new experience data using the updated DeepQAgent and save it as a file (10,000 games)

In [23]:
# 10,000 games
# epsilon = 0.5

# Average total reward per game = -83
# Total number of moves: 73792
# wins/games: 0.002
# games_stopped: 85 (after 200 moves)


import h5py


n_games = 10000
total_moves = 0
n_games_reward = 0
wins = 0

collector = ExperienceCollector()

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    total_reward = 0
    num_moves = 0
    
    agent.set_epsilon(0.5)

    while not percept.is_terminated:
        agent.set_collector(collector)
        (agent, next_action) = agent.select_action(percept)
        (env, percept) = env.apply_action(next_action)
        collector.record_reward(percept.reward) # add reward to collector
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
        
        #print("Game %d/%d" % (i + 1, n_games))
        #print("Total reward:", total_reward)
        #print("Moves per episode:", num_moves)
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves

print("epsilon =", agent.epsilon)
print("Number of games:", n_games)
print("Total number of moves:", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)

experience = collector.to_buffer()
with h5py.File('drive/My Drive/q_agent_experience_10_2_1', 'w') as exp_out:
    experience.serialize(exp_out)

print("exp.states.shape:", experience.states.shape)
print("exp.actions.shape:", experience.actions.shape)
print("exp.rewards.shape:", experience.rewards.shape)
print("exp.states[0]:", experience.states[0])
print("exp.actions[0]:", experience.actions[0])
print("exp.rewards[0]:", experience.rewards[0])

Game 100/10000 ...
Game 200/10000 ...
Game 300/10000 ...
Game 400/10000 ...
Game 500/10000 ...
Game 600/10000 ...
Game 700/10000 ...
Game 800/10000 ...
Game 900/10000 ...
Game 1000/10000 ...
Game 1100/10000 ...
Game 1200/10000 ...
Game 1300/10000 ...
Game 1400/10000 ...
Game 1500/10000 ...
Game 1600/10000 ...
Game 1700/10000 ...
Game 1800/10000 ...
Game 1900/10000 ...
Game 2000/10000 ...
Game 2100/10000 ...
Game 2200/10000 ...
Game 2300/10000 ...
Game 2400/10000 ...
Game 2500/10000 ...
Game 2600/10000 ...
Game 2700/10000 ...
Game 2800/10000 ...
Game 2900/10000 ...
Game 3000/10000 ...
Game 3100/10000 ...
Game 3200/10000 ...
Game 3300/10000 ...
Game 3400/10000 ...
Game 3500/10000 ...
Game 3600/10000 ...
Game 3700/10000 ...
Game 3800/10000 ...
Game 3900/10000 ...
Game 4000/10000 ...
Game 4100/10000 ...
Game 4200/10000 ...
Game 4300/10000 ...
Game 4400/10000 ...
Game 4500/10000 ...
Game 4600/10000 ...
Game 4700/10000 ...
Game 4800/10000 ...
Game 4900/10000 ...
Game 5000/10000 ...
Game 5100

### Training the agent on the new experience

In [24]:
# Load the Q agent's experience

import h5py

q_agent_experience_10_2_1 = load_experience(h5py.File('drive/My Drive/q_agent_experience_10_2_1', 'r'))
q_agent_experience_10_2_1.states.shape

(73792, 13, 4, 4)

**Train with learning rate = 0.001**

In [25]:
# Run only a single epoch of training (as it is not known whether the experience data is good)
# lr=0.001

agent_10_2.train(q_agent_experience_10_2_1, lr=0.001)



In [26]:
# Save the model as "model_10_2_1"

agent_10_2.model.save('drive/My Drive/model_10_2_1')

INFO:tensorflow:Assets written to: drive/My Drive/model_10_2_1/assets


**Evaluating the updated agent**

In [27]:
model = keras.models.load_model('drive/My Drive/model_10_2_1')

In [30]:
# epsilon = 0.0

# Run 5 games to see the actions
# The agent chooses the Grab action when in the start location and it is stuck in the start location
# It is worse than before (climbing out right away)

# max number of moves is set to 20


n_games = 5
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    #agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 19:
            games_stopped += 1
            break
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.2f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_action: grab
next_actio

### Load the "model_10_2" and train on the new experience with learning rate = 0.0001 instead of 0.001

In [33]:
model = keras.models.load_model('drive/My Drive/model_10_2')
agent = DeepQAgent.new_agent(model, 4,4)
agent.train(q_agent_experience_10_2_1, lr=0.0001)




In [34]:
# epsilon = 0.0

# Run 5 games to see the actions
# The agent chooses the climb and the (turn_right, then climb) actions when in the start location 
# The result with the learning rate of 0.0001 is better than for lr=0.001

# max number of moves is set to 20


n_games = 5
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for i in range(n_games):
    if (i + 1) % 100 == 0:
        print("Game %d/%d ..." % (i + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    #agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 19:
            games_stopped += 1
            break
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.2f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

next_action: climb
next_action: turn_right
next_action: climb
next_action: climb
next_action: climb
next_action: turn_right
next_action: climb
epsilon = 0.0
Number of games:  5
Total number of moves:  7
n_games_reward: -7
avg_reward_per_game: -1.40
wins/games: 0.000
games_stopped: 0
rewards_list[:100]: [-1, -2, -1, -1, -2]


### Save this model and evaluate it at epsilon = 0.5

In [35]:
# Save the model as the "model_10_2_2"

model.save('drive/My Drive/model_10_2_2')

INFO:tensorflow:Assets written to: drive/My Drive/model_10_2_2/assets


In [36]:
# epsilon = 0.5

# Number of games: 1000
# avg_reward_per_game: -86
# Total number of moves: 7820
# wins/games: 0.001
# games_stopped: 8 (after 200 moves)


n_games = 1000
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for n in range(n_games):
    if (n + 1) % 100 == 0:
        print("Game %d/%d ..." % (n + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        #print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

Game 100/1000 ...
Game 200/1000 ...
Game 300/1000 ...
Game 400/1000 ...
Game 500/1000 ...
Game 600/1000 ...
Game 700/1000 ...
Game 800/1000 ...
Game 900/1000 ...
Game 1000/1000 ...
epsilon = 0.5
Number of games:  1000
Total number of moves:  7820
n_games_reward: -85840
avg_reward_per_game: -86
wins/games: 0.001
games_stopped: 8
rewards_list[:100]: [-1, -2, -1, -2, -3, -1, -2, -1, -1, -29, -2, -2, -2, -2, -13, -12, -1, -1002, -3, -2, -12, -14, -2, -4, -2, -1, -1, -1, -7, -1, -1, -1, -15, -2, -1, -1, -27, -2, -1, -12, -1, -1, -199, -1, -2, -2, -1, -1, -1, -2, -1, -2, -2, -13, -1, -1158, -14, -2, -74, -3, -1, -12, -1, -1, -1006, -15, -2, -2, -148, -3, -1, -9, -1, -2, -1, -12, -12, -1, -2, -1053, -1, -1, -3, -210, -13, -50, -1035, -1, -3, -12, -14, -2, -2, -2, -4, -4, -2, -2, -2, -4]


### Evaluate the agent at epsilon = 0.0

In [38]:
# epsilon = 0.0

# Number of games: 1000
# avg_reward_per_game: -1.43
# Total number of moves: 1427
# wins/games: 0.000
# games_stopped: 0 (after 200 moves)


n_games = 1000
total_moves = 0
n_games_reward = 0
rewards_list = []
wins = 0
games_stopped = 0

for n in range(n_games):
    if (n + 1) % 100 == 0:
        print("Game %d/%d ..." % (n + 1, n_games))
    
    agent = DeepQAgent.new_agent(model, 4, 4)
    (env, percept) = Environment.new_game(4, 4, 0.2, True)
    #env.visualize()
    #percept.show()
    total_reward = 0
    num_moves = 0

    #agent.set_epsilon(0.5)
    
    while not percept.is_terminated:
        (agent, next_action) = agent.select_action(percept)
        #print("next_action:", next_action.show())
        #agent.agent_state.show()

        (env, percept) = env.apply_action(next_action)
        #env.visualize()
        #percept.show()
        total_reward += percept.reward
        num_moves += 1
        if num_moves > 199:
            games_stopped += 1
            break
    
    if total_reward > 0:
        wins += 1
    n_games_reward += total_reward
    total_moves += num_moves
    rewards_list.append(total_reward)

print("epsilon =", agent.epsilon)
print("Number of games: ", n_games)
print("Total number of moves: ", total_moves)
print("n_games_reward:", n_games_reward)
print("avg_reward_per_game: %.2f" % (n_games_reward / n_games))
print("wins/games: %.3f" % (wins / n_games))
print("games_stopped:", games_stopped)
print("rewards_list[:100]:", rewards_list[:100])

Game 100/1000 ...
Game 200/1000 ...
Game 300/1000 ...
Game 400/1000 ...
Game 500/1000 ...
Game 600/1000 ...
Game 700/1000 ...
Game 800/1000 ...
Game 900/1000 ...
Game 1000/1000 ...
epsilon = 0.0
Number of games:  1000
Total number of moves:  1427
n_games_reward: -1427
avg_reward_per_game: -1.43
wins/games: 0.000
games_stopped: 0
rewards_list[:100]: [-1, -1, -2, -1, -1, -1, -1, -1, -1, -2, -2, -2, -1, -1, -1, -2, -1, -2, -2, -1, -2, -1, -1, -2, -1, -2, -2, -1, -1, -1, -2, -1, -1, -2, -1, -1, -1, -1, -2, -2, -1, -1, -2, -2, -1, -1, -2, -1, -1, -2, -1, -1, -1, -1, -1, -2, -1, -1, -1, -1, -1, -2, -2, -1, -1, -2, -2, -1, -2, -2, -2, -2, -1, -2, -2, -1, -1, -2, -1, -2, -1, -1, -1, -1, -1, -1, -2, -1, -2, -2, -2, -1, -1, -2, -2, -1, -1, -2, -1, -1]


### Conclusions


- **The DeepQAgent trained on the probabilistic experience data learned to climb out without gold**


- **The average score per game was about -85 (if epsilon=0.5) and -1.4 (if epsilon=0.0). The wins percentage was very low (0.3%, 0.1%). To compare, the probabilistic agent got the average score of 266 and 40% of wins (was getting the highest reward quite often)**


- **The agent needs further training for improvement**


- **The network and DeepQAgent can be used for larger grids**