# Lab 01 - Exit the dungeon!

In this first lab, we will create by hand our first Reinforcement Learning environment.
A lot of agents will be harmed in the process of solving the lab.

## The environment

The environment is a NxN array of integers. 
Each cell of this environment can have the following values:
- 0 : empty cell
- 1 : obstacle, non-traversable
- 2 : lava
- 3 : exit

All border cells are obstacles.
Upon initialization, the environment has:
- N/2 obstacles placed randomly in the maze.
- N/2 lava cells placed randomly in the cell.

## The game

The agent starts in a random empty cell, and has to reach the exit.
The exit is randomly positioned in an other empty cell.

At each timestep:
- the agent decides on an action (move up, left, right or down)
- the action is sent to the environment
- the environment sends back observations, rewards and a boolean that indicates whether the environment terminated.

The environment terminates if the agent reaches the exit, or if the environement reaches a time limit of N^2 timesteps.

## Observations

The agent receives a dictionary of observations:
- target: relative coordinates of the exit 
- proximity: a 3x3 array that encodes for the value of the cells around the agent.

## Rewards

When acting, an agent receives a reward depending on the cell it ends up on:
- if the agent moves towards an obstacle, it gets a reward of -5 and stays at its original position
- if the agent is on a lava cell after its action, it receives a reward of -20
- at each timestep, the agent receives an additional reward of -1
- when the agent reaches the goal, it receives a reward of N**2


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from random import randrange

# Part 1 - Defining the environment.

We will define the environment as a class.
We are providing pseudo code which is incomplete and probably not completely error-free.

You have to fill the blanks.
We advise you to look at the pseudo-code for Part 2 and 3 to have an idea of how things work together.

In order to make sure that your environment runs as intended, you will create a display function.

In [256]:
class Dungeon:
    
    def __init__(self, N):

        # The environment is a NxN array of integers. Each cell of this environment can have the following values:
        self.N = N
        self.empty = 0
        self.obstacle = 1
        self.lava = 2
        self.exit = 3
        self.agent = 5
        
        self.reward = 0
        
        # Numpy array that holds the information about the environment
        self.dungeon = self.create_dungeon(N)
        
        # position of the agent and exit will be decided by resetting the environment.
        self.position_agent = None
        self.position_exit = None
        
        # run time
        self.time_elapsed = 0
        self.time_limit = N*N
        self.done = False
        
        self.observations = {'up':None,'down':None,'left':None,'right':None}
        
    def create_dungeon(self,N):
        num_obs = int(N/2)
        num_lava = int(N/2)
        dungeon = np.zeros((N, N))
        dungeon[:,[0,-1]] = dungeon[[0,-1]] = 1
        
        #add obstacle and lavas
        while num_obs > 0:
            obs_rnd_x,obs_rnd_y = randrange(1,N-1),randrange(1,N-1)
            if dungeon[obs_rnd_x][obs_rnd_x] == 0:
                dungeon[obs_rnd_x][obs_rnd_x] = self.obstacle
                num_obs -= 1
        while num_lava > 0:
            lava_rnd_x,lava_rnd_y = randrange(1,N-1),randrange(1,N-1)
            if dungeon[lava_rnd_x][lava_rnd_y] == 0:
                dungeon[lava_rnd_x][lava_rnd_y] = self.lava
                num_lava -= 1                

        #print(dungeon)
        return dungeon
        
        
    def step(self, action):
        # action is 'up', 'down', 'left', or 'right'
        actions = {'up':(-1,0),'down':(1,0),'left':(0,-1),'right':(0,1)}
        
        Y_move = actions[action][0]
        X_move = actions[action][1]
        
        
        # modify the position of the agent
        if ((self.position_agent[0]+Y_move) not in [0,self.N] and (self.position_agent[1]+X_move) not in [0,self.N] and
            ((self.dungeon[self.position_agent[0]+Y_move][self.position_agent[1]+X_move] == self.empty) or 
              self.dungeon[self.position_agent[0]+Y_move][self.position_agent[1]+X_move] == self.exit)):
            
            self.dungeon[self.position_agent[0]][self.position_agent[1]] = self.empty
            
            self.position_agent[0] += Y_move
            self.position_agent[1] += X_move
            #print(self.position_agent)
            self.dungeon[self.position_agent[0]][self.position_agent[1]] = self.agent
        
        # calculate total reward
        rewards = {1:-5,2:-20,0:-1,3:(self.N**2),5:-1}
        curr_pos = self.position_agent
        curr_status = self.dungeon[curr_pos[0]+Y_move][curr_pos[1]+X_move]
        self.reward = rewards[int(curr_status)]
        #print(self.reward)
        
        # calculate observations
        self.observations['up'] = self.dungeon[self.position_agent[0]-1][self.position_agent[1]]
        self.observations['down'] = self.dungeon[self.position_agent[0]+1][self.position_agent[1]]
        self.observations['left'] = self.dungeon[self.position_agent[0]][self.position_agent[1]-1]
        self.observations['right'] = self.dungeon[self.position_agent[0]][self.position_agent[1]+1]
        
        # update time
        self.time_elapsed += 1
        #print('time use',self.time_elapsed)
            
        # verify termination condition
        if self.time_elapsed == self.time_limit or (self.position_agent[0] == self.position_exit[0] and self.position_agent[1] == self.position_exit[1]):
            self.done = True
        
        return self.observations, self.reward, self.done
    
    def display(self):
        # prints the environment
        print(self.dungeon)
        # ...
        
    def reset(self):
        """
        This function resets the environment to its original state (time = 0).
        Then it places the agent and exit at new random locations.
        
        It is common practice to return the observations, 
        so that the agent can decide on the first action right after the resetting of the environment.
        
        """
        self.time_elapsed = 0
        self.time_limit = self.N**2  
        
        # position of the agent is a numpy array    
        while True: # 记得修改 X Y 顺序 ####################################################################
            agent_rnd_x,agent_rnd_y = randrange(1,(self.N-1)),randrange(1,(self.N-1))
            if self.dungeon[agent_rnd_x][agent_rnd_y] == 0:
                self.dungeon[agent_rnd_x][agent_rnd_y] = self.agent
                self.position_agent = np.array([agent_rnd_x,agent_rnd_y]) #get the agent XY point
                print('agnt',self.position_agent)
                break
        
        # position of the exit is a numpy array    
        while True:
            exit_rnd_x,exit_rnd_y = randrange(1,(self.N-1)),randrange(1,(self.N-1))
            if self.dungeon[exit_rnd_x][exit_rnd_y] == 0:
                self.dungeon[exit_rnd_x][exit_rnd_y] = self.exit
                self.position_exit = np.array([exit_rnd_x,exit_rnd_y]) #get the exit XY point
                print('exit',self.position_exit)
                break
                
        # Calculate observations
        self.observations['up'] = self.dungeon[self.position_agent[0]-1][self.position_agent[1]]
        self.observations['down'] = self.dungeon[self.position_agent[0]+1][self.position_agent[1]]
        self.observations['left'] = self.dungeon[self.position_agent[0]][self.position_agent[1]-1]
        self.observations['right'] = self.dungeon[self.position_agent[0]][self.position_agent[1]+1]
        #print(self.observations)
        
        return self.observations

In [231]:
# dungeon = Dungeon(10)
# dungeon.reset()
# dungeon.display()

# a = random_policy(a)
# obs, reward, done = dungeon.step(a)
# print(a)
# print(dungeon.position_agent,obs, reward, done)
# dungeon.display()

# Part 2 - Defining a policy

##### A policy tells the agent how to act depending on its current observation and internal beliefs.

As a first simple case, we will define policy as a function that maps observations to actions.

As your agent is stupid and doesn't have any way of learning what to do, in this first lab we will write by hand the policy.
Try to come up with a strategy to terminate the game with the maximum reward.

We advise you to start with a very simple policy, then maybe try a random policy, and finally an 'intelligent' policy.


In [232]:
# def basic_policy(observation):
    
#     ...
    
#     return action

def random_policy(observation):
    actions = ['up','down','left','right']
    action = np.random.choice(actions)
    return action
    
# def intelligent_policy(observation):
#     ...
    

# Part 3 - Evaluating your policy

Now that you have the environment and policies, you can simulate runs of your games under different policies and evaluate the reward that particular policies will get upon termination of the environment. 

To that effect, we will create a function run_single_experiment, which will have as input:
- an instance of an environment
- a policy

And it will return the reward obtained once the environment terminates.


In [255]:
def run_single_exp(envir, policy):
    
    obs = envir.reset()
    envir.display()
    done = False
    total_reward = 0
    
    while not done:
        action = policy(obs)
        obs, reward, done = dungeon.step(action)
        total_reward += reward
        #print(reward,total_reward,done)
    envir.display()
    return total_reward
    
    
dungeon = Dungeon(10)
run_single_exp(dungeon, random_policy)
# print(total_reward)

agnt [3 7]
exit [4 5]
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 0. 2. 0. 0. 0. 2. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 0. 5. 0. 1.]
 [1. 0. 0. 0. 1. 3. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 2. 1. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 2. 0. 0. 0. 0. 2. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
time use 1
time use 2
time use 3
time use 4
time use 5
time use 6
time use 7
time use 8
time use 9
time use 10
time use 11
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 0. 2. 0. 0. 0. 2. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 5. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 2. 1. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 2. 0. 0. 0. 0. 2. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]


63

# Part 3 - Evaluating your policy

Now that you can evaluate how a policy performs on a particular environment, consider the following.
Because of stochasticity of initial agent position and exit position, different runs will lead to different total rewards.

To properly evaluate our policies, we must calculate the statistics over multiple runs.

To that effect, we will create a function run_experiments, which will have as input:
- an instance of an environment
- a policy
- a number of times that the experiment will be run

It will return the maximum reward obtained over all the runs, the average and variance over the rewards.


In [213]:
def run_experiments(envir, policy, number_exp):
    
    all_rewards = []
    
    for n in range(number_exp):
        
        final_reward = run_single_exp(envir, policy)
        all_rewards.append(final_reward)
    
    max_reward = 
    mean_reward = 
    var_reward = 
    
    return max_reward, mean_reward, var_reward
    

SyntaxError: invalid syntax (<ipython-input-213-55d0606c5724>, line 10)

# Part 4

Draw some plots to compare how your different policies perform depending on the environment size.

As the environment generation is also stochastic (random obstacles and lava), you might need to compute additional statistics.
