# Navigating a 3D grid with an obstacle by reinforcement learning (continous action space)

## Salient features:
1) Custom Gym environment
2) Training a reinforcement learning agent using Stable-baselines

## Important links
Stable-Baselines: https://github.com/hill-a/stable-baselines

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines zoo: https://github.com/araffin/rl-baselines-zoo

In [2]:
from stable_baselines3.common.env_checker import check_env
import gym
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import A2C, DDPG, SAC
from stable_baselines3.common.env_util import make_vec_env

## The gym interface

The gym interface provides mainly three methods:
- `reset()` called at the beginning of an episode, it returns an observation
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether the episode is over and additional information
- (Optional) `render(method='human')` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `method='rbg_array'` to retrieve an image of the scene

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

[Documentation on custom env](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html)

##  Gym environment

In [3]:
class ReachEndEnv(gym.Env):
    """
    This is a simple env where the agent must learn to reach the destination with obstacles in between. 
    """
    metadata = {'render.modes': ['console']}

    def __init__(self, grid_size=4,block_low=1,block_high=2):
        super(ReachEndEnv, self).__init__()

        # Size of the 3D-grid
        self.grid_size = grid_size
        # Initialize the agent at the top left corner
        self.initial_pos = np.zeros((3,),dtype=np.float32) 
        self.agent_pos = np.zeros((3,),dtype=np.float32)
        # Specify the blocked region
        self.block_low =block_low
        self.block_high =block_high
        self.act_low = -1.0
        self.act_high = 1.0
        # Define action and observation space
        # They must be gym.spaces objects
        # Example when using discrete actions, we have two: left and right
        self.action_space = spaces.Box(low=self.act_low, high=self.act_high, shape=(3,), dtype=np.float32)
        # The observation will be the coordinate of the agent
        # this can be described both by Discrete and Box space
        self.observation_space = spaces.Box(low=0, high=self.grid_size, shape=(3,), dtype=np.float32)
        # Check if goal is reached
        self.endcheck = False

    def reset(self):
        """
        Important: the observation must be a numpy array
        :return: (np.array) 
        """
        # Initialize the agent at the right of the grid
        self.agent_pos = np.zeros((3,),dtype=np.float32)#self.initial_pos
        # here we convert to float32 to make it more general (in case we want to use continuous actions)
        # return np.array([self.agent_pos]).astype(np.float32)
        return self.agent_pos

    def step(self, action):

            
        self.agent_pos += action
        if action[0]<self.act_low or action[0]>self.act_high:
            raise ValueError("Received invalid action={} which is not part of the action space".format(action))
        if action[1]<self.act_low or action[1]>self.act_high:
            raise ValueError("Received invalid action={} which is not part of the action space".format(action)) 
        if action[2]<self.act_low or action[2]>self.act_high:
            raise ValueError("Received invalid action={} which is not part of the action space".format(action))    

        # Account for the boundaries of the grid
        self.agent_pos = np.clip(self.agent_pos, 0, self.grid_size-1)

        # Are we at the end?
        done = bool(self.agent_pos[0] == self.grid_size-1 and self.agent_pos[1]==self.grid_size-1 and \
                    self.agent_pos[2]==self.grid_size-1)

        # Null reward everywhere except when reaching the goal (right corner)
        if self.agent_pos[0] == self.grid_size-1 and self.agent_pos[1]==self.grid_size-1 and \
        self.agent_pos[2]==self.grid_size-1: 
            reward = 1
        elif self.agent_pos[0]>=self.block_low and self.agent_pos[0]<=self.block_high and \
        self.agent_pos[1]>=self.block_low and self.agent_pos[1]<=self.block_high and\
        self.agent_pos[2]>=self.block_low and self.agent_pos[2]<=self.block_high:
            reward = -10
        else:
            reward = -1

        # Optionally we can pass additional info, we are not using that for now
        info = {}
        self.endcheck = done

        return self.agent_pos, reward, done, info

    def render(self, mode='console'):
        if mode != 'console':
            raise NotImplementedError()        

    def close(self):
        pass


### Validate the environment

Stable Baselines provides a [helper](https://stable-baselines.readthedocs.io/en/master/common/env_checker.html) to check that your environment follows the Gym interface. It also optionally checks that the environment is compatible with Stable-Baselines (and emits warning if necessary).

In [4]:
env = ReachEndEnv()
# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)

In [5]:
# Instantiate the env
env = ReachEndEnv(grid_size=10,block_low=3,block_high=6)
# wrap it
env = make_vec_env(lambda: env, n_envs=1)

In [6]:
# Train the agent
# Using Deep Deterministic Policy gradient
model = DDPG('MlpPolicy', env, train_freq= 4, tensorboard_log='./log_files/',verbose=1).learn(500000)

Using cuda device
Logging to ./log_files/DDPG_1
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 8.45e+04  |
|    ep_rew_mean     | -8.46e+04 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 183       |
|    time_elapsed    | 1839      |
|    total_timesteps | 338001    |
| train/             |           |
|    actor_loss      | 95.2      |
|    critic_loss     | 0.0459    |
|    learning_rate   | 0.001     |
|    n_updates       | 337900    |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 4.23e+04  |
|    ep_rew_mean     | -4.23e+04 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 183       |
|    time_elapsed    | 1839      |
|    total_timesteps | 338055    |
| train/             |           |
|    actor_loss      | 95.1      |
|    critic_loss     | 0.188     |
|    le

In [7]:
# Test the trained agent
obs = env.reset()
OBS = np.zeros((14,3))
n_steps = 1000
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=True)
    print("Step {}".format(step + 1))
    print("Action: ", action)
    obs, reward, done, info = env.step(action)
    print('obs=', obs, 'reward=', reward, 'done=', done)
    # env.render(mode='console')
    OBS[step+1] = obs
    if done:
        # Note that the VecEnv resets automatically
        # when a done signal is encountered
        print("Goal reached!", "reward=", reward)
        break

Step 1
Action:  [[0.9999958 1.        1.       ]]
obs= [[0.9999958 1.        1.       ]] reward= [-1.] done= [False]
Step 2
Action:  [[ 0.97656536  1.         -0.04836106]]
obs= [[1.9765612  2.         0.95163894]] reward= [-1.] done= [False]
Step 3
Action:  [[ 0.99939346  1.         -0.04262352]]
obs= [[2.9759545 3.        0.9090154]] reward= [-1.] done= [False]
Step 4
Action:  [[0.35319114 1.         0.29385602]]
obs= [[3.3291457 4.        1.2028714]] reward= [-1.] done= [False]
Step 5
Action:  [[0.38485265 1.         0.87581825]]
obs= [[3.7139983 5.        2.0786896]] reward= [-1.] done= [False]
Step 6
Action:  [[0.81266344 1.         0.83166504]]
obs= [[4.526662  6.        2.9103546]] reward= [-1.] done= [False]
Step 7
Action:  [[0.987561   1.         0.99249506]]
obs= [[5.514223  7.        3.9028497]] reward= [-1.] done= [False]
Step 8
Action:  [[0.9864454 1.        1.       ]]
obs= [[6.5006685 8.        4.9028497]] reward= [-1.] done= [False]
Step 9
Action:  [[0.9694021  0.589846

In [8]:
# Save the model
model.save('3d_grid_obstacle_continuous10x10')

In [5]:
# Load the model
# model = SAC.load('3d_grid_obstacle_continuous10x10')

In [9]:
# Save the results
from scipy import io
io.savemat('3d_grid_obstacle.mat',{'OBS':OBS})