## Introduction to Flatland for MultiAgent Reinforcement Learning on Railway Environments  

In this notebook, we will learn how to:  
   * Create Flatland railway environments
   * Use TensorFlow to build a neural network for a reinforcement learning agent
   * Take actions and visualize agents in the environment
   
The main aim is to introduce reinforcement learning concepts and understand different parts of the problem.  
In the next notebook, we will introduce a framework to solve problems at scale.

In [None]:
import numpy as np
from flatland.envs.observations import GlobalObsForRailEnv
from flatland.envs.rail_env import RailEnv
from environments.custom_rail_generator import simple_rail_generator
from environments.custom_schedule_generator import sparse_schedule_generator
from environments.env_utils import env_from_env_config
from environments.observations import TreeObsForRailEnv
from environments.visualization_utils import animate_env, get_patch, render_env

# Docker
# Start virtual display before importing RenderTool
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()

from flatland.utils.rendertools import RenderTool, AgentRenderVariant

## Environment Setup

Here we generate a random environment for each episode, with characteristics specified by the environment config

In [None]:
# ENVIRONMENT SETUP - 
n_trains = 3
env_config = {
    "obs_config": {"max_depth": 2},
    "rail_generator": "complex_rail_generator",
    "rail_config": {"nr_start_goal": 12, "nr_extra": 0, "min_dist": 8, "seed": 10},
    "width": 8,
    "height": 8,
    "number_of_agents": n_trains,
    "schedule_generator": "complex_schedule_generator",
    "schedule_config": {},
    "frozen": False,
    "remove_agents_at_target": True,
    "wait_for_all_done": False
}
env = env_from_env_config(env_config)

observation_space = env.observation_space.shape[0]
n_actions = env.action_space.n

* In the previous section, we discussed how the Flatland observation space is arranged in a tree structure.  
* While this is convenient for scalable architectures, we need to supply inputs to our neural networks in vector format.  
* To achieve this, we use a preprocessor to transform into the required representation for training.

In [None]:
from environments.preprocessor import TreeObsPreprocessor
from gym.spaces import Box

preprocessor = TreeObsPreprocessor(Box(low=-np.inf, high=np.inf, shape=(observation_space,)), 
                                       {"custom_options": {"tree_depth": 2, "observation_radius": 0}})

def preprocess(observation):
    """
    Tree-like observations --> vector of observations to feed as input to our neural network
    """
    observation = [preprocessor.transform(o) for o in list(observation.values())]
    return np.concatenate(observation, axis=-1)  

# Neural Network

We are going to use TensorFlow to implement a neural network.  
* We will receive observations $o_t$ of the state of the environment, $s_t$ at time $t$
* We will use the neural network as a function approximator to map $o_t$ to the state-action value $Q(o_t, a_t)$
* With this, we can define a policy for the agent to take actions in the environment based on state-action value estimation

In [None]:
import tensorflow as tf
from tensorflow.keras import initializers, losses, Model, optimizers
from tensorflow.keras.layers import Dense, Flatten, Reshape

* Modify the code in the cell below to implement your own neural network
* You can add hidden layers by changing the number of units in the `hiddens` array
* Remember to specify `activations` for each layer
* You might also like to consider a different architecture to a fully connected neural network

In [None]:
class TFModel(Model):
    def __init__(self, n_actions, n_trains):
        super(TFModel, self).__init__()
        self.flatten = Flatten()
        
        # TODO: prepend some layers
        hiddens = [n_actions**n_trains]
        activations = [None]
        
        self.dense_layers = [
            Dense(
                h,
                name=f"fc_{k}",
                activation=activation,
                kernel_initializer=initializers.glorot_normal(),
                bias_initializer=initializers.constant(0.1),
            )
            for k, (h, activation) in enumerate(
                zip(hiddens, activations)
            )
        ]
        
    def call(self, x):
        x = self.flatten(x)
        for layer in self.dense_layers:
            x = layer(x)
            
        return x

## DQN Introduction

For this demonstration of centralized control, we will make use of a simple DQN agent.  

DQN is a value iteration algorithm, which also makes use of frozen target networks and replay buffers.
  * The target networks improve learning stability by preventing us from having to train on a moving target.
  * Trajectories are stored in the replay buffer, and later sampled to facilitate off-policy training on decorrelated data.  

These details are not the focus of this course, but more information can be found in Google DeepMind's original paper:  
https://arxiv.org/abs/1312.5602

In [None]:
class DQNAgent():
    def __init__(self):
        self.tau=0.01  # Updating target hyperparameter
        self.model = TFModel(n_actions, n_trains)  # Behavioural Model
        self.target_model = TFModel(n_actions, n_trains)  # Frozen Target Model
        self.target_model.set_weights(self.model.get_weights())  # Clone
        self.model.compile(
            optimizer=optimizers.Adam(learning_rate=1e-4),
            loss=losses.MeanSquaredError()
        )
        
    def update_target_networks(self):
        """theta_target = (1-tau)*theta_target + tau*theta"""
        model_params = self.model.get_weights()
        target_model_params = self.target_model.get_weights()

        index = 0
        for m, t in zip(model_params, target_model_params):
            t = t * (1 - self.tau) + m * self.tau
            target_model_params[index] = t
            index += 1

        self.target_model.set_weights(target_model_params)
    
dqn_agent = DQNAgent()

DQN is a value based algorithm: our neural network will output state-action values, $Q(s,a)$.


To know how to behave, we must define an explicit policy.  
* In this case, the best policy we can define is a greedy one, always taking the action corresponding to the state-action pair with the highest Q-value.
* We encourage some exploration by occasionally ignoring the greedy policy and taking a random action.
* This arrives at our policy of choice for this algorithm: $\epsilon$-greedy.

In [None]:
def e_greedy(Q, epsilon):
    """Epsilon greedy policy."""
    
    # Random action
    if np.random.uniform(0,1) < epsilon:
        return np.random.randint(0, n_actions**n_trains)
    
    # Greedy action
    return np.argmax(Q)

Since we are using a single agent to control all of the trains in a centralized manner, we need to translate the action index output by the $\epsilon$-greedy policy into a tuple of numbers that represent a real action to be taken by each train in the environment.

In [None]:
def actions_per_train(action):
    """
    E.g. for 3 trains, each with 5 actions:
         a model output of 11 --> (0,2,1).
         I.e. train 1 takes action 0,
              train 2 takes action 2,
              train 3 takes action 1.
    """
    return np.unravel_index(action, [n_actions] * n_trains)

Not all trains enter and exit the environment in the same time step.  
We will import some helper functions to pad any missing observations with zeros.  
These values will not affect the training of the neural network.

In [None]:
import functools
from environments.env_utils import pad_initial_obs, pad_done_agents

pad_obs = functools.partial(pad_initial_obs, observation_space=observation_space, n_trains=n_trains)
pad_dones = functools.partial(pad_done_agents, observation_space=observation_space, n_trains=n_trains)

# DQN Algorithm

<img src="https://storage.cloud.google.com/gtc-2020/images/DQN_algorithm.png" width="500" height="400" align="left">    

[Image from https://arxiv.org/abs/1312.5602]

In the cell below is the first component of the DQN algorithm, which gets trajectories of experience in the environment
* Given an observation of the state of the environment, you need to use your model to make a `prediction` of the state-action values
* Based on the predicted Q-values, the agent takes an action, $a_t$, for each train in the environment according to an $\epsilon$-greedily policy
* The environment transitions into the next state, $s_{t+1}$, returning rewards, $r_t$, for the action taken for each train

These trajectories are highly correlated with each other, due to the sequential nature of their acquisition. For this reason, we do not train on them immediately, but prefer to store them in a `replay buffer`, for off-policy training later.

In [None]:
def get_trajectory(obs, epsilon):
    """
    Arguments:
        obs: current observation of the state of trains in the environment
        epsilon: current hyperparameter value for exploration
    Returns:
        trajectory of (obs, actions, rewards, next_obs, dones)
    """
    
    # TODO: use your neural network model to make a prediction of Q-values
    # given the observation, obs, of the current state of the environment
    # Q = ...
    
    # Epsilon greedy policy
    actions = e_greedy(np.squeeze(Q), epsilon=epsilon)
    
    # Convert single index to action for each train
    a = actions_per_train(actions)
    
    # Flatland env expects a dictionary of actions
    action_dict = dict(zip(list(np.arange(len(a))), list(a)))

    # Take a step in the environment
    next_obs, rewards, dones, info = env.step(action_dict)
    
    next_obs = preprocess(next_obs)
    next_obs, rewards, dones = pad_dones(next_obs, rewards, dones)
    
    return obs, actions, rewards, next_obs, dones

Now we must consider how to train our agent

* In the cell below is a function `train_dqn`, which receives trajectories of experience sampled from the replay buffer
* The neural network is trained off-policy
* This is achieved using the target model to make a prediction of the target Q-value of the next state-action pair, $Q(\phi_{j+1}, a'; \theta)$
* The behavorial model parameters are then adjusted through backpropagation to correct the current Q-value towards the value of the immediate reward received + the discounted future prediction as, specified by the Bellman Optimality Equation:

$y_j = r_j + \gamma max_{a'} Q(\phi_{j+1}, a'; \theta)$

You will need to implement this equation in the training loop using the values predicted from your target Q model.

In [None]:
def train_dqn(trajectories, gamma=0.99):
    """
    Arguments:
        trajectories: batch of (obs, actions, rewards, next_obs, dones)
        gamma: future rewards discount factor
    """
    obs, actions, rewards, next_obs, dones = trajectories

    # TODO: use the target_model to make a prediction of the target Q for the next_obs_batch
    # Q_target = ...

    # Initialize training target
    y = np.zeros_like(Q_target)
    for k in range(batch_size):

        # We take the mean of rewards for all trains, to maintain r in [-1,1]
        r = sum(rewards[k]) / n_trains
        
        # We need to update the Q(s,a) value for the action taken
        action = actions[k]

        if dones[k]['__all__']:
            y[k][action] = r
        else:
            # TODO: fill in y_batch[k][action] using the Bellman Equation (see above).
            # (Hint: gamma has been set at the top of this cell.)
            # y_batch[k][action] = ...

            
    # Gradient Descent on (y - Q)^2
    with tf.GradientTape() as tape:
        # Forward pass
        Q = dqn_agent.model(obs)
        
        # Compute the loss on the actions taken
        actions = tf.one_hot(actions, n_actions**n_trains)
        loss = dqn_agent.model.loss(Q * actions, y)
        
    variables = dqn_agent.model.trainable_variables
    gradients = tape.gradient(loss, variables)
    dqn_agent.model.optimizer.apply_gradients(zip(gradients, variables))
    dqn_agent.update_target_networks()

## Main training loop

Run the cell below to execute training across multiple episodes of game play in the environment
* For best results, train for longer by increasing `n_episodes`
* For quicker results, decrease `n_episodes`

In [None]:
# Import a replay buffer for storing training trajectories
from utils.replay_buffer import ReplayBuffer
replay_buffer = ReplayBuffer()

# Training parameters
n_episodes = 100
max_steps_per_episode = 25

# Mini-batch size of trajectory samples
batch_size = 128

# Exploration
epsilon = 0.99
epsilon_decay = 0.995
min_epsilon = 0.0

for episode in range(n_episodes):
    obs = env.reset()
    obs = preprocess(obs)
    obs = pad_obs(obs)

    episode_reward = 0
    for _ in range(max_steps_per_episode):
        
        # Decay exploration coefficient
        epsilon = max(min_epsilon, epsilon*epsilon_decay)
        
        # Store trajectory experience tuple in replay memory
        obs, actions, rewards, next_obs, dones = get_trajectory(obs, epsilon)
        replay_buffer.add(obs, actions, rewards, next_obs, dones)
        
        # Update tally of current episode reward
        episode_reward += sum(rewards)
        
        if replay_buffer.size() > batch_size:
            # Sample de-correlated mini-batch of trajectories
            trajectories = replay_buffer.sample(batch_size)
            
            # Train DQN agent with the function you wrote above
            train_dqn(trajectories)
        
        if dones['__all__']:
            break

        obs = next_obs
    
    print(f"Episode {episode+1} Reward: {episode_reward/n_trains:.1f}")

# Visualize Performance  

* For this demonstration, we will visualize performance on a fixed test environment
* This test environment was not seen during training
* The test city map is also larger than the training maps
* After enough training, this can be used to test our agent's ability to generalize to new scenarios

In [None]:
speed_ration_map = {1.: 1.}
rail_generator = simple_rail_generator(n_trains=n_trains, seed=42)
schedule_generator = sparse_schedule_generator(speed_ration_map)
obs_builder_object = TreeObsForRailEnv(max_depth=2, predictor=None)

env = RailEnv(
            width=16,
            height=16,
            number_of_agents=n_trains,
            rail_generator=rail_generator,
            schedule_generator=schedule_generator,
            obs_builder_object=obs_builder_object,
            remove_agents_at_target=True,
        )

# Instantiate Renderer
env_renderer = RenderTool(env, gl="PILSVG",
                          agent_render_variant=AgentRenderVariant.AGENT_SHOWS_OPTIONS_AND_BOX,
                          show_debug=False,
                          screen_height=726,
                          screen_width=1240)

In [None]:
obs = env.reset()
obs = preprocess(obs[0])

env_renderer.reset()

frames = []
n_actions = env.action_space[0]

rollout_steps = 25
for step in range(rollout_steps):
    
    Q = dqn_agent.model.predict(np.expand_dims(obs, 0))
    
    # Test with a deterministic policy: epsilon=0
    actions = e_greedy(np.squeeze(Q, axis=0), epsilon=0)
    
    a = actions_per_train(actions)
    action_dict = dict(zip(list(np.arange(len(a))), list(a)))

    obs, rewards, dones, info = env.step(action_dict)
    obs = preprocess(obs)

    env_renderer.render_env(show=False, frames=False, show_observations=False)
    frames.append(env_renderer.gl.get_image())
    
animate_env(frames)

## Scalability: from centralized control to MARL

In this section:
* We illustrated an example of the centralized control of multiple trains

* We learned how to use TensorFlow to build and train a neural network as a function approximator to help in this task


* In general, this approach works well, but it is not scalable:  
notice how, for N actions and t trains, our neural network had to produce $N^t$ values for a joint action space!

* This is manageable in our example with 3 trains, but what if we wanted to scale to 100s or 1000s of trains?


In the next section:
* We will explore multi-agent reinforcement learning as a more scalable alternative

* We will introduce the RLlib framework to help train more advanced algorithms than our naive implementation of DQN

* This will improve utilization of available compute resources