# Reinforcement Learning Tutorial
Tutorial adapted by Hongfei from [PyTorch Tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html).

This tutorial shows how to use PyTorch to train Deep Q-Learning (DQL) agents
on the LunarLander task and the CartPole-v1 task from [Gymnasium](https://www.gymnasium.farama.org).

## Task -- LunarLander

State dimension: 8
- $x,y$ positions
- $x,y$ linear velocities
- angle and angular velocity
- two True/False values indicating whether left or right leg touches the ground or not

Action space dimension: 4
- 0: do nothing
- 1: left engine fire
- 2: main engine fire
- 3: right engine fire

For details of this task please refer to SPH6004 lecture notes. Environment provided by the [Gymnasium](https://gymnasium.farama.org/environments/box2d/lunar_lander/) project.

![LunarLander](Figs/lunar_lander.gif)

## Task -- CartPole-v1

The agent has to decide between two actions - moving the cart left or
right - so that the pole attached to it stays upright. You can find more
information about the environment at
[Gymnasium's website](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

![CartPole-v1](Figs/cartpole.gif)

As the agent observes the current state of the environment and chooses
an action, the environment *transitions* to a new state, and also
returns a reward that indicates the consequences of the action. In this
task, rewards are +1 for every incremental timestep and the environment
terminates if the pole falls over too far or the cart moves more than 2.4
units away from center. This means better performing scenarios will run
for longer duration, accumulating larger return.

The CartPole task is designed so that the inputs to the agent are 4 real
values representing the environment state (0: cart position, 1: cart velocity, 2: pole angle (in radians), 3: pole angular velocity).
We take these 4 inputs without any scaling and pass them through a 
small fully-connected network with 2 outputs, one for each action (0: left, 1: right). 

----

In [1]:
# %%bash
# pip3 install gymnasium[classic_control]
# pip3 install gymnasium[box2d]

In [2]:
import gymnasium as gym
import math
import random
from collections import namedtuple, deque
from itertools import count
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from time import sleep
from tqdm.notebook import tqdm
import os

task = ["LunarLander-v2", "CartPole-v1"][0]

# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
# Helper function to render environment

def render(task="LunarLander-v2", policy_net = None, wind=False):
    # if policy_net is None, use randomly sampled policy for action.
    if task=="LunarLander-v2":
        env = gym.make(task,render_mode='human',enable_wind=wind,wind_power=10)
    elif task=="CartPole-v1":
        env = gym.make(task,render_mode='human')

    observation, info = env.reset()
    total_reward = 0
    with torch.no_grad():
        for i in count():
            if policy_net is None:
                action = env.action_space.sample()        
            else:
                q_values = policy_net(torch.tensor(observation,device=device).unsqueeze(0))
                action = q_values.argmax().item()

            observation, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            _ = env.render()
            if terminated or truncated:
                policy_mode = 'random' if policy_net is None else 'policy_net'
                print('Poicy is {}, episode duration: {}'.format(policy_mode,i+1))
                print('Total reward is {:.2f}'.format(total_reward))
                sleep(1)
                env.close()
                break

In [4]:
render(task)

Poicy is random, episode duration: 44
Total reward is 44.00


## Replay Memory

Experience relay stores
the transitions that the agent observes, allowing us to reuse this data
later. By sampling from it randomly, the transitions that build up a
batch are decorrelated. It has been shown that this greatly stabilizes
and improves the DQL training procedure.

For this, we're going to need two classses:

-  ``Transition`` - a named tuple representing a single transition in
   our environment.
-  ``ReplayMemory`` - a cyclic buffer of bounded size that holds the
   transitions observed recently. It also implements a ``.sample()``
   method for selecting a random batch of transitions for training.




In [5]:
Transition = namedtuple('Transition',
                        ('state', 'action', 'reward', 'next_state'))


class ReplayMemory(object):

    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Now, let's define our model.

### Q-network

We will use a feed forward neural network $N$ to approximate best utility functions.

#### LunarLander-v2

The input of network $N$ has dimension 8. It has four outputs, representing $Q(s,\text{do nothing})$, $Q(s,\text{left engine fire})$, $Q(s,\text{main engine fire})$, and $Q(s,\text{right engine fire})$ (where $s$ is the input to the network representing current state). In effect, the network is trying to predict the *optimal total reward* of
taking each action given the current input.

#### CartPole-v1

The network $N$ input has dimension 4. It has two
outputs, representing $Q(s, \mathrm{left})$ and
$Q(s, \mathrm{right})$ (where $s$ is the input to the
network representing current state). In effect, the network is trying to predict the *optimal total reward* of
taking each action given the current input.




In [6]:
class DQN(nn.Module):

    def __init__(self, state_space_dim, n_actions):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(state_space_dim, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.layer3(x)

## $\varepsilon$-greedy algorithm

The following cell defines a function to implement $\varepsilon$-greedy algorithm. Here we use a dynamic $\varepsilon$, such that at the beginning $t=0$ we have a large $\varepsilon=\varepsilon_\text{start}$, while as number of steps $t\to \infty$ the threshold $\varepsilon\to \varepsilon_\text{end}$ for a small value $\varepsilon_\text{end}$. Both $\varepsilon_\text{start}$ and $\varepsilon_\text{end}$ are hyper-parameters defined later.

About `tensor.max()` method:
Say we have tensor
```
a = tensor([[-0.5044, -1.5145],
            [-0.8489,  0.0904],
            [ 0.0348,  2.0637]])
```
then `a.max(dim=1)` returns two tensors, where the first contains the maximum along `dim=1`, i.e.
```
tensor([-0.5044,  0.0904,  2.0637])
```
and the second contains the indices of the maximum elements, i.e.
```
indices=tensor([0, 1, 1]).
```

In [7]:
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            # we pick action with the larger optimal total rewards
            # .max(dim=1)[1] pick up the maximum indices
            return policy_net(state).max(dim=1)[1].view(1, 1)
    else:
        # we randomly select an action from environment's action space
        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)


## Training

Finally, the code for training our model.

Here, you can find an ``optimize_model`` function that performs a
single step of the optimization. It performs the following:
1. We sample a batch of experience $(s,a,r,s')$ from our experience replay `memory`, concatenates
all the tensors into a single one.

2. Computes $Q(s, a)$ using our policy network.

3. Compute $V(s') = \max_a' Q(s', a')$ using our target network.

4. Combine $Q(s, a)$, $V(s')$ and the current reward $r$ into our loss.

5. (We set $V(s') = 0$ if $s$ is a terminal state.)

6. Update policy network by stochastic gradient descent.

7. [Soft update](https://arxiv.org/pdf/1509.02971.pdf) update target network. Soft update controlled by the hyperparameter ``TAU``, which will be defined later.

In [8]:
def optimize_model():
    if len(memory) < BATCH_SIZE:
        # If we do not have enough experience in experience replay, do nothing
        return
    
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(dim=1, index=action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(dim=1)[0]
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    # In-place gradient clipping
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

    # Soft update of the target network's weights
    # θ′ ← τ θ + (1 −τ )θ′
    target_net_state_dict = target_net.state_dict()
    policy_net_state_dict = policy_net.state_dict()
    for key in policy_net_state_dict:
        target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
    target_net.load_state_dict(target_net_state_dict)

In [9]:
# ------------------------ Hyper-parameters ------------------
# BATCH_SIZE is the number of experience sampled from experience replay (We train on several experience in parallel)
# GAMMA is the discount factor for future rewards
# EPS_START is the starting value of epsilon (exploration probability)
# EPS_END is the final value of epsilon
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
# TAU is the update rate of the target network
# LR is the learning rate of the AdamW optimizer
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.1
EPS_DECAY = 10000
TAU = 0.005
LR = 1e-4
memory = ReplayMemory(10000)

steps_done = 0
num_episodes = 1000
total_rewards = []
best_total_rewards = float('-inf')

# ------------------------ Re-create environment ------------------

env = gym.make(task)

# Get number of actions from gym action space
n_actions = env.action_space.n
# Get the number of state observations
state, info = env.reset()
n_observations = len(state)
episode_durations = []

# ------------------------- Initialize networks ---------------------

policy_net = DQN(n_observations, n_actions).to(device)
target_net = DQN(n_observations, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())

optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)


model_save_path = './ckpt'
os.makedirs(model_save_path,exist_ok=True)

Below, you can find the main training loop. At the beginning we reset
the environment and obtain the initial ``state`` Tensor. Then, we sample
an action, execute it, observe the next state and the reward, and optimize our model once. When the episode ends, we restart the loop.

Note that for LunarLander, environment will terminate if total steps exceeds 1000, while for CartPole, environment will terminate if total steps exceeds 500.

In [10]:
# Training loop
for i_episode in tqdm(range(num_episodes)):
    # Initialize the environment and get it's state
    state, info = env.reset()
    total_reward = 0
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    for t in count():
        action = select_action(state)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        total_reward += reward
        reward = torch.tensor([reward], device=device)
        done = terminated or truncated

        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)

        # Store the transition in memory
        memory.push(state, action, reward, next_state)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the policy network)
        optimize_model()

        if done:
            episode_durations.append(t + 1)
            total_rewards.append(total_reward)
            # print and save model every 50 episodes
            aw = np.minimum(len(episode_durations),50)
            if i_episode%50 == 0:
                if best_total_rewards < np.mean(total_rewards[-aw:]):
                    # save model if avg performance is good
                    torch.save(policy_net.state_dict(),'./ckpt/{}_checkpoint.pt'.format(task))
                    best_total_rewards = np.mean(total_rewards[-aw:])

                print('Episode {}, avg episode_durations {:.2f}, avg total reward {:.2f}'.format(i_episode,np.mean(episode_durations[-aw:]),np.mean(total_rewards[-aw:])))
            break

print('Complete')

  0%|          | 0/500 [00:00<?, ?it/s]

Episode 0, avg episode_durations 56.00, avg total reward 56.00
Episode 50, avg episode_durations 19.28, avg total reward 19.28
Episode 100, avg episode_durations 17.34, avg total reward 17.34
Episode 150, avg episode_durations 48.16, avg total reward 48.16
Episode 200, avg episode_durations 99.02, avg total reward 99.02
Episode 250, avg episode_durations 270.04, avg total reward 270.04
Episode 300, avg episode_durations 208.62, avg total reward 208.62
Episode 350, avg episode_durations 225.74, avg total reward 225.74
Episode 400, avg episode_durations 303.62, avg total reward 303.62
Episode 450, avg episode_durations 495.30, avg total reward 495.30
Complete


Lastly we visualize the trained policy net.

In [14]:

policy_net.load_state_dict(torch.load('./ckpt/{}_checkpoint.pt'.format(task)))
# Set policy_net=None to visualize a random policy.
render(task=task, policy_net=policy_net,wind=False)

Poicy is policy_net, episode duration: 500
Total reward is 500.00
