# Reinforcement Learning Hackathon

### Setting Up Environment

Hopefully you have successfully set up and activated the environment. See the README.md if not.

### Testing Environment

We'll first try and use OpenAi [Gym](https://www.gymlibrary.dev) to make a game that we can control manually.

In [None]:
# Importing relevant modules
import numpy as np
import gym
from gym.utils.play import play, PlayPlot

You can create a game environment by doing `gym.make(...)`, and selecting any from a number of environment options. Some simple examples are found in the *Classic Control* section, seen here: https://www.gymlibrary.dev/environments/classic_control/. Clicking on a game, you will find information about how to import it. 

For this example, we will have a go at playing and training the [Cart Pole](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) environment. Have a look at the documentation page for this and you'll see that the *action* space is discrete and size 2. This means that at each point, you can either apply a force left (0) or right (1). Let's first have a go at manually playing.

In [None]:
# Create the cart pole environment
env = gym.make("CartPole-v1")

# Let's play the game manually, mapping left/right movement to a/d keys.
play(gym.make("CartPole-v1", render_mode="rgb_array"), keys_to_action={
    "a": np.array(0), # 0 corresponds to moving the cart leftwards
    "d": np.array(1), # 1 is rightwards
}, noop=np.array(0)) # This bit maps 'no operation' (not pressing any button) to an action.

Before even running this cell, you might have realised there is a problem here. The action state only contains 'left' or 'right'; in this particular example there is no action state for doing nothing, which does not transfer well to the keyboard, where a no operation state must always be selected. 

To get around this, I reworked the cartpole code to include a 'no force applied' state. Don't worry about the specifics of this for now.

In [None]:
from src.cartPole.myCartPole import CartPoleEnv

# I have edited the cartpole code to attempt to include a zero state, and increased angle of termination
env = CartPoleEnv(render_mode="rgb_array")

play(env, keys_to_action={
    "a": np.array(0),
    "d": np.array(2),
}, noop=np.array(1))

Hopefully this was more playable! I also increased the angle the pole falls to reset the game, and made the pole nice and long.

As a bonus little game, see if you can get to the top of the mountain (it's *really* difficult ;)):

In [None]:
# Create new environment using gym.make
env = gym.make("MountainCar-v0")

play(gym.make("MountainCar-v0", render_mode="rgb_array"), keys_to_action={
    "a": np.array(0),
    "d": np.array(2),
}, noop=np.array(1))

Hopefully at this point you can see it's not too difficult to get these games up and running.

Now let's walk through how to go about training a network to learn to play a game.

# Training a Network to Play

Following on from the presentation from my glamorous assistant, which introduced the general ideas and concepts of reinforcement learning, we will now go through an example of how one might train a network to play the Cart Pole game.

I'll do my best to make it clear what is going on at each stage, but the overall idea that you are trying to train a network to approximate a function, Q, which tells you the value of taking a given action from a given state. (Basically: Is this move good or nah?). E.g. the pole is tilted to the right in the current state. Should the cart go left or right? It should go right, to attempt to re-align the pole and stop it from falling.

In [None]:
# Imports
import gym
import math
import random
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

First, we again set up the Cart Pole environment. `render_mode` is set to human, so we can see what goes on as it trains.

A Transition is also defined, which is a tuple that relates to an action linking one state to the next state, and the reward associated with that move.

In [None]:
env = gym.make("CartPole-v1", render_mode="human")

# Define a transition from one state to the next
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))

In [None]:
# if GPU is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Storing transitions
This is a class to store transitions that are observed.

This is then randomly sampled from when training the network.

Sampling randomly increases stability.

In [None]:
class ReplayMemory(object):

    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

### Network
Deep Q Network (DQN) which will try to approximate the Q function.

Curreently just a simple fully connected network.

The network takes in the number of observations seen (in cart pole case this will include cart position, pole angle etc...), and output size matches the size of the action space.

In [None]:
class DQN(nn.Module):

    def __init__(self, n_observations, n_actions):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(n_observations, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.layer3(x)

### Stuff for training

A bunch of parameters for training. EPS_START, EPS_END AND EPS_DECAY control epsilon, which determines whether the next action taken is the best network-predicted action, or a completely random one.  Random actions are taken to explore unseen states and transitions. Earlier on, more random actions are taken. As time goes on, epsilon decreases, and the moves are increasingly chosen based on the network suggestion.

In [None]:
# BATCH_SIZE is the number of transitions sampled from the replay buffer
# GAMMA is the discount factor as mentioned in the previous section
# EPS_START is the starting value of epsilon
# EPS_END is the final value of epsilon
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
# TAU is the update rate of the target network
# LR is the learning rate of the ``AdamW`` optimizer
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TAU = 0.005
LR = 1e-4

More stuff to set up for the training loop. Here we get the number of actions and observations. We then create **two** instances of the DQN: a policy net and a target net. I don't quite understand fully yet what goes on with these two, but you choose actions based on which one the policy net thinks is best. The target net lags the policy net by only being 'softly' updated each time. This provides stability?

In [None]:
# Get number of actions from gym action space
n_actions = env.action_space.n
# Get the number of state observations
state, info = env.reset()
n_observations = len(state)

policy_net = DQN(n_observations, n_actions).to(device)
target_net = DQN(n_observations, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())

optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
memory = ReplayMemory(10000) # define the memory as an instance of the class mentioned earlier

steps_done = 0

Here we define the function that determines how an action is selected. This function uses the epsilon values mentioned and defined a few cells above. A number is randomly sampled between 0 and 1, and the epsilon threshold determines whether the policy net is used, or a random action is taken. Epsilon falls as time (steps_done) increases.

In [None]:
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            # t.max(1) will return the largest column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1)[1].view(1, 1)
    else:
        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)

In [None]:
# Plot stuff
episode_durations = []

def plot_durations(show_result=False):
    plt.figure(1)
    durations_t = torch.tensor(episode_durations, dtype=torch.float)
    if show_result:
        plt.title('Result')
    else:
        plt.clf()
        plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Duration')
    plt.plot(durations_t.numpy())
    # Take 100 episode averages and plot them too
    if len(durations_t) >= 100:
        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(99), means))
        plt.plot(means.numpy())

    plt.pause(0.001)  # pause a bit so that plots are updated
    if is_ipython:
        if not show_result:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        else:
            display.display(plt.gcf())

### Optimise Loop
Here's the crux of it all. Inside this function, the policy net is used to get the state action values. The expected state action values are then calculated through using the target net to assess the value of next states. [Hubner Loss](https://en.wikipedia.org/wiki/Huber_loss) is then calculated between these two. 

I appreciate that makes little sense. I also do not really know what's going on yet.

This loop trains on a sample taken randomly from the current memory (all transitions observed so far).

In [None]:
"""
OPTIMIZE LOOP.
Sample from memory to get a batch and train on this.
Policy net and a target net (for soft update and stability).
"""

def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0]
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    # In-place gradient clipping
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

In [None]:
# Set epochs based on hardware available
if torch.cuda.is_available():
    num_episodes = 1000
else:
    num_episodes = 500

### Training
Finally, we can train. For each episode:

1. We select an action using the `select_action` function.
1. We do that action, and store the transition in the memory.
1. We do one step of optimisation, which updates the policy net using a random batch from the memory-stored transitions.
1. We then update the target net 'softly'.

In [None]:
"""
TRAIN LOOP
"""
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

for i_episode in range(num_episodes):
    print(f"training {i_episode}")
    # Initialize the environment and get it's state
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    for t in count():
        action = select_action(state)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)
        done = terminated or truncated

        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the policy network)
        optimize_model()

        # Soft update of the target network's weights
        # θ′ ← τ θ + (1 −τ )θ′
        target_net_state_dict = target_net.state_dict()
        policy_net_state_dict = policy_net.state_dict()
        for key in policy_net_state_dict:
            target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
        target_net.load_state_dict(target_net_state_dict)

        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
print('Complete')
plot_durations(show_result=True)
plt.ioff()
plt.show()

The robot can now play the game. But it still cannot love. It will never love.

I hope you enjoyed The Notebook.

<img src="the_notebook.png"  width=50%>