# Reinforcement Learning
## Loading the environment
I am running the RL algorithms on LunarLander-v3 , which is one of many gym environments. The goal of the game is to safely land a rover on a planet surface. The espisode ends after the rover touches the ground and a reward is given, which is calculated based on the time taken to land, the smoothness of the landing and potential damages to the rover. The action space consists of 4 (one-hot) actions and the observation space of 8 continuous measurements.

In [5]:
! apt-get install swig3.0

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig3.0
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 1,109 kB of archives.
After this operation, 5,555 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig3.0 amd64 3.0.12-2.2ubuntu1 [1,109 kB]
Fetched 1,109 kB in 1s (1,567 kB/s)
Selecting previously unselected package swig3.0.
(Reading database ... 126111 files and directories currently installed.)
Preparing to unpack .../swig3.0_3.0.12-2.2ubuntu1_amd64.deb ...
Unpacking swig3.0 (3.0.12-2.2ubuntu1) ...
Setting up swig3.0 (3.0.12-2.2ubuntu1) ...
Processing triggers for man-db (2.10.2-1) ...


In [6]:
! ln -s /usr/bin/swig3.0 /usr/bin/swig

In [7]:
! pip install gymnasium[box2d]

Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting swig==4.* (from gymnasium[box2d])
  Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2312070 sha256=731a48864dad0f720f54267b19821af5f477779d9421ab5da268bd553b02307f
  Stored in directory: /root/.cache/pip/wheels/ab/f1/0c/d56f4a2bdd12bae0a0693ec33f2f0daadb

In [66]:
import gymnasium as gym

env = gym.make("LunarLander-v3")

## Deep Q Learning
In the first part of notebook, I am working with value-based DRL algorithms, namely DQN. I am using a simple feedforward NN with architecture 8-128-64-4 and ReLU for activation.

### Baseline Deep Q Learning
I am starting with a basic Q Learning procedure. Namely, I run the model on every step of the simulation, selecting the action with the highest estimated value (ie. $ \arg\max_{a} Q(s, a) $ ), then transition to the next state on the environemnt $s'$ and optimize using the the MSQE loss function:
$$ \lVert \arg\max_{a'} Q(s', a') - ( R + γ Q(s, a) ) \rVert ^2 $$

I am also incorporating an $\epsilon$-greedy stategy of expoloration, in which case the algorithm a uniformly random action, ignoring the model outputs.


In [44]:
import numpy as np

In [55]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

In [82]:
def dql_baseline(model, optimizer, num_episodes, gamma, epsilon, device):
    criterion = torch.nn.MSELoss()

    for episode in range(num_episodes):
        observation, info = env.reset()
        observation = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)  # (1, obs_state_dim)

        action = 0 # start with a dummy action

        total_loss = 0.0
        total_reward = 0.0
        num_runs = 0

        while True:
            next_observation, reward, terminated, truncated, info = env.step(action)

            total_reward += reward

            if terminated or truncated:
                break

            next_observation = torch.tensor(next_observation, dtype=torch.float32, device=device).unsqueeze(0) # (1, obs_state_dim)

            optimizer.zero_grad()

            observations_bundled = torch.cat([observation, next_observation], dim=0)  # (2, obs_state_dim)
            q_values = model(observations_bundled)  # (2, num_actions)

            best_values = q_values.max(dim=1).values  # (prev_value, next_value)

            loss = criterion(best_values[0], reward + gamma * best_values[1].detach())

            total_loss += loss.item()
            num_runs += 1

            loss.backward()
            optimizer.step()

            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = q_values.argmax(dim=1)[1].item()  # highest value on the 2nd prediction (next_value)

        avg_loss = total_loss / num_runs if num_runs > 0 else 0
        avg_reward = total_reward / num_runs if num_runs > 0 else 0

        print(f'Episode {episode}   Avg Loss: {avg_loss:.4f}, Avg Reward: {avg_reward:.2f}')


    env.close()


In [83]:
from datetime import datetime

In [84]:
num_episodes = 300
gamma = 0.99
epsilon = 0.05
lr = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

model = DQN(obs_state_dim, num_actions).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


start = datetime.now()

dql_baseline(model, optimizer, num_episodes, gamma, epsilon, device)

end = datetime.now()

print(end - start)

Episode 0   Avg Loss: 36.7307, Avg Reward: -2.36
Episode 1   Avg Loss: 22.8168, Avg Reward: -4.43
Episode 2   Avg Loss: 180.7674, Avg Reward: -4.24
Episode 3   Avg Loss: 283.7177, Avg Reward: -7.18
Episode 4   Avg Loss: 52.9152, Avg Reward: -2.36
Episode 5   Avg Loss: 728.9474, Avg Reward: -9.44
Episode 6   Avg Loss: 67.7815, Avg Reward: -8.02
Episode 7   Avg Loss: 47.8840, Avg Reward: -8.06
Episode 8   Avg Loss: 3.7803, Avg Reward: -9.05
Episode 9   Avg Loss: 14.8400, Avg Reward: -6.98
Episode 10   Avg Loss: 2.8627, Avg Reward: -8.43
Episode 11   Avg Loss: 2.8828, Avg Reward: -7.58
Episode 12   Avg Loss: 1.3779, Avg Reward: -6.98
Episode 13   Avg Loss: 4.8145, Avg Reward: -7.95
Episode 14   Avg Loss: 3.8329, Avg Reward: -6.73
Episode 15   Avg Loss: 186.9174, Avg Reward: -6.35
Episode 16   Avg Loss: 10.0212, Avg Reward: -8.91
Episode 17   Avg Loss: 3.4724, Avg Reward: -8.38
Episode 18   Avg Loss: 2.2988, Avg Reward: -8.39
Episode 19   Avg Loss: 5.5683, Avg Reward: -9.95
Episode 20   Av

## Improved DQL algoirhtm
With the baseline DQL algorithm, we don't get good results, the best we can do is having a mean reward close to 0, which is by far from a good performance in this simple game. I am adding 2 things to the above algorithm:

### Experience replay
One way to improve the performance of the agent is to use what we call an experience replay. This is essentially a memory bank of past experiences that we can collect while running the agent. During the training stage, we can construct our training set by sampling (using simple uniform sampling across all available experiences in the bank).

Essentially, in our training procedure, we can now have epochs, each of which consists of 2 parts. First, we play the game with the current state (weights) of the agent and for each step of the simulation, add the experience to the experience replay. An experience consists of a simple training sample, namely the previous state $s$, action $a$, reward $R$ and next state $s'$, esentially a tuple $(a, s, R, s')$. These are enought to feed to the model and get the Q values for the criterion.

2 remarks:
* First, we sample in order to avoid having the training in sequence and allowing the model to see training data not in sequential order. This is achieved by sampling.
* Additionally, it is very important to avoid data drift and hence lower quality predictions. Therefore, it is key to retain older data from the previous epochs and, when adding new data, to remove the oldest ones (in the order they were added to the structure). Essentially, this makes our experience replay a circular buffer that can be implemented using a (FIFO) queue, in a similar fashion to how the queue on MoCo works.

Esentially, a queue allows us to add elements to the front (the new data) and eject elements from them back (the old data) in $O(1)$.

Unfortunately, the deque implementaion in Python doesn't support access to intermediate elements. Also, it seems that a circular buffer wouldn't work, as it has a fixed size and we need the flexibility to support a varianble amount of experiences (like say in case the last epoch produces more samples than the threshold we set).

The best way to do this is to implement a queue using 2 stacks. This is a well know method (see here: https://www.geeksforgeeks.org/queue-using-stacks). A stack can be modeled as a list (which allows for efficiently adding/removing elements at its back). With this setup, we can enable sampling in $O(1)$

Let me implement the queue below:

In [48]:
# Note: the code is AI generated for the most part, with my own modifications to make it work properly

class Stack:
    def __init__(self):
        self.stack = []

    def get(self, index):
        if index >= len(self.stack):
            print('Warning: index out-of-bounds, functionality might not work as expected!')

        return self.stack[index]

    def push(self, element):
        self.stack.append(element)

    def pop(self):
        if not self.is_empty():
            return self.stack.pop()

    def is_empty(self):
        return len(self.stack) == 0

    def __len__(self):
        return len(self.stack)

class Queue:
    def __init__(self):
        self.stack1 = Stack()
        self.stack2 = Stack()

    def get(self, index):
        if index >= self.__len__():
            print('Warning: index out-of-bounds, functionality might not work as expected!')

        if index < len(self.stack1):
            return self.stack1.get(index)
        else:
            return self.stack2.get(index - len(self.stack1))

    def push(self, element):
        self.stack1.push(element)

    def pop(self):
        if self.stack2.is_empty():
            while not self.stack1.is_empty():
                self.stack2.push(self.stack1.pop())

        if not self.stack2.is_empty(): # empty means no elements in queue
            return self.stack2.pop()

    def is_empty(self):
        return self.stack1.is_empty() and self.stack2.is_empty()

    def __len__(self):
        return len(self.stack1) + len(self.stack2)



**Note:** the order which we do the sampling doesn't matter, as what really matters is to be able to sample every element in the queue unfiformly.

Moving on to the experience replay:

In [86]:
from collections import deque

# Structure for a single experience
class Experience:
    def __init__(self, state, action, reward, next_state, done=False):
        self.state = state
        self.action = action
        self.reward = reward
        self.next_state = next_state
        self.done = done

class ExperienceReplay:
    def __init__(self):
        self.memory = Queue()

    def add(self, experience):
        self.memory.push(experience)

    def remove(self):
        return self.memory.pop()

    def sample(self):
        memory_len = len(self.memory)

        random_index = np.random.randint(memory_len) # range: [0, memory_len - 1]
        return self.memory.get(random_index)

    def __len__(self):
        return len(self.memory)

In [87]:
# Some testing

exp_replay = ExperienceReplay()
exp_replay.add(Experience(1, 1, 1, 1))
exp_replay.add(Experience(2, 2, 2, 2))
exp_replay.add(Experience(3, 3, 3, 3))

print(f'len(exp_replay)={len(exp_replay)}')

print([ exp_replay.sample().state for _ in range(10) ]) # values between 1 and 3

print(exp_replay.memory.get(0).state) # 1

exp_replay.remove()
print(exp_replay.memory.get(0).state) # 3
print(exp_replay.memory.get(1).state) # 2

exp_replay.remove()
print(exp_replay.memory.get(0).state) # 3

len(exp_replay)=3
[1, 1, 2, 3, 3, 1, 3, 1, 2, 1]
1
3
2
3


## Offline/Online networks
One more thing we can do is, for each epoch, to keep have an online and an offline network. The offline network (target) estimates the Q value of the current state and the online network estimates the Q value of the next state.

Usually, after $C$ steps, we update the parameters of the offline network with the online one and we proceed as usual. In my implementaion, I am doing this once per epoch (every epoch would consist of about 5k training samples).

## Implementation details
Initially, I fill the experience replay with the 1st epoch experiences. On subsequent epochs, I eject about 1/3rd of the stored experiences and refill it with tne new ones.

Regarding training, the new loss function is as follows, given quadruplet $(s, a, r, s')$ and the online $Q_{on}$ and offline $Q_{off}$ networks:
$$ \lVert (r + \arg\max_{a'} Q_{off}(s', a')) - Q_{on}(s, a) \rVert^2 $$

For simplicity, I am running a batch gradient descent across the whole dataset at once.

To compute the loos we do the following:

1. Use the offline network to find the best Q value for the next state, by passing $s'$ to it. ALso, make sure it is frozen or detached, so that it doesn't get updated at all during backpropagation.
2. Use the online/current network to compute the Q value for the current state $s$, by taking the action $a$ the agent took during play. Basically, in PyTorch, we need the $a$-th (0-based) index of $Q_{on}(s)$.
3. Pass these values to the criterion and apply backprop. Make sure the detach the old model predictions.

In [108]:
def clean_er(experience_replay, num_experiences_to_remove):
    for _ in range(num_experiences_to_remove):
        experience_replay.remove()

def update_er(experience_replay, new_experiences):
    for exp in new_experiences:
        experience_replay.add(exp)

def play(model, env, epsilon, device, total_new_experiences):
    model.eval()
    experiences = []

    with torch.no_grad():
        while len(experiences) < total_new_experiences: # episode
            observation, info = env.reset()
            observation = torch.tensor(observation, dtype=torch.float32, device=device) # (obs_state_dim)

            action = 0

            while True:
                next_observation, reward, terminated, truncated, info = env.step(action)
                next_observation = torch.tensor(next_observation, dtype=torch.float32, device=device) # (obs_state_dim)

                done = terminated or truncated # new addition based on the text cell below

                #if done:
                    #break

                # (s, a, r, s')
                experience = Experience(observation, action, reward, next_observation, done)
                experiences.append(experience)

                if done:
                    break

                q_values = model(next_observation)  # (num_actions)

                # update the action

                if np.random.rand() < epsilon: # epsilon-greed
                    action = env.action_space.sample()
                else:
                    action = q_values.argmax().item()  # highest value on the 2nd prediction (next_value)

    return experiences

def train(offline_model, online_model, optimizer, experience_replay, gamma, device):
    criterion = torch.nn.MSELoss()

    online_model.train()

    train_size = 5 * len(experience_replay) # 5 times larger dataset than experiences

    # construct the dataset
    dataset = []

    for _ in range(train_size):
        dataset.append(experience_replay.sample())

    # For simplicity, I will use batch gradient descent (about 5k samples, 8 inputs)
    curr_states = torch.stack([ exp.state for exp in dataset ]).to(device) # (train_size, obs_state_dim)
    next_states = torch.stack([exp.next_state for exp in dataset]).to(device) # (train_size, obs_state_dim)
    actions = torch.tensor([ exp.action for exp in dataset ], dtype=torch.int64, device=device) # (train_size)
    rewards = torch.tensor([ exp.reward for exp in dataset ], dtype=torch.float32, device=device) # (train_size)
    is_done = torch.tensor([ exp.done for exp in dataset ], dtype=torch.float32, device=device) # (train_size), new addition

    # Compute the loss
    optimizer.zero_grad()

    # Find the best next state Q values
    next_state_q_values = offline_model(next_states) # (train_size, num_actions)
    best_next_state_q_value = next_state_q_values.max(dim=1).values # (train_size)

    # Find the curr state Q value
    curr_state_q_values = online_model(curr_states) # (train_size, num_actions)

    # The below function finds the index specified by the actions array along dim=1 (so we get the q_value of the corresponding action idx)
    curr_state_q_value = curr_state_q_values.gather(1, actions.unsqueeze(1)).squeeze(1) # (train_size)

    # Compute the loss & backprop
    loss = criterion(curr_state_q_value, rewards + gamma * best_next_state_q_value.detach()) # detach!!

    loss.backward()
    optimizer.step()

    # Return statistics
    avg_loss = loss.item()
    avg_reward = torch.mean(rewards).item()

    return avg_loss, avg_reward

def create_offline_model(online_model):
    offline_model = DQN(online_model.fc1.in_features, online_model.fc3.out_features)
    offline_model.load_state_dict(online_model.state_dict())

    return offline_model


def dql_optimized(env, online_model, optimizer, num_epochs, gamma, epsilon, device, num_experiences):
    criterion = torch.nn.MSELoss()
    experience_replay = ExperienceReplay()

    offline_model = create_offline_model(online_model)

    for epoch in range(num_epochs):
        if epoch > 0:
            # Remove 1/3rd of the experinces from the replay
            clean_er(experience_replay, num_experiences // 3)

        total_new_experiences = num_experiences - len(experience_replay)

        # Play phrase
        new_experiences = play(online_model, env, epsilon, device, total_new_experiences)
        update_er(experience_replay, new_experiences)

        # Train phrase
        avg_loss, avg_reward = train(offline_model, online_model, optimizer, experience_replay, gamma, device)

        # Update offline_model
        offline_model.load_state_dict(online_model.state_dict())

        print(f'Epoch {epoch+1}   Avg Loss: {avg_loss:.4f}, Avg Reward: {avg_reward:.2f}')


    env.close()


In [110]:
num_epochs = 40
num_experiences = 1000
gamma = 0.99
epsilon = 0.05
lr = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

model = DQN(obs_state_dim, num_actions).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


start = datetime.now()

dql_optimized(env, model, optimizer, num_epochs, gamma, epsilon, device, num_experiences)

end = datetime.now()

print(end - start)

Epoch 1   Avg Loss: 243.4455, Avg Reward: -8.53
Epoch 2   Avg Loss: 232.3337, Avg Reward: -9.27
Epoch 3   Avg Loss: 237.4430, Avg Reward: -9.21
Epoch 4   Avg Loss: 212.2434, Avg Reward: -8.58
Epoch 5   Avg Loss: 258.0019, Avg Reward: -8.48
Epoch 6   Avg Loss: 235.0309, Avg Reward: -7.89
Epoch 7   Avg Loss: 213.0520, Avg Reward: -7.47
Epoch 8   Avg Loss: 192.1595, Avg Reward: -7.01
Epoch 9   Avg Loss: 208.6154, Avg Reward: -7.25
Epoch 10   Avg Loss: 218.3159, Avg Reward: -7.31
Epoch 11   Avg Loss: 184.9039, Avg Reward: -7.16
Epoch 12   Avg Loss: 150.4380, Avg Reward: -6.12
Epoch 13   Avg Loss: 158.2692, Avg Reward: -5.87
Epoch 14   Avg Loss: 95.5065, Avg Reward: -4.80
Epoch 15   Avg Loss: 92.2774, Avg Reward: -4.71
Epoch 16   Avg Loss: 83.6665, Avg Reward: -4.51
Epoch 17   Avg Loss: 114.1888, Avg Reward: -5.04
Epoch 18   Avg Loss: 93.5967, Avg Reward: -4.19
Epoch 19   Avg Loss: 97.6360, Avg Reward: -4.85
Epoch 20   Avg Loss: 69.2206, Avg Reward: -3.52
Epoch 21   Avg Loss: 82.5441, Avg R

**Note:** I am still getting bad rewards. After doing some resarch, it seems that we should keep proper track of the final state. Also, include the done flag in the experience replay and only retain the reward $r$ in that case (so no consideration of the next state Q-value).

I am adapting the code (so it should be fresh), but the above text cells remain the same.


Trying some more stuff (like paying with decaying epsilon, a better model, etc.) and also grid search on the hyperparameters:

In [124]:
def dql_optimized_decay_epsilon(env, online_model, optimizer, num_epochs, gamma, epsilon_init, decay_epsilon, device, num_experiences, supress_outputs):
    criterion = torch.nn.MSELoss()
    experience_replay = ExperienceReplay()

    offline_model = create_offline_model(online_model)

    epsilon = epsilon_init

    for epoch in range(num_epochs):
        if epoch > 0:
            # Remove 1/3rd of the experinces from the replay
            clean_er(experience_replay, num_experiences // 3)

        total_new_experiences = num_experiences - len(experience_replay)

        # Play phrase
        new_experiences = play(online_model, env, epsilon, device, total_new_experiences)
        update_er(experience_replay, new_experiences)

        # Train phrase
        avg_loss, avg_reward = train(offline_model, online_model, optimizer, experience_replay, gamma, device)

        # Update offline_model
        offline_model.load_state_dict(online_model.state_dict())

        if not supress_outputs or epoch % 10 == 9:
            print(f'Epoch {epoch+1}   Avg Loss: {avg_loss:.4f}, Avg Reward: {avg_reward:.2f}')

        epsilon = max(0.01, epsilon * decay_epsilon)


    env.close()

In [125]:
num_epochs = 30
num_experiences = 2000
gamma = 0.99
epsilon_init = 0.7
decay_epsilon = 0.9
lr = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

model = DQN(obs_state_dim, num_actions).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

start = datetime.now()

for num_experiences in [ 500, 1000, 2000, 5000 ]:
    for epsilon_init in [ .7, .8, .9]:
        for decay_epsilon in [ .9, .8, .7 ]:
            for lr in [ .0005, .001, .005 ]:
                print(f'num_experiences={num_experiences}, epsilon_init={epsilon_init}, decay_epsilon={decay_epsilon}, lr={lr}:')

                model = DQN(obs_state_dim, num_actions).to(device)
                optimizer = torch.optim.Adam(model.parameters(), lr=lr)

                dql_optimized_decay_epsilon(env, model, optimizer, num_epochs, gamma, epsilon_init, decay_epsilon, device, num_experiences, True)


end = datetime.now()

print(end - start)

num_experiences=500, epsilon_init=0.7, decay_epsilon=0.9, lr=0.0005:
Epoch 10   Avg Loss: 177.6367, Avg Reward: -3.49
Epoch 20   Avg Loss: 172.3517, Avg Reward: -3.20
Epoch 30   Avg Loss: 134.8248, Avg Reward: -1.77
num_experiences=500, epsilon_init=0.7, decay_epsilon=0.9, lr=0.001:
Epoch 10   Avg Loss: 196.5792, Avg Reward: -6.52
Epoch 20   Avg Loss: 274.9839, Avg Reward: -4.45
Epoch 30   Avg Loss: 223.9106, Avg Reward: -4.24
num_experiences=500, epsilon_init=0.7, decay_epsilon=0.9, lr=0.005:
Epoch 10   Avg Loss: 188.6595, Avg Reward: -2.27
Epoch 20   Avg Loss: 174.2901, Avg Reward: -4.32
Epoch 30   Avg Loss: 243.0284, Avg Reward: -5.66
num_experiences=500, epsilon_init=0.7, decay_epsilon=0.8, lr=0.0005:
Epoch 10   Avg Loss: 182.1112, Avg Reward: -4.24
Epoch 20   Avg Loss: 176.6966, Avg Reward: -5.66
Epoch 30   Avg Loss: 84.7424, Avg Reward: -3.64
num_experiences=500, epsilon_init=0.7, decay_epsilon=0.8, lr=0.001:
Epoch 10   Avg Loss: 177.9313, Avg Reward: -3.71
Epoch 20   Avg Loss: 1

'\nstart = datetime.now()\n\ndql_optimized_decay_espsilon(env, model, optimizer, num_epochs, gamma, epsilon_init, decay_epsilon, device, num_experiences)\n\nend = datetime.now()\n\nprint(end - start)\n'

In [126]:
# Best Runs:

'''
num_experiences=2000, epsilon_init=0.7, decay_epsilon=0.7, lr=0.0005:
Epoch 10   Avg Loss: 101.0632, Avg Reward: -1.64
Epoch 20   Avg Loss: 141.5659, Avg Reward: -2.62
Epoch 30   Avg Loss: 121.9993, Avg Reward: -0.92
'''

'''
num_experiences=1000, epsilon_init=0.7, decay_epsilon=0.8, lr=0.001:
Epoch 10   Avg Loss: 101.6116, Avg Reward: -1.42
Epoch 20   Avg Loss: 52.0279, Avg Reward: -0.80
Epoch 30   Avg Loss: 93.6158, Avg Reward: -1.87
'''

## Double DQL
Moving on to the double Deep Q-Learning approach. It is similar to the offline/online approach, with the key difference being that the online network select the next action to take (ie. finds the action $a'$ that yields the best Q-value on the next state $s'$) and the offline network is used to evaluate the next state based solely on this selected action. And we also use the online network to find the Q-value of the current state $s$.

Here is the loss function for double DQL:
$$ \lVert (r + Q_{off}(s', a')) - Q_{on}(s, a) \rVert^2 $$

where $$ a' = \arg\max_{a'} Q_{on}(s', a') $$

For comparison this is the loss function for the online/offline approach:
$$ \lVert (r + \arg\max_{a'} Q_{off}(s', a')) - Q_{on}(s, a) \rVert^2 $$

Notice that the only differece between the 2 is that we use the online network to select the best action for the next state instead of using the offline for both this and evaluation like the previous approach.

In [127]:
def train_dql(offline_model, online_model, optimizer, experience_replay, gamma, device):
    criterion = torch.nn.MSELoss()

    online_model.train()

    train_size = 5 * len(experience_replay) # 5 times larger dataset than experiences

    # construct the dataset
    dataset = []

    for _ in range(train_size):
        dataset.append(experience_replay.sample())

    # For simplicity, I will use batch gradient descent (about 5k samples, 8 inputs)
    curr_states = torch.stack([ exp.state for exp in dataset ]).to(device) # (train_size, obs_state_dim)
    next_states = torch.stack([exp.next_state for exp in dataset]).to(device) # (train_size, obs_state_dim)
    actions = torch.tensor([ exp.action for exp in dataset ], dtype=torch.int64, device=device) # (train_size)
    rewards = torch.tensor([ exp.reward for exp in dataset ], dtype=torch.float32, device=device) # (train_size)
    is_done = torch.tensor([ exp.done for exp in dataset ], dtype=torch.float32, device=device) # (train_size), new addition

    # Compute the loss
    optimizer.zero_grad()

    ### START Different with the regural offline/online training ###

    # Find the best next state Q values
    next_state_q_values = online_model(next_states) # (train_size, num_actions)

    a_prime = next_state_q_values.argmax(dim=1) # (train_size), the next state selected actions

    # (same trick as below to get the right indices)
    next_action_q_value = offline_model(next_states).gather(1, a_prime.unsqueeze(1)).squeeze(1) # (train_size)

    ### END Different with the regural offline/online training ###

    # Find the curr state Q value
    curr_state_q_values = online_model(curr_states) # (train_size, num_actions)

    # The below function finds the index specified by the actions array along dim=1 (so we get the q_value of the corresponding action idx)
    curr_state_q_value = curr_state_q_values.gather(1, actions.unsqueeze(1)).squeeze(1) # (train_size)

    # Compute the loss & backprop
    loss = criterion(curr_state_q_value, rewards + gamma * next_action_q_value.detach()) # detach!!

    loss.backward()
    optimizer.step()

    # Return statistics
    avg_loss = loss.item()
    avg_reward = torch.mean(rewards).item()

    return avg_loss, avg_reward

def double_dql(env, online_model, optimizer, num_epochs, gamma, epsilon, device, num_experiences):
    criterion = torch.nn.MSELoss()
    experience_replay = ExperienceReplay()

    offline_model = create_offline_model(online_model)

    for epoch in range(num_epochs):
        if epoch > 0:
            # Remove 1/3rd of the experinces from the replay
            clean_er(experience_replay, num_experiences // 3)

        total_new_experiences = num_experiences - len(experience_replay)

        # Play phrase
        new_experiences = play(online_model, env, epsilon, device, total_new_experiences)
        update_er(experience_replay, new_experiences)

        # Train phrase
        # !! the only difference in this line !!
        avg_loss, avg_reward = train_dql(offline_model, online_model, optimizer, experience_replay, gamma, device)

        # Update offline_model
        offline_model.load_state_dict(online_model.state_dict())

        print(f'Epoch {epoch+1}   Avg Loss: {avg_loss:.4f}, Avg Reward: {avg_reward:.2f}')


    env.close()

In [128]:
num_epochs = 40
num_experiences = 1000
gamma = 0.99
epsilon = 0.05
lr = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

model = DQN(obs_state_dim, num_actions).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


start = datetime.now()

double_dql(env, model, optimizer, num_epochs, gamma, epsilon, device, num_experiences)

end = datetime.now()

print(end - start)

Epoch 1   Avg Loss: 162.2543, Avg Reward: -2.02
Epoch 2   Avg Loss: 173.9440, Avg Reward: -1.86
Epoch 3   Avg Loss: 170.6209, Avg Reward: -2.61
Epoch 4   Avg Loss: 157.2971, Avg Reward: -3.08
Epoch 5   Avg Loss: 157.1756, Avg Reward: -4.89
Epoch 6   Avg Loss: 123.8236, Avg Reward: -4.82
Epoch 7   Avg Loss: 203.3733, Avg Reward: -6.47
Epoch 8   Avg Loss: 244.4152, Avg Reward: -6.80
Epoch 9   Avg Loss: 253.5333, Avg Reward: -7.57
Epoch 10   Avg Loss: 228.4135, Avg Reward: -7.58
Epoch 11   Avg Loss: 229.0850, Avg Reward: -6.93
Epoch 12   Avg Loss: 173.1917, Avg Reward: -4.98
Epoch 13   Avg Loss: 145.6154, Avg Reward: -3.18
Epoch 14   Avg Loss: 139.6315, Avg Reward: -3.17
Epoch 15   Avg Loss: 212.2620, Avg Reward: -5.27
Epoch 16   Avg Loss: 209.5178, Avg Reward: -6.80
Epoch 17   Avg Loss: 245.0399, Avg Reward: -7.21
Epoch 18   Avg Loss: 233.7222, Avg Reward: -6.81
Epoch 19   Avg Loss: 225.4056, Avg Reward: -6.95
Epoch 20   Avg Loss: 197.8004, Avg Reward: -5.70
Epoch 21   Avg Loss: 187.9504

## Policy-based Deep RL
On the 2nd part of the notebook, I am diving into some policy-based RL algoithms.

The idea behind policy-based RL algorithms is to have an agent (in the form of a deep neural network) making decisions on the policy function. Concretely, it parameterizes the stochastic policy function $\pi_{θ}$, which outputs a probabilistic distribution on the actions space given a state $s$:

$$ \pi_{\theta}(a_0 \vert s_0) = pr(a = a_0 \vert s = s_0) $$

### Policy gradient
The fundmental policy-based method is policy gradient. The objective of policy gradient is to maximize the expected reward if we start from an initial state $s_0$:

$$ J(\theta) = \mathbb{E}_{\tau}[R] = \mathbb{E}_{\tau}[r_0 + \gamma r_1 + \gamma^2r_2 + \cdots \vert s_0, \pi_{\theta}] $$

Essentially, what this means is that we need to find such a policy $\pi_{\theta}$, such that, among all possible trajectories (paths) $\tau = (s_0, a_0, r_0, s_1, \cdots)$ that can possible occur with any probability, the expected reward we get is maximized. Essentially, we want to maximize our expected reward starting from the initial state $s_0$.

Policy gradient is about optimizing this objective. This can be done by gradient ascent, that is we do gradient descent but we move on the opposite side, in order to maximize our objective (instead of say minimizing a loss function). However, with an easy trick (trying to minimize $-J(\theta)$), we can use our familar gradient descent.

Okay, now let's try to compute the gradient:
$$ \nabla_{\theta} J(\theta) = \nabla_{\theta} \mathbb{E}_{\tau}[R] $$.

Having the gradient outside of the expectation doesn't help much. We should try to get it inside, as it will be very helpful with many practical policy-based algorithms.

One way to do this is using the log trick. We have
$$
\nabla_{\theta} lnf(\theta) = \frac {\nabla_{\theta} f(\theta)} {f(\theta)} ⟺ \nabla_{\theta} f(\theta) = f(\theta) \nabla_{\theta} lnf(\theta)
\tag{1}
$$

We can expand our expectation (integration or summation, depending if we have a continuous or discrete state/action space):
$$ \nabla_{\theta} \mathbb{E}_{\tau}[R] =
\nabla_{\theta} \sum_{\tau' \in \tau} p(\tau' | s_0, \pi(\theta)) R =
\sum_{\tau' \in \tau} (\nabla_{\theta} p(\tau')) \cdot R \overset{(1)}{=}
\sum_{\tau' \in \tau} p(\tau') \cdot (\nabla_{\theta} ln p(\tau') \cdot R ) = \mathbb{E}_{\tau}[\nabla_{\theta} lnp(\tau') \cdot R]
\tag{2}
$$

Note that this works both for integrals and summations and we can treat R as a constant in terms or $\theta$, so can factor it out of the gradient.

Okay, now let's try to compute the probability of a trajectory $\tau'$ to occur given our policy and the initial state. $\tau'$ can be decomposed to a path of the form $\tau' = (s_0, a_0, r_0, s_1, a_1, r_1, \cdots)$, essentially a sequence of a feeback loop of a state fed to the agent, the action taken by the agent, the reward given the action and the state and transition to a new state. One thing to note here is that we assume that the game respects the markovian property, that is both the enviornment and the agent operate based on the current state $s_{t}$ and do not consider the prior ones.

This allows us to compute the probability easily:
$$ p(\tau') = \pi_\theta (a_0 \vert s_0) p(s_1 | a_0, s_0) \pi_\theta (a_1 \vert s_1) p(s_2 | a_1, s_1) \cdots = \prod_{i=0} \pi_\theta (a_i \vert s_i) p(s_{i+1} | a_i, s_i)  $$

The neat thing is that, with the logarithm we introduced earlier, we can convert this to a sum:
$$ \nabla_{\theta} lnp(\tau') = \sum_{i=0} \nabla_{\theta} ln \pi_\theta (a_i \vert s_i) + \sum_{i=0} \nabla_{\theta} ln p(s_{i+1} | a_i, s_i) \tag{3} $$

Now, the right summation depends only on the environment and is constant in terms of $\theta$, so we can disregard it completely. Applying $(3)$ to $(2)$, we get:

$$ J(\theta) = \mathbb{E}_{\tau}[( \sum_{i=0} \nabla_{\theta} ln \pi_\theta (a_i \vert s_i) ) \cdot R] \tag{4} $$

$(4)$ is the acual formula used for policy gradient.

### REINFORCE
One algorithm that uses the results of policy gradient is REINFORCE, which, essentially, approximates the reward expectation using Monte Carlo. That is, given an agent (neural network that spits out action probabilities given an input state) $\pi(\theta)$, plays the game, gathering samples (similar to the PLAY part of the DQN) and uses those (and their associated rewards) for training.

One trick to make computations easier is to use the "casuality trick", that is, we are safe, for each step $i$ to multiply with the rewards starting from this state and not the whole $R$, which could make our training more stable. That is, our loss function becomes:
$$ L(\theta) = -\frac {1}{N} \sum_{k=1}^{N} \sum_{i=0} (ln \pi_\theta (a_i \vert s_i) R_i) \tag{5} $$

Some remarks for $(5)$:
* $R_i$ is still a constant, but this time is the reward starting from step $i$: $$ R_i = \sum_{t=i} \gamma_{t-i} r_{t} $$
* By using $(5)$ as loss function in PyTorch, essntially, during the backwards process, it actually optimizes $(4)$. I indroduce the negative sign at the beggining to transform this to a gradient descent optimization problem.

To sum up, we do something similar to the online/offline DQL. We can have multiple epochs. On each epoch, we play the game $N$ times with the model (agent), which doesn't update for the duration of the epoch. On each step, we sample an action from its policy output, as well as its probability $p(a_i)$. We can have our model return logits that we can pass through a softmax layer. Then, we compute $lnp(a_i)$ and store it.

When we are done, we compute $R_t$ backwards, ie $ R_t = r_t + \gamma R_{t+1} $ and multiply it with $lnp(a_i)$. Summing all these terms, we get the loss function for one run. Averaging thse values across all runs, we get our final loss function $(5)$. We can then return gradients and update our model.

On a side note, in our case, we can do batch gradient descent. However, if we had a complex game with a lot of data, we could use mini-batch GD with say batch size of 256. Each batch can be an independent "epoch" and average across the current batch.

Let me implement reinforce below:


In [144]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class PolicyAgent(nn.Module): # policy
    def __init__(self, state_dim, action_dim):
        super(PolicyAgent, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) # logits

        return F.softmax(x, dim=-1)

In [147]:
def compute_R_t(rewards, gamma):
    R_t = len(rewards) * [0] # no need for backprop (is a constant)

    R_t[-1] = rewards[-1]

    for i in range(len(rewards) - 2, -1, -1):
        R_t[i] = rewards[i] + gamma * (R_t[i + 1])

    return R_t

def sample_action(model, observation): # sample an action from the policy distribution for state
    action_probs = model(observation) # (action_space)

    action_dist = torch.distributions.Categorical(action_probs)

    action = action_dist.sample()
    action_prob_log = action_dist.log_prob(action)

    return action.item(), action_prob_log

def reinforce(env, model, optimizer, num_epochs, runs_per_epoch, gamma, device):
    for epoch in range(num_epochs):
        outputs = []
        mean_rewards = [] # for logging

        # PLAY
        for _ in range(runs_per_epoch):
            rewards = []
            action_prob_logs = []

            model.eval()

            observation, info = env.reset()
            observation = torch.tensor(observation, dtype=torch.float32, device=device) # (obs_state_dim)

            action, action_prob_log = sample_action(model, observation) # seed action
            action_prob_logs.append(action_prob_log)

            while True:
                observation, reward, terminated, truncated, info = env.step(action)

                rewards.append(reward)

                if terminated or truncated:
                    break

                observation = torch.tensor(observation, dtype=torch.float32, device=device) # (obs_state_dim)

                action, action_prob_log = sample_action(model, observation) # careful! not the max!
                action_prob_logs.append(action_prob_log)

            mean_rewards.append(np.mean(rewards))
            R_t = compute_R_t(rewards, gamma)

            R_t = torch.tensor(R_t, dtype=torch.float32, device=device)
            action_prob_logs = torch.stack(action_prob_logs).to(device)

            curr_loss_component = -torch.sum(action_prob_logs * R_t)
            outputs.append(curr_loss_component)

        # TRAIN
        model.train()

        outputs = torch.stack(outputs).to(device)

        optimizer.zero_grad()

        loss = torch.mean(outputs)
        loss.backward()

        optimizer.step()

        print(f'Epoch {epoch+1}   Avg Loss: {loss.item():.4f}, Avg Reward: {np.mean(mean_rewards):.2f}')

    env.close()

In [150]:
num_epochs = 500
runs_per_epoch = 50
gamma = 0.99
lr = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

model = PolicyAgent(obs_state_dim, num_actions).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

start = datetime.now()

reinforce(env, model, optimizer, num_epochs, runs_per_epoch, gamma, device)

end = datetime.now()

print(end - start)

Epoch 1   Avg Loss: -11750.0098, Avg Reward: -1.88
Epoch 2   Avg Loss: -12404.7490, Avg Reward: -1.98
Epoch 3   Avg Loss: -12305.9365, Avg Reward: -2.12
Epoch 4   Avg Loss: -10902.0840, Avg Reward: -1.85
Epoch 5   Avg Loss: -10650.5029, Avg Reward: -1.91
Epoch 6   Avg Loss: -12607.1328, Avg Reward: -2.15
Epoch 7   Avg Loss: -10442.2285, Avg Reward: -1.80
Epoch 8   Avg Loss: -12437.5527, Avg Reward: -2.06
Epoch 9   Avg Loss: -12936.0215, Avg Reward: -2.07
Epoch 10   Avg Loss: -9590.5879, Avg Reward: -1.63
Epoch 11   Avg Loss: -11193.7539, Avg Reward: -1.78
Epoch 12   Avg Loss: -12170.5967, Avg Reward: -2.02
Epoch 13   Avg Loss: -13207.8389, Avg Reward: -2.03
Epoch 14   Avg Loss: -10097.1162, Avg Reward: -1.72
Epoch 15   Avg Loss: -9052.3564, Avg Reward: -1.49
Epoch 16   Avg Loss: -11505.5078, Avg Reward: -1.65
Epoch 17   Avg Loss: -11414.4688, Avg Reward: -1.75
Epoch 18   Avg Loss: -11060.4004, Avg Reward: -1.77
Epoch 19   Avg Loss: -10181.6338, Avg Reward: -1.74
Epoch 20   Avg Loss: -1

KeyboardInterrupt: 

In [None]:
# 1hr 6min execution above

### Actor-Critic
REINFORCE baseline is a more advanced variation of REINFORCE, which allows us to reduce the variation introduced by the $R_i$ future reward at step $i$, as this value can vary significantly across trajectories. A baseline can be any function $B(s)$ that accepts as input the current state $s$ and returns a number. The policy gradient can then be written as:
$$ J(\theta) = \mathbb{E}_{\tau}[ \sum_{i=0} (\nabla_{\theta} ln \pi_\theta (a_i \vert s_i)  \cdot (R_i - B(s_i)) )] $$

The idea behind this is that, if we can find a good enough baseline $B$, we can reduce the variance of different trajectories and have smooth training. However, like with REINFORCE, we have to play the whole epoch before returning gradients and updating our model, which is not always the best.

Here comes the Actor-Critic (AC) framework. We can replace this $R_i - B(s_i)$ term with a Q-value estimation for the current state $s_i$ and action $a_i$. This can take the form of a 2nd agent, the critic, which work in a pretty similar way to a DQN. To summarize, for the plain AC, we get:
$$ J(\theta) = \mathbb{E}_{\tau}[ \sum_{i=0} (\nabla_{\theta} ln \pi_\theta (a_i \vert s_i)  \cdot Q(s_i, a_i) )] $$

There are advanced variations of the plain AC, like the Advantage Actor-Critic (A2C) that incorporates the advantage function as the baseline:
$$ A(s_i, a_i) = Q(s_i, a_i) - V(s_i) = r_i + \gamma V(s_{i+1}) - V(s_i)  $$
which utilizes a value function that maps a state $s$ to its estimated value.

The critic is also needs to be updated and it works like in DQL. Our goal is to have it return accurate estimations for input states. This can be done using the TD loss squared, which is a very similar MSQE to the one used in QL:
$$ J_c(w) = \lVert r_i + \gamma V(s_{i+1}; w) - V(s_i; w) \rVert^2 $$

We can update both the actor and the critic on each step of our episodes, but we need to be careful to detach the estimations of the critic while updating the actor. This also means, the loss for the actor is quite simple too:
$$ J_a(\theta) = (r_i + \gamma V(s_{i+1}) - V(s_{i})) \cdot \nabla_{\theta} ln \pi_\theta (a_i \vert s_i) $$

Below I am implementing the A2C to use the 1-step TD (there exist variations with multiple steps as well).

Here is an overview of the A2C algorithm:
* For each step on every episode:
  1. Sample the next action $a$ using the actor.
  2. Compute the loss for the actor $J_a(\theta)$ and update the actor. Make sure to detach the tensors that involved the critic.
  3. Compute the loss for the critic $J_c(w)$ and update the critic.

In [161]:
def sample_action(model, observation): # sample an action from the policy distribution for state
    action_probs = model(observation) # (action_space)

    action_dist = torch.distributions.Categorical(action_probs)

    action = action_dist.sample()
    action_prob_log = action_dist.log_prob(action)

    return action.item(), action_prob_log

def a2c(env, actor, optimizer_a, critic, optimizer_c, num_episodes, gamma, device, supress_output=False):
    critic_criterion = torch.nn.MSELoss()

    for episode in range(num_episodes):
        rewards_sum, num_steps = 0.0, 0

        observation, info = env.reset()
        observation = torch.tensor(observation, dtype=torch.float32, device=device) # (obs_state_dim)

        while True:
            # 1. Sample from actor
            action, action_prob_log = sample_action(actor, observation) # s_t -> a_t

            next_observation, reward, terminated, truncated, info = env.step(action) # s_{t+1}
            next_observation = torch.tensor(next_observation, dtype=torch.float32, device=device) # (obs_state_dim)

            curr_value = critic(observation) # cache values
            next_value = critic(next_observation)

            if terminated or truncated:
                next_value = torch.tensor([0.0], device=device) # no value if terminated, like in DQL

            # 2. Compute loss and update actor
            value_component = reward + gamma * next_value.detach() - curr_value.detach()
            actor_loss = -action_prob_log * value_component

            optimizer_a.zero_grad()
            actor_loss.backward()
            optimizer_a.step()

            # 3. Compute loss and update critic
            next_pred = reward + gamma * next_value
            critic_loss = critic_criterion(next_pred, critic(observation))

            optimizer_c.zero_grad()
            critic_loss.backward()
            optimizer_c.step()

            if terminated or truncated:
                break


            observation = next_observation # update the observation
            rewards_sum += reward
            num_steps += 1

        if not supress_output or (episode + 1) % 50 == 0:
            print(f'Episode {episode+1}   Avg Reward: {rewards_sum / num_steps:.2f}')

    env.close()

In [155]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ValueDNN(nn.Module):
    def __init__(self, state_dim):
        super(ValueDNN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

In [162]:
num_episodes = 300
gamma = 0.99
lr_a = 0.001
lr_c = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

actor = PolicyAgent(obs_state_dim, num_actions).to(device)
optimizer_a = torch.optim.Adam(model.parameters(), lr=lr_a)

critic = ValueDNN(obs_state_dim).to(device)
optimizer_c = torch.optim.Adam(model.parameters(), lr=lr_c)

start = datetime.now()

a2c(env, actor, optimizer_a, critic, optimizer_c, num_episodes, gamma, device)

end = datetime.now()

print(end - start)

Episode 1   Avg Reward: 0.04
Episode 2   Avg Reward: -0.86
Episode 3   Avg Reward: -0.34
Episode 4   Avg Reward: -0.39
Episode 5   Avg Reward: -1.14
Episode 6   Avg Reward: -0.67
Episode 7   Avg Reward: 0.33
Episode 8   Avg Reward: -0.08
Episode 9   Avg Reward: -2.16
Episode 10   Avg Reward: -0.96
Episode 11   Avg Reward: -1.57
Episode 12   Avg Reward: -0.04
Episode 13   Avg Reward: -1.18
Episode 14   Avg Reward: -1.58
Episode 15   Avg Reward: -2.88
Episode 16   Avg Reward: 0.12
Episode 17   Avg Reward: 0.38
Episode 18   Avg Reward: -3.31
Episode 19   Avg Reward: 0.04
Episode 20   Avg Reward: 0.43
Episode 21   Avg Reward: 0.02
Episode 22   Avg Reward: -0.03
Episode 23   Avg Reward: -1.53
Episode 24   Avg Reward: -0.39
Episode 25   Avg Reward: -0.63
Episode 26   Avg Reward: -0.33
Episode 27   Avg Reward: -0.50
Episode 28   Avg Reward: -2.34
Episode 29   Avg Reward: -0.93
Episode 30   Avg Reward: -0.08
Episode 31   Avg Reward: 0.42
Episode 32   Avg Reward: -1.22
Episode 33   Avg Reward: 

Impressive, we managed to have positive rewards even in the first few episode runs! I run the A2C algorithm once again with more episodes:

In [163]:
num_episodes = 3000
gamma = 0.99
lr_a = 0.001
lr_c = 0.001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

obs_state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n

actor = PolicyAgent(obs_state_dim, num_actions).to(device)
optimizer_a = torch.optim.Adam(model.parameters(), lr=lr_a)

critic = ValueDNN(obs_state_dim).to(device)
optimizer_c = torch.optim.Adam(model.parameters(), lr=lr_c)

start = datetime.now()

a2c(env, actor, optimizer_a, critic, optimizer_c, num_episodes, gamma, device, supress_output=True)

end = datetime.now()

print(end - start)

Episode 50   Avg Reward: -0.18
Episode 100   Avg Reward: -1.82
Episode 150   Avg Reward: 0.06
Episode 200   Avg Reward: -1.12
Episode 250   Avg Reward: -1.26
Episode 300   Avg Reward: -2.14
Episode 350   Avg Reward: -2.11
Episode 400   Avg Reward: -2.97
Episode 450   Avg Reward: -1.70
Episode 500   Avg Reward: -1.40
Episode 550   Avg Reward: -0.13
Episode 600   Avg Reward: -0.24
Episode 650   Avg Reward: -0.72
Episode 700   Avg Reward: -0.31
Episode 750   Avg Reward: 0.03
Episode 800   Avg Reward: -1.40
Episode 850   Avg Reward: -2.05
Episode 900   Avg Reward: 0.30
Episode 950   Avg Reward: 0.11
Episode 1000   Avg Reward: -2.67
Episode 1050   Avg Reward: -3.68
Episode 1100   Avg Reward: 0.45
Episode 1150   Avg Reward: -0.24
Episode 1200   Avg Reward: -0.07
Episode 1250   Avg Reward: -1.77
Episode 1300   Avg Reward: -1.72
Episode 1350   Avg Reward: -1.19
Episode 1400   Avg Reward: 0.23
Episode 1450   Avg Reward: 0.49
Episode 1500   Avg Reward: -0.83
Episode 1550   Avg Reward: -3.56
Epis