# Deep Reinforcement Learning Laboratory

In this laboratory session we will work on getting more advanced versions of Deep Reinforcement Learning algorithms up and running. Deep Reinforcement Learning is **hard**, and getting agents to stably train can be frustrating and requires quite a bit of subtlety in analysis of intermediate results. We will start by refactoring (a bit) my implementation of `REINFORCE` on the [Cartpole environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

Considerazioni finali di Cartpole:

2. **Exploration**. The model is probably overfitting (or perhaps remaining too *plastic*, which can explain the unstable convergence). Our policy is *always* stochastic in that we sample from the output distribution. It would be interesting to add a temperature parameter to the policy so that we can control this behavior, or even implement a deterministic policy sampler that always selects the action with max probability to evaluate the quality of the learned policy network.

3. **Discount Factor**: The discount factor (default $\gamma = 0.99$) is an important hyperparameter that has an effect on the stability of training. Try different values for $\gamma$ and see how it affects training. Can you think of other ways to stabilize training?

Cose che ho scritto io da aggiungere:

usare la temperatura nel softmax per evitare crolli.

provare anche a diminuire hidden layer size

di solito non si riportano le curve per i risultati, visto che è stocastico-> 5 seed e poi riportare curva di confidenza

se si fanno tante prove ha senso togliere la parte di rendering, quella è solo di debugging

## Imports and weights and biases login

In [None]:
# Using weights and biases
!pip install wandb -qU

In [1]:
# Standard imports.
import numpy as np
import matplotlib.pyplot as plt
import gymnasium
import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb
import os

# Plus one non standard one -- we need this to sample from policies.
from torch.distributions import Categorical

In [3]:
# Login to weights and biases account
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmarco-chisci[0m ([33mmarcouni[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

## Exercise 1: Improving my `REINFORCE` Implementation (warm up)

In this exercise we will refactor a bit and improve some aspects of my `REINFORCE` implementation.

**First Things First**: Spend some time playing with the environment to make sure you understand how it works.

## Policy Net

In [4]:
# A simple policy network with one hidden layer and a temperature parameter to smooth the output
class PolicyNet(nn.Module):
    def __init__(self, env, hidden_layers=32, temperature=1):
        super().__init__()
        self.temperature = temperature
        self.fc1 = nn.Linear(env.observation_space.shape[0], hidden_layers)
        self.fc2 = nn.Linear(hidden_layers, env.action_space.n)
        self.relu = nn.ReLU()

    def forward(self, s):
        s = F.relu(self.fc1(s))
        s = F.softmax(self.fc2(s)/self.temperature, dim=-1)
        return s

## Episode Runner

In [5]:
# A class that , given an environment, a policy network and the max lenght of the episode is in charge
# of running it
class Episode_runner:
    def __init__(self, env, policy, maxlen=500):
        self.env = env
        self.policy = policy
        self.maxlen= maxlen

    # Given observation and the policy, sample from pi(a | obs). Returns the
    # selected action and the log probability of that action (needed for policy gradient).
    def select_action(self, obs):
        dist = Categorical(self.policy(obs))
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return (action.item(), log_prob.reshape(1))

    # Given the environment and the policy, run it up to the maximum number of steps.
    def run_episode(self):
        # Collect just about everything.
        observations = []
        actions = []
        log_probs = []
        rewards = []

        # Reset the environment and start the episode.
        (obs, info) = self.env.reset() 
        for i in range(self.maxlen):
            # Get the current observation, run the policy and select an action.
            obs = torch.tensor(obs)
            (action, log_prob) = self.select_action(obs)
            observations.append(obs)
            actions.append(action)
            log_probs.append(log_prob)

            # Advance the episode by executing the selected action.
            (obs, reward, term, trunc, info) = self.env.step(action)
            rewards.append(reward)
            if term or trunc:
                break
        return (observations, actions, torch.cat(log_probs), rewards)


## Deterministic Episode Runner

In [6]:
# A class that , given an episode runner (with an environment and a policy network), an episode render to show the policy 
# and the max lenght of each episode evaluates the quality of the learned policy network, 
# always selecting the action with max probability 
class Determinist_Test_Episode_runner:
    def __init__(self, episode_runner, episode_runner_render, maxlen=500):
        self.ep_runner = episode_runner
        self.ep_run_render = episode_runner_render
        self.maxlen= maxlen

    #select the most probable action given the policy and current observation
    def select_action(self, obs):
        dist = Categorical(self.ep_runner.policy(obs))
        action= torch.argmax(dist.log_prob(dist.enumerate_support()))
        return action.item()

    # Given the environment and the policy, run it up to the maximum number of steps 
    def run_episode(self):
        # Collect just about everything.
        observations = []
        actions = []
        rewards = []

        # Reset the environment and start the episode.
        (obs, info) = self.ep_runner.env.reset() 
        for i in range(self.maxlen):
            # Get the current observation, run the policy and select an action.
            obs = torch.tensor(obs)
            action = self.select_action(obs)
            observations.append(obs)
            actions.append(action)

            # Advance the episode by executing the selected action.
            (obs, reward, term, trunc, info) = self.ep_runner.env.step(action)
            rewards.append(reward)
            if term or trunc:
                break
        return (observations, actions, rewards)

    def test(self, test_episodes):
        print('Testing the best policy')
        self.ep_runner.policy.eval()
        total_reward = 0
        episode_lengths = []
        for _ in range(test_episodes):
            (_, _, rewards) = self.run_episode()
            total_reward += np.sum(rewards)
            episode_lengths.append(len(rewards))
        test_average_episode_len_metric = {"test_average_episode_length": np.mean(episode_lengths)}
        test_average_rewards_metric = {"test_average_total_reward": total_reward / test_episodes}
        wandb.log({**test_average_rewards_metric, **test_average_episode_len_metric})

        (obs, _, _, _) = self.ep_run_render.run_episode()
        self.ep_runner.policy.train()
        print(f'Average Total reward: {total_reward / test_episodes}')

## Reinforce

**Next Things Next**: Now get your `REINFORCE` implementation working on the environment. You can import my (probably buggy and definitely inefficient) implementation here. Or even better, refactor an implementation into a separate package from which you can `import` the stuff you need here.

**Last Things Last**: My implementation does a **super crappy** job of evaluating the agent performance during training. The running average is not a very good metric. Modify my implementation so that every $N$ iterations (make $N$ an argument to the training function) the agent is run for $M$ episodes in the environment. Collect and return: (1) The average **total** reward received over the $M$ iterations; and (2) the average episode length. Analyze the performance of your agents with these new metrics.

In [7]:
# Utility to compute the discounted total reward. Torch doesn't like flipped arrays, so we need to
# .copy() the final numpy array. There's probably a better way to do this.
def compute_returns(rewards, gamma):
    return np.flip(np.cumsum([gamma**(i+1)*r for (i, r) in enumerate(rewards)][::-1]), 0).copy()

# Implementation of the REINFORCE policy gradient algorithm.
# It receives the episode runner, the wandb run to save the results, the episode runner render that is used to monitor training 
# when display = True, the gamma parameter, the number of episodes to train the policy and baseline net, the type of baseline used,
# eval_every (after how many training steps we evaluate the policy), eval_episode (how many episodes we 
# evaluate the policy on) and the learning rates
def reinforce(episode_runner, wandb, episode_runner_render=None, gamma=0.99, num_episodes=2000,
              baseline=None, display=False, eval_every=100, eval_episodes=100, lr= 1e-2, lr_baseline = 1e-3 ):
    # The only non-vanilla part: we use Adam instead of SGD.
    opt = torch.optim.Adam(episode_runner.policy.parameters(), lr= lr)

    # If we have a baseline network, create the optimizer.
    if isinstance(baseline, nn.Module):
        opt_baseline = torch.optim.Adam(baseline.parameters(), lr= lr_baseline)  
        baseline.train()
        print('Training agent with baseline value network.')
    elif baseline == 'std':
        print('Training agent with standardization baseline.')
    else:
        print('Training agent with no baseline.')

    #Collect running rewards, all the episodes lengths and training loss
    running_rewards = [0.0]
    all_episodes_lenghts = []
    training_losses = []
    value_losses = []

    #save the latest policy with the greatest average totale reward
    best_model_state_dict = None
    best_avg_tot_rew = 0
    
    # The main training loop.
    episode_runner.policy.train()
    for episode in range(1, num_episodes+1):
        # Run an episode of the environment, collect everything needed for policy update.
        (observations, actions, log_probs, rewards) = episode_runner.run_episode()

        # Compute the discounted reward for every step of the episode.
        returns = torch.tensor(compute_returns(rewards, gamma), dtype=torch.float32)
        # Keep a running average of total discounted rewards for the whole episode.
        running_rewards.append(0.05 * returns[0].item() + 0.95 * running_rewards[-1])
        running_rewards_metric = {"running_reward": running_rewards[-1]}

        # Handle baseline.
        if isinstance(baseline, nn.Module):
            with torch.no_grad():
                target = returns - baseline(torch.stack(observations))
        elif baseline == 'std':                                       #Standardize returns
            target = (returns - returns.mean()) / returns.std()
        else:
            target = returns

        # Make an optimization step
        opt.zero_grad()
        loss = (-log_probs * target).mean()
        loss.backward()
        opt.step()

        #log only the mean training loss, episode lenght and running reward of 10 episode to make the graphs cleaner
        all_episodes_lenghts.append(len(returns))
        training_losses.append(loss.detach().cpu().numpy())

        if episode % 10 == 0:
            loss_policy_metric = {"loss_policy": np.mean(training_losses[-10:])}
            episode_length_metric = {"episode_length": np.mean(all_episodes_lenghts[-10:])}
            wandb.log({**loss_policy_metric, **episode_length_metric}, commit = False)

        # Update baseline network.
        if isinstance(baseline, nn.Module):
            opt_baseline.zero_grad()
            loss_baseline = ((returns - baseline(torch.stack(observations)))**2.0).mean()
            loss_baseline.backward()
            opt_baseline.step()
            value_losses.append(loss_baseline.detach().cpu().numpy())
            if episode % 10 == 0:
                loss_value_metric = {"loss_value": np.mean(value_losses[-10:])}
                wandb.log({**loss_value_metric}, commit = False)

        # Render and evaluate the current policy after every "eval_every" policy updates.
        if episode % eval_every == 0:
            episode_runner.policy.eval()
            total_reward = 0
            episode_lengths = []
            #evaluate on "eval_episodes" episodes the total reward and the episodes length
            for _ in range(eval_episodes):
                (_, _, _, rewards) = episode_runner.run_episode()
                total_reward += np.sum(rewards)
                episode_lengths.append(len(rewards))
            average_episode_len_metric = {"average_episode_length": np.mean(episode_lengths)}
            average_rewards_metric = {"average_total_reward": total_reward / eval_episodes}
            wandb.log({**average_rewards_metric, **average_episode_len_metric}, commit = False)
            if  total_reward / eval_episodes >= best_avg_tot_rew:
                best_avg_tot_rew = total_reward / eval_episodes
                # save all the parameters of the best policy
                best_model_state_dict = episode_runner.policy.state_dict()
            if display:
                (obs, _, _, _) = episode_runner_render.run_episode()
            episode_runner.policy.train()
            print(f'Running reward of episode {episode}/{num_episodes}: {running_rewards[-1]}')
            print(f'Average Total reward: {total_reward / eval_episodes}')
        
        wandb.log({**running_rewards_metric})

    # lastly, calculate and print the average episode lenght of the entire training
    print(f'Average length of all episodes: {np.mean(all_episodes_lenghts)}')
    average_all_episodes_metric= {"average_lenght_all_episodes": np.mean(all_episodes_lenghts)}
    wandb.log({**average_all_episodes_metric})
    
    episode_runner.policy.eval()
    if isinstance(baseline, nn.Module):
        baseline.eval()
    return best_model_state_dict
    

## Standard run

In [9]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Standard Run ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'std',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 1,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁█
average_total_reward,▁█
episode_length,▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂▃▃▃▃▄▂▄█
loss_policy,▆▄▅▃▂▄▁▄▂▄▆▂▄█▄▃▄▅▁▂▂▂▃▄▃▄▃
running_reward,▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▆▇██

0,1
average_episode_length,217.5
average_total_reward,217.5
episode_length,429.8
loss_policy,-0.00798
running_reward,86.6903


Training agent with standardization baseline.
Running reward of episode 100/2000: 32.71589919446313
Average Total reward: 58.8
Running reward of episode 200/2000: 62.14137885745559
Average Total reward: 217.5
Running reward of episode 300/2000: 94.74059099161452
Average Total reward: 499.25
Running reward of episode 400/2000: 96.09381802522763
Average Total reward: 481.95
Running reward of episode 500/2000: 97.85864756470048
Average Total reward: 487.85
Running reward of episode 600/2000: 98.09670548972397
Average Total reward: 500.0
Running reward of episode 700/2000: 91.089020737204
Average Total reward: 500.0
Running reward of episode 800/2000: 98.23690690133812
Average Total reward: 500.0
Running reward of episode 900/2000: 93.5830460039909
Average Total reward: 479.6
Running reward of episode 1000/2000: 97.02925393227288
Average Total reward: 500.0
Running reward of episode 1100/2000: 95.56164338768357
Average Total reward: 378.55
Running reward of episode 1200/2000: 98.1845934928

0,1
average_episode_length,▁▄████████▆██▇██████
average_lenght_all_episodes,▁
average_total_reward,▁▄████████▆██▇██████
episode_length,▁▁▂▂▃▄▆▇█▇███▃███▅██▅█████████▇▇██▇█████
loss_policy,▇▇▃▅▂▆▆▅▅▅▅▃▅▄▄▆▆▂▇▆▂▅▆▆▄▆▅▇▆▅▅▁▇▇▄▄▇▅█▆
running_reward,▁▂▃▄▆▆███████▇██████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,421.7265
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00443
running_reward,98.34952
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 45.24261558588204
Average Total reward: 56.1
Running reward of episode 200/2000: 88.17505960616143
Average Total reward: 153.85
Running reward of episode 300/2000: 97.87176516515488
Average Total reward: 500.0
Running reward of episode 400/2000: 97.33309453649986
Average Total reward: 444.05
Running reward of episode 500/2000: 98.31654295569537
Average Total reward: 474.7
Running reward of episode 600/2000: 98.0786709152909
Average Total reward: 500.0
Running reward of episode 700/2000: 88.86195641509707
Average Total reward: 500.0
Running reward of episode 800/2000: 95.59207986596432
Average Total reward: 278.35
Running reward of episode 900/2000: 98.27400811155437
Average Total reward: 500.0
Running reward of episode 1000/2000: 74.98692654753931
Average Total reward: 108.5
Running reward of episode 1100/2000: 94.3809585845198
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3260294355

0,1
average_episode_length,▁▃█▇███▅█▂██████▄███
average_lenght_all_episodes,▁
average_total_reward,▁▃█▇███▅█▂██████▄███
episode_length,▁▁▃▅▃██████▇▇▄█████▃▃▅████████████▇█████
loss_policy,█▇▃▅▄▅▆▅▆▆▆▂▃▅▅▃▆▅▆▄▁▆▃▆▅▃▇█▅▆▆▅▆▂▆▄▄▆▇▆
running_reward,▁▃▄▇▇████████▅█████▇▆▇██████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,415.561
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00062
running_reward,98.3495
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 66.2024049949195
Average Total reward: 329.0
Running reward of episode 200/2000: 90.71400884176553
Average Total reward: 306.4
Running reward of episode 300/2000: 71.32711784746544
Average Total reward: 117.2
Running reward of episode 400/2000: 95.85379320565168
Average Total reward: 470.1
Running reward of episode 500/2000: 88.6751074013284
Average Total reward: 276.0
Running reward of episode 600/2000: 98.0568359457516
Average Total reward: 378.95
Running reward of episode 700/2000: 85.50913099454664
Average Total reward: 199.4
Running reward of episode 800/2000: 96.93070227213474
Average Total reward: 500.0
Running reward of episode 900/2000: 97.00646803282702
Average Total reward: 373.3
Running reward of episode 1000/2000: 98.33444586527519
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34943617252813
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.219237168102

0,1
average_episode_length,▅▅▁▇▄▆▃█▆███▁███████
average_lenght_all_episodes,▁
average_total_reward,▅▅▁▇▄▆▃█▆███▁███████
episode_length,▁▁▅▇▃▅▂▅▇▆██▄▄▄▇████████▇▇▅████████▇█▇██
loss_policy,▆▆▆▇▆▇▅▅▆▆▆▆▆▆▅▇▇▆▆▆▆▆▆▅▃▄█▆▆▆▇▇▇▇▅▁▆▇▆▆
running_reward,▁▂▆█▇▇▅▇█████▇▇▇██████████▇███████▇▇████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,392.965
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00823
running_reward,98.34767
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 64.18655985126483
Average Total reward: 117.45
Running reward of episode 200/2000: 91.0291745439531
Average Total reward: 224.75
Running reward of episode 300/2000: 92.22766498992853
Average Total reward: 483.15
Running reward of episode 400/2000: 98.28326097775064
Average Total reward: 489.15
Running reward of episode 500/2000: 98.11194037135056
Average Total reward: 500.0
Running reward of episode 600/2000: 98.34567881654553
Average Total reward: 500.0
Running reward of episode 700/2000: 97.03570630367346
Average Total reward: 500.0
Running reward of episode 800/2000: 98.30298266780466
Average Total reward: 500.0
Running reward of episode 900/2000: 98.31993287170565
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.06477831623546
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34783959792426
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3084952

0,1
average_episode_length,▁▃██████████▄▃▃▂▄███
average_lenght_all_episodes,▁
average_total_reward,▁▃██████████▄▃▃▂▄███
episode_length,▁▂▃▄▃▆███▇█████████▇██▆███▂▂▄▂▅▄▃▃▇▇████
loss_policy,▆▁▄▂▄▇▅▆▇▅▆▇▆▇▆▆▆▆▆▃▆▆▅▆▆▄▂▅█▄▅▅▃▇▆▆▆▆▄▅
running_reward,▁▃▆▇▇█████████████████████▆▆▇▆▇▇▇▆██████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,490.4
average_lenght_all_episodes,377.853
average_total_reward,490.4
episode_length,499.5
loss_policy,-0.00849
running_reward,97.57143
test_average_episode_length,491.805
test_average_total_reward,491.805


Training agent with standardization baseline.
Running reward of episode 100/2000: 52.53875272596661
Average Total reward: 176.35
Running reward of episode 200/2000: 93.73580729045212
Average Total reward: 453.9
Running reward of episode 300/2000: 97.28332806210642
Average Total reward: 500.0
Running reward of episode 400/2000: 98.34321299886047
Average Total reward: 494.7
Running reward of episode 500/2000: 98.28014721429058
Average Total reward: 483.55
Running reward of episode 600/2000: 97.8272487276613
Average Total reward: 500.0
Running reward of episode 700/2000: 98.12234334143987
Average Total reward: 500.0
Running reward of episode 800/2000: 87.74643931950025
Average Total reward: 343.55
Running reward of episode 900/2000: 75.47979736878095
Average Total reward: 172.85
Running reward of episode 1000/2000: 97.0255859839134
Average Total reward: 416.3
Running reward of episode 1100/2000: 98.33920815719178
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.25882011

## Standard run and higher temperature (5)

In [10]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Standard and temperature 5 ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'std',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▇█████▅▁▆█▂▇▆██▂███
average_lenght_all_episodes,▁
average_total_reward,▁▇█████▅▁▆█▂▇▆██▂███
episode_length,▁▁▃▄▄███▇█▇██████▅▄█████▆█████▇███▄█████
loss_policy,▇▇▇▆▆▇██▇██▇▇█▇█▆▇▆▇▇▇█▆▆▇▇▇▇▇▁▇▆▇▆████▇
running_reward,▁▃▅▇█▇██████████▇█▇███████████▇███▇█████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,423.1615
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.01007
running_reward,98.33906
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 25.929437324089378
Average Total reward: 32.65
Running reward of episode 200/2000: 62.213972986772724
Average Total reward: 148.2
Running reward of episode 300/2000: 82.60212577121675
Average Total reward: 176.7
Running reward of episode 400/2000: 93.96289815535772
Average Total reward: 402.3
Running reward of episode 500/2000: 96.0098544531907
Average Total reward: 287.7
Running reward of episode 600/2000: 97.31797903324576
Average Total reward: 500.0
Running reward of episode 700/2000: 98.03687329085803
Average Total reward: 472.55
Running reward of episode 800/2000: 97.9945584991854
Average Total reward: 491.95
Running reward of episode 900/2000: 96.43207559436021
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.28210346671098
Average Total reward: 500.0
Running reward of episode 1100/2000: 96.3751670397502
Average Total reward: 225.35
Running reward of episode 1200/2000: 98.29695469

0,1
average_episode_length,▁▃▃▇▅█████▄█████████
average_lenght_all_episodes,▁
average_total_reward,▁▃▃▇▅█████▄█████████
episode_length,▁▁▁▁▂▄▂▆▇█▇███▇███▆█▇█▇█████████████████
loss_policy,▅▅▂▂▁▄▇▃▅▄▇▇▇▆▅▆█▇▃▆▃█▄█▆▇█▆▇█▅▆▆▇▄▅█▆▆▅
running_reward,▁▂▃▃▅▆▆▇████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,415.677
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00603
running_reward,98.34952
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 23.881031570633947
Average Total reward: 40.15
Running reward of episode 200/2000: 57.736844261333516
Average Total reward: 142.0
Running reward of episode 300/2000: 94.86828897512726
Average Total reward: 455.2
Running reward of episode 400/2000: 97.04106218913063
Average Total reward: 482.9
Running reward of episode 500/2000: 98.08374884239758
Average Total reward: 462.8
Running reward of episode 600/2000: 97.67164312127268
Average Total reward: 478.95
Running reward of episode 700/2000: 97.84771709146843
Average Total reward: 437.2
Running reward of episode 800/2000: 98.32861910395586
Average Total reward: 500.0
Running reward of episode 900/2000: 94.66916917965473
Average Total reward: 475.8
Running reward of episode 1000/2000: 98.18605883078813
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.26400679820104
Average Total reward: 461.85
Running reward of episode 1200/2000: 93.270296

0,1
average_episode_length,▁▃▇█▇█▇███▇▇▇███████
average_lenght_all_episodes,▁
average_total_reward,▁▃▇█▇█▇███▇▇▇███████
episode_length,▁▁▂▂▄▅▇▇▇█▇▆█████▆█████▄███████▆████████
loss_policy,▆▇▁▄▆▇▆▅▆▆█▆▆▅▆▆▅▇▇▅▆▆▇▆▆▇▄▅▅▅▄█▅▇▆▄▄▅▇▅
running_reward,▁▂▃▄▅▇███████████▇█████▇████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,420.7645
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00822
running_reward,98.34064
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 24.97650669296172
Average Total reward: 30.1
Running reward of episode 200/2000: 50.650234553148884
Average Total reward: 114.2
Running reward of episode 300/2000: 93.27920996501801
Average Total reward: 403.0
Running reward of episode 400/2000: 96.99587821207643
Average Total reward: 472.95
Running reward of episode 500/2000: 96.19530445629833
Average Total reward: 267.7
Running reward of episode 600/2000: 94.957603327465
Average Total reward: 480.75
Running reward of episode 700/2000: 92.80265601524734
Average Total reward: 379.2
Running reward of episode 800/2000: 92.58579432546331
Average Total reward: 500.0
Running reward of episode 900/2000: 98.28942081517984
Average Total reward: 500.0
Running reward of episode 1000/2000: 97.86123264788272
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34663449984711
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3495083356

0,1
average_episode_length,▁▂▇█▅█▆█████████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▇█▅█▆█████████████
episode_length,▁▁▁▂▃▆▆▅███▇█▆▃▇██████████▆█████▇████▄██
loss_policy,▃▂▃█▄▄▄▁▄▇▄▂▃▂▂▂▇▅▆▅▅▃▆▆▅▂▂▆▄▆▄▅▃▃▄▂▂▄▁▄
running_reward,▁▂▂▃▅▇████████▇███████████▇█████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,409.4595
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.004
running_reward,98.33909
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 23.553795438268196
Average Total reward: 36.95
Running reward of episode 200/2000: 56.55064634220968
Average Total reward: 142.8
Running reward of episode 300/2000: 91.64790728972982
Average Total reward: 404.05
Running reward of episode 400/2000: 94.10095109490055
Average Total reward: 442.35
Running reward of episode 500/2000: 98.0458981284741
Average Total reward: 487.9
Running reward of episode 600/2000: 98.34152819277885
Average Total reward: 500.0
Running reward of episode 700/2000: 98.34351303575056
Average Total reward: 500.0
Running reward of episode 800/2000: 98.34947365516523
Average Total reward: 500.0
Running reward of episode 900/2000: 97.98864749088678
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.25435069169532
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34896196671255
Average Total reward: 500.0
Running reward of episode 1200/2000: 96.88332388

0,1
average_episode_length,▁▃▇▇████████████████
average_lenght_all_episodes,▁
average_total_reward,▁▃▇▇████████████████
episode_length,▁▁▁▂▃▆▆▅████████▆▇████▇█████▇███████████
loss_policy,▅▆▁▃▃▂▅▆▇▆▆▅█▇▇▆██▅▅▇▆▇▆▇▆▇▇▄▄█▇▆▅▇▇▇▇▇▇
running_reward,▁▂▂▃▅▇█▇████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,431.7595
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00168
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 26.83923487458664
Average Total reward: 37.55
Running reward of episode 200/2000: 74.51195438229841
Average Total reward: 315.1
Running reward of episode 300/2000: 92.18660611213245
Average Total reward: 456.75
Running reward of episode 400/2000: 97.48870334249546
Average Total reward: 473.15
Running reward of episode 500/2000: 97.1706223988939
Average Total reward: 480.8
Running reward of episode 600/2000: 98.26989274668249
Average Total reward: 477.4
Running reward of episode 700/2000: 98.16480313211207
Average Total reward: 500.0
Running reward of episode 800/2000: 98.0017380198416
Average Total reward: 500.0
Running reward of episode 900/2000: 98.32794873415861
Average Total reward: 478.8
Running reward of episode 1000/2000: 97.97978904073473
Average Total reward: 447.55
Running reward of episode 1100/2000: 98.14453882088127
Average Total reward: 487.2
Running reward of episode 1200/2000: 96.580900341

## Standard run and higher temperature (10)

In [12]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Standard and temperature 10 ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'std',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 10,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▁▂▂▄▆█▇▆██▆██████
average_total_reward,▁▁▁▂▂▄▆█▇▆██▆██████
episode_length,▁▁▁▁▂▂▁▂▂▃▃▃▅▆█▆▇▆▆▆██▇███▆▇██████████▇█
loss_policy,▅▆▅▄▇▁▄▅▇▆▄▆▆▅▆▅▆▅▄▅▆▆▃▆▇▆▆▅▅▆▆▅▆▆▆▆▆▆▅█
running_reward,▁▆▆▇████████████████████████████████████

0,1
average_episode_length,500.0
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00237
running_reward,9.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 22.16568026537247
Average Total reward: 29.45
Running reward of episode 200/2000: 35.74352037674666
Average Total reward: 55.05
Running reward of episode 300/2000: 70.49331650832612
Average Total reward: 120.8
Running reward of episode 400/2000: 85.458920615001
Average Total reward: 202.4
Running reward of episode 500/2000: 91.57265888488924
Average Total reward: 372.55
Running reward of episode 600/2000: 89.44259666828034
Average Total reward: 208.3
Running reward of episode 700/2000: 92.22526037222525
Average Total reward: 293.85
Running reward of episode 800/2000: 96.1536516069143
Average Total reward: 436.5
Running reward of episode 900/2000: 95.63969546835189
Average Total reward: 486.35
Running reward of episode 1000/2000: 98.220242553561
Average Total reward: 437.0
Running reward of episode 1100/2000: 98.29579112130567
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.972547426222

0,1
average_episode_length,▁▁▂▄▆▄▅▇█▇███████▅██
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▄▆▄▅▇█▇███████▅██
episode_length,▁▁▁▁▂▃▃▅▅▄▅▅▅▅▆█▇▇████████████████▇█▇███
loss_policy,▆▄▇▃▅▁▇▆▅▆▂▅▆▆▂▆▄▅▇▇▅▆▆█▆▆▇▇▅▇▆▅▆█▃▆▇▄▆▇
running_reward,▁▂▂▂▃▄▆▇▇▇██▇▇▇█████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,369.6655
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00391
running_reward,98.27888
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 22.93964205123031
Average Total reward: 39.45
Running reward of episode 200/2000: 41.18739140270995
Average Total reward: 57.1
Running reward of episode 300/2000: 78.89411137546924
Average Total reward: 217.15
Running reward of episode 400/2000: 84.41578798734405
Average Total reward: 359.5
Running reward of episode 500/2000: 95.60180401230649
Average Total reward: 451.3
Running reward of episode 600/2000: 97.92725010766306
Average Total reward: 459.7
Running reward of episode 700/2000: 95.27069536172121
Average Total reward: 488.55
Running reward of episode 800/2000: 98.01595174989652
Average Total reward: 473.15
Running reward of episode 900/2000: 97.91987447672201
Average Total reward: 498.5
Running reward of episode 1000/2000: 97.91030455652094
Average Total reward: 496.5
Running reward of episode 1100/2000: 98.07849872461578
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.34792083

0,1
average_episode_length,▁▁▄▆▇▇██████████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▄▆▇▇██████████████
episode_length,▁▁▁▁▂▃▄▃▆▅▇▇▇███▇███████████████████████
loss_policy,▅▅▂▃▇▅▅▅█▁▆▄▄▄▅▅▅▅▅█▇▆▇▆▅▆▅▆▅▆▅▅▆▅▇▇▆▆▅█
running_reward,▁▂▂▃▃▅▇▇▇▇██████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,405.7335
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00832
running_reward,97.93908
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 23.698670489439746
Average Total reward: 30.45
Running reward of episode 200/2000: 43.45003506164022
Average Total reward: 62.2
Running reward of episode 300/2000: 77.91009980628179
Average Total reward: 211.05
Running reward of episode 400/2000: 92.87547093184334
Average Total reward: 298.95
Running reward of episode 500/2000: 94.38419736565884
Average Total reward: 423.3
Running reward of episode 600/2000: 96.8122575933616
Average Total reward: 492.5
Running reward of episode 700/2000: 96.41839687834289
Average Total reward: 335.85
Running reward of episode 800/2000: 97.59170765350223
Average Total reward: 474.75
Running reward of episode 900/2000: 92.97206822071841
Average Total reward: 500.0
Running reward of episode 1000/2000: 95.01416784366324
Average Total reward: 500.0
Running reward of episode 1100/2000: 96.92506116702563
Average Total reward: 403.5
Running reward of episode 1200/2000: 95.6172160

0,1
average_episode_length,▁▁▄▅▇█▆███▇██▆██████
average_lenght_all_episodes,▁
average_total_reward,▁▁▄▅▇█▆███▇██▆██████
episode_length,▁▁▁▂▂▄▅▇▆▄▇▇███▇▇▄▇▄█▇▇██▆███▇█▄▇█▇█████
loss_policy,▅█▁▆▄▅▃▆▇▄▆▅▅▅▅▇▂▂▆▄▆▃▅▅▄▆▇▆▅▄▅▅▄▇▆▆▅▆▄▆
running_reward,▁▂▂▃▄▆▇█▇▇███████▇█▇█████▇███▇█▇████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,391.0045
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00021
running_reward,98.34952
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 20.354387225086743
Average Total reward: 21.25
Running reward of episode 200/2000: 34.538424925053995
Average Total reward: 56.3
Running reward of episode 300/2000: 82.26923967252634
Average Total reward: 300.6
Running reward of episode 400/2000: 92.77205541514661
Average Total reward: 457.7
Running reward of episode 500/2000: 96.81175716839374
Average Total reward: 441.65
Running reward of episode 600/2000: 97.73970607537439
Average Total reward: 500.0
Running reward of episode 700/2000: 98.24179671261828
Average Total reward: 474.35
Running reward of episode 800/2000: 96.50784829359428
Average Total reward: 441.6
Running reward of episode 900/2000: 98.32422065570984
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.2022674479129
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.11677271625855
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.21286433

0,1
average_episode_length,▁▂▅▇▇██▇████████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▅▇▇██▇████████████
episode_length,▁▁▁▁▂▃▅▆▇▇▇████▇█████████████████▇▇█████
loss_policy,▆▆▆▁▆▂▆▇▅▆▄▆▇▇▅▆▆▆▆▅▆▅▇█▆▆▆▇▆▆█▆▅▄▆▆▆█▇▇
running_reward,▁▂▂▂▃▅▇█████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,414.981
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00284
running_reward,97.3879
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 21.625619198686906
Average Total reward: 24.8
Running reward of episode 200/2000: 30.18874240659956
Average Total reward: 44.25
Running reward of episode 300/2000: 57.30448430834697
Average Total reward: 109.55
Running reward of episode 400/2000: 82.85195281131075
Average Total reward: 131.45
Running reward of episode 500/2000: 87.18454569224969
Average Total reward: 221.15
Running reward of episode 600/2000: 93.18461871015343
Average Total reward: 231.55
Running reward of episode 700/2000: 97.02780971509172
Average Total reward: 479.1
Running reward of episode 800/2000: 96.91635901242573
Average Total reward: 500.0
Running reward of episode 900/2000: 94.43567864952625
Average Total reward: 483.4
Running reward of episode 1000/2000: 98.19639645566181
Average Total reward: 483.3
Running reward of episode 1100/2000: 98.34861884696464
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.042141

## Standard run and lower gamma

In [13]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Standard and gamma 0.9 ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.9,
              "baseline": 'std',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▂▃▄▄█████▇██████▇█
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▃▄▄█████▇██████▇█
episode_length,▁▁▁▁▁▁▃▆▃▅▄▆▅████████████▇███████▇███▇██
loss_policy,▄▆▁▃█▅▃▄▄▆▁▃▃▄▄▃▅▅▄▄▅▅▄▄▅▂▅▅▆▅▅▅▄▂▅▃▄▅▅▅
running_reward,▁▂▂▂▃▄▅▇▆▇▇█▇██▇████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,376.759
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00401
running_reward,98.30638
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 7.994547748675325
Average Total reward: 33.1
Running reward of episode 200/2000: 8.607860377723597
Average Total reward: 61.0
Running reward of episode 300/2000: 8.923415512879773
Average Total reward: 59.9
Running reward of episode 400/2000: 8.983670079746867
Average Total reward: 106.75
Running reward of episode 500/2000: 8.988613969397555
Average Total reward: 108.15
Running reward of episode 600/2000: 8.999476674461715
Average Total reward: 251.4
Running reward of episode 700/2000: 8.999936398746492
Average Total reward: 360.45
Running reward of episode 800/2000: 8.999999622998956
Average Total reward: 467.25
Running reward of episode 900/2000: 8.998725503422172
Average Total reward: 407.3
Running reward of episode 1000/2000: 8.993508044397183
Average Total reward: 346.5
Running reward of episode 1100/2000: 8.999961371523598
Average Total reward: 480.25
Running reward of episode 1200/2000: 8.999684541

0,1
average_episode_length,▁▁▁▂▂▄▆█▇▆██▆███████
average_lenght_all_episodes,▁
average_total_reward,▁▁▁▂▂▄▆█▇▆██▆███████
episode_length,▁▁▁▁▂▂▁▂▂▃▃▃▅▆██▇▄▆▃██▇▇███▆████████████
loss_policy,▆▆▅▄█▁▄▅▇▆▅▆▆▆▇▇▇▃▄▅▆▆▃▆▇▆▆▇▄▆█▆█▇▇▆█▇▅▇
running_reward,▁▆▆▇████████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,340.3505
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00283
running_reward,9.0
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 8.610700356876613
Average Total reward: 53.5
Running reward of episode 200/2000: 8.93692639304487
Average Total reward: 70.75
Running reward of episode 300/2000: 8.99012332073504
Average Total reward: 163.6
Running reward of episode 400/2000: 8.999940542148764
Average Total reward: 371.9
Running reward of episode 500/2000: 8.924227289187757
Average Total reward: 146.5
Running reward of episode 600/2000: 8.98158204562783
Average Total reward: 417.7
Running reward of episode 700/2000: 8.952451993457252
Average Total reward: 498.35
Running reward of episode 800/2000: 8.999718475491187
Average Total reward: 500.0
Running reward of episode 900/2000: 8.999998333225896
Average Total reward: 500.0
Running reward of episode 1000/2000: 8.999999988882266
Average Total reward: 482.55
Running reward of episode 1100/2000: 8.99831160481705
Average Total reward: 395.85
Running reward of episode 1200/2000: 8.9999900038069

0,1
average_episode_length,▁▁▃▆▂▇████▆▆██████▆▄
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▆▂▇████▆▆██████▆▄
episode_length,▁▁▁▂▂▂▄▅▇▃▄▄▇█▇██████▆▇█▆███▄▄████████▄▄
loss_policy,▆▆▇▆▅▆▇▅▆▅▆▅▆▅█▆▇▆█▇▆▇▆▇▇▇▆▆▅▁▅▆▆▆▅▆▆▅▄▅
running_reward,▁▆██████████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,213.95
average_lenght_all_episodes,372.5655
average_total_reward,213.95
episode_length,249.6
loss_policy,-0.01006
running_reward,8.98799
test_average_episode_length,207.78
test_average_total_reward,207.78


Training agent with standardization baseline.
Running reward of episode 100/2000: 7.71236728667254
Average Total reward: 29.85
Running reward of episode 200/2000: 7.5045824692438705
Average Total reward: 19.75
Running reward of episode 300/2000: 8.844400446374683
Average Total reward: 61.6
Running reward of episode 400/2000: 8.976467059383236
Average Total reward: 99.7
Running reward of episode 500/2000: 8.972381102171436
Average Total reward: 220.9
Running reward of episode 600/2000: 8.99736446464825
Average Total reward: 302.95
Running reward of episode 700/2000: 8.999719314555634
Average Total reward: 426.9
Running reward of episode 800/2000: 8.953743257278715
Average Total reward: 481.2
Running reward of episode 900/2000: 8.999705806784347
Average Total reward: 500.0
Running reward of episode 1000/2000: 8.999998249774524
Average Total reward: 500.0
Running reward of episode 1100/2000: 8.999999378501848
Average Total reward: 500.0
Running reward of episode 1200/2000: 8.9999345882656

0,1
average_episode_length,▁▁▂▂▄▅▇█████████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▂▄▅▇█████████████
episode_length,▁▁▁▁▁▁▂▂▂▂▃▄▇█▇▇████████████▇█████▇▇████
loss_policy,▆▇▃▁▄▆▆▇▇█▆█▆▇▇▆▇▆▆▇▆▇▇▇█▆▇▆▆▇▇▆▆▇▆▆▆▇▆▆
running_reward,▁▆▆▆▆▇██████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,358.531
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00113
running_reward,8.99991
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 7.806123420852687
Average Total reward: 24.85
Running reward of episode 200/2000: 8.839288587842182
Average Total reward: 66.45
Running reward of episode 300/2000: 8.985362091762077
Average Total reward: 215.35
Running reward of episode 400/2000: 8.992356651845018
Average Total reward: 235.9
Running reward of episode 500/2000: 8.999753145151516
Average Total reward: 299.2
Running reward of episode 600/2000: 8.998333289476468
Average Total reward: 320.35
Running reward of episode 700/2000: 8.968956133204532
Average Total reward: 413.95
Running reward of episode 800/2000: 8.990612746779243
Average Total reward: 455.5
Running reward of episode 900/2000: 8.99539400955497
Average Total reward: 489.6
Running reward of episode 1000/2000: 8.999954090911041
Average Total reward: 467.45
Running reward of episode 1100/2000: 8.999791028781559
Average Total reward: 472.75
Running reward of episode 1200/2000: 8.9999987

0,1
average_episode_length,▁▂▄▄▅▅▇▇███▅████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▄▄▅▅▇▇███▅████████
episode_length,▁▁▁▁▂▂▃▃▆█▄▄▆▆▇▇▇█▇█▇▇█▅▆███████████████
loss_policy,▅▆▁▂▅▇▆██▅▇▅▅▆▅▆▅▇▅▅▅▆▆▆▅▆▆▆▆▆▅▇▆▆▅▆▅▅▆▆
running_reward,▁▆▆▇████████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,374.3055
average_total_reward,500.0
episode_length,500.0
loss_policy,0.0042
running_reward,8.99999
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/2000: 8.37756749381354
Average Total reward: 32.35
Running reward of episode 200/2000: 8.714103535021597
Average Total reward: 69.7
Running reward of episode 300/2000: 8.972797166807794
Average Total reward: 137.5
Running reward of episode 400/2000: 8.991433354967826
Average Total reward: 384.3
Running reward of episode 500/2000: 8.994922394511539
Average Total reward: 338.15
Running reward of episode 600/2000: 8.972180263806955
Average Total reward: 465.7
Running reward of episode 700/2000: 8.999828892203816
Average Total reward: 444.25
Running reward of episode 800/2000: 8.999941640250848
Average Total reward: 376.7
Running reward of episode 900/2000: 8.99989967599757
Average Total reward: 495.45
Running reward of episode 1000/2000: 8.999999405699516
Average Total reward: 490.9
Running reward of episode 1100/2000: 8.999999996481405
Average Total reward: 500.0
Running reward of episode 1200/2000: 8.99999992683

-----
## Exercise 2: `REINFORCE` with a Value Baseline (warm up)

In this exercise we will augment my implementation (or your own) of `REINFORCE` to subtract a baseline from the target in the update equation in order to stabilize (and hopefully speed-up) convergence. For now we will stick to the Cartpole environment.



**First Things First**: Recall from the slides on Deep Reinforcement Learning that we can **subtract** any function that doesn't depend on the current action from the q-value without changing the (maximum of our) objecttive function $J$:  

$$ \nabla J(\boldsymbol{\theta}) \propto \sum_{s} \mu(s) \sum_a \left( q_{\pi}(s, a) - b(s) \right) \nabla \pi(a \mid s, \boldsymbol{\theta}) $$

In `REINFORCE` this means we can subtract from our target $G_t$:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha (G_t - b(S_t)) \frac{\nabla \pi(A_t \mid s, \boldsymbol{\theta})}{\pi(A_t \mid s, \boldsymbol{\theta})} $$

Since we are only interested in the **maximum** of our objective, we can also **rescale** our target by any function that also doesn't depend on the action. A **simple baseline** which is even independent of the state -- that is, it is **constant** for each episode -- is to just **standardize rewards within the episode**. So, we **subtract** the average return and **divide** by the variance of returns:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha \left(\frac{G_t - \bar{G}}{\sigma_G}\right) \nabla  \pi(A_t \mid s, \boldsymbol{\theta}) $$

This baseline is **already** implemented in my implementation of `REINFORCE`. Experiment with and without this standardization baseline and compare the performance. We are going to do something more interesting.

## Without the baseline
Since I already experimented with Reinforce with the standard baseline the next runs will not cointain a baseline of any kind

In [14]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Without the baseline ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": None,
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▂▃▆▆▇▇▆█████████▄▇█
average_lenght_all_episodes,▁
average_total_reward,▁▂▃▆▆▇▇▆█████████▄▇█
episode_length,▁▁▁▁▁▂▃▄▆▅▆▆█▆▇▇▆█████▇█▇█▇██▇██▇█▇████▇
loss_policy,▁▆▁▂▆▅▄▅▄▅▃▃▄█▄▃▃▅▄▄▅▅▄▄▅▆▅▅▅▅▃▄▅▇▆▅▄▄▄▄
running_reward,▁▆▇█▇███████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,499.15
average_lenght_all_episodes,387.739
average_total_reward,499.15
episode_length,459.7
loss_policy,0.00077
running_reward,9.0
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with no baseline.
Running reward of episode 100/2000: 20.367188207392463
Average Total reward: 23.6
Running reward of episode 200/2000: 28.356962689132324
Average Total reward: 32.65
Running reward of episode 300/2000: 50.196384699792084
Average Total reward: 97.05
Running reward of episode 400/2000: 77.38302344667393
Average Total reward: 164.2
Running reward of episode 500/2000: 89.90828410855038
Average Total reward: 358.75
Running reward of episode 600/2000: 73.04345409524544
Average Total reward: 270.0
Running reward of episode 700/2000: 66.43086691635207
Average Total reward: 185.9
Running reward of episode 800/2000: 75.47564028859684
Average Total reward: 209.3
Running reward of episode 900/2000: 94.2267822808414
Average Total reward: 342.45
Running reward of episode 1000/2000: 92.01660900812905
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.3102797721196
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.34929309646748
Averag

0,1
average_episode_length,▁▁▂▃▆▅▃▄▆███▆▅███▇█▅
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▃▆▅▃▄▆███▆▅███▇█▅
episode_length,▁▁▁▁▁▂▂▂▂▃▄▂▄▂▆▄▄▆▃▅██████▇█▂▅▇▇███████▃
loss_policy,▁▁▂▄▃▄▅▆▇▇▇▅▆▅▆█▇▆█▆▄▄▄▄▄▄▃▄▅▆▄▄▄▄▄▄▄▄▄▆
running_reward,▁▂▂▃▂▃▄▅▆▆▇▅▇▅▇▆▆▇▇▆██████▇█▅▇█████████▆
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,283.15
average_lenght_all_episodes,299.879
average_total_reward,283.15
episode_length,124.8
loss_policy,14.09194
running_reward,66.97933
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with no baseline.
Running reward of episode 100/2000: 19.94712459018216
Average Total reward: 20.65
Running reward of episode 200/2000: 19.52804798016683
Average Total reward: 25.1
Running reward of episode 300/2000: 18.746739144450807
Average Total reward: 20.0
Running reward of episode 400/2000: 51.74657560544224
Average Total reward: 97.65
Running reward of episode 500/2000: 74.99367015507983
Average Total reward: 182.3
Running reward of episode 600/2000: 79.68546876043298
Average Total reward: 161.95
Running reward of episode 700/2000: 85.04253764657325
Average Total reward: 184.2
Running reward of episode 800/2000: 94.52771599271692
Average Total reward: 330.3
Running reward of episode 900/2000: 96.24765622810492
Average Total reward: 494.75
Running reward of episode 1000/2000: 91.77900350005297
Average Total reward: 252.25
Running reward of episode 1100/2000: 83.04294433415521
Average Total reward: 231.3
Running reward of episode 1200/2000: 96.1195383919827
Average

0,1
average_episode_length,▁▁▁▂▃▃▃▆█▄▄▇▂▃▁▂██▇▄
average_lenght_all_episodes,▁
average_total_reward,▁▁▁▂▃▃▃▆█▄▄▇▂▃▁▂██▇▄
episode_length,▁▁▁▁▁▁▁▂▂▄▃▄▂▄▄▆▅███▄▃▆▇▇▅▂▂▁▁▇▅▅▅██▇█▇▃
loss_policy,▃▃▃▂▂▂▅▆▆██▇▆██▇▇▅▅▅█▇▆▅▆▆▄▅▂▁▅▇▆▄▅▅▅▅▅▃
running_reward,▁▂▂▂▂▁▂▄▅▆▆▆▆▇▇▇████▇▇▇███▅▄▃▂▆▇▆▆█████▅
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,197.85
average_lenght_all_episodes,250.1985
average_total_reward,197.85
episode_length,131.2
loss_policy,7.18892
running_reward,58.00466
test_average_episode_length,358.38
test_average_total_reward,358.38


Training agent with no baseline.
Running reward of episode 100/2000: 22.890720422596
Average Total reward: 33.55
Running reward of episode 200/2000: 35.291455188592074
Average Total reward: 44.65
Running reward of episode 300/2000: 53.717657421400745
Average Total reward: 86.35
Running reward of episode 400/2000: 61.880180400354455
Average Total reward: 132.35
Running reward of episode 500/2000: 35.89277275919185
Average Total reward: 53.5
Running reward of episode 600/2000: 50.69716957968764
Average Total reward: 60.85
Running reward of episode 700/2000: 48.53642207868032
Average Total reward: 73.75
Running reward of episode 800/2000: 80.04372163275795
Average Total reward: 213.85
Running reward of episode 900/2000: 98.15710645086948
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.34838622934323
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34951870686092
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3188037686289
Average

0,1
average_episode_length,▁▁▂▂▁▁▂▄████████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▂▁▁▂▄████████████
episode_length,▁▁▁▁▁▂▂▂▂▂▄▃▂▂▂▃▅██████████████████▇████
loss_policy,▁▁▂▄▄▆▆▆▆▄▆▅▄▅▅█▇▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▅▄▃▄▄
running_reward,▁▂▂▂▃▄▄▅▅▄▄▆▄▆▄▅▇███████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,329.1145
average_total_reward,500.0
episode_length,487.5
loss_policy,10.36318
running_reward,98.17755
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with no baseline.
Running reward of episode 100/2000: 21.736855232360256
Average Total reward: 30.6
Running reward of episode 200/2000: 27.39902428751464
Average Total reward: 36.75
Running reward of episode 300/2000: 41.56982564925054
Average Total reward: 76.8
Running reward of episode 400/2000: 65.0520883257497
Average Total reward: 133.85
Running reward of episode 500/2000: 51.80863923915272
Average Total reward: 79.9
Running reward of episode 600/2000: 95.16295773472105
Average Total reward: 478.1
Running reward of episode 700/2000: 88.90569476059035
Average Total reward: 225.45
Running reward of episode 800/2000: 77.05751689646934
Average Total reward: 259.7
Running reward of episode 900/2000: 53.12773204415513
Average Total reward: 68.2
Running reward of episode 1000/2000: 31.86528201448646
Average Total reward: 94.45
Running reward of episode 1100/2000: 68.95688293451661
Average Total reward: 102.45
Running reward of episode 1200/2000: 77.46725857204423
Average T

0,1
average_episode_length,▁▁▂▃▂█▄▅▂▂▂▃▃▃▂▃▂▂▂▂
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▃▂█▄▅▂▂▂▃▃▃▂▃▂▂▂▂
episode_length,▁▁▁▁▁▂▁▂▂▄▃▆▇█▃▃▇▂▂▁▃▃▃▃▃▃▄▄▃▂▃▃▃▂▂▂▂▂▂▂
loss_policy,▁▃▃▃▂▅▄▇▇██▆▆▅█▇▅▄▄▁▆▇▇▆▇▇▆▇▇▅▇▆▅▄▃▃▄▄▄▄
running_reward,▁▂▂▂▃▃▄▄▅▆▅▇██▇▆▇▅▄▂▅▆▆▆▆▇▇█▇▆▆▆▆▅▄▅▅▅▅▆
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,104.6
average_lenght_all_episodes,134.5575
average_total_reward,104.6
episode_length,109.9
loss_policy,10.44287
running_reward,66.05285
test_average_episode_length,110.66
test_average_total_reward,110.66


Training agent with no baseline.
Running reward of episode 100/2000: 21.247652941300803
Average Total reward: 27.85
Running reward of episode 200/2000: 38.12038895776915
Average Total reward: 44.25
Running reward of episode 300/2000: 50.365668498263894
Average Total reward: 91.3
Running reward of episode 400/2000: 83.9404025617387
Average Total reward: 246.55
Running reward of episode 500/2000: 77.85885357293179
Average Total reward: 268.5
Running reward of episode 600/2000: 76.79211822645209
Average Total reward: 198.6
Running reward of episode 700/2000: 80.21160043222355
Average Total reward: 134.95
Running reward of episode 800/2000: 57.933488226697996
Average Total reward: 114.0
Running reward of episode 900/2000: 97.16289646088337
Average Total reward: 500.0
Running reward of episode 1000/2000: 59.330597550833154
Average Total reward: 94.25
Running reward of episode 1100/2000: 14.857272275588404
Average Total reward: 26.6
Running reward of episode 1200/2000: 32.35128633998752
Aver

**The Real Exercise**: Standard practice is to use the state-value function $v(s)$ as a baseline. This is intuitively appealing -- we are more interested in updating out policy for returns that estimate the current **value** worse. Our new update becomes:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha (G_t - \tilde{v}(S_t \mid \mathbf{w})) \frac{\nabla \pi(A_t \mid s, \boldsymbol{\theta})}{\pi(A_t \mid s, \boldsymbol{\theta})} $$

where $\tilde{v}(s \mid \mathbf{w})$ is a **deep neural network** with parameters $w$ that estimates $v_\pi(s)$. What neural network? Typically, we use the **same** network architecture as that of the Policy.

**Your Task**: Modify your implementation to fit a second, baseline network to estimate the value function and use it as **baseline**.

## With Baseline network

In [8]:
# A simple Net with the same architecture as the Policy Net that has to learn to estimate the value function
class ValueNet(nn.Module):
    def __init__(self, env, hidden_layers=32):
        super().__init__()
        self.fc1 = nn.Linear(env.observation_space.shape[0], hidden_layers)
        self.fc2 = nn.Linear(hidden_layers, 1)
        self.relu = nn.ReLU()

    def forward(self, s):
        s = F.relu(self.fc1(s))
        s = self.fc2(s)
        return s

In [16]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Baseline Network ",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▂▄▅▄▃▂█▂▁▂████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▄▅▄▃▂█▂▁▂████████
episode_length,▁▁▁▁▂▂▃▃▃▂▂▂▅▃▃▂▃██▇▁▁▁▂▃███████████████
loss_policy,▃▃▃▅▅▆██▆▆▆▅▆▇▇▅▇▅▅▅▁▁▃▄▆▅▅▅▅▅▄▄▅▄▄▄▄▄▄▄
running_reward,▁▂▂▂▃▄▅▆▆▄▆▄▇▇▆▄▆██▇▂▁▂▃▅███████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,278.86
average_total_reward,500.0
episode_length,500.0
loss_policy,9.52762
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 18.385507946235805
Average Total reward: 24.8
Running reward of episode 200/2000: 27.337891315924985
Average Total reward: 40.3
Running reward of episode 300/2000: 48.94506384717461
Average Total reward: 94.2
Running reward of episode 400/2000: 87.7221432319426
Average Total reward: 358.5
Running reward of episode 500/2000: 96.1324490482994
Average Total reward: 473.25
Running reward of episode 600/2000: 98.33093439318282
Average Total reward: 500.0
Running reward of episode 700/2000: 98.25359321689321
Average Total reward: 500.0
Running reward of episode 800/2000: 94.15650484549951
Average Total reward: 225.55
Running reward of episode 900/2000: 97.45551554852823
Average Total reward: 472.75
Running reward of episode 1000/2000: 97.06465916173579
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.07286247669177
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.346828250632

0,1
average_episode_length,▁▁▂▆███▄████████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▆███▄████████████
episode_length,▁▁▁▁▁▁▂▃▇▃█████▇▄██▇████████████████████
loss_policy,▅▇▆▅▇▇▇█▆█▃▂▂▂▂▂▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
loss_value,▁▃▂▁▃▃▅▇▇█▆▆▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▂▂▃▃▄▆▇▇██████▇███████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,398.129
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.11073
loss_value,598.27344
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 16.83578667352332
Average Total reward: 17.7
Running reward of episode 200/2000: 22.965908322185395
Average Total reward: 28.9
Running reward of episode 300/2000: 55.90019131914127
Average Total reward: 128.55
Running reward of episode 400/2000: 87.75277591504432
Average Total reward: 234.25
Running reward of episode 500/2000: 92.53298793620372
Average Total reward: 239.55
Running reward of episode 600/2000: 96.16116664149013
Average Total reward: 488.55
Running reward of episode 700/2000: 96.46826864501728
Average Total reward: 490.9
Running reward of episode 800/2000: 70.62004861917573
Average Total reward: 103.15
Running reward of episode 900/2000: 78.2246071553638
Average Total reward: 158.25
Running reward of episode 1000/2000: 95.39791992764842
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.33205038490816
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.34942199

0,1
average_episode_length,▁▁▃▄▄██▂▃█████▅█████
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▄▄██▂▃█████▅█████
episode_length,▁▁▁▁▁▂▃▅▄▅▆█▇▇▇▆▂▃▃▇██████████▅▄▄███████
loss_policy,▄▅▅▅▇▇█▇▇▆▄▃▂▂▂▃▃▄▃▁▂▂▂▂▂▂▂▂▂▂▂▃▃▁▂▂▂▂▂▂
loss_value,▁▁▁▁▃▄███▇▆▅▅▅▅▅▃▅▅▅▅▅▅▅▅▅▅▅▅▅▆▅▆▅▅▅▅▅▅▅
running_reward,▁▂▂▂▂▄▆▇▇▇▇█████▆▆▇███████████▇█████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,350.9945
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.07509
loss_value,598.49774
running_reward,98.3495
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 22.281848419537802
Average Total reward: 28.0
Running reward of episode 200/2000: 53.62815686467122
Average Total reward: 115.1
Running reward of episode 300/2000: 83.76246841043545
Average Total reward: 268.9
Running reward of episode 400/2000: 63.74858045082211
Average Total reward: 102.6
Running reward of episode 500/2000: 87.97375980632762
Average Total reward: 336.95
Running reward of episode 600/2000: 97.6647314586418
Average Total reward: 481.65
Running reward of episode 700/2000: 97.65936320372809
Average Total reward: 493.55
Running reward of episode 800/2000: 98.28467008855114
Average Total reward: 495.15
Running reward of episode 900/2000: 97.7117098206728
Average Total reward: 488.8
Running reward of episode 1000/2000: 98.01391223908752
Average Total reward: 488.7
Running reward of episode 1100/2000: 86.66527912355751
Average Total reward: 269.25
Running reward of episode 1200/2000: 98.240284673

0,1
average_episode_length,▁▂▅▂▆█████▅█████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▅▂▆█████▅█████████
episode_length,▁▁▁▂▃▄▄▅▄█▆▇█████▇██▇███████▇███████▅▇██
loss_policy,▄▄▅▇██▇▆▅▃▃▂▄▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▁▂▂
loss_value,▁▁▂▄▇█▇▇▆▅▅▅▅▅▅▅▅▅▅▅▄▅▅▅▅▅▅▅▄▅▅▅▅▅▅▅▅▄▅▅
running_reward,▁▂▂▃▄▆▇▇▆█▇███████████▇█████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,400.676
average_total_reward,500.0
episode_length,500.0
loss_policy,0.15962
loss_value,598.25964
running_reward,98.34289
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 17.44650506993677
Average Total reward: 21.3
Running reward of episode 200/2000: 25.488443079499636
Average Total reward: 32.15
Running reward of episode 300/2000: 41.63601094207207
Average Total reward: 88.9
Running reward of episode 400/2000: 87.6099692984882
Average Total reward: 226.35
Running reward of episode 500/2000: 85.55186051180435
Average Total reward: 127.85
Running reward of episode 600/2000: 95.44625588271197
Average Total reward: 370.9
Running reward of episode 700/2000: 98.15177528843111
Average Total reward: 487.85
Running reward of episode 800/2000: 98.3110262229343
Average Total reward: 500.0
Running reward of episode 900/2000: 98.27358853264109
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.25829019228898
Average Total reward: 497.4
Running reward of episode 1100/2000: 98.27441217162068
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.181552630923

0,1
average_episode_length,▁▁▂▄▃▆██████████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▄▃▆██████████████
episode_length,▁▁▁▁▁▂▃▆▆█▃▇▆██████████▆████████████████
loss_policy,▅▅▄▅▅██▆▅▄▅▃▃▂▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
loss_value,▁▂▁▃▂▆█▇▇▆▆▆▆▅▅▅▅▅▅▅▅▅▅▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▂▁▂▂▃▅▇▇█▆█████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,399.6445
average_total_reward,500.0
episode_length,500.0
loss_policy,0.22547
loss_value,598.12701
running_reward,98.32808
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 21.38267050170289
Average Total reward: 28.0
Running reward of episode 200/2000: 23.217362229512833
Average Total reward: 35.8
Running reward of episode 300/2000: 36.94217072842432
Average Total reward: 39.45
Running reward of episode 400/2000: 62.832406715066156
Average Total reward: 110.65
Running reward of episode 500/2000: 93.43508517402675
Average Total reward: 285.9
Running reward of episode 600/2000: 95.65822266827351
Average Total reward: 418.4
Running reward of episode 700/2000: 95.69248309437596
Average Total reward: 487.45
Running reward of episode 800/2000: 98.08902420205092
Average Total reward: 456.25
Running reward of episode 900/2000: 97.94376891505759
Average Total reward: 412.05
Running reward of episode 1000/2000: 93.72715203393149
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.32215855477298
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.32225553

## With Baseline Network and higher temperature

In [17]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Baseline Network and temp 10",
          config={
              "hidden_layers": 32,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 10,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▁▂▅▇█▇▇████▆██████
average_lenght_all_episodes,▁
average_total_reward,▁▁▁▂▅▇█▇▇████▆██████
episode_length,▁▁▁▁▁▁▁▂▃▄▆▆▇▆████▃▆██████▆▇▇███████████
loss_policy,▇▆▅▅▇█▆▇█▇▅▄▂▄▃▃▂▂▅▂▁▂▂▂▂▂▃▂▁▁▂▂▂▂▂▂▂▂▂▂
loss_value,▂▁▁▁▃▄▂▄██▆▆▅▆▅▅▅▅▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▂▂▂▂▃▃▄▅▇▇▇█▇████▇▇████████▇███████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,362.4385
average_total_reward,500.0
episode_length,498.4
loss_policy,-0.07096
loss_value,599.25696
running_reward,98.34321
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 19.183506459172378
Average Total reward: 20.75
Running reward of episode 200/2000: 30.100428325060776
Average Total reward: 42.7
Running reward of episode 300/2000: 53.24085431990675
Average Total reward: 156.2
Running reward of episode 400/2000: 87.17611378731436
Average Total reward: 274.25
Running reward of episode 500/2000: 93.78258029846293
Average Total reward: 429.45
Running reward of episode 600/2000: 78.16152860895221
Average Total reward: 256.0
Running reward of episode 700/2000: 97.28110043913716
Average Total reward: 478.35
Running reward of episode 800/2000: 97.17399631955786
Average Total reward: 500.0
Running reward of episode 900/2000: 92.42908476865685
Average Total reward: 383.45
Running reward of episode 1000/2000: 97.25006782471317
Average Total reward: 480.45
Running reward of episode 1100/2000: 97.33184293619298
Average Total reward: 462.95
Running reward of episode 1200/2000: 97.81609

0,1
average_episode_length,▁▁▃▅▇▄██▆█▇█████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▅▇▄██▆█▇█████████
episode_length,▁▁▁▁▁▁▄▄▄▄▇▅█▇█▇█▆▃▇▇▇███▇██████████████
loss_policy,▅▅▆▅▆▆██▇▆▃▄▁▁▁▂▁▂▃▁▁▂▁▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁
loss_value,▁▁▂▂▃▃██▆▇▆▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▂▂▂▃▃▆▇▇▇██▇█████▇█████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,383.9965
average_total_reward,500.0
episode_length,500.0
loss_policy,0.13483
loss_value,598.22485
running_reward,98.34951
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 16.115211315668343
Average Total reward: 26.8
Running reward of episode 200/2000: 21.81889781319484
Average Total reward: 29.4
Running reward of episode 300/2000: 25.638065698205068
Average Total reward: 32.6
Running reward of episode 400/2000: 51.82225212538218
Average Total reward: 149.25
Running reward of episode 500/2000: 79.54618308575843
Average Total reward: 218.7
Running reward of episode 600/2000: 94.63875154463302
Average Total reward: 481.9
Running reward of episode 700/2000: 95.08126311887636
Average Total reward: 484.35
Running reward of episode 800/2000: 97.50197131607042
Average Total reward: 467.0
Running reward of episode 900/2000: 97.40831860307311
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.34395300901052
Average Total reward: 500.0
Running reward of episode 1100/2000: 94.89274654804902
Average Total reward: 269.05
Running reward of episode 1200/2000: 98.2254590131

0,1
average_episode_length,▁▁▁▃▄█████▅██▇██████
average_lenght_all_episodes,▁
average_total_reward,▁▁▁▃▄█████▅██▇██████
episode_length,▁▁▁▁▁▁▁▂▃▃▃▇▇▅▇█▇████▇███▇█▇████████████
loss_policy,▅▅▅▅▆▅▆▇██▆▃▂▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▁▁▁▂▁▂▄▇█▇▆▆▇▆▆▆▆▆▆▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▂▂▂▂▃▄▅▆▇▇█▇██████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,491.35
average_lenght_all_episodes,371.8035
average_total_reward,491.35
episode_length,500.0
loss_policy,0.1577
loss_value,598.24377
running_reward,98.12196
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 18.866451379219736
Average Total reward: 20.65
Running reward of episode 200/2000: 26.69573728326189
Average Total reward: 39.75
Running reward of episode 300/2000: 54.860592637949644
Average Total reward: 77.9
Running reward of episode 400/2000: 85.2991098779191
Average Total reward: 321.1
Running reward of episode 500/2000: 95.85501774082272
Average Total reward: 440.05
Running reward of episode 600/2000: 97.4902415571414
Average Total reward: 492.05
Running reward of episode 700/2000: 96.89804264012697
Average Total reward: 483.2
Running reward of episode 800/2000: 94.64327280983144
Average Total reward: 319.6
Running reward of episode 900/2000: 96.95755066634251
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.16432356286433
Average Total reward: 465.1
Running reward of episode 1100/2000: 97.78299076953908
Average Total reward: 493.3
Running reward of episode 1200/2000: 97.62746771256

0,1
average_episode_length,▁▁▂▅▇██▅█▇████████▆█
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▅▇██▅█▇████████▆█
episode_length,▁▁▁▁▂▂▃▄▃▅▇██▇█▇▇███▇▇▇█▇█████████████▄█
loss_policy,▅▄▅▄▇▇█▇▇▅▃▃▃▂▂▂▂▂▁▁▂▂▂▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁
loss_value,▁▁▂▁▄▅█▇▇▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▄▅
running_reward,▁▂▂▂▃▄▅▆▇▇████████████████████████████▇█
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,387.4625
average_total_reward,500.0
episode_length,487.6
loss_policy,-0.69755
loss_value,606.10754
running_reward,97.47808
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 22.137660930667803
Average Total reward: 24.45
Running reward of episode 200/2000: 34.35531586944902
Average Total reward: 39.15
Running reward of episode 300/2000: 69.72722328299251
Average Total reward: 179.85
Running reward of episode 400/2000: 83.35566354013048
Average Total reward: 139.45
Running reward of episode 500/2000: 92.64241645580998
Average Total reward: 461.15
Running reward of episode 600/2000: 92.39572814183286
Average Total reward: 297.15
Running reward of episode 700/2000: 94.4515992615833
Average Total reward: 463.7
Running reward of episode 800/2000: 95.71002139925483
Average Total reward: 493.6
Running reward of episode 900/2000: 81.9532053384586
Average Total reward: 352.65
Running reward of episode 1000/2000: 96.15248828408872
Average Total reward: 500.0
Running reward of episode 1100/2000: 95.81721568027436
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.29921840

0,1
average_episode_length,▁▁▃▃▇▅▇█▆████▇▃█▄███
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▃▇▅▇█▆████▇▃█▄███
episode_length,▁▁▁▂▁▂▃▄▂▅▇▇▅▇▆▆▇▅▇█▇▇██████▂▂██▄▄██▅▇██
loss_policy,▅▆▆█▇██▇▆▅▄▄▅▃▃▂▂▃▂▂▂▁▂▂▂▂▂▂▃▂▁▁▃▃▁▂▃▁▂▂
loss_value,▁▁▂▄▃▆██▅▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▃▃▅▅▅▅▅▅▆▅▅▅
running_reward,▁▂▂▃▃▄▆▇▆▇██▇██▇█▇██████████▆▆▇██▇██████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,336.287
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.25145
loss_value,599.67987
running_reward,95.33847
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 19.84480796851617
Average Total reward: 22.2
Running reward of episode 200/2000: 22.66890234889799
Average Total reward: 28.5
Running reward of episode 300/2000: 37.388597683732826
Average Total reward: 52.85
Running reward of episode 400/2000: 54.868267098938446
Average Total reward: 103.75
Running reward of episode 500/2000: 82.56332495435332
Average Total reward: 199.25
Running reward of episode 600/2000: 86.82295108422903
Average Total reward: 416.75
Running reward of episode 700/2000: 96.6440508326321
Average Total reward: 395.55
Running reward of episode 800/2000: 96.7059063122603
Average Total reward: 385.7
Running reward of episode 900/2000: 91.15909213629051
Average Total reward: 163.45
Running reward of episode 1000/2000: 90.42343576993066
Average Total reward: 246.75
Running reward of episode 1100/2000: 92.73186541422275
Average Total reward: 284.3
Running reward of episode 1200/2000: 97.05024269

## With Baseline Network, temperature 10 and 128 hidden layers

In [19]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Baseline Network 128 layers",
          config={
              "hidden_layers": 128,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 10,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▁▂▄▇▆▆▃▄▅████▇█▅██
average_lenght_all_episodes,▁
average_total_reward,▁▁▁▂▄▇▆▆▃▄▅████▇█▅██
episode_length,▁▁▁▁▁▂▂▂▂▃▅▅▇█▆█▇▇▃▅▃▅▅███▇███▇▇█▇███▇██
loss_policy,▅▆▅▅▆▇▇▇█▇▅▅▃▃▃▃▃▂▄▄▄▃▂▁▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▂▁▁▂▄▅▄██▇▇▆▆▇▆▆▆▆▇▅▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▂▂▂▃▃▄▅▆▇▇▇█████▇▇▇███████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,330.12
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.35661
loss_value,598.45764
running_reward,98.33974
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 32.0094782451846
Average Total reward: 54.6
Running reward of episode 200/2000: 53.71425610165732
Average Total reward: 105.75
Running reward of episode 300/2000: 93.38760826609865
Average Total reward: 345.7
Running reward of episode 400/2000: 90.9043200569277
Average Total reward: 338.7
Running reward of episode 500/2000: 98.15367064571203
Average Total reward: 500.0
Running reward of episode 600/2000: 93.59512871292686
Average Total reward: 500.0
Running reward of episode 700/2000: 98.04483675426094
Average Total reward: 500.0
Running reward of episode 800/2000: 95.99248338343018
Average Total reward: 500.0
Running reward of episode 900/2000: 93.32621489131296
Average Total reward: 207.95
Running reward of episode 1000/2000: 85.33033286073864
Average Total reward: 228.35
Running reward of episode 1100/2000: 97.6791894842405
Average Total reward: 426.1
Running reward of episode 1200/2000: 94.6629006432714

0,1
average_episode_length,▁▂▆▅████▃▄▇▆████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▆▅████▃▄▇▆████████
episode_length,▁▁▂▂▄▅▇▅▆█▇▇██████▃▃▄▇█▅████████████████
loss_policy,▆▆██▇▅▂▃▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▃▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▁▄▅█▆▆▅▆▅▅▅▅▅▅▅▅▆▅▄▆▆▅▆▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▂▃▄▅▇█▇▇█████████▇▇▇███████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,403.6785
average_total_reward,500.0
episode_length,500.0
loss_policy,0.03259
loss_value,598.48242
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 21.82209116101363
Average Total reward: 42.25
Running reward of episode 200/2000: 71.86424379036274
Average Total reward: 176.95
Running reward of episode 300/2000: 91.56399804929937
Average Total reward: 461.5
Running reward of episode 400/2000: 96.46635909649882
Average Total reward: 461.9
Running reward of episode 500/2000: 69.45056391289467
Average Total reward: 119.4
Running reward of episode 600/2000: 96.84742894461216
Average Total reward: 490.0
Running reward of episode 700/2000: 96.94150844818222
Average Total reward: 500.0
Running reward of episode 800/2000: 98.33302265259974
Average Total reward: 500.0
Running reward of episode 900/2000: 96.60796007612899
Average Total reward: 470.35
Running reward of episode 1000/2000: 97.83744630024387
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.18714837127611
Average Total reward: 500.0
Running reward of episode 1200/2000: 80.5946705820

0,1
average_episode_length,▁▃▇▇▂██████▂████▄▆██
average_lenght_all_episodes,▁
average_total_reward,▁▃▇▇▂██████▂████▄▆██
episode_length,▁▁▁▂▅▅█▇▇▃▄▇█▇█████████▇▃███████▇▇▄▆██▇█
loss_policy,▆▆▇█▆▅▃▄▃▄▃▁▂▂▂▂▃▃▂▂▃▂▃▃▃▂▃▂▂▂▃▂▃▂▂▁▂▂▂▂
loss_value,▁▁▃▆█▆▆▆▆▅▅▇▆▆▆▆▆▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▇▆▆▅▆
running_reward,▁▂▂▄▆▆███▇▆▇████████████▇████████▇▇█████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,390.1525
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.34646
loss_value,598.45117
running_reward,98.18483
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 32.87386553492955
Average Total reward: 62.1
Running reward of episode 200/2000: 71.7765475513662
Average Total reward: 307.4
Running reward of episode 300/2000: 77.32814666953193
Average Total reward: 207.25
Running reward of episode 400/2000: 97.2856897095752
Average Total reward: 500.0
Running reward of episode 500/2000: 98.25577237769019
Average Total reward: 500.0
Running reward of episode 600/2000: 98.34897038384602
Average Total reward: 500.0
Running reward of episode 700/2000: 98.2959884278717
Average Total reward: 500.0
Running reward of episode 800/2000: 96.54816612731405
Average Total reward: 473.35
Running reward of episode 900/2000: 98.29286194993297
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.34918997374223
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.30676929946814
Average Total reward: 500.0
Running reward of episode 1200/2000: 96.93669745527039

0,1
average_episode_length,▁▅▃█████████████████
average_lenght_all_episodes,▁
average_total_reward,▁▅▃█████████████████
episode_length,▁▁▂▂▃▄▆████████▆▇███████▇██▅███████▇████
loss_policy,▆▆█▇▇▆▃▂▂▂▂▂▁▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▁▆▅██▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇
running_reward,▁▂▃▄▆▇▇████████████████████▇████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,428.02
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.09432
loss_value,598.31781
running_reward,98.3495
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 26.444223843940154
Average Total reward: 43.85
Running reward of episode 200/2000: 43.736361093433004
Average Total reward: 109.45
Running reward of episode 300/2000: 86.90578161826217
Average Total reward: 248.1
Running reward of episode 400/2000: 88.41146762258899
Average Total reward: 202.4
Running reward of episode 500/2000: 93.32662692067937
Average Total reward: 419.8
Running reward of episode 600/2000: 75.0133115672801
Average Total reward: 142.8
Running reward of episode 700/2000: 88.50847363514167
Average Total reward: 500.0
Running reward of episode 800/2000: 98.29126121682144
Average Total reward: 500.0
Running reward of episode 900/2000: 98.13336392430519
Average Total reward: 500.0
Running reward of episode 1000/2000: 70.39035060759554
Average Total reward: 127.85
Running reward of episode 1100/2000: 97.9296358656161
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3470394830

0,1
average_episode_length,▁▂▄▃▇▃███▂███████▅█▄
average_lenght_all_episodes,▁
average_total_reward,▁▂▄▃▇▃███▂███████▅█▄
episode_length,▁▁▁▂▃▄▅▅▇█▆▃▂▃████▆▃▃██████████████▄▄▃█▇
loss_policy,▇▅▇▇█▆▄▅▁▂▂▃▁▃▁▂▂▂▃▂▃▂▂▁▁▂▂▂▂▂▂▂▂▂▂▃▃▂▂▂
loss_value,▂▁▄▄██▇▇▇▆▆▄▂▆▆▆▆▆▆▄▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▄▆▆
running_reward,▁▂▂▃▄▆▇█▇▇█▆▅▅████▇▆▆███████████████▇▆██
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,268.7
average_lenght_all_episodes,357.0285
average_total_reward,268.7
episode_length,442.9
loss_policy,0.03808
loss_value,595.13416
running_reward,96.74298
test_average_episode_length,328.87
test_average_total_reward,328.87


Training agent with baseline value network.
Running reward of episode 100/2000: 27.32490215760632
Average Total reward: 46.85
Running reward of episode 200/2000: 50.07565717645881
Average Total reward: 143.55
Running reward of episode 300/2000: 94.660531243787
Average Total reward: 389.35
Running reward of episode 400/2000: 95.69167556889424
Average Total reward: 500.0
Running reward of episode 500/2000: 95.73467453263497
Average Total reward: 430.1
Running reward of episode 600/2000: 98.07848238386033
Average Total reward: 500.0
Running reward of episode 700/2000: 95.5742230383279
Average Total reward: 457.6
Running reward of episode 800/2000: 98.25599593002525
Average Total reward: 500.0
Running reward of episode 900/2000: 94.8329517703557
Average Total reward: 500.0
Running reward of episode 1000/2000: 97.20180244361106
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34273032405396
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.34948522090842

## With baseline, 128 hidden layer and temperature 5

In [9]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Baseline Network 128 layers temp 5",
          config={
              "hidden_layers": 128,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()



Training agent with baseline value network.
Running reward of episode 100/2000: 35.797448217256715
Average Total reward: 67.25
Running reward of episode 200/2000: 79.6415294325236
Average Total reward: 137.4
Running reward of episode 300/2000: 97.29908716932266
Average Total reward: 480.3
Running reward of episode 400/2000: 98.34330630111522
Average Total reward: 500.0
Running reward of episode 500/2000: 98.34948863099743
Average Total reward: 500.0
Running reward of episode 600/2000: 92.47221855127512
Average Total reward: 500.0
Running reward of episode 700/2000: 98.31472868441936
Average Total reward: 500.0
Running reward of episode 800/2000: 98.34931943638274
Average Total reward: 500.0
Running reward of episode 900/2000: 98.31785339862901
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.3493379363445
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34952434147003
Average Total reward: 500.0
Running reward of episode 1200/2000: 95.3691838452254

0,1
average_episode_length,▁▂███████████████▇██
average_lenght_all_episodes,▁
average_total_reward,▁▂███████████████▇██
episode_length,▁▁▂▃▂█████▇▃███████████████████████████▇
loss_policy,▅▆██▄▃▃▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂
loss_value,▁▂▅█▃▅▅▅▅▅▅▄▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▂▃▅▆▇█████▇████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,439.1475
average_total_reward,500.0
episode_length,436.9
loss_policy,0.42298
loss_value,581.49048
running_reward,96.53181
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 20.841262645814915
Average Total reward: 29.65
Running reward of episode 200/2000: 71.9193425366524
Average Total reward: 144.95
Running reward of episode 300/2000: 95.45127602692487
Average Total reward: 469.2
Running reward of episode 400/2000: 94.12219114288489
Average Total reward: 257.5
Running reward of episode 500/2000: 98.18456820757844
Average Total reward: 500.0
Running reward of episode 600/2000: 96.43755045398139
Average Total reward: 500.0
Running reward of episode 700/2000: 98.32011739779324
Average Total reward: 478.8
Running reward of episode 800/2000: 98.34527953749006
Average Total reward: 500.0
Running reward of episode 900/2000: 98.3475551331099
Average Total reward: 500.0
Running reward of episode 1000/2000: 96.79266993178605
Average Total reward: 500.0
Running reward of episode 1100/2000: 95.1890426758746
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3308137210355

0,1
average_episode_length,▁▃█▄██████████████▇█
average_lenght_all_episodes,▁
average_total_reward,▁▃█▄██████████████▇█
episode_length,▁▁▁▂▆▅█▇█▇███████████▇███▇███████████▇██
loss_policy,▇▄▇█▆▇▂▂▁▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▂▁▄▅▇█▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▂▄▆▇██████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,443.4365
average_total_reward,500.0
episode_length,500.0
loss_policy,0.29222
loss_value,598.1026
running_reward,98.34208
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 45.045412820910265
Average Total reward: 82.55
Running reward of episode 200/2000: 80.85112253017265
Average Total reward: 120.7
Running reward of episode 300/2000: 97.90786057883304
Average Total reward: 490.5
Running reward of episode 400/2000: 97.63810325223278
Average Total reward: 500.0
Running reward of episode 500/2000: 95.34413878063337
Average Total reward: 494.0
Running reward of episode 600/2000: 97.80194689958434
Average Total reward: 488.75
Running reward of episode 700/2000: 96.33997477460002
Average Total reward: 372.45
Running reward of episode 800/2000: 73.92080141748143
Average Total reward: 300.15
Running reward of episode 900/2000: 96.35612166099278
Average Total reward: 422.5
Running reward of episode 1000/2000: 98.16600390349913
Average Total reward: 424.9
Running reward of episode 1100/2000: 98.04089997865516
Average Total reward: 482.6
Running reward of episode 1200/2000: 98.20492934

0,1
average_episode_length,▁▂████▆▅▇▇███▃▃▃▅██▅
average_lenght_all_episodes,▁
average_total_reward,▁▂████▆▅▇▇███▃▃▃▅██▅
episode_length,▁▁▂▃▄██▅▇██▆██▅▂▄█▆██████▇█▄▅▅▅▄▄▃█▅▇▃██
loss_policy,▅▆▇█▇▃▃▃▂▂▂▃▂▂▃▂▃▁▂▂▂▂▂▂▂▂▂▄▂▂▂▂▂▃▁▃▂▃▂▂
loss_value,▁▂▅██▆▆▆▆▆▆▆▆▆▆▂▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▆▆▆▅▆▆
running_reward,▁▂▄▅▆██████████▅▇███████████▇▇▇▇▇▆███▇██
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,345.65
average_lenght_all_episodes,367.0285
average_total_reward,345.65
episode_length,500.0
loss_policy,-0.58856
loss_value,601.22131
running_reward,97.72375
test_average_episode_length,348.29
test_average_total_reward,348.29


Training agent with baseline value network.
Running reward of episode 100/2000: 28.502079069726147
Average Total reward: 59.9
Running reward of episode 200/2000: 54.19332413032455
Average Total reward: 96.0
Running reward of episode 300/2000: 87.12481725002105
Average Total reward: 488.1
Running reward of episode 400/2000: 79.59950670360128
Average Total reward: 247.1
Running reward of episode 500/2000: 97.2750826406726
Average Total reward: 500.0
Running reward of episode 600/2000: 97.66652157414896
Average Total reward: 494.1
Running reward of episode 700/2000: 98.12435511492698
Average Total reward: 495.7
Running reward of episode 800/2000: 98.34062218908198
Average Total reward: 500.0
Running reward of episode 900/2000: 98.34947273963373
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.34952513957684
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.32981761882733
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3494087708598


0,1
average_episode_length,▁▂█▄█████████▇██████
average_lenght_all_episodes,▁
average_total_reward,▁▂█▄█████████▇██████
episode_length,▁▁▁▁▂▂▆▃▄██▆██▇██████████▇██████████▆███
loss_policy,▆▆▆▇█▆▄▄▄▁▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▂▂▃█▅▆▆▇▇▆▆▇▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▃▃▅▆▇▇▇███████████████████████████▇███
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,414.4455
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.17439
loss_value,598.13141
running_reward,98.34643
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 41.683340314953234
Average Total reward: 70.75
Running reward of episode 200/2000: 87.95169738306129
Average Total reward: 385.95
Running reward of episode 300/2000: 87.71938741314705
Average Total reward: 179.05
Running reward of episode 400/2000: 97.84658594168975
Average Total reward: 500.0
Running reward of episode 500/2000: 97.6137482486996
Average Total reward: 500.0
Running reward of episode 600/2000: 93.79955925892523
Average Total reward: 287.6
Running reward of episode 700/2000: 97.86824121423288
Average Total reward: 500.0
Running reward of episode 800/2000: 98.34667599426898
Average Total reward: 500.0
Running reward of episode 900/2000: 98.34950858136422
Average Total reward: 500.0
Running reward of episode 1000/2000: 98.34952535177885
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.34952545106859
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.3495254516

## With baseline, 16 hidden layer and temperature 5

In [11]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Baseline Network 16 layers temp 5",
          config={
              "hidden_layers": 16,
              "num_episodes": 2000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 20,
              "test_episodes" : 200,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▁▂▂▂▃█
average_total_reward,▁▁▂▂▂▃█
episode_length,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▃▃▃▄▅█▇▇▇▆▃█
loss_policy,▆▅▅▄▅▅▆▅▆▆▆▆▅▆▇▇▇█▇▇▇████▇▇██▇▆▅▃▃▃▁▂▂▃▁
loss_value,▂▁▁▁▂▁▂▂▂▂▂▂▂▂▄▄▃▄▄▄▅▅▆▆▆▅▆██▇▆▆▅▆▅▅▅▅▅▅
running_reward,▁▁▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▆▇▇▇▇▇█████▇▇

0,1
average_episode_length,467.5
average_total_reward,467.5
episode_length,500.0
loss_policy,0.09622
loss_value,604.20288
running_reward,92.11111


Training agent with baseline value network.
Running reward of episode 100/2000: 21.198236506876636
Average Total reward: 30.55
Running reward of episode 200/2000: 31.232564610192572
Average Total reward: 44.6
Running reward of episode 300/2000: 44.692480121087954
Average Total reward: 68.05
Running reward of episode 400/2000: 66.62110245511452
Average Total reward: 180.4
Running reward of episode 500/2000: 88.40182042486495
Average Total reward: 332.6
Running reward of episode 600/2000: 83.32056016023441
Average Total reward: 157.65
Running reward of episode 700/2000: 91.81500888867096
Average Total reward: 406.55
Running reward of episode 800/2000: 96.77226491538515
Average Total reward: 454.4
Running reward of episode 900/2000: 97.48401223321831
Average Total reward: 476.15
Running reward of episode 1000/2000: 97.66977940926452
Average Total reward: 450.95
Running reward of episode 1100/2000: 97.60032193149338
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.123901

0,1
average_episode_length,▁▁▂▃▆▃▇▇█▇██████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▃▆▃▇▇█▇██████████
episode_length,▁▁▁▁▁▁▂▂▃▇▆▄▃▇█▇██████▇███████████▇█████
loss_policy,▅▆▆▇▇▇▇██▇▇▇▇▄▃▄▃▃▂▃▃▂▁▂▂▃▂▂▁▂▂▁▂▂▂▂▂▂▂▂
loss_value,▁▂▂▃▃▃▄▅▇██▇▇▆▆▅▆▆▅▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▁▂▂▃▃▄▄▆▇▇▇▇▇██████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,374.909
average_total_reward,500.0
episode_length,500.0
loss_policy,0.22988
loss_value,598.46936
running_reward,98.34873
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 17.617126928474022
Average Total reward: 18.95
Running reward of episode 200/2000: 21.983712912468164
Average Total reward: 24.15
Running reward of episode 300/2000: 32.688632384872285
Average Total reward: 53.8
Running reward of episode 400/2000: 46.311647207216595
Average Total reward: 86.35
Running reward of episode 500/2000: 55.40946724598233
Average Total reward: 85.45
Running reward of episode 600/2000: 77.91658142260447
Average Total reward: 176.95
Running reward of episode 700/2000: 91.96008081729212
Average Total reward: 467.5
Running reward of episode 800/2000: 92.64993100950878
Average Total reward: 487.55
Running reward of episode 900/2000: 93.43870505846611
Average Total reward: 458.7
Running reward of episode 1000/2000: 98.17758115866054
Average Total reward: 500.0
Running reward of episode 1100/2000: 98.348507450449
Average Total reward: 500.0
Running reward of episode 1200/2000: 98.349519424

0,1
average_episode_length,▁▁▂▂▂▃██▇███████████
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▂▂▃██▇███████████
episode_length,▁▁▁▁▁▁▁▂▂▂▃▃▄█▆▇█▆██████████████████████
loss_policy,▅▅▆▇▆▆▇▇████▆▃▃▃▂▃▂▂▂▃▃▂▂▃▃▂▃▃▂▂▁▂▂▂▂▂▂▂
loss_value,▁▁▂▂▂▂▄▄▆▆██▇▆▅▅▅▅▅▅▅▅▅▅▅▅▆▅▆▆▅▅▆▅▅▅▅▅▅▅
running_reward,▁▁▂▂▂▂▃▃▄▅▅▆▇█▇▇████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,353.0045
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.01722
loss_value,598.54407
running_reward,98.34951
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 22.16306466148011
Average Total reward: 26.85
Running reward of episode 200/2000: 28.886228809847072
Average Total reward: 45.65
Running reward of episode 300/2000: 40.843385781872165
Average Total reward: 66.8
Running reward of episode 400/2000: 53.74016651778328
Average Total reward: 84.25
Running reward of episode 500/2000: 72.45635442557415
Average Total reward: 136.8
Running reward of episode 600/2000: 76.60291503680668
Average Total reward: 179.9
Running reward of episode 700/2000: 89.11852919517449
Average Total reward: 205.15
Running reward of episode 800/2000: 93.80659716912147
Average Total reward: 380.6
Running reward of episode 900/2000: 94.89933485951958
Average Total reward: 360.3
Running reward of episode 1000/2000: 96.63808651317343
Average Total reward: 474.2
Running reward of episode 1100/2000: 71.61713226011538
Average Total reward: 123.05
Running reward of episode 1200/2000: 95.223829797

0,1
average_episode_length,▁▁▂▂▃▃▄▆▆█▂█▆██████▆
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▂▃▃▄▆▆█▂█▆██████▆
episode_length,▁▁▁▁▂▂▂▂▂▃▅▃▄▄▅▇▆▇▅▆▆▃▅▇▅▅█████████████▆
loss_policy,▅▅▆▆█▇██▇█▆▆▆▅▄▃▄▄▅▂▃▄▂▂▃▃▁▁▁▁▂▂▂▂▂▂▂▂▂▃
loss_value,▁▁▂▂▄▅▆▇▅██▇█▇▇▇▆▇▇▆▆▅▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▂▂▂▃▃▄▅▅▆▆▆▇▇▇▇█████▇▇█▇███████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,341.55
average_lenght_all_episodes,318.242
average_total_reward,341.55
episode_length,345.8
loss_policy,2.31151
loss_value,637.19031
running_reward,94.85033
test_average_episode_length,248.67
test_average_total_reward,248.67


Training agent with baseline value network.
Running reward of episode 100/2000: 22.154159010378276
Average Total reward: 23.6
Running reward of episode 200/2000: 42.640429757797364
Average Total reward: 60.85
Running reward of episode 300/2000: 58.28422419342057
Average Total reward: 149.9
Running reward of episode 400/2000: 70.13383411703677
Average Total reward: 243.9
Running reward of episode 500/2000: 93.65025409918812
Average Total reward: 400.7
Running reward of episode 600/2000: 97.21920257720252
Average Total reward: 458.45
Running reward of episode 700/2000: 96.36180964935616
Average Total reward: 490.35
Running reward of episode 800/2000: 98.21942163017954
Average Total reward: 468.2
Running reward of episode 900/2000: 98.29505146390808
Average Total reward: 477.15
Running reward of episode 1000/2000: 81.07692798090767
Average Total reward: 180.8
Running reward of episode 1100/2000: 95.57818115147941
Average Total reward: 454.75
Running reward of episode 1200/2000: 94.6556184

wandb: Network error (ConnectionError), entering retry loop.


Running reward of episode 2000/2000: 97.57133208039082
Average Total reward: 500.0
Average length of all episodes: 368.136
Testing the best policy
Average Total reward: 500.0


0,1
average_episode_length,▁▂▃▄▇▇███▃▇█████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▃▄▇▇███▃▇█████████
episode_length,▁▁▁▁▂▂▃▄▃▅▆█▇▇████▃▃▃██▇███████▇██████▆█
loss_policy,▆▆▅█▇██▇▆▅▄▃▂▄▃▃▃▄▇▆▅▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▁
loss_value,▁▁▁▄▄▇██▅▅▅▄▄▅▄▅▄▅▇▅▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▅▄
running_reward,▁▂▂▃▄▅▆▆▆▇▇███████▇▆▆███████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,368.136
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.95553
loss_value,600.33997
running_reward,97.57133
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/2000: 20.81301291478362
Average Total reward: 20.85
Running reward of episode 200/2000: 22.536055472466934
Average Total reward: 25.85
Running reward of episode 300/2000: 38.754879320110874
Average Total reward: 49.8
Running reward of episode 400/2000: 46.98872133705153
Average Total reward: 72.1
Running reward of episode 500/2000: 71.14876500291152
Average Total reward: 170.45
Running reward of episode 600/2000: 84.52754520848063
Average Total reward: 373.3
Running reward of episode 700/2000: 95.90888860307658
Average Total reward: 446.6
Running reward of episode 800/2000: 96.88433790762633
Average Total reward: 488.65
Running reward of episode 900/2000: 98.15760126368826
Average Total reward: 489.35
Running reward of episode 1000/2000: 96.86923938458297
Average Total reward: 463.85
Running reward of episode 1100/2000: 97.92858538331704
Average Total reward: 500.0
Running reward of episode 1200/2000: 97.43464217

 # Longer runs

In [9]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Long Standard",
          config={
              "hidden_layers": 32,
              "num_episodes": 5000,
              "gamma": 0.99,
              "baseline": 'std',
              "eval_every":100,
              "eval_episodes": 50,
              "test_episodes" : 500,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= config.baseline, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()



Training agent with standardization baseline.
Running reward of episode 100/5000: 25.929437324089378
Average Total reward: 34.88
Running reward of episode 200/5000: 66.22344520865079
Average Total reward: 220.42
Running reward of episode 300/5000: 93.13194041076233
Average Total reward: 446.58
Running reward of episode 400/5000: 97.69084853801253
Average Total reward: 497.64
Running reward of episode 500/5000: 91.6378924596298
Average Total reward: 445.74
Running reward of episode 600/5000: 97.87126826870261
Average Total reward: 496.68
Running reward of episode 700/5000: 98.23325474611671
Average Total reward: 487.94
Running reward of episode 800/5000: 97.9213996811954
Average Total reward: 496.42
Running reward of episode 900/5000: 97.29732643216911
Average Total reward: 464.68
Running reward of episode 1000/5000: 96.43985168253451
Average Total reward: 465.9
Running reward of episode 1100/5000: 95.6770866155626
Average Total reward: 485.62
Running reward of episode 1200/5000: 96.459

0,1
average_episode_length,▁▄▇████▇██████████████▂▂█▆█████▅████████
average_lenght_all_episodes,▁
average_total_reward,▁▄▇████▇██████████████▂▂█▆█████▅████████
episode_length,▁▂▇███████████████████▂▂▆▆██████████████
loss_policy,▇▂▇▆▇▇▇▇▇▅▆▇▆▆▅▆▅▅▆▇▅▅▆▁▆▇▄▄▇▇▃▅▆█▇▄█▆█▇
running_reward,▁▁▇███████████████████▅▅████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,445.5856
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00373
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/5000: 23.881031570633947
Average Total reward: 43.96
Running reward of episode 200/5000: 61.357614959251784
Average Total reward: 137.44
Running reward of episode 300/5000: 90.70079348025416
Average Total reward: 293.94
Running reward of episode 400/5000: 97.04527767035847
Average Total reward: 449.1
Running reward of episode 500/5000: 88.00431507793684
Average Total reward: 324.28
Running reward of episode 600/5000: 90.81557652559184
Average Total reward: 342.0
Running reward of episode 700/5000: 98.26493084813076
Average Total reward: 500.0
Running reward of episode 800/5000: 98.34902460683786
Average Total reward: 492.16
Running reward of episode 900/5000: 98.33670367946526
Average Total reward: 498.64
Running reward of episode 1000/5000: 98.24892454394835
Average Total reward: 478.48
Running reward of episode 1100/5000: 93.7856017900028
Average Total reward: 305.54
Running reward of episode 1200/5000: 98.09

0,1
average_episode_length,▁▂▅▇▆███▅███████████████▇███████████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▅▇▆███▅███████████████▇███████████████
episode_length,▁▁▅█▇███▆███████████████████████████████
loss_policy,▄▂▅▂▁▃█▃▄▃▅▅▅▁▃▅▄▂▅▄▅▆▃▃▁▅▃▃▄▅▆▄▃▃▄▂▆▃▇▄
running_reward,▁▂▇█▇███████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,470.0716
average_total_reward,500.0
episode_length,500.0
loss_policy,0.00897
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/5000: 24.97650669296172
Average Total reward: 31.3
Running reward of episode 200/5000: 45.921708582733174
Average Total reward: 86.4
Running reward of episode 300/5000: 87.27037965547721
Average Total reward: 324.7
Running reward of episode 400/5000: 92.80076757091595
Average Total reward: 444.28
Running reward of episode 500/5000: 97.74359558527486
Average Total reward: 474.4
Running reward of episode 600/5000: 96.04065731338116
Average Total reward: 457.82
Running reward of episode 700/5000: 97.14494061084474
Average Total reward: 486.8
Running reward of episode 800/5000: 97.57150191439311
Average Total reward: 485.28
Running reward of episode 900/5000: 98.3438696457183
Average Total reward: 480.34
Running reward of episode 1000/5000: 98.08748301236918
Average Total reward: 487.18
Running reward of episode 1100/5000: 97.68392732577747
Average Total reward: 500.0
Running reward of episode 1200/5000: 80.7937247

0,1
average_episode_length,▁▂▅▇▇████▂███████▇█████████████████▄██▇█
average_lenght_all_episodes,▁
average_total_reward,▁▂▅▇▇████▂███████▇█████████████████▄██▇█
episode_length,▁▁▄▇██▇█▄██████████████▅████████▇██▃████
loss_policy,█▁▆▄▆▃▅▆▃▄▅▅▅▅▄▅▆▄▅▅▄▃▂▂▃▃▃█▅▃▅▃▆▄▅▄▄▅▅█
running_reward,▁▁▅▇███████████████████████████████▅████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,499.08
average_lenght_all_episodes,446.8126
average_total_reward,499.08
episode_length,500.0
loss_policy,-0.00377
running_reward,98.34934
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/5000: 23.553795438268196
Average Total reward: 36.92
Running reward of episode 200/5000: 48.850442055002205
Average Total reward: 120.02
Running reward of episode 300/5000: 87.3899667134479
Average Total reward: 381.84
Running reward of episode 400/5000: 96.9750300420671
Average Total reward: 405.3
Running reward of episode 500/5000: 94.92428300482032
Average Total reward: 424.54
Running reward of episode 600/5000: 95.3577596200834
Average Total reward: 375.52
Running reward of episode 700/5000: 96.93442734276559
Average Total reward: 353.04
Running reward of episode 800/5000: 96.63340597882762
Average Total reward: 480.56
Running reward of episode 900/5000: 93.35433063628705
Average Total reward: 500.0
Running reward of episode 1000/5000: 98.31995125479428
Average Total reward: 500.0
Running reward of episode 1100/5000: 97.41970080271187
Average Total reward: 479.3
Running reward of episode 1200/5000: 97.80065

0,1
average_episode_length,▁▂▆▇▆▆███▇█████████▇████████████████████
average_lenght_all_episodes,▁
average_total_reward,▁▂▆▇▆▆███▇█████████▇████████████████████
episode_length,▁▂▅▇▆███████▇███▇███▇▇████████▆█████████
loss_policy,▅█▃▃▁▅▆▆▅▅▂▅▂▅▆▆▅▄▄▅▂▃▄▄▄▃▅▇▄▄▂▃▅▆▅█▆▄▇▆
running_reward,▁▂▆███▇▇████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,461.3368
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.00442
running_reward,98.34953
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with standardization baseline.
Running reward of episode 100/5000: 26.83923487458664
Average Total reward: 45.16
Running reward of episode 200/5000: 71.07414355893434
Average Total reward: 291.66
Running reward of episode 300/5000: 75.59806452324817
Average Total reward: 120.5
Running reward of episode 400/5000: 64.91610301583613
Average Total reward: 167.26
Running reward of episode 500/5000: 60.49810698208862
Average Total reward: 166.26
Running reward of episode 600/5000: 83.81780691453102
Average Total reward: 180.28
Running reward of episode 700/5000: 84.25354414056797
Average Total reward: 187.3
Running reward of episode 800/5000: 89.31049440453363
Average Total reward: 270.28
Running reward of episode 900/5000: 87.40660873201648
Average Total reward: 254.5
Running reward of episode 1000/5000: 91.23645942247299
Average Total reward: 299.2
Running reward of episode 1100/5000: 94.95247169705188
Average Total reward: 403.0
Running reward of episode 1200/5000: 97.71454

In [10]:
#random seeds, to reproduce the same results
seeds = [1, 11, 111, 1111, 11111]
for i in range(len(seeds)):

    # Training and arhitecture hyperparameters, initialise a wandb run
    run=wandb.init(
          project="Lab3-DRL-warmups",
          name = "Long Baseline Network ",
          config={
              "hidden_layers": 32,
              "num_episodes": 5000,
              "gamma": 0.99,
              "baseline": 'net',
              "eval_every":100,
              "eval_episodes": 50,
              "test_episodes" : 500,
              "temperature" : 5,
              "lr" : 1e-2,
              "lr_baseline" : 1e-3
              })
    
    # Copy the configuration
    config = wandb.config

    #Instaintiate two versions of cartpole, one that animates the episodes (which slows everything
    # down), and another that does not animate.
    env = gymnasium.make('CartPole-v1')
    env_render = gymnasium.make('CartPole-v1', render_mode='human')

    #set the seed
    torch.manual_seed(seeds[i])
    env.reset(seed = seeds[i])
    env_render.reset(seed = seeds[i])
    
    # Make a policy network.
    policy = PolicyNet(env, config.hidden_layers, config.temperature)

    # Make a value network
    value = ValueNet(env, config.hidden_layers)
    
    # Create episode_runner
    episode_runner= Episode_runner(env, policy)
    episode_runner_rend= Episode_runner(env_render, policy)
    
    # Train the agent
    best_model_state_dict = reinforce(episode_runner, run, episode_runner_rend, gamma=config.gamma, num_episodes=config.num_episodes,
              baseline= value, display=False, eval_every=config.eval_every,
              eval_episodes=config.eval_episodes, lr= config.lr, lr_baseline = config.lr_baseline )
    
    # Load the best policy on the determinist episode runner to test it
    episode_runner.policy.load_state_dict(best_model_state_dict)
    det_ep_runner = Determinist_Test_Episode_runner(episode_runner, episode_runner_rend )
    det_ep_runner.test(test_episodes=config.test_episodes)
    
    # Close up everything
    env_render.close()
    env.close()

0,1
average_episode_length,▁▅▂▃▃▃▄▄▇█▇████▅▁▄▅▅▃▃▃▇▃▇▅█▇▃▄██▆█▅████
average_lenght_all_episodes,▁
average_total_reward,▁▅▂▃▃▃▄▄▇█▇████▅▁▄▅▅▃▃▃▇▃▇▅█▇▃▄██▆█▅████
episode_length,▁▂▃▃▄▃▄▅▆▇█████▅▂▄▄▆▃▃▃▇▄▇▇█▇▃▄▇█▇▅▅████
loss_policy,▅▁▄▇█▅█▄▄▅▅▅▅▆▆▄▂█▇▆▃▇▇▆▄▆▄▇▅▆▃▅▅▃▆▄█▅▄▆
running_reward,▁▁▆▄▅▆▇▇▇███████▄▆▇▇▇▇▇█▇████▆▇█████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,324.567
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.01108
running_reward,98.08132
test_average_episode_length,473.328
test_average_total_reward,473.328


Training agent with baseline value network.
Running reward of episode 100/5000: 18.385507946235805
Average Total reward: 25.22
Running reward of episode 200/5000: 27.72983752575506
Average Total reward: 32.12
Running reward of episode 300/5000: 64.22468848056272
Average Total reward: 155.14
Running reward of episode 400/5000: 89.10697015616128
Average Total reward: 335.62
Running reward of episode 500/5000: 95.63168623651953
Average Total reward: 415.32
Running reward of episode 600/5000: 93.84854663320502
Average Total reward: 500.0
Running reward of episode 700/5000: 98.25783903420837
Average Total reward: 473.08
Running reward of episode 800/5000: 97.15665590218028
Average Total reward: 467.8
Running reward of episode 900/5000: 98.2840907545666
Average Total reward: 465.64
Running reward of episode 1000/5000: 98.18950170364234
Average Total reward: 500.0
Running reward of episode 1100/5000: 98.34615702986025
Average Total reward: 500.0
Running reward of episode 1200/5000: 98.3495055

0,1
average_episode_length,▁▁▃▆███▇████████████▆▃██████▇███████▆███
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▆███▇████████████▆▃██████▇███████▆███
episode_length,▁▁▂▅▇████████████████▇█▅████████████▇███
loss_policy,▅▅█▆▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
loss_value,▁▁▇█▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
running_reward,▁▁▃▇▇████████████████▇█▇████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,450.5462
average_total_reward,500.0
episode_length,500.0
loss_policy,-0.31685
loss_value,598.28729
running_reward,98.34952
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/5000: 16.83578667352332
Average Total reward: 20.38
Running reward of episode 200/5000: 20.976225015836192
Average Total reward: 30.86
Running reward of episode 300/5000: 36.95819186369441
Average Total reward: 70.76
Running reward of episode 400/5000: 38.819241672584
Average Total reward: 81.48
Running reward of episode 500/5000: 96.0395940629453
Average Total reward: 492.72
Running reward of episode 600/5000: 97.31956066094345
Average Total reward: 450.82
Running reward of episode 700/5000: 97.61191316214983
Average Total reward: 478.96
Running reward of episode 800/5000: 96.33065113235315
Average Total reward: 481.86
Running reward of episode 900/5000: 96.68219995549433
Average Total reward: 491.32
Running reward of episode 1000/5000: 97.21477078743885
Average Total reward: 500.0
Running reward of episode 1100/5000: 98.29640428899066
Average Total reward: 500.0
Running reward of episode 1200/5000: 98.071517791

0,1
average_episode_length,▁▁▂▂▇██████████████████▃▄▄▆▆██▅▃▂▃▃▂▂▂▂▃
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▂▇██████████████████▃▄▄▆▆██▅▃▂▃▃▂▂▂▂▃
episode_length,▁▁▁▂█▇███████▇█████████▂▃▄█▆██▃▃▂▃▃▂▂▂▂▃
loss_policy,▇▆██▄▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▁▂▂▂▃▃▂▂▂▁▂▂▂▂
loss_value,▂▁▃▅███▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▃▇▇██▇▇▇▅▅▇▅▂▃▂▅▆
running_reward,▁▁▂▂█████████▇█████████▅▆▇████▇▆▅▆▆▄▄▄▆▆
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,143.86
average_lenght_all_episodes,333.0886
average_total_reward,143.86
episode_length,145.9
loss_policy,0.51675
loss_value,471.48407
running_reward,74.94729
test_average_episode_length,148.828
test_average_total_reward,148.828


Training agent with baseline value network.
Running reward of episode 100/5000: 22.281848419537802
Average Total reward: 30.88
Running reward of episode 200/5000: 41.25732420820014
Average Total reward: 53.4
Running reward of episode 300/5000: 69.02338710732116
Average Total reward: 169.32
Running reward of episode 400/5000: 83.35198132907558
Average Total reward: 194.24
Running reward of episode 500/5000: 83.6997560802067
Average Total reward: 310.32
Running reward of episode 600/5000: 72.3976967477512
Average Total reward: 103.26
Running reward of episode 700/5000: 70.58062011509753
Average Total reward: 171.04
Running reward of episode 800/5000: 73.07261246170619
Average Total reward: 127.74
Running reward of episode 900/5000: 70.02051382243351
Average Total reward: 154.12
Running reward of episode 1000/5000: 90.96148124459195
Average Total reward: 303.32
Running reward of episode 1100/5000: 74.28175608240997
Average Total reward: 140.32
Running reward of episode 1200/5000: 97.19703

0,1
average_episode_length,▁▁▃▃▂▃▂▃▃███▅▅▅▄███████▇█▄▇▅██████████▃▄
average_lenght_all_episodes,▁
average_total_reward,▁▁▃▃▂▃▂▃▃███▅▅▅▄███████▇█▄▇▅██████████▃▄
episode_length,▁▁▂▄▅▂▃▃▃▆███▆▄▄█████████▄█▅███████▆██▂▃
loss_policy,▅▇█▇▄▃▂▃▃▂▂▂▂▂▃▃▂▂▂▂▁▂▂▂▂▃▁▂▂▂▂▂▂▂▂▃▂▂▂▂
loss_value,▁▄▇█▅▃▄▅▅▆▅▅▅▆▆▅▅▅▅▅▅▅▅▅▅▅▅▆▅▅▅▅▅▅▅▆▅▅▃▅
running_reward,▁▂▄▇▆▅▇▅▇▇█████▇█████████▇████████████▄▇
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,260.12
average_lenght_all_episodes,372.0146
average_total_reward,260.12
episode_length,241.3
loss_policy,-0.5874
loss_value,625.31879
running_reward,87.4106
test_average_episode_length,296.92
test_average_total_reward,296.92


Training agent with baseline value network.
Running reward of episode 100/5000: 17.44650506993677
Average Total reward: 19.88
Running reward of episode 200/5000: 31.832088364179725
Average Total reward: 35.78
Running reward of episode 300/5000: 49.58172597979848
Average Total reward: 77.24
Running reward of episode 400/5000: 86.35361267541263
Average Total reward: 318.74
Running reward of episode 500/5000: 90.25031077442189
Average Total reward: 419.4
Running reward of episode 600/5000: 75.66654488330619
Average Total reward: 169.3
Running reward of episode 700/5000: 86.40439981796591
Average Total reward: 173.6
Running reward of episode 800/5000: 78.92960674017783
Average Total reward: 156.38
Running reward of episode 900/5000: 90.62637731209816
Average Total reward: 366.28
Running reward of episode 1000/5000: 98.27393842759136
Average Total reward: 499.92
Running reward of episode 1100/5000: 98.29516696862272
Average Total reward: 498.1
Running reward of episode 1200/5000: 98.3490281

0,1
average_episode_length,▁▁▂▅▃▃▃▆██████████▇███████████▅▆█▇████▄█
average_lenght_all_episodes,▁
average_total_reward,▁▁▂▅▃▃▃▆██████████▇███████████▅▆█▇████▄█
episode_length,▁▁▂▅█▃▃█▆███▄███████████▇████████▇██████
loss_policy,▆██▇▃▆▃▁▃▃▃▃▃▃▃▃▂▃▂▃▃▃▃▃▃▃▃▃▃▃▃▂▃▃▃▃▃▃▂▃
loss_value,▁▂▅█▅▆▅▆▆▅▅▅▄▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
running_reward,▁▁▃▆▇▆▆▇████████████████████████████████
test_average_episode_length,▁
test_average_total_reward,▁

0,1
average_episode_length,500.0
average_lenght_all_episodes,423.678
average_total_reward,500.0
episode_length,500.0
loss_policy,0.21494
loss_value,598.36804
running_reward,98.34595
test_average_episode_length,500.0
test_average_total_reward,500.0


Training agent with baseline value network.
Running reward of episode 100/5000: 21.38267050170289
Average Total reward: 25.68
Running reward of episode 200/5000: 32.62097073353295
Average Total reward: 49.18
Running reward of episode 300/5000: 57.47753396836479
Average Total reward: 129.76
Running reward of episode 400/5000: 80.79634629702414
Average Total reward: 298.72
Running reward of episode 500/5000: 91.93705105788192
Average Total reward: 422.34
Running reward of episode 600/5000: 95.66825502449966
Average Total reward: 433.68
Running reward of episode 700/5000: 97.71020506950825
Average Total reward: 463.3
Running reward of episode 800/5000: 98.04008323667588
Average Total reward: 424.3
Running reward of episode 900/5000: 90.69683889750817
Average Total reward: 490.7
Running reward of episode 1000/5000: 98.2942283719292
Average Total reward: 500.0
Running reward of episode 1100/5000: 98.32808925030336
Average Total reward: 500.0
Running reward of episode 1200/5000: 98.349398538

## Video recording

In [None]:
from gymnasium.wrappers import RecordVideo

env_render = gymnasium.make('CartPole-v1', render_mode="rgb_array")

# wrap the env in the record video
recorder = gymnasium.wrappers.RecordVideo(env=env_render, video_folder="/data01/dl24marchi/DLAProjects/videos", name_prefix="test-video", episode_trigger=lambda x: x%10 == 0)

# env reset for a fresh start
observation, info = recorder.reset()

###
# Start the recorder
recorder.start_video_recorder()

# And run the final agent for a few episodes.
for _ in range(100):
    run_episode(recorder, policy)

####
# Don't forget to close the video recorder before the env!
recorder.close_video_recorder()

env_render.close()

-----
## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving Lunar Lander with `REINFORCE` (easy)

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the [Lunar Lander Environment](https://gymnasium.farama.org/environments/box2d/lunar_lander/). This environment is a little bit harder than Cartpole, but not much. Make sure you perform the same types of analyses we did during the lab session to quantify and qualify the performance of your agents.

### Exercise 3.2: Solving Cartpole and Lunar Lander with `Deep Q-Learning` (harder)

On policy Deep Reinforcement Learning tends to be **very unstable**. Write an implementation (or adapt an existing one) of `Deep Q-Learning` to solve our two environments (Cartpole and Lunar Lander). To do this you will need to implement a **Replay Buffer** and use a second, slow-moving **target Q-Network** to stabilize learning.

### Exercise 3.3: Solving the OpenAI CarRacing environment (hardest)

Use `Deep Q-Learning` -- or even better, an off-the-shelf implementation of **Proximal Policy Optimization (PPO)** -- to train an agent to solve the [OpenAI CarRacing](https://github.com/andywu0913/OpenAI-GYM-CarRacing-DQN) environment. This will be the most *fun*, but also the most *difficult*. Some tips:

1. Make sure you use the `continuous=False` argument to the environment constructor. This ensures that the action space is **discrete** (we haven't seen how to work with continuous action spaces).
2. Your Q-Network will need to be a CNN. A simple one should do, with two convolutional + maxpool layers, folowed by a two dense layers. You will **definitely** want to use a GPU to train your agents.
3. The observation space of the environment is a single **color image** (a single frame of the game). Most implementations stack multiple frames (e.g. 3) after converting them to grayscale images as an observation.

