Name: Naser Kazemi

Student ID: 99102059

In this assignment, we will implement and test REINFORCE and PPO, which are both on-policy RL algortihms.

# REINFORCE algorithm **(40 points)**

## Setup

We must first install the required packages.

In [None]:
!pip install "gymnasium[mujoco]"
# !pip install "gymnasium[classic-control]"
!pip install imageio
# !pip install "imageio[ffmpeg]"
# !pip install "imageio[pyav]"

Collecting gymnasium[mujoco]
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium[mujoco])
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting mujoco>=2.3.3 (from gymnasium[mujoco])
  Downloading mujoco-3.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
Collecting glfw (from mujoco>=2.3.3->gymnasium[mujoco])
  Downloading glfw-2.7.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: glfw, farama-notifications, gymnasium, mujoco
Successfully insta

In [None]:
import gymnasium as gym
import random
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
from collections import namedtuple, deque
import imageio

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import MultivariateNormal, Normal
from torch.distributions import Categorical

import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

## Explore the environment

We will train an REINFORCE agent on the `CartPole` environment.

This code displays a video given it's path.

In [None]:
from IPython.display import HTML
from base64 import b64encode

def show_video(path):
    mp4 = open(path, 'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML("""
    <video width=400 controls>
          <source src="%s" type="video/mp4">
    </video>
    """ % data_url)

Explore the `CartPole` environment using random actions. At each timestep, render the current frame, and use it to make a video of the trajectory.

In [None]:
env = gym.make("CartPole-v1", render_mode="rgb_array")
frames = []

env.reset()
for _ in range(100):
    frames.append(env.render())
    # TODO: select a random action
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()
imageio.mimsave('./CartPole.mp4', frames, fps=25)
show_video('./CartPole.mp4')




## Policy Network **(10 points)**

Complete the following code to build an agent that predicts the the probability of playing each action, given the state.

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(PolicyNetwork, self).__init__()
        # TODO
        # Define the Policy Network architecture
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # TODO
        # predict the probability of playing each action
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        action_probs = F.softmax(x, dim=-1)
        return action_probs

## Agent **(20 points)**

REINFORCE algorithm works by interacting with an environment by taking actions based on a policy. As the agent collects rewards from the environment, it records the outcomes and the **log probabilities** of the actions it took. At the end of an episode, the algorithm calculates the total **discounted reward** from each step—this is known as the return.

$$ R_t = \sum_{k=t}^{T} \gamma^{k-t} r_k
 $$

These returns are used to weight the logged probabilities, actions that lead to higher returns are made more probable.


$$ \theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1} \gamma^t R_t \nabla_\theta \log \pi_\theta(a_t|s_t)
 $$


In [None]:
class REINFORCEAgent:
    def __init__(self, policy, optimizer, gamma=0.99):
        self.policy = policy
        self.optimizer = optimizer
        self.gamma = gamma
        self.log_probs = []
        self.rewards = []

    def select_action(self, state):
        # TODO
        # select an action by sampling from the actor's response
        state = torch.from_numpy(state).float().unsqueeze(0)
        probs = self.policy(state)
        m = torch.distributions.Categorical(probs)
        action = m.sample()
        self.log_probs.append(m.log_prob(action))
        action = action.item()
        return action

    def update_policy(self):
        R = 0
        policy_loss = []
        returns = []

        # TODO
        # Calculate the discounted reward
        for r in self.rewards[::-1]:
            R = r + self.gamma * R
            returns.insert(0, R)

        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # TODO
        # Calculate the policy loss
        for log_prob, R in zip(self.log_probs, returns):
            policy_loss.append(- log_prob * R)


        self.optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        self.optimizer.step()

        # Reset the rewards and log probabilities
        del self.rewards[:]
        del self.log_probs[:]

    def store_reward(self, reward):
        self.rewards.append(reward)


## Training **(5 points)**

Define the hyperparameters and complete the training loop.

In [None]:
env = gym.make('CartPole-v1')
input_size = env.observation_space.shape[0]
output_size = 2 # Left, Right actions
lr = 0.01

policy = PolicyNetwork(input_size, 128, output_size)
optimizer = optim.Adam(policy.parameters(), lr=lr)
agent = REINFORCEAgent(policy, optimizer)

num_episodes = 1000 # TODO

for episode in range(num_episodes):
    state, info = env.reset()
    total_reward = 0

    # TODO
    # collect rewards and log probabilities for updating the policy in a loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        agent.store_reward(reward)
        total_reward += reward
        state = next_state

        if terminated or truncated:
            break

    agent.update_policy()
    if episode % 50 == 0:
        print(f'Episode {episode+1}: Total Reward = {total_reward}')
env.close()

Episode 1: Total Reward = 25.0
Episode 51: Total Reward = 185.0
Episode 101: Total Reward = 500.0
Episode 151: Total Reward = 203.0
Episode 201: Total Reward = 173.0
Episode 251: Total Reward = 210.0
Episode 301: Total Reward = 500.0
Episode 351: Total Reward = 123.0
Episode 401: Total Reward = 500.0
Episode 451: Total Reward = 500.0
Episode 501: Total Reward = 146.0
Episode 551: Total Reward = 500.0
Episode 601: Total Reward = 500.0
Episode 651: Total Reward = 500.0
Episode 701: Total Reward = 500.0
Episode 751: Total Reward = 500.0
Episode 801: Total Reward = 500.0
Episode 851: Total Reward = 500.0
Episode 901: Total Reward = 325.0
Episode 951: Total Reward = 500.0


## Evaluation **(5 points)**

Here we use the trained agent and collect a trajectory using it's policy. Calculate the cumulative reward by adding rewards in each time space. Save and display the video of this run in the end.

In [None]:
env = gym.make("CartPole-v1", render_mode="rgb_array")
state, _ = env.reset()
frames = []

total_reward = 0
# TODO
# run the policy in the environment in a loop
while True:
    action = agent.select_action(state)
    next_state, reward, terminated, truncated, _ = env.step(action)
    frames.append(env.render())
    total_reward += reward
    state = next_state

    if terminated or truncated:
        break


env.close()
print(f'Total Reward: {total_reward}')

imageio.mimsave('./eval_reinforce.mp4', frames, fps=25)
show_video('./eval_reinforce.mp4')



Total Reward: 500.0




# Proximal Policy Optimization **(60 points)**

## Setup

## Explore the environment

This code is essential for rendering MUJOCO based environments.

In [None]:
# Configure MuJoCo to use the EGL rendering backend (requires GPU)
%env MUJOCO_GL=egl

env: MUJOCO_GL=egl


We will train a PPO agent in the `HalfCheetah` environment. This environment features continuous actions and more complex mechanics.

Explore this environment using random actions as well, and display the video of the resulting trajectory.

* What are the observation and action spaces of this environment?

* Are values bounded?

In [None]:
env = gym.make("HalfCheetah-v4", render_mode="rgb_array")
env.reset()
frames = []

for _ in range(100):
    frames.append(env.render())
    # TODO
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()
imageio.mimsave('./HalfCheetah.mp4', frames, fps=25)
show_video('./HalfCheetah.mp4')


## Actor & Critic **(15 points)**

Proximal Policy Optimization (PPO) is an advanced reinforcement learning algorithm that uses separate actor and critic networks to optimize policy performance.

The actor network is responsible for predicting a probability distribution over actions (discrete) or estimating the value for each action (continuous), given the current state, while the critic network evaluates how good the action taken by the actor is, by predicting the reward based on state.


In [None]:
class Actor(nn.Module):
    def __init__(self, state_dim, hidden_size, action_dim):
        super(Actor, self).__init__()
        # TODO
        # Define the Actor architecture
        self.fc1 = nn.Linear(state_dim, hidden_size)
        self.mu = nn.Linear(hidden_size, action_dim)

        self.log_std = nn.Parameter(torch.zeros(action_dim), requires_grad=True)



    def forward(self, state):
        # TODO
        # In case of continuous environment, we usually
        # predict a mean and std for each action and sample
        # the action from a normal distribution
        x = F.relu(self.fc1(state))
        mu = F.tanh(self.mu(x))
        std = torch.exp(self.log_std)
        return mu, std

class Critic(nn.Module):
    def __init__(self, state_dim, hidden_size):
        super(Critic, self).__init__()
        # TODO
        # Define the Critic architecture
        self.fc1 = nn.Linear(state_dim , hidden_size)
        self.value = nn.Linear(hidden_size, 1)

    def forward(self, state):
        # TODO
        # Predict the value of state
        x = F.relu(self.fc1(state))
        value = self.value(x)
        return value

## Memory

PPO algorithms need to store sequences of actions, states, log probabilities, rewards, and state values to train the agent. This data is captured in the `Memory` class, which facilitates batch processing by holding and then clearing these elements at the end of each training iteration.

In [None]:
class Memory:
    def __init__(self):
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.state_values = []

    def clear(self):
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.state_values[:]


## Agent **(35 points)**

In PPO, the actor's goal is to maximize the expected return. However, direct maximization can cause large policy updates, risking instability. To prevent this, PPO employs a clipping mechanism, limiting policy changes to a defined range.

$$ L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]
 $$

Additionally, it uses a probability ratio to scale updates, ensuring changes This ratio provides a scaling factor for the policy updates, ensuring that changes are made in proportion to the improvement in policy performance.

$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
 $$

 The critic aims to minimize the error between its predictions and the actual returns.

 $$ L^{VF}(\phi) = \left( V_\phi(s_t) - \hat{R}_t \right)^2
 $$

In [None]:
class PPO(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_size=64, lr=1e-4, gamma=0.99, epochs=4, eps_clip=0.2):
        super(PPO, self).__init__()
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.epochs = epochs

        self.actor = Actor(state_dim, hidden_size, action_dim)
        self.critic = Critic(state_dim, hidden_size)

        self.optimizer_actor = optim.Adam(self.actor.parameters(), lr=lr)
        self.optimizer_critic = optim.Adam(self.critic.parameters(), lr=lr)
        self.memory = Memory()

    def select_action(self, state, train=True):
        # TODO
        # Save state, action, log probability and state value of current step in the memory buffer.
        # predict the actions by sampling from a normal distribution
        # based on the mean and std calculated by actor
        state = torch.tensor(state, dtype=torch.float).to(device)
        mu, std = self.actor(state)
        # dist = MultivariateNormal(mu, torch.diag_embed(std))
        dist = Normal(mu, std)
        action = dist.sample()
        logprob = dist.log_prob(action)
        state_value = self.critic(state)

        if train:
            self.memory.states.append(state)
            self.memory.actions.append(action)
            self.memory.logprobs.append(logprob)
            self.memory.state_values.append(state_value)

        return action.detach().cpu().numpy()

    def evaluate(self, state, action):
        # TODO
        # evaluate the state value of this state and log probability of choosing this action
        action = torch.tensor(action, dtype=torch.float)
        mu, std = self.actor(state)
        # dist = MultivariateNormal(mu, torch.diag_embed(std))
        dist = Normal(mu, std)
        action_logprobs = dist.log_prob(action)
        state_value = self.critic(state)
        entropy = dist.entropy()
        return action_logprobs, state_value, entropy

    def update(self):
        rewards = []
        discounted_reward = 0
        # TODO
        # Calculate discounted rewards
        for reward in self.memory.rewards[::-1]:
            discounted_reward = reward + self.gamma * discounted_reward
            rewards.insert(0, discounted_reward)

        rewards = torch.tensor(rewards, dtype=torch.float32).to(device)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-7)


        # TODO
        # load saved states, actions, log probs, and state values
        old_states = torch.squeeze(torch.stack(self.memory.states, dim=0)).detach().to(device)
        old_actions = torch.squeeze(torch.stack(self.memory.actions, dim=0)).detach().to(device)
        old_logprobs = torch.squeeze(torch.stack(self.memory.logprobs, dim=0)).detach().to(device)
        old_state_values = torch.squeeze(torch.stack(self.memory.state_values, dim=0)).detach().to(device)


        # TODO
        # Calculate advantages for each timestep (usually difference of rewards and state values)
        advantages = rewards.detach() - old_state_values.detach()
        advantages = advantages.unsqueeze(-1)

        loss_ac = 0
        loss_cri = 0
        for _ in range(self.epochs):
            # calculate logprobs and state values based on the new policy
            logprobs, state_values, entropy = self.evaluate(old_states, old_actions) # TODO
            state_values = torch.squeeze(state_values)

            # TODO
            # Calculate the loss function and perform the optimization

            ratios = torch.exp(logprobs - old_logprobs.detach())

            # print(advantages.shape)

            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            loss_actor = - torch.min(surr1, surr2).mean().mean() # TODO
            loss_critic = F.mse_loss(state_values, rewards)# TODO

            self.optimizer_actor.zero_grad()
            loss_actor.backward()
            loss_ac += loss_actor.item()
            self.optimizer_actor.step()

            self.optimizer_critic.zero_grad()
            loss_critic.backward()
            loss_cri += loss_critic.item()
            self.optimizer_critic.step()

        # clear the buffer
        self.memory.clear()
        return loss_ac, loss_cri

## Training **(5 points)**

Define the hyperparameters and complete the training loop.

In [None]:
env = gym.make("HalfCheetah-v4")
state_dim = env.observation_space.shape[0] # TODO
action_dim = env.action_space.shape[0] # TODO
hidden_size = 64 # TODO
lr = 3e-4 # TODO

agent = model = PPO(state_dim, action_dim, hidden_size=hidden_size, lr=lr).to(device)

# We need to train for many more steps to achieve acceptable results compared to the last environment
num_episodes = 1000 # TODO
update_period = 4

actor_losses = []
critic_losses = []
moving_rewards = np.array([])

In [None]:
def train(start_episode):
    moving_rewards = np.array([])
    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0
        # TODO
        # write the training loop
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, info = env.step(action)
            agent.memory.rewards.append(reward)

            total_reward += reward
            state = next_state

            if terminated or truncated:
                break

        if episode % update_period == 0:
            loss_ac, loss_cri = agent.update()
            actor_losses.append(loss_ac)
            critic_losses.append(loss_cri)
            moving_rewards = np.append(moving_rewards, total_reward)
        if episode % 100 == 0:
            print(f"actor loss:\t{loss_ac:.6f}")
            print(f"critic loss:\t{loss_cri:.6f}")
            print(f'Episode {episode + start_episode}: Going Reward = {moving_rewards.mean():.1f}: Std = {moving_rewards.std():.1f}')
            moving_rewards = np.array([])

    env.close()


def save_memory_to_json(memory, filename):
    memory_data = {
        'actions': [action.tolist() for action in memory.actions],
        'states': [state.tolist() for state in memory.states],
        'logprobs': [logprob.tolist() for logprob in memory.logprobs],
        'rewards': [reward.tolist() for reward in memory.rewards],
        'state_values': [state_value.tolist() for state_value in memory.state_values]
    }

    with open(filename, 'w') as f:
        json.dump(memory_data, f)


def load_memory_from_json(filename):
    with open(filename, 'r') as f:
        memory_data = json.load(f)

    # Convert lists back to tensors
    memory = Memory()
    memory.actions = [torch.tensor(action) for action in memory_data['actions']]
    memory.states = [torch.tensor(state) for state in memory_data['states']]
    memory.logprobs = [torch.tensor(logprob) for logprob in memory_data['logprobs']]
    memory.rewards = [torch.tensor(reward) for reward in memory_data['rewards']]
    memory.state_values = [torch.tensor(state_value) for state_value in memory_data['state_values']]

    return memory

In [None]:
train(0)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	1.652674
critic loss:	6.691543
Episode 0: Going Reward = -1069.7: Std = 0.0
actor loss:	-0.018704
critic loss:	3.840977
Episode 100: Going Reward = -897.0: Std = 116.0
actor loss:	0.048294
critic loss:	3.963023
Episode 200: Going Reward = -835.1: Std = 98.8
actor loss:	-0.122607
critic loss:	3.930660
Episode 300: Going Reward = -783.1: Std = 91.4
actor loss:	-0.072026
critic loss:	3.946661
Episode 400: Going Reward = -723.4: Std = 62.4
actor loss:	0.042279
critic loss:	4.123845
Episode 500: Going Reward = -698.1: Std = 52.2
actor loss:	0.136185
critic loss:	4.051512
Episode 600: Going Reward = -691.3: Std = 56.6
actor loss:	-0.342282
critic loss:	3.880317
Episode 700: Going Reward = -639.8: Std = 60.1
actor loss:	0.204279
critic loss:	3.931245
Episode 800: Going Reward = -606.5: Std = 48.3
actor loss:	0.036396
critic loss:	3.980851
Episode 900: Going Reward = -571.9: Std = 61.8


In [None]:
train(1000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	0.128073
critic loss:	3.476170
Episode 1000: Going Reward = -536.7: Std = 0.0
actor loss:	-0.030239
critic loss:	3.833135
Episode 1100: Going Reward = -510.9: Std = 54.9
actor loss:	-0.058190
critic loss:	3.611347
Episode 1200: Going Reward = -508.1: Std = 61.0
actor loss:	0.398291
critic loss:	3.808809
Episode 1300: Going Reward = -421.1: Std = 68.7
actor loss:	-0.134574
critic loss:	3.890572
Episode 1400: Going Reward = -368.9: Std = 49.2
actor loss:	0.153144
critic loss:	3.995360
Episode 1500: Going Reward = -343.1: Std = 70.9
actor loss:	0.128896
critic loss:	3.851589
Episode 1600: Going Reward = -299.4: Std = 54.8
actor loss:	0.152894
critic loss:	3.788832
Episode 1700: Going Reward = -254.1: Std = 63.8
actor loss:	0.103453
critic loss:	3.731679
Episode 1800: Going Reward = -208.8: Std = 81.6
actor loss:	0.331599
critic loss:	3.825966
Episode 1900: Going Reward = -162.0: Std = 48.6


In [None]:
train(2000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	0.181833
critic loss:	3.854838
Episode 2000: Going Reward = -140.0: Std = 0.0
actor loss:	-0.002660
critic loss:	3.687878
Episode 2100: Going Reward = -101.6: Std = 67.9
actor loss:	-0.247535
critic loss:	3.839186
Episode 2200: Going Reward = -87.3: Std = 95.0
actor loss:	0.174657
critic loss:	3.772946
Episode 2300: Going Reward = -11.3: Std = 63.5
actor loss:	0.079238
critic loss:	3.344638
Episode 2400: Going Reward = 16.4: Std = 59.6
actor loss:	-0.152122
critic loss:	3.597556
Episode 2500: Going Reward = 15.8: Std = 80.9
actor loss:	0.087542
critic loss:	3.382423
Episode 2600: Going Reward = 74.5: Std = 100.2
actor loss:	-0.516331
critic loss:	3.720362
Episode 2700: Going Reward = 142.6: Std = 88.2
actor loss:	-0.080063
critic loss:	3.385840
Episode 2800: Going Reward = 161.4: Std = 82.9
actor loss:	0.094017
critic loss:	2.983063
Episode 2900: Going Reward = 195.1: Std = 90.8


In [None]:
train(3000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.395713
critic loss:	3.331043
Episode 3000: Going Reward = 229.6: Std = 0.0
actor loss:	-2.279335
critic loss:	3.406750
Episode 3100: Going Reward = 347.7: Std = 138.2
actor loss:	-0.758666
critic loss:	2.130825
Episode 3200: Going Reward = 473.7: Std = 184.0
actor loss:	-0.483033
critic loss:	1.264937
Episode 3300: Going Reward = 527.4: Std = 208.6
actor loss:	-0.471942
critic loss:	1.295983
Episode 3400: Going Reward = 682.0: Std = 264.7
actor loss:	-0.822821
critic loss:	1.539512
Episode 3500: Going Reward = 732.7: Std = 292.5
actor loss:	-0.087698
critic loss:	1.324136
Episode 3600: Going Reward = 694.2: Std = 258.3
actor loss:	-1.231445
critic loss:	2.242342
Episode 3700: Going Reward = 691.8: Std = 309.5
actor loss:	-1.929214
critic loss:	2.068665
Episode 3800: Going Reward = 776.3: Std = 384.7
actor loss:	0.694523
critic loss:	2.673688
Episode 3900: Going Reward = 830.6: Std = 309.7


In [None]:
train(4000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-1.671080
critic loss:	2.057370
Episode 4000: Going Reward = 500.4: Std = 0.0
actor loss:	-1.385398
critic loss:	3.008119
Episode 4100: Going Reward = 669.1: Std = 334.9
actor loss:	0.337063
critic loss:	1.327038
Episode 4200: Going Reward = 835.1: Std = 347.7
actor loss:	0.888704
critic loss:	1.742947
Episode 4300: Going Reward = 916.4: Std = 372.6
actor loss:	-1.146650
critic loss:	1.326077
Episode 4400: Going Reward = 1012.5: Std = 402.8
actor loss:	-0.232475
critic loss:	1.032105
Episode 4500: Going Reward = 892.3: Std = 438.9
actor loss:	0.082277
critic loss:	1.821228
Episode 4600: Going Reward = 1009.5: Std = 454.8
actor loss:	-1.130518
critic loss:	2.331015
Episode 4700: Going Reward = 971.3: Std = 460.4
actor loss:	0.863146
critic loss:	1.501559
Episode 4800: Going Reward = 1133.0: Std = 455.5
actor loss:	-1.407602
critic loss:	2.113835
Episode 4900: Going Reward = 1326.8: Std = 417.2


In [None]:
train(5000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-1.578427
critic loss:	1.781631
Episode 5000: Going Reward = 918.5: Std = 0.0
actor loss:	-1.259697
critic loss:	1.872303
Episode 5100: Going Reward = 1073.0: Std = 444.1
actor loss:	0.932035
critic loss:	1.534477
Episode 5200: Going Reward = 1178.9: Std = 519.7
actor loss:	-0.971309
critic loss:	1.401790
Episode 5300: Going Reward = 1019.9: Std = 480.1
actor loss:	1.742777
critic loss:	2.496648
Episode 5400: Going Reward = 1369.3: Std = 508.0
actor loss:	-0.345199
critic loss:	1.400091
Episode 5500: Going Reward = 1218.4: Std = 507.4
actor loss:	0.496828
critic loss:	1.235398
Episode 5600: Going Reward = 1482.0: Std = 567.7
actor loss:	0.388180
critic loss:	1.264069
Episode 5700: Going Reward = 1474.2: Std = 568.5
actor loss:	0.200110
critic loss:	0.993266
Episode 5800: Going Reward = 1619.8: Std = 449.6
actor loss:	0.489242
critic loss:	1.335235
Episode 5900: Going Reward = 1496.7: Std = 518.6


In [None]:
train(6000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-1.222552
critic loss:	1.781541
Episode 6000: Going Reward = 809.5: Std = 0.0
actor loss:	0.455338
critic loss:	1.459432
Episode 6100: Going Reward = 1459.1: Std = 648.3
actor loss:	-0.565222
critic loss:	0.987756
Episode 6200: Going Reward = 1568.9: Std = 590.8
actor loss:	-1.354901
critic loss:	1.587460
Episode 6300: Going Reward = 1582.4: Std = 616.7
actor loss:	0.903273
critic loss:	3.155771
Episode 6400: Going Reward = 1701.0: Std = 608.6
actor loss:	0.025149
critic loss:	1.079531
Episode 6500: Going Reward = 1683.5: Std = 698.2
actor loss:	-0.440988
critic loss:	1.226930
Episode 6600: Going Reward = 1430.6: Std = 619.8
actor loss:	0.893528
critic loss:	0.931975
Episode 6700: Going Reward = 1267.8: Std = 684.1
actor loss:	0.429397
critic loss:	1.273149
Episode 6800: Going Reward = 1445.0: Std = 746.8
actor loss:	-1.618469
critic loss:	1.556376
Episode 6900: Going Reward = 1672.7: Std = 722.8


In [None]:
train(7000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.959732
critic loss:	1.298393
Episode 7000: Going Reward = 2063.7: Std = 0.0
actor loss:	-2.246874
critic loss:	2.281144
Episode 7100: Going Reward = 1627.1: Std = 674.0
actor loss:	2.045805
critic loss:	4.384126
Episode 7200: Going Reward = 1551.1: Std = 699.0
actor loss:	-0.825182
critic loss:	0.957634
Episode 7300: Going Reward = 1565.0: Std = 730.2
actor loss:	-1.767600
critic loss:	2.410198
Episode 7400: Going Reward = 1668.3: Std = 733.4
actor loss:	0.711776
critic loss:	1.501264
Episode 7500: Going Reward = 1562.3: Std = 825.8
actor loss:	-0.290187
critic loss:	1.024779
Episode 7600: Going Reward = 1537.6: Std = 814.1
actor loss:	-1.976377
critic loss:	2.210938
Episode 7700: Going Reward = 1725.1: Std = 655.0
actor loss:	-0.866393
critic loss:	1.010949
Episode 7800: Going Reward = 1770.4: Std = 654.4
actor loss:	0.691232
critic loss:	2.354581
Episode 7900: Going Reward = 1815.3: Std = 769.4


In [None]:
train(8000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	1.083672
critic loss:	1.855007
Episode 8000: Going Reward = 2482.4: Std = 0.0
actor loss:	0.583486
critic loss:	1.664475
Episode 8100: Going Reward = 1991.1: Std = 720.8
actor loss:	-1.552750
critic loss:	1.817653
Episode 8200: Going Reward = 1906.1: Std = 693.6
actor loss:	1.270662
critic loss:	4.174177
Episode 8300: Going Reward = 1887.8: Std = 776.2
actor loss:	0.376112
critic loss:	2.842135
Episode 8400: Going Reward = 2239.4: Std = 575.6
actor loss:	0.673780
critic loss:	1.155876
Episode 8500: Going Reward = 1957.3: Std = 699.7
actor loss:	-1.144424
critic loss:	1.549476
Episode 8600: Going Reward = 1693.5: Std = 764.3
actor loss:	0.852003
critic loss:	2.323795
Episode 8700: Going Reward = 2101.7: Std = 612.3
actor loss:	-1.341768
critic loss:	1.788214
Episode 8800: Going Reward = 1978.1: Std = 828.7
actor loss:	0.284812
critic loss:	1.037826
Episode 8900: Going Reward = 1688.5: Std = 812.6


In [None]:
train(9000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	1.508298
critic loss:	2.894476
Episode 9000: Going Reward = 2713.7: Std = 0.0
actor loss:	-0.635450
critic loss:	0.936186
Episode 9100: Going Reward = 1900.2: Std = 775.0
actor loss:	-0.337288
critic loss:	1.536267
Episode 9200: Going Reward = 2076.1: Std = 765.0
actor loss:	-0.122129
critic loss:	0.691059
Episode 9300: Going Reward = 1826.3: Std = 875.8
actor loss:	-0.539403
critic loss:	0.813124
Episode 9400: Going Reward = 1886.5: Std = 739.4
actor loss:	-1.249101
critic loss:	0.938969
Episode 9500: Going Reward = 2068.0: Std = 851.0
actor loss:	-1.113430
critic loss:	1.416739
Episode 9600: Going Reward = 2321.6: Std = 684.0
actor loss:	-0.864280
critic loss:	1.049242
Episode 9700: Going Reward = 2279.5: Std = 711.2
actor loss:	0.618948
critic loss:	1.251532
Episode 9800: Going Reward = 2002.4: Std = 865.9
actor loss:	0.052994
critic loss:	1.162389
Episode 9900: Going Reward = 2216.2: Std = 698.8


In [None]:
train(10000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	1.760358
critic loss:	2.423975
Episode 10000: Going Reward = 2573.0: Std = 0.0
actor loss:	0.125492
critic loss:	1.033644
Episode 10100: Going Reward = 1562.8: Std = 782.3
actor loss:	1.986222
critic loss:	4.218519
Episode 10200: Going Reward = 1958.8: Std = 844.2
actor loss:	0.530093
critic loss:	1.091257
Episode 10300: Going Reward = 2092.0: Std = 793.5
actor loss:	-0.405855
critic loss:	0.872290
Episode 10400: Going Reward = 1957.4: Std = 854.5
actor loss:	0.100350
critic loss:	1.071672
Episode 10500: Going Reward = 1846.3: Std = 811.9
actor loss:	-0.993702
critic loss:	1.366012
Episode 10600: Going Reward = 2169.1: Std = 808.1
actor loss:	-2.357786
critic loss:	2.562377
Episode 10700: Going Reward = 1932.3: Std = 947.9
actor loss:	-1.190859
critic loss:	1.356342
Episode 10800: Going Reward = 2042.8: Std = 783.0
actor loss:	1.471030
critic loss:	4.328387
Episode 10900: Going Reward = 2377.2: Std = 765.4


In [None]:
train(11000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.918202
critic loss:	0.976145
Episode 11000: Going Reward = 2300.8: Std = 0.0
actor loss:	0.272379
critic loss:	1.499477
Episode 11100: Going Reward = 2107.2: Std = 781.8
actor loss:	0.640745
critic loss:	3.814688
Episode 11200: Going Reward = 2386.3: Std = 767.2
actor loss:	0.127481
critic loss:	2.012447
Episode 11300: Going Reward = 2338.3: Std = 676.7
actor loss:	-0.122785
critic loss:	1.193419
Episode 11400: Going Reward = 2452.1: Std = 700.7
actor loss:	0.943301
critic loss:	4.062940
Episode 11500: Going Reward = 2496.6: Std = 616.5
actor loss:	0.335358
critic loss:	1.395730
Episode 11600: Going Reward = 2446.6: Std = 726.2
actor loss:	-2.523592
critic loss:	2.407605
Episode 11700: Going Reward = 2455.0: Std = 918.4
actor loss:	-0.053031
critic loss:	1.117975
Episode 11800: Going Reward = 2590.1: Std = 711.2
actor loss:	1.090303
critic loss:	4.054928
Episode 11900: Going Reward = 2364.7: Std = 844.3


In [None]:
train(12000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.204708
critic loss:	1.039251
Episode 12000: Going Reward = 1930.8: Std = 0.0
actor loss:	0.669831
critic loss:	1.640394
Episode 12100: Going Reward = 2676.6: Std = 656.0
actor loss:	0.058755
critic loss:	1.496151
Episode 12200: Going Reward = 2663.0: Std = 783.6
actor loss:	1.393255
critic loss:	1.370451
Episode 12300: Going Reward = 2384.6: Std = 910.4
actor loss:	1.782996
critic loss:	4.615757
Episode 12400: Going Reward = 2538.0: Std = 749.0
actor loss:	1.308171
critic loss:	2.033708
Episode 12500: Going Reward = 2581.7: Std = 717.2
actor loss:	-0.341454
critic loss:	0.724399
Episode 12600: Going Reward = 2025.5: Std = 955.4
actor loss:	-1.562526
critic loss:	1.564011
Episode 12700: Going Reward = 1960.0: Std = 947.5
actor loss:	-0.555066
critic loss:	0.642153
Episode 12800: Going Reward = 2358.7: Std = 792.2
actor loss:	0.487740
critic loss:	0.811773
Episode 12900: Going Reward = 2458.7: Std = 788.4


In [None]:
train(13000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.063949
critic loss:	1.113895
Episode 13000: Going Reward = 1629.0: Std = 0.0
actor loss:	0.050514
critic loss:	1.043421
Episode 13100: Going Reward = 2259.3: Std = 878.2
actor loss:	0.587449
critic loss:	1.691791
Episode 13200: Going Reward = 2505.5: Std = 865.9
actor loss:	1.220568
critic loss:	4.195149
Episode 13300: Going Reward = 2438.6: Std = 946.0
actor loss:	-0.046315
critic loss:	0.992700
Episode 13400: Going Reward = 2615.4: Std = 791.2
actor loss:	-0.317873
critic loss:	0.809228
Episode 13500: Going Reward = 2518.7: Std = 859.8
actor loss:	0.977924
critic loss:	4.129565
Episode 13600: Going Reward = 2821.2: Std = 648.2
actor loss:	0.838940
critic loss:	4.083984
Episode 13700: Going Reward = 2850.8: Std = 523.1
actor loss:	0.829072
critic loss:	4.113334
Episode 13800: Going Reward = 2655.0: Std = 787.3
actor loss:	0.715934
critic loss:	1.253542
Episode 13900: Going Reward = 2766.9: Std = 666.2


In [None]:
train(14000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-1.305408
critic loss:	0.906713
Episode 14000: Going Reward = 439.3: Std = 0.0
actor loss:	0.682399
critic loss:	1.280999
Episode 14100: Going Reward = 2447.6: Std = 968.5
actor loss:	1.422424
critic loss:	1.734107
Episode 14200: Going Reward = 1982.5: Std = 1005.1
actor loss:	-1.101760
critic loss:	1.338134
Episode 14300: Going Reward = 2950.8: Std = 509.4
actor loss:	-0.810579
critic loss:	1.331666
Episode 14400: Going Reward = 2648.5: Std = 864.1
actor loss:	-0.147816
critic loss:	1.173295
Episode 14500: Going Reward = 3006.6: Std = 473.3
actor loss:	0.827362
critic loss:	3.939340
Episode 14600: Going Reward = 3057.7: Std = 376.1
actor loss:	-2.376897
critic loss:	2.001214
Episode 14700: Going Reward = 2879.7: Std = 754.8
actor loss:	-1.196871
critic loss:	1.441862
Episode 14800: Going Reward = 2655.2: Std = 964.9
actor loss:	-0.405367
critic loss:	0.837522
Episode 14900: Going Reward = 2620.1: Std = 879.8


In [None]:
train(15000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-1.610340
critic loss:	1.548312
Episode 15000: Going Reward = 3216.6: Std = 0.0
actor loss:	1.148691
critic loss:	4.239946
Episode 15100: Going Reward = 2617.7: Std = 783.9
actor loss:	-0.712530
critic loss:	1.083287
Episode 15200: Going Reward = 3015.6: Std = 510.5
actor loss:	-1.053042
critic loss:	1.327524
Episode 15300: Going Reward = 2490.6: Std = 1010.5
actor loss:	-0.943783
critic loss:	1.034032
Episode 15400: Going Reward = 2500.1: Std = 955.5
actor loss:	-0.419472
critic loss:	0.716385
Episode 15500: Going Reward = 2298.3: Std = 1002.8
actor loss:	-1.781363
critic loss:	1.733323
Episode 15600: Going Reward = 2697.2: Std = 988.5
actor loss:	-1.916803
critic loss:	1.806646
Episode 15700: Going Reward = 2890.7: Std = 758.7
actor loss:	0.636459
critic loss:	4.010274
Episode 15800: Going Reward = 2872.7: Std = 764.7
actor loss:	-0.901222
critic loss:	1.569181
Episode 15900: Going Reward = 2836.8: Std = 790.3


In [None]:
train(16000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	0.387231
critic loss:	3.935620
Episode 16000: Going Reward = 3429.7: Std = 0.0
actor loss:	0.764429
critic loss:	2.818906
Episode 16100: Going Reward = 3089.0: Std = 596.9
actor loss:	1.454127
critic loss:	1.661459
Episode 16200: Going Reward = 2518.1: Std = 977.8
actor loss:	-1.858835
critic loss:	1.918456
Episode 16300: Going Reward = 2471.8: Std = 936.2
actor loss:	1.167719
critic loss:	1.295110
Episode 16400: Going Reward = 2681.8: Std = 892.0
actor loss:	0.255375
critic loss:	1.532980
Episode 16500: Going Reward = 2643.7: Std = 893.4
actor loss:	-0.986740
critic loss:	2.166893
Episode 16600: Going Reward = 2458.8: Std = 1064.5
actor loss:	1.645145
critic loss:	4.492075
Episode 16700: Going Reward = 2184.7: Std = 955.8
actor loss:	-0.583145
critic loss:	1.329617
Episode 16800: Going Reward = 2710.9: Std = 885.6
actor loss:	-1.503880
critic loss:	1.850412
Episode 16900: Going Reward = 2850.1: Std = 757.1


In [None]:
train(17000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	0.371530
critic loss:	0.708143
Episode 17000: Going Reward = 1023.5: Std = 0.0
actor loss:	-0.104432
critic loss:	1.602578
Episode 17100: Going Reward = 2590.1: Std = 1031.4
actor loss:	-0.467307
critic loss:	1.344976
Episode 17200: Going Reward = 3033.0: Std = 706.3
actor loss:	1.882612
critic loss:	4.741973
Episode 17300: Going Reward = 2557.9: Std = 930.2
actor loss:	0.665773
critic loss:	4.094248
Episode 17400: Going Reward = 3058.0: Std = 674.8
actor loss:	0.034847
critic loss:	1.339307
Episode 17500: Going Reward = 2910.2: Std = 742.5
actor loss:	-1.703453
critic loss:	1.257346
Episode 17600: Going Reward = 2793.4: Std = 941.1
actor loss:	0.389718
critic loss:	1.591747
Episode 17700: Going Reward = 2895.0: Std = 634.8
actor loss:	-0.300445
critic loss:	1.401207
Episode 17800: Going Reward = 2736.7: Std = 797.5
actor loss:	-1.868326
critic loss:	2.104671
Episode 17900: Going Reward = 3034.1: Std = 392.2


In [None]:
train(18000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	-0.930445
critic loss:	1.399592
Episode 18000: Going Reward = 2790.3: Std = 0.0
actor loss:	-1.071731
critic loss:	1.451731
Episode 18100: Going Reward = 3209.3: Std = 565.4
actor loss:	0.339464
critic loss:	3.963279
Episode 18200: Going Reward = 3089.6: Std = 664.9
actor loss:	0.401374
critic loss:	4.003474
Episode 18300: Going Reward = 3051.4: Std = 708.7
actor loss:	-1.638298
critic loss:	1.063118
Episode 18400: Going Reward = 2816.9: Std = 941.1
actor loss:	0.091935
critic loss:	1.007602
Episode 18500: Going Reward = 3113.9: Std = 778.7
actor loss:	-0.884976
critic loss:	1.072968
Episode 18600: Going Reward = 2867.8: Std = 875.7
actor loss:	-0.737758
critic loss:	1.291255
Episode 18700: Going Reward = 2964.3: Std = 926.6
actor loss:	1.118991
critic loss:	4.227972
Episode 18800: Going Reward = 3133.4: Std = 769.0
actor loss:	-1.475305
critic loss:	1.145234
Episode 18900: Going Reward = 3069.3: Std = 745.7


In [None]:
train(19000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	0.248218
critic loss:	1.981068
Episode 19000: Going Reward = 3321.2: Std = 0.0
actor loss:	0.649829
critic loss:	1.012041
Episode 19100: Going Reward = 2414.3: Std = 937.5
actor loss:	-0.696635
critic loss:	0.714832
Episode 19200: Going Reward = 2539.3: Std = 1002.8
actor loss:	0.608715
critic loss:	2.637791
Episode 19300: Going Reward = 2670.1: Std = 968.8
actor loss:	1.398447
critic loss:	4.321323
Episode 19400: Going Reward = 2882.7: Std = 813.2
actor loss:	0.115981
critic loss:	0.772255
Episode 19500: Going Reward = 3287.9: Std = 452.4
actor loss:	-1.667769
critic loss:	1.454974
Episode 19600: Going Reward = 3163.1: Std = 596.7
actor loss:	-0.198728
critic loss:	0.608767
Episode 19700: Going Reward = 2801.4: Std = 952.6
actor loss:	0.158064
critic loss:	1.902664
Episode 19800: Going Reward = 2905.7: Std = 907.8
actor loss:	0.602219
critic loss:	1.107381
Episode 19900: Going Reward = 2405.2: Std = 1172.4


In [None]:
train(20000)
torch.save(agent.actor.state_dict(), 'ppo_actor_model.pth')
torch.save(agent.critic.state_dict(), 'ppo_critic_model.pth')
save_memory_to_json(agent.memory, 'ppo_memory.json')

  action = torch.tensor(action, dtype=torch.float)


actor loss:	1.883685
critic loss:	4.836060
Episode 20000: Going Reward = 3513.2: Std = 0.0
actor loss:	0.993683
critic loss:	3.655667
Episode 20100: Going Reward = 2925.3: Std = 854.5
actor loss:	0.732522
critic loss:	1.575552
Episode 20200: Going Reward = 2869.8: Std = 843.1
actor loss:	0.144057
critic loss:	1.685840
Episode 20300: Going Reward = 3180.6: Std = 606.8
actor loss:	-2.664516
critic loss:	2.610653
Episode 20400: Going Reward = 3008.5: Std = 850.6
actor loss:	0.540946
critic loss:	3.883887
Episode 20500: Going Reward = 2927.6: Std = 892.4
actor loss:	-3.112492
critic loss:	3.293306
Episode 20600: Going Reward = 3266.3: Std = 460.6
actor loss:	1.050787
critic loss:	4.130845
Episode 20700: Going Reward = 3146.2: Std = 828.5
actor loss:	0.812542
critic loss:	3.972325
Episode 20800: Going Reward = 3345.3: Std = 415.1
actor loss:	0.467937
critic loss:	3.965305
Episode 20900: Going Reward = 3071.7: Std = 727.5


In [None]:
agent.memory = load_memory_from_json('ppo_memory.json')
agent.actor.load_state_dict(torch.load('ppo_actor_model.pth'))
agent.critic.load_state_dict(torch.load('ppo_critic_model.pth'))

<All keys matched successfully>

## Evaluation **(5 points)**

Evaluate the trained policy on the environment. Calculate the cumulative reward and display the video of the trajectory.

In [57]:
env = gym.make("HalfCheetah-v4", render_mode="rgb_array")
state, _ = env.reset()
frames = []

total_reward = 0
# TODO
# run the policy in the environment in a loop
while True:
    action = agent.select_action(state, False)
    next_state, reward, terminated, truncated, info = env.step(action)
    frames.append(env.render())
    total_reward += reward
    state = next_state

    if terminated or truncated:
        break

env.close()
print(f'Total Reward: {total_reward}')

imageio.mimsave('./eval_ppo.mp4', frames, fps=25)
show_video('./eval_ppo.mp4')

Total Reward: 3593.877640595171
