### A2C Pendulum Training Continuous Control 

#### A3C
[Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.](http://proceedings.mlr.press/v48/mniha16.pdf)

### **Advantage Actor-Critic (A2C)**

The **Advantage Actor-Critic (A2C)** algorithm is a popular method for training reinforcement learning agents in continuous action spaces. A2C is an *on-policy* policy optimization algorithm that learns a policy to maximize the expected return by updating the policy parameters (\(\theta\)) using gradient ascent. The method alternates between optimizing two components:
1. **The Actor**: Learns the policy by predicting the probabilities (or mean actions in continuous spaces).
2. **The Critic**: Estimates the value function to reduce variance in policy updates.

---

### **Policy Gradient in Continuous Spaces**

For a continuous action space, the policy is parameterized as a Gaussian distribution:

$$ \pi_\theta(a_t | s_t) = \mathcal{N}(\mu_\theta(s_t), \sigma_\theta) $$

where:
- $\mu_\theta(s_t)$: Mean action predicted by the actor.
- $\sigma_\theta$: Standard Deviation, either fixed or learned as a separate parameter.

The policy gradient to maximize the expected return $\mathcal{J}(\pi_\theta)$ is:

$$ \nabla_\theta \mathcal{J}(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) A^{\pi_\theta}(s_t, a_t) \right] $$

where:
- $\tau$ : A trajectory sampled from the policy.
- $A^{\pi_\theta}(s_t, a_t) = R_t - V(s_t)$ : Advantage function, which reduces variance in gradient updates.

---

#### **Critic Update**

The critic learns the value function $V(s_t; w)$ by minimizing the **mean squared error (MSE)** between the predicted value and the observed return $(R_t)$:


$$ L_{\text{critic}}(w) = \mathbb{E}\left[\left(R_t - V(s_t; w)\right)^2\right] $$

The return $(R_t)$ is computed using the bootstrapped target:

$$ R_t = r_t + \gamma V(s_{t+1}; w) $$

where $\gamma$ is the discount factor. The gradient update for the critic is:

$$ w_{k+1} = w_k - \beta \nabla_w L_{\text{critic}}(w_k) $$

where $\beta$ is the learning rate for the critic.

---

#### **Actor Update**

The actor updates the policy by adjusting the mean \(\mu_\theta\) to increase the likelihood of actions weighted by their advantage:

$$ \nabla_\theta L_{\text{policy}} = - \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) A(s_t, a_t) $$

where the advantage function is approximated as:

$$ A(s_t, a_t) = R_t - V(s_t; w) $$

For a Gaussian policy:

$$ \log \pi_\theta(a_t | s_t) = -\frac{1}{2} \left( \frac{(a_t - \mu_\theta(s_t))^2}{\sigma^2} + \log(2\pi\sigma^2) \right) $$


Gradients are propagated through both the mean and variance (if trainable), enabling the actor to learn an optimal policy.

---

### **Entropy Regularization**

To encourage exploration and prevent premature convergence to deterministic policies, **entropy regularization** is added to the policy loss:

$$ L_{\text{entropy}} = - \beta \sum_t \mathbb{E}_{a \sim \pi_\theta} \left[ \log \pi_\theta(a | s_t) \right] $$

where $\beta$ is the entropy coefficient. Higher entropy encourages broader exploration of the action space.

---

### **Combined Loss Function**

The overall loss for A2C combines the policy loss, value loss, and entropy regularization:

$$ L = L_{\text{policy}} + c_1 L_{\text{value}} - c_2 L_{\text{entropy}} $$

where:
- $c_1$: Weight for the value loss.
- $c_2$: Weight for the entropy regularization.



### Load dependencies for Env

In [58]:
import os
import datetime

import numpy as np 
import gymnasium as gym


env = gym.make("Pendulum-v1")
observation, info = env.reset()
print(observation.shape)
print(env.action_space)
print(env.observation_space)
print(env.observation_space.shape[-1])

(3,)
Box(-2.0, 2.0, (1,), float32)
Box([-1. -1. -8.], [1. 1. 8.], (3,), float32)
3


## Define Model, Optimizer, Loss with Torch


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal

from enum import Enum


# Prioritize device: CUDA > MPS > CPU
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    print("CUDA is available. Using CUDA.")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
    print("MPS backend is available. Using MPS.")
else:
    DEVICE = torch.device("cpu")
    print("Neither CUDA nor MPS is available. Using CPU.")

class Mode(Enum):
    COMBINED = "combined"
    SEPARATE = "separate"

class SimpleActorCriticCombined(nn.Module):

    def __init__(self, state_size, action_size, fc_units=64):
        """Initialize parameters and build model."""
        super(SimpleActorCriticCombined, self).__init__()

        # First we have same body for the actor and critic, in  our case 64 ReLU units
        self.fc1 = nn.Linear(state_size, fc_units)

        # Actor head
        self.fc_actor = nn.Linear(fc_units, fc_units)

        # Actor Head presented in form of mean modeled by linear layer and log(std) to use Gaussian Distribution
        self.fc_mean = nn.Linear(fc_units, action_size) # fully connected 
        self.log_std = nn.Parameter(torch.zeros(action_size)) # standard initialization std = exp(log_std) = exp(0) = 1
       

        # Critic Head
        self.fc_critic = nn.Linear(fc_units, fc_units)
        self.fc_critic_out = nn.Linear(fc_units, 1)
    

    def forward(self, state):
        """Forward method implementation."""
        x = F.relu(self.fc1(state))
        
        value = self.fc_critic_out(F.relu(self.fc_critic(x)))
        
        action_mean = self.fc_mean(F.relu(self.fc_actor(x)))
        action_mean = torch.tanh(action_mean)*2.0 #lets make it between -2,2
        
        # action_mean = self.fc_mean(x) 
        action_std = self.log_std.exp() # Convert log-std to std
        distribution = Normal(action_mean, action_std)
        action = distribution.sample() # When to Use rsample: Use rsample if gradients need to flow through sampled actions, such as in:
                                    # SAC or DDPG: Off-policy RL algorithms.
                                    # VAEs or auxiliary losses involving actions
        log_prob = distribution.log_prob(action)
        entropy = distribution.entropy()

        return value, action, log_prob, entropy, action_mean
    
#   Testing Simalar logic to the one presented by iKorotkov with absolutely 2 separate networks for critic and author
class SimpleActorCriticSeparate(nn.Module):

    def __init__(self, state_size, action_size, fc_units=64):
        """Initialize parameters and build model."""
        super(SimpleActorCriticSeparate, self).__init__()

        # Actor head
        self.fc_actor_0 = nn.Linear(state_size, fc_units)
        self.fc_actor_1 = nn.Linear(fc_units, fc_units)

        # Actor Head presented in form of mean modeled by linear layer and log(std) to use Gaussian Distribution
        self.fc_mean = nn.Linear(fc_units, action_size) # fully connected 
        self.log_std = nn.Parameter(torch.zeros(action_size)) # standard initialization std = exp(log_std) = exp(0) = 1
        
        # Critic Head
        self.fc_critic_0 = nn.Linear(state_size, fc_units)
        self.fc_critic_1 = nn.Linear(fc_units, fc_units)
        self.fc_critic_out = nn.Linear(fc_units, 1)
    

    def forward(self, state):
        """Forward method implementation."""
        
        value_x = F.relu(self.fc_critic_0(state))
        value_x = F.relu(self.fc_critic_1(value_x))
        value = self.fc_critic_out(value_x)
        
        actor_x = F.relu(self.fc_actor_0(state))
        actor_x = F.relu(self.fc_actor_1(actor_x))
        action_mean = self.fc_mean(actor_x)  # Unbounded mean
 
        action_std = self.log_std.exp() # Convert log-std to std
        distribution = Normal(action_mean, action_std)

        action_raw = distribution.sample()
        action = torch.tanh(action_raw) * 2.0 # Scale to [-2, 2]
        log_prob = distribution.log_prob(action_raw) - torch.log(torch.tensor(2.0)) - torch.log(1 - torch.tanh(action_raw).pow(2) + 1e-6)
        # log_prob = log_prob.sum(dim=-1)  # Sum over action dims if needed
        entropy = distribution.entropy()

        return value, action, log_prob, entropy, action_mean

MPS backend is available. Using MPS.


## Rollout Buffer/Storage

For $𝑛$ n-step Actor-Critic, the rollout storage collects and stores the following components for each step  $𝑡$ in a batch of $𝑁$ parallel environments:
- states/observations $S_{t}$
- value estimations $V(s_{t})$
- rewards $r_{t}$
- actions $a_{t}$
- action log probabilities $\log \pi(a_{t}|s_{t}, {\theta})$
- mask $m_{t}$ - a binary mask to indicate if the environment is still active (1 if active, 0 entered terminal state)
- truncates $tr_{t}$ - a binary mask to indicate if the environmnet was truncated due to time limitation (**important to adjust rewards to include next state value**)

At each step $𝑡$, for environment $𝑏$, the collected data is:

$$ \{S_{t,b}, a_{t,b}, r_{t+1,b}, V(s_{t}), \log \pi(a_{t,b}|s_{t,b}, {\theta}), m_{t+1,b}, tr_{t+1,b}  \} $$

This is collected iteratively for $n$ timesteps $N$ environments, then used for gradient updates in the Actor-Critic framework.
    
Additional important hyper parameters:
- number of steps $TD (n)$ error method 
- number of parallel environments executed 

---

#### **Generalized Advantage Estimation (GAE)**
To reduce variance further while balancing bias, **Generalized Advantage Estimation (GAE)** is often used. The advantage function is computed as a weighted sum of temporal difference (TD) residuals over multiple steps, controlled by a parameter $\lambda \in [0, 1]$:

$$ A(s_t, a_t) = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}, $$

where the TD residual $\delta_t$ is given by:

$$ \delta_t = r_t + \gamma V(s_{t+1}; w) - V(s_t; w). $$

This introduces a bias-variance tradeoff: lower $\lambda$ relies more on immediate TD errors (lower variance), while higher $\lambda$ uses longer-horizon rewards (lower bias).

\begin{align*}
\text{If }  \lambda=0,  \; \hat{A}_t^{\text{GAE}(\gamma,0)} & := \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \\
\text{If }  \lambda=1,  \; \hat{A}_t^{\text{GAE}(\gamma,1)} & :=  \sum_{l=0}^{\infty} \gamma^{l} \delta_{t+l} = \sum_{l=0}^{\infty} \gamma^{l} r_{t+l} - V(s_t)
\end{align*}

---



In [107]:
import torch 
import numpy as np 

class RolloutStorage:
    """
    Rollout buffer used in on-policy algorithms like A2C/PPO.
    It corresponds to ``buffer_size`` transitions collected
    using the current policy or n-steps as we like to call it in TD.
    This experience will be discarded after the policy update.
    In order to use PPO objective, we also store the current value of each state
    and the log probability of each taken action.

    It is only involved in policy and value function training but not action selection.
    """
    def __init__(self, 
                 obs_shape: any,
                 action_space: any,
                 num_steps: int = 1, 
                 n_envs: int = 1,
                 device: torch.device = torch.device("cpu")):
        
        self.num_steps = num_steps
        self.n_envs = n_envs

        self.obs_shape = obs_shape
        self.action_space = action_space
        self.device = device

        self.last_obs = None # To store the last observation between rollouts

        if action_space.__class__.__name__ == 'Discrete' or action_space.__class__.__name__ == 'MultiDiscrete':
            self.action_dim = 1
        else:
            self.action_dim = action_space.shape[-1]

        # setup all the data   
        self.reset()  
    
    def reset(self) -> None:
        """
        Call reset, whenever we starting to collect next n-step rollout of data. 
        """
        self.obs = torch.zeros(self.num_steps, self.n_envs, *self.obs_shape, dtype=torch.float32, device=self.device)
        self.rewards = torch.zeros(self.num_steps, self.n_envs, 1,  dtype=torch.float32, device=self.device)
        self.values = torch.zeros(self.num_steps, self.n_envs,  1,  dtype=torch.float32, device=self.device)
        self.log_probs = torch.zeros(self.num_steps, self.n_envs, 1, dtype=torch.float32, device=self.device)
        self.entropies =torch.zeros(self.num_steps, self.n_envs, 1, dtype=torch.float32, device=self.device)
        self.actions = torch.zeros(self.num_steps, self.n_envs, self.action_dim, dtype=torch.float32, device=self.device)
        self.masks = torch.ones(self.num_steps, self.n_envs, 1, dtype=torch.int8, device=self.device)
        self.truncates = torch.zeros(self.num_steps, self.n_envs, 1, dtype=torch.bool, device=self.device)
        self.advantages = torch.zeros(self.num_steps, self.n_envs, 1, dtype=torch.float32, device=self.device)
        self.returns = torch.zeros(self.num_steps, self.n_envs, 1, dtype=torch.float32, device=self.device)

        self.step = 0
    
    def add(
        self,
        obs: torch.Tensor,
        actions: torch.Tensor,
        log_probs: torch.Tensor,
        entropies: torch.Tensor,
        values: torch.Tensor,
        rewards: torch.Tensor,
        masks: torch.Tensor,
        truncates: torch.Tensor
    ) -> None:
        """
        :param obs: Observations
        :param action: Actions
        :param log_probs: log probability of the action
            following the current policy.
        :param values: estimated value of the current state
            following the current policy.
        :param entropies: entropy calculated for the current step
        :param rewards: rewards
        :param masks: indicate env is still active (terminated or truncated)
        :param truncated: indicate env is truncated, needed to calculated Advantages correctly
        """
        self.obs[self.step].copy_(obs)
        self.actions[self.step].copy_(actions)
        self.log_probs[self.step] = log_probs.clone() #keep gradients
        self.values[self.step] = values.clone() #keep gradients
        self.entropies[self.step] = entropies.clone() #keep gradients
        self.rewards[self.step].copy_(rewards)
        self.masks[self.step].copy_(masks)
        self.truncates[self.step].copy_(truncates)

        self.step = (self.step + 1) % self.num_steps # hopefully thios % is actually not needed here

    def compute_returns_and_advantages(
            self,
            last_values: torch.Tensor,
            gamma: float = 0.99,
            gae_lambda: float = 1.0,
            normalize: bool = True) -> None:
        """
        Post-processing step: compute the advantages A using TD(n) error method, to use in the gradient calculation in future
            - TD(1) or A_1 is one-step estimate with bootstrapping delta_t = (r_{t+1} + gamma * v(s_{t+1}) - v(s_t))
            ....
            - TD(n) or A_n is n-step estimate with bootstrapping SUM_{l=0}^{n}(gamma^{l}*delta_{t+l})
               (r_{t+1} + gamma*r_{t+2} + gamma^2*r_{t+3} + .....+ gamma^(n+1)*v(s_{t+n+1}) - v(s_t))
        
        We using Generalized Advantage Estimation, in this case advantage calculated next way:

            - A_t^gae(gamma,lambda) = SUM_{l=0}^{\infty}( (gamma*lambda)^{l} * \delta_{t+l})

        :param last_values: state values estimation for the last step (one for each env)
        :param gamma: discount to be used in reward estimation
        :param use_gae:  use Generalized Advantage Estimation 
        :param gae_lambada: factor for trade off of bias vs variance for GAE
        """
        gae = 0
        for step in reversed(range(self.num_steps)):
            if step == self.num_steps - 1:
                next_values = last_values.detach()
            else:
                next_values = self.values[step+1].detach()

            # Handle truncated episodes by incorporating next state value
            # https://github.com/DLR-RM/stable-baselines3/issues/633#issuecomment-961870101
            adjusted_rewards = self.rewards[step].clone()  # Start with original rewards
            adjusted_rewards[self.truncates[step]] += gamma * next_values[self.truncates[step]]
            
            delta = adjusted_rewards + gamma * self.masks[step] * next_values - self.values[step].detach() #td_error
            gae = delta + gamma * gae_lambda * self.masks[step] * gae
            self.advantages[step] = gae.detach()

        #R_t = A_t{GAE} + V(s_t) 
        self.returns = self.advantages + self.values.detach()

        # Normalize advantages to reduce skewness and improve convergence
        if normalize:
            self.advantages = (self.advantages - self.advantages.mean()) / (self.advantages.std() + 1e-8)

  """


## A2C Agent 

In [153]:
from collections import deque

class A2CAgent:
    """
    Advantage Actor Critic (A2C)
    :param env(gym.Env): The environment to learn from / openAI Gym environment - vector environment
    :param n_steps: The number of steps to run for each environment per update
        (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
    :param gamma: Discount factor
    :param gae_lambda: Factor for trade-off of bias vs variance for Generalized Advantage Estimator.
        Equivalent to classic advantage when set to 1.
    :param ent_coef: Entropy coefficient for the loss calculation
    :param vf_coef: Value function coefficient for the loss calculation
    :param gae_lambda: Factor for trade-off of bias vs variance for Generalized Advantage Estimator.
         Equivalent to classic advantage when set to 1.
    :param device: Device (cpu, cuda, ...) on which the code should be run.
    """
    def __init__(self,
                 env: gym.vector.VectorEnv,
                 net_mode: Mode = Mode.COMBINED,
                 n_steps: int = 5,
                 gamma: float = 0.99,
                 gae_lambda: float = 1.0,
                 ent_coef: float = 0.01,
                 min_ent_coef: float = 0.001,
                 lr: float = 1e-3,
                 vf_coef: float = 0.5,
                 device: torch.device = torch.device("cpu"),
                 save_path: str = 'a2c'):
        
        self.env = env
        self.n_steps = n_steps
        self.n_envs = env.num_envs

        self.gamma = gamma
        self.gae_lambda = gae_lambda

        self.entropy_coef = ent_coef
        self.min_entropy_coef = min_ent_coef
        self.value_loss_coef = vf_coef
        self.device = device
        self.save_path = save_path

        # network and optimizer 
        action_space = self.env.action_space
        if isinstance(action_space, gym.spaces.MultiDiscrete):
            action_size = action_space[0].n  # This gives the dimensions of first env, everywhere else same
        elif isinstance(action_space, gym.spaces.Box): 
            action_size = action_space.shape[-1]
        else:
            action_size = action_space.n
        
        state_size = self.env.observation_space.shape[-1] # works only for the boxes

        if net_mode == Mode.COMBINED:
            self.actor_critic = SimpleActorCriticCombined(state_size, action_size).to(self.device)
        else:
            self.actor_critic = SimpleActorCriticSeparate(state_size, action_size).to(self.device)
        self.optimizer = torch.optim.Adam(self.actor_critic.parameters(), lr=lr)

        # rollout storage
        self.rollout_storage =  RolloutStorage(
            (state_size,), 
            self.env.action_space, 
            num_steps=n_steps, 
            n_envs=self.n_envs, 
            device=self.device)

    def collect_rollouts(self):
        """
        Collect experiences using the current policy and fill a ``RolloutBuffer``.
        The term rollout here refers to the model-free notion and should not
        be used with the concept of rollout used in model-based RL or planning.
        """

        last_obs = self.rollout_storage.last_obs 
        self.rollout_storage.reset()

        for _ in range(self.n_steps):
            # with torch.no_grad():
            values, actions, log_probs, entropies, _ =  self.actor_critic(last_obs)


            #clamp selected actions to be in correct format
            selected_actions = actions.detach().cpu().clamp(-2.0, 
                                            2.0).numpy()
            
            # Take actions in env and look the results
            obs, rewards, terminates, truncates, infos = self.env.step(selected_actions)

            # Updating global number of steps agent done, while learning
            self.num_timesteps += self.env.num_envs

            dones = (terminates | truncates)
            masks = torch.tensor(1 - dones, dtype=torch.float32).unsqueeze(-1)
            rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(-1)
            truncates = torch.tensor(truncates, dtype=torch.bool).unsqueeze(-1)

            if "final_info" in infos:
                source_info = infos["final_info"]
                self._extract_episode_data(source_info.get('episode'), source_info.get('_episode'))

            self.rollout_storage.add(last_obs, 
                            actions, 
                            log_probs,
                            entropies,
                            values,
                            rewards,
                            masks,
                            truncates)
            
            last_obs = torch.tensor(obs, dtype=torch.float32, device=self.device)

        # compute values for the last timestamp
        #with torhc.no_grad():
        last_values, _, _, _, _  = self.actor_critic(last_obs)

        self.rollout_storage.compute_returns_and_advantages(
            last_values, 
            self.gamma,
            self.gae_lambda)
        self.rollout_storage.last_obs = last_obs

    def learn(self):
        """
        Update policy using the currently gathered
        rollout buffer (one gradient step over whole data).
        
        We have same network for both actor(policy) and critic(value)

        For actor we calculate gradient accent to maximize accumulated n-step updates: action log probabilities * estimate of advantage function,
        where for last state one step update advantage function, for second last state two step update advantage function and so on 
          advantage = (r_{t+1} + gamma * v(s_{t+1}) - v(s_t))
        Additionally for actor we add entropy  dto favor exploration 

        For critic we calculate gradient decent to minimize mean squared error between estimated returns and value estimation for the state,
        where for last state one step return function, for second last state two step return function and so on 
          return = (r_{t+1} + gamma * v(s_{t+1}))
        """
        advantages = self.rollout_storage.advantages
        log_probs = self.rollout_storage.log_probs
        entropies = self.rollout_storage.entropies
        returns = self.rollout_storage.returns
        values = self.rollout_storage.values
        
        
        # 1. Policy gradient loss or Actor gradient loss (gradient only on the log probs) 
        #  here we have gradient accent, so added {-}
        policy_loss = -(advantages * log_probs).mean()

        # 2. Value loss using the TD(n) n step error  target  (gradient only on values)
        # here we have gradient decent, so keep it
        value_loss = F.mse_loss(values, returns)

        # 3. Entropy loss to favor exploration  (gradient on log probs)
        #  here we have gradient accent so added {-}
        entropy_loss = -torch.mean(entropies)
        
        # 4. Combine total loss for the network 
        loss = policy_loss + self.value_loss_coef * value_loss  +  self.entropy_coef * entropy_loss

        # 5. Optimization step
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(self.actor_critic.parameters(), 0.5)
        self.optimizer.step()
        return loss, policy_loss, value_loss, entropy_loss

    def train(self, total_timesteps: int = 1e5, log_interval: int = 100):
        """Train the agent."""

        obs, _ = self.env.reset()
        self.rollout_storage.last_obs = torch.tensor(obs, dtype=torch.float32, device=self.device)

        self.num_timesteps = 0
        self.episode_rewards = deque(maxlen=15)
        self.episode_lengths = deque(maxlen=15)
        # Calculate the updates
        update = 1
        reached_best_score_times = 0

        while self.num_timesteps < total_timesteps:
            self.collect_rollouts()
            loss, policy_loss, value_loss, entropy_loss = self.learn()
            
             # Display training infos
            if update % log_interval == 0 and len(self.episode_rewards) > 1:
                print(
                    "Updates {}, num timesteps {}/{} Mean Length {:.1f}\n"
                    "Loss {:.4f}: Policy {:.4f} Value {:.4f} Entropy {:.4f}\n"
                    "Last {} training episodes: mean/median reward {:.1f}/{:.1f}, min/max reward {:.1f}/{:.1f}"
                    .format(
                        update, self.num_timesteps, total_timesteps, np.mean(self.episode_lengths),
                        loss, policy_loss, value_loss, entropy_loss,
                        len(self.episode_rewards), np.mean(self.episode_rewards),
                        np.median(self.episode_rewards), np.min(self.episode_rewards),
                        np.max(self.episode_rewards)
                    )
                )
                
                if self.entropy_coef != 0.0:
                    self.entropy_coef = max(self.min_entropy_coef, self.entropy_coef * 0.99) # ~230  (* 100 * 16 * 5) ~ 1.84 mln steps from 0.01 to 0.001
                
                if update % (10 * log_interval) == 0:
                    print('Saving checkpoint...')
                    self._save_policy_with_normalize_obs(f'./models/checkpoint_{self.save_path}.pth')

                if  np.mean(self.episode_rewards) >= -250:
                    reached_best_score_times += 1
                    if reached_best_score_times > 3:
                        print('Saving best solution, finishing training...')
                        self._save_policy_with_normalize_obs(f'./models/checkpoint_{self.save_path}.pth')
                        break


            update += 1
        
    def _save_policy_with_normalize_obs(self, save_path):
        """
        Saves the policy's state_dict along with NormalizeObservation RMS data.
        """
        all_obs_rms = [env.obs_rms for env in self.env.envs]

        # Save policy and normalization data
        save_data = {
            "policy_state_dict": self.actor_critic.state_dict(),
            "obs_rms_data": all_obs_rms
        }
        torch.save(save_data, save_path)
        print(f"Policy and NormalizeObservation data saved to {save_path}") 

    def _extract_episode_data(self, episode_data, episode_flags):
        """
        Extract data for environments where '_episode' is True and append it to 
        self.episode_info_buffer deque
            {'r': -21.0, '_r': True, 'l': 944, '_l': True, 't': 5.089006, '_t': True}- example
            r - cumulative reward
            l - episode length
            t - elapsed time since beginning of episode

        :param episode_data: dict, data from environments
        :param episode_flags: np.ndarray, boolean array indicating done environments
        """
        done_envs = np.where(episode_flags)[0]  # Get indices of done environments
        for env_index in done_envs:
            env_specific_data = {}
            for key, value in episode_data.items():
                if isinstance(value, np.ndarray):  # Ensure it's an array
                    env_specific_data[key] = value[env_index]
            self.episode_rewards.append(env_specific_data['r'])
            self.episode_lengths.append(env_specific_data['l'])

Apperently there is wrapper in gymnasium  for clipping actions, I see some people use it instead of writing specific code to clip actions, we could try both I guess


### Train and more

In [154]:
from helpers.envs import make_sync_vec, AutoresetMode


num_envs = 8
env_id = 'Pendulum-v1'
DEVICE = torch.device("cpu")

envs = make_sync_vec(env_id, 
                    num_envs=num_envs, 
                    wrappers=(gym.wrappers.RecordEpisodeStatistics, 
                              gym.wrappers.NormalizeObservation,),
                    autoreset_mode=AutoresetMode.SAME_STEP)

agent = A2CAgent(envs, 
                 net_mode=Mode.SEPARATE, 
                 n_steps=5, 
                 gae_lambda=0.96, 
                 gamma=0.99, 
                 device=DEVICE, 
                 ent_coef=0.01, 
                 lr=5e-4, 
                 save_path='a2c_pendulum_continuous_n_5')
# agent.actor_critic.load_state_dict(torch.load('./advanced_models/checkpoint_a2c_pendulum_continuous_n_10.pth'))
agent.train(total_timesteps=1e6, log_interval=100) # 16*5*50=2000 (with 50 longer entropy)


Updates 100, num timesteps 4000/1000000.0 Mean Length 200.0
Loss 320.8534: Policy -0.0177 Value 641.7707 Entropy -1.4260
Last 15 training episodes: mean/median reward -1172.0/-1165.8, min/max reward -1608.7/-890.1
Updates 200, num timesteps 8000/1000000.0 Mean Length 200.0
Loss 142.7148: Policy 0.1026 Value 285.2528 Entropy -1.4260
Last 15 training episodes: mean/median reward -1228.6/-1107.8, min/max reward -1685.9/-867.9
Updates 300, num timesteps 12000/1000000.0 Mean Length 200.0
Loss 236.1185: Policy -0.0095 Value 472.2841 Entropy -1.4321
Last 15 training episodes: mean/median reward -1397.4/-1403.6, min/max reward -1783.1/-758.0
Updates 400, num timesteps 16000/1000000.0 Mean Length 200.0
Loss 365.7928: Policy -0.1627 Value 731.9387 Entropy -1.4265
Last 15 training episodes: mean/median reward -1297.3/-1309.9, min/max reward -1717.4/-975.9
Updates 500, num timesteps 20000/1000000.0 Mean Length 200.0
Loss 126.8536: Policy -0.2471 Value 254.2288 Entropy -1.4264
Last 15 training epis

In [155]:
import os
import imageio
import numpy as np
from IPython.display import Video, display, HTML

def record_video(env, policy, out_directory, out_name, fps=60):
    """
    Generate a replay video of the agent and display it in the notebook.
    :param env: Environment to record.
    :param policy: Policy used to determine actions.
    :param out_directory: Path to save the video.
    :param fps: Frames per second.
    """
    images = []
    done = False
    obs, _ = env.reset()
    img = env.render()

    times = 0
    while times != 5:
        # Preprocess the observation, set input to network to be difference
        state = torch.tensor(obs, dtype=torch.float32)

        # calculate actions and values
        value, action, action_prob, _, action_mean = policy(state)
        action = action_mean.detach().cpu().numpy()

        obs, reward, terminated, truncated, _ = env.step(action) 
        img = env.render()
        images.append(img)
        if terminated or truncated:
            obs, _ = env.reset() 
            times += 1
    
    # Save the video
    video_path = os.path.join(out_directory, out_name)
    imageio.mimsave(video_path, [np.array(img) for img in images], fps=fps)
    
    # Display the video in Jupyter notebook
    display(Video(video_path, embed=True, width=640, height=480))

def eval_policy(env, policy, num_episodes = 10):
    # Store rewards for each episode
    episode_rewards = []

    # Evaluation loop
    for episode in range(num_episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0  # Keep track of total reward in the episode
        while not done:
            state = torch.tensor(obs, dtype=torch.float32)
            # Select the action using the trained model
            value, action, action_prob, _, action_mean = policy(state)
            action = action.detach().cpu().numpy()
            # Step the environment
            obs, reward, terminated, truncated, info = env.step(action)
            total_reward += reward
            done = terminated or truncated
        
        episode_rewards.append(total_reward)  # Store the total reward for the episode
        print(f"Episode {episode + 1}: Total Reward = {total_reward}")

    # Calculate mean and standard deviation
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    print(f"Mean Reward: {mean_reward}, Standard Deviation: {std_reward}")

In [None]:
env = gym.make('Pendulum-v1', render_mode='rgb_array')
env = gym.wrappers.NormalizeObservation(env)
env.update_running_mean = False


eval_model = SimpleActorCriticSeparate(env.observation_space.shape[-1], env.action_space.shape[-1])
checkpoint = torch.load('./models/checkpoint_a2c_pendulum_continuous_n_5.pth', weights_only=False)
all_obs_rms = checkpoint['obs_rms_data']
mean = sum(obs_rms.mean for obs_rms in all_obs_rms) / len(all_obs_rms)
var = sum(obs_rms.var for obs_rms in all_obs_rms) / len(all_obs_rms)
env.obs_rms.mean = mean 
env.obs_rms.var = var 
eval_model.load_state_dict(checkpoint['policy_state_dict'])
# Ensure the model is in evaluation mode

# eval_model = agent.actor_critic.cpu()
eval_model.eval()

eval_policy(env, eval_model, num_episodes=100)
record_video(env, eval_model, './videos', 'output_pong_a2c_pendulum_continuous_n_5.mp4')

Episode 1: Total Reward = -510.48428910730814
Episode 2: Total Reward = -130.65933463454542
Episode 3: Total Reward = -3.272582928072358
Episode 4: Total Reward = -280.5471316885407
Episode 5: Total Reward = -129.16336851320654
Episode 6: Total Reward = -284.77348872801855
Episode 7: Total Reward = -5.0163425140145135
Episode 8: Total Reward = -135.71010467997215
Episode 9: Total Reward = -3.471146300109374
Episode 10: Total Reward = -136.54231164270857
Episode 11: Total Reward = -130.26161353405513
Episode 12: Total Reward = -262.6062685220581
Episode 13: Total Reward = -279.7445055214702
Episode 14: Total Reward = -137.35164181211968
Episode 15: Total Reward = -132.87544349640805
Episode 16: Total Reward = -440.49563139727167
Episode 17: Total Reward = -4.390144409088493
Episode 18: Total Reward = -4.736045917198095
Episode 19: Total Reward = -274.52846471909237
Episode 20: Total Reward = -406.4650487701025
Episode 21: Total Reward = -405.4099095312261
Episode 22: Total Reward = -540



: 

<video width="640" height="480" controls>
  <source src="../assets/videos/output_pong_a2c_pendulum_continuous_n_5.mp4" type="video/mp4">
</video>


## OFFTOPIC

In [62]:
from gymnasium.wrappers import RecordEpisodeStatistics
from gymnasium.vector import SyncVectorEnv

num_envs = 4
env_id = 'Pendulum-v1'

envs = gym.make_vec('Pendulum-v1', 
                    num_envs=num_envs, 
                    wrappers=(RecordEpisodeStatistics,))

print(envs.action_space)
action_space = envs.action_space
state_size = envs.observation_space.shape[-1]
action_size = envs.action_space.shape[-1]

envs.reset()
random_moves = np.random.uniform(low=action_space.low, high=action_space.high, size=action_space.shape)
print(random_moves)
obs, _, _, _, _= envs.step(random_moves)


processed_obs = torch.FloatTensor(obs)
print(processed_obs.shape)
model = SimpleActorCritic(state_size, action_size)
value, action, log_prob, entropy, mean = model(processed_obs)
# print(value)
# print(value.grad_fn)
print(action)
print('Mean', mean)
print(log_prob)
print(log_prob.sum(-1).unsqueeze(-1))
print(entropy)
print(value.shape, action.shape, log_prob.shape, entropy.shape)
rollouts = RolloutStorage((state_size,), envs.action_space, 5, 4)
# print(rollouts.obs.shape)
# print(rollouts.advantages.shape)
# print(rollouts.actions.shape)
# print(rollouts.log_probs.shape)
# print(rollouts.entropies.shape)
# print(rollouts.values.shape)
# print(rollouts.rewards.shape)
rollouts.add(processed_obs, action, log_prob, entropy, value, torch.Tensor([[0]]*4), torch.Tensor([[1]]*4), torch.Tensor([[0]]*4))

Box(-2.0, 2.0, (4, 1), float32)
[[-0.94284604]
 [ 0.50593257]
 [ 0.57670753]
 [ 1.5064863 ]]
torch.Size([4, 3])
tensor([[ 0.5802],
        [-0.4201],
        [ 0.4749],
        [-0.4455]])
Mean tensor([[ 0.2628],
        [ 0.2025],
        [ 0.2517],
        [-0.1390]], grad_fn=<MulBackward0>)
tensor([[-0.9693],
        [-1.1127],
        [-0.9439],
        [-0.9659]], grad_fn=<SubBackward0>)
tensor([[-0.9693],
        [-1.1127],
        [-0.9439],
        [-0.9659]], grad_fn=<UnsqueezeBackward0>)
tensor([[1.4189],
        [1.4189],
        [1.4189],
        [1.4189]], grad_fn=<AddBackward0>)
torch.Size([4, 1]) torch.Size([4, 1]) torch.Size([4, 1]) torch.Size([4, 1])
