## Policy-based models

In the past, we have learned about value-based models such as Q-learning, where we use a Q-table to figure out the value of being in a state and taking an action, or deep q-learning where we use a neural-net to estimate the Q-value of a every state and action pairs. 

In value-based methodologies, the optimal policy is simply the action that results in the highest Q-value for each state. Unlike, value-based models, in policy-based models, we train the policy directly. 

## What are the policy-based methods? 

The main goal of Reinforcement learning is for our agent to learn the optimial policy that will maximize the expected cumulative reward. RL is based on the reward hypothesis which states that all goals can be described as the maximization of the expected cumulative reward. 

Unlike how in value-based methods our agent learns a value function which implicitly contains the optimal policy, in policy-based methods, we directly learn a optimal policy without the use of a value function. 
* First we parameterize policy. We use a policy net that will output the prob distribution of all actions.
* $pi(s) = P[A|s; theta]$
* Our goal would then be to maximize the performance(reward gained) of the policy using gradient ascent to update the parameters of the policy. 

## The difference between policy-based and policy-gradient methods

Policy-gradient methods are a subset of policy-based methods. In policy-based methods, optimization is on-policy since we use the data from our most recent policy to update our parameters. 

The difference between policy-based and policy-gradient methods is based on how we optimize the parameters:

* In policy-based, we optimize the parameters indirectly by maximizing the local approximation of the objective function with tech such as hill climbing, simulated annealing, or evolution strategies.
* In policy-gradient, we optimize the parameters directly by using gradient ascent on the performance of the objective function $J(theta)$

## Advantages and disadvantages of policy-gradient methods

### Advantages

#### The simplicity of integration 
There is no memory overhead from the action value storage. We estimate the policy directly. 

#### Policy-gradient methods can learn stochastic policies 

Policy gradients can learn stochastic policies while value functions virtually can't 

There are two consequences that arise from this: 

1. We don't need to implement exploration/exploitation trade-off. Since our model outputs a probability distribution over actions, the agent can explore the env without taking the same trajectory all the time.
2. We also eliminate the problem of perpetual aliasing. Perpetual aliasing is when two states are perceived as the same for our model. Our agent can get stuck in aliased states and never reach the goal, or spends a lot of time not reaching the goal.

In a stochastic policy, actions are chosen more randomly thus it will not get stuck and have a higher probability of reaching the goal. 

Policy-gradient methods are better in high-dimensional action-spaces and continuous action spaces. 
The main problem with Deep Q-learning is that they assign a value for each possible action which is great for discrete action spaces but terrible for continous or high-dimensional action spaces. Policy-gradients fix this problem by returning a probability distribution over all actions. 

#### Policy-gradient methods are better at convergence 
In value-based methods, we change the policy more aggressively as we take the max Q-value for even small gains. In policy-gradient methods on the other hand, we change the policy more smoothly.

### Disadvantages 
* Policy-gradient methods have a tendency to converget at a local maximum instead of global maximum.
* Training Policy-gradients takes a lot of time to train.
* Policy-gradient tends to have high variance. 

## Policy-gradient methods

### Big picture
We know that in policy-gradient methods, our goal is to find parameters that maximize the expected return. 

The idea behind policy-gradient methods is that we have a parameterized stochastic policy. For our case, the policy is a neural net which outputs a probability distribution over all actions. The probability of taking each action is called an action preference. 

Our goal when training using policy-gradient methods is that we want to adjust the probability distribution of all actions such that those that are optimal have a higher probability of being sampled. We tweak the parameters each time agent interacts with the env. 

**How do we optimize the weights of our neural net?**

We let our agent interact with the env for an episode and accumulate rewards. If we get a positive reward, we use increase the weights accordingly and decrease weights if we receive a negative reward. 

**How do we know if our policy is good?** We use a score/objective function called $J(theta)$

#### The objective function

The objective function gives us the performance of our agent given a trajectory(state action sequence without the rewared), and outputs the expected cumulative reward. 

$J(theta) = E_{tau ~ pi}[R(tau)]$ 
$R(tau) = r_{t+1} + gamma*r_{t+2}+ gamma^2*r_{t+3}+......$
(tau) = trajectory
* The expected return is the weighted average of all possible values that return R(tau) can return.
* $J(theta) = summation(P(tau;theta)R(tau))$
    - R(tau): return from an arbitrary trajectory.
    - P(tau;theta): probability of each possible trajectory tau.
    - J(theta): Expected return, which is calculated by summing all possible trajectories, the probabilities of taking each trajectory given theta multiplied by the return gained from this trajectory.
#### Gradient ascent and the policy-gradient theorem 
Gradient ascent is simply the inverse of gradient descent since we want to maximize the reward. We update our parameters based on this formula: 

$theta <--- theta+ lr*J(theta)'$

There are two problems we encounter when we calculate the derivative of $J(theta)$
1. We can't really calculate the derivative of our objective function since we need to calculate the probability of each possible trajectory, which can be really expensive. Instead, we calculate an estimate of our gradient using some sample trajectories we collect.
2. In order to differentiate the objective function, we need to differentiate the markov decision process dynamics, which is tied to the env. We most likely won't be knowing this as such we can't differentiate it.

To address the above problems, we will use the Policy Gradient Theorem which reformulates the objective function into a differentiable function. 

**The Policy Gradient Theorem**

$J(theta)' = E_{pi_{theta}}[log pi_{theta}(a_t| s_t)'R(tau)']$

### The Monte Carlo Reinforce 
The Monte-carlo policy gradient, is a policy-gradient that the estimated return from an entire episode to update the policy parameters. 

In the loop: 
* Use the policy pi to collect episodes
* Use the episode to estimate the gradient $g = J(theta)'$
    * $J(theta)' = summation((log pi_{theta}(a_t|s_t)*R(tau))')$
* Update the weights of the policy using $theta <--- theta+ lr*g$



## Example 

In [1]:
import numpy as np 
from collections import deque 

import matplotlib.pyplot as plt 
%matplotlib inline

import torch 
import torch.nn as nn 
import torch.nn.functional as F 
import torch.optim as optim 
from torch.distributions import Categorical 

import gym 
import gym_pygame
import ple.games.pixelcopter
from tqdm.auto import tqdm


couldn't import doomish
Couldn't import doom


In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [3]:
print(device)

cuda


## CartPole-v1

In [4]:
env_id = "CartPole-v1"

env = gym.make(env_id)

eval_env = gym.make(env_id)

state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [5]:
print("____Observation Space_______")
print("The state space is: ", state_size)
print("Sample Observation", env.observation_space.sample())

____Observation Space_______
The state space is:  4
Sample Observation [-3.9728885e+00 -2.1653244e+36  3.6817831e-01  1.9501910e+38]


In [6]:
print("____Action Space_______")
print("The action space is: ", action_size)
print("Action Space Sample", env.action_space.sample())

____Action Space_______
The action space is:  2
Action Space Sample 1


In [7]:
class Policy(nn.Module):
    def __init__(self,state_size,action_size,hidden_size):
        super(Policy,self).__init__()
        self.fc1 = nn.Linear(state_size,hidden_size)
        self.fc2 = nn.Linear(hidden_size,action_size)
    def forward(self,x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x,dim=1)
    def act(self,state):
        """
        Given a state, take action
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

In [8]:
debug_pol = Policy(state_size,action_size,64).to(device)
state,info = env.reset()
debug_pol.act(state)

(0, tensor([-0.5260], grad_fn=<SqueezeBackward1>))

In [9]:
def reinforce(policy,optimizer, n_training_episodes, max_t, gamma, print_every): 
    scores_deque = deque(maxlen=100)
    scores = []

    for i_episode in tqdm(range(1,n_training_episodes+1)):
        saved_log_probs = []
        rewards = []
        state = env.reset()

        for t in range(max_t):
            action,log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            state,reward,done, _ = env.step(action)
            rewards.append(reward)
            if done: 
                break
        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))

        returns = deque(maxlen=max_t)
        n_steps = len(rewards)

        for t in range(n_steps)[::-1]:
            disc_return_t = (returns[0] if len(returns)>0 else 0)
            returns.appendleft(gamma* disc_return_t+rewards[t])
        eps = np.finfo(np.float32).eps.item()

        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + eps)


        policy_loss = []
        for log_prob, disc_return in zip(saved_log_probs,returns):
            policy_loss.append(-log_prob * disc_return)
        policy_loss = torch.cat(policy_loss).sum()


        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if i_episode % print_every ==0:
            print("Episode {}\tAverage Score: {:.2f}".format(i_episode, np.mean(scores_deque)))
    
    return scores

In [10]:
cartpole_hyperparams = {
    "h_size":16,
    "n_training_episodes": 1000,
    "n_evaluation_episodes": 10,
    "max_t": 1000,
    "gamma":1.0,
    "lr":1e-2,
    "env_id":env_id,
    "state_space":state_size,
    "action_space":action_size,
}

In [11]:
cartpole_policy = Policy(
    cartpole_hyperparams["state_space"],
    cartpole_hyperparams["action_space"],
    cartpole_hyperparams["h_size"],
).to(device)
cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), lr = cartpole_hyperparams["lr"])

In [12]:
scores = reinforce(
    cartpole_policy,
    cartpole_optimizer,
    cartpole_hyperparams["n_training_episodes"],
    cartpole_hyperparams["max_t"],
    cartpole_hyperparams["gamma"],
    100
)

  0%|          | 0/1000 [00:00<?, ?it/s]

  if not isinstance(terminated, (bool, np.bool8)):


Episode 100	Average Score: 25.72
Episode 200	Average Score: 76.19
Episode 300	Average Score: 729.94
Episode 400	Average Score: 985.55
Episode 500	Average Score: 309.37
Episode 600	Average Score: 616.85
Episode 700	Average Score: 320.12
Episode 800	Average Score: 990.03
Episode 900	Average Score: 1000.00
Episode 1000	Average Score: 1000.00


In [12]:
disp_env = gym.make('CartPole-v1',render_mode="human")
episodes=10
observation,info=disp_env.reset()
done=False
score=0
steps=0;
while not done :
    with torch.no_grad():
        action, _ = cartpole_policy.act(observation)
    observation,reward,done,truncated,info=disp_env.step(action)
    score+=reward
    steps+=1;
    done = done or truncated
    disp_env.render()
print(f"Score: {score}")
disp_env.close()

  if not isinstance(terminated, (bool, np.bool8)):


Score: 16.0


In [35]:
def evaluate_agent(env, max_steps, n_eval_episodes, policy):
    """
    Evaluate the agent for ``n_eval_episodes`` episodes and return the average reward and std of reward.
    :param env: The evaluation environment 
    :param n_eval_episodes: Number of episodes to evaluate the agent
    :param policy: The Reinforce agent
    """
    episode_rewards=[]
    for episode in tqdm(range(n_eval_episodes)):
        state,_ = env.reset()
        step = 0
        done = False
        total_rewards_ep=0

        for step in range(max_steps):
            with torch.no_grad():
                action,_ = policy.act(state)
            new_state, reward,done,truncated, info = env.step(action)
            total_rewards_ep+=reward
            if done: 
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward
            

In [15]:
evaluate_agent(
    eval_env, cartpole_hyperparams["max_t"], cartpole_hyperparams["n_evaluation_episodes"], cartpole_policy
)

  0%|          | 0/10 [00:00<?, ?it/s]

(1000.0, 0.0)

## PixelCopter Env

In [4]:
env_id = "Pixelcopter-PLE-v0"
env = gym.make(env_id)
eval_env = gym.make(env_id)
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [5]:
print("_____OBSERVATION SPACE_______")
print("The State Space is: ", state_size)
print("State Space Sample: ", env.observation_space.sample())

_____OBSERVATION SPACE_______
The State Space is:  7
State Space Sample:  [ 1.4560156   0.71981394 -0.11523677 -1.836842    0.14641193  0.39520997
 -1.4625291 ]


In [6]:
print("_____ACTION SPACE_______")
print("The Action Space is: ", action_size)
print("Action Space Sample: ", env.action_space.sample())

_____ACTION SPACE_______
The Action Space is:  2
Action Space Sample:  1


In [7]:
class Policy2(nn.Module):
    def __init__(self,state_size, action_size, hidden_size):
        super(Policy2,self).__init__()
        self.fc1 = nn.Linear(state_size,hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size*2)
        self.fc3 = nn.Linear(hidden_size*2, action_size)
    
    def forward(self,x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.softmax(x,dim=1)

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

In [36]:
pixelcopter_hyperparams = {
    "hidden_size" : 64,
    "n_training_episodes": 50000,
    "n_evaluation_episodes": 10,
    "max_t": 10000,
    "gamma":0.99,
    "lr": 1e-4,
    "env_id": env_id,
    "state_size": state_size,
    "action_size": action_size
}

In [9]:
pixelcopter_policy = Policy2(
    pixelcopter_hyperparams["state_size"],
    pixelcopter_hyperparams["action_size"],
    pixelcopter_hyperparams["hidden_size"]
).to(device)
pixelcopter_optim = optim.Adam(pixelcopter_policy.parameters(),lr = pixelcopter_hyperparams["lr"])

In [14]:
scores = reinforce(
    pixelcopter_policy,
    pixelcopter_optim,
    pixelcopter_hyperparams["n_training_episodes"],
    pixelcopter_hyperparams["max_t"],
    pixelcopter_hyperparams["gamma"],
    1000
)

  0%|          | 0/50000 [00:00<?, ?it/s]

Episode 1000	Average Score: 6.09
Episode 2000	Average Score: 5.92
Episode 3000	Average Score: 5.89
Episode 4000	Average Score: 8.83
Episode 5000	Average Score: 12.85
Episode 6000	Average Score: 13.92
Episode 7000	Average Score: 17.04
Episode 8000	Average Score: 17.83
Episode 9000	Average Score: 14.05
Episode 10000	Average Score: 18.26
Episode 11000	Average Score: 20.32
Episode 12000	Average Score: 22.01
Episode 13000	Average Score: 15.12
Episode 14000	Average Score: 23.52
Episode 15000	Average Score: 19.70
Episode 16000	Average Score: 26.63
Episode 17000	Average Score: 22.63
Episode 18000	Average Score: 13.52
Episode 19000	Average Score: 25.25
Episode 20000	Average Score: 26.05
Episode 21000	Average Score: 25.81
Episode 22000	Average Score: 31.77
Episode 23000	Average Score: 33.98
Episode 24000	Average Score: 25.72
Episode 25000	Average Score: 32.08
Episode 26000	Average Score: 35.19
Episode 27000	Average Score: 36.19
Episode 28000	Average Score: 33.71
Episode 29000	Average Score: 25.4

In [10]:
from pathlib import Path 
model_path = Path()
model_dir = model_path / "Pixelcopter.pth"
if model_dir.exists():
    print("Best model already saved")
else:
    print(f"Saving model to: {model_dir}")
    torch.save(obj = pixelcopter_policy.state_dict(),
          f = model_dir)

Best model already saved


## Loading model 

In [11]:
pixelcopter_policy.load_state_dict(torch.load(f=model_dir))

<All keys matched successfully>

In [34]:
disp_env = gym.make('Pixelcopter-PLE-v0')
episodes=10
observation=disp_env.reset()
done=False
score=0
steps=0;
while not done :
    with torch.no_grad():
        action, _ = pixelcopter_policy.act(observation)
    observation,reward,done,info=disp_env.step(action)
    score+=reward
    steps+=1;
    disp_env.render('human')
print(f"Score: {score}")
disp_env.close()

  logger.warn(
  logger.warn(
  logger.warn(
  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):
  logger.warn(
  logger.warn(f"{pre} is not within the observation space.")
  logger.warn(


ImportError: cannot import name 'pyglet_rendering' from 'gym.utils' (C:\Users\jayan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\gym\utils\__init__.py)

In [13]:
!pip install gymgrid2

Collecting gymgrid2
  Downloading gymgrid2-1.2.6-py3-none-any.whl.metadata (1.5 kB)
Downloading gymgrid2-1.2.6-py3-none-any.whl (11 kB)
Installing collected packages: gymgrid2
Successfully installed gymgrid2-1.2.6


In [38]:
def evaluate_agent2(env, max_steps, n_eval_episodes, policy):
    """
    Evaluate the agent for ``n_eval_episodes`` episodes and return the average reward and std of reward.
    :param env: The evaluation environment 
    :param n_eval_episodes: Number of episodes to evaluate the agent
    :param policy: The Reinforce agent
    """
    episode_rewards=[]
    for episode in tqdm(range(n_eval_episodes)):
        state = env.reset()
        step = 0
        done = False
        total_rewards_ep=0

        for step in range(max_steps):
            with torch.no_grad():
                action,_ = policy.act(state)
            new_state, reward,done, info = env.step(action)
            total_rewards_ep+=reward
            if done: 
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward

In [39]:
evaluate_agent2(eval_env, pixelcopter_hyperparams["max_t"], pixelcopter_hyperparams["n_evaluation_episodes"],pixelcopter_policy)

  0%|          | 0/10 [00:00<?, ?it/s]

  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):
  logger.warn(
  logger.warn(f"{pre} is not within the observation space.")


(43.0, 57.37246726435948)