# Chapter 14: Policy-Based Deep Reinforcement Learning

AlphaGo use a deep reinforcement learning model called Actor-Critic method to create a value head and a policy head. It then combines the value head and policy head with MCTS to crearte powerful plays. To understand AC, we need to understand policy based deep reinformcent learning as well. So this chapter introducdes you to that.

In the last two chapters, you have learned to use value-based reinforcement learning to solve the Frozen Lake and Cart Pole games. To train the model, you used trial and error to learn the value of each action in a certain state, $V(a|s)$. Once the reinforcement model is trained, you play the game by choosing the action with the highest value function $V(a|s)$ in the state $s$. 

In this chapter, you'll learn policy-based reinforcement learning. Instead of trying to make decisions based on the value functions, $V(a|s)$, the agent makes decisions based on a policy $\pi(a|s)$ that explicitly tells you which action to take in each state in the game environment. While a deterministic policy tells tells the agent which action to take, a stochastic policy gives the probability distribution over all possible actions. 

In particular, you'll learn a method called policy gradients. To learn the best policy, the agent plays the game many times. When playing the game, the agent takes action based on the model's predictions. The agent then observes the rewards from the actions taken and adjusts the model weights accordingly. If the prediction is smaller than the desired outcome, the agent adjusts the model weights so that the prediction will increase. Similarly, if the prediction is greater than the desired outcome, the agent adjusts the model weights so that the prediction will decrease. Further, the magnitude of the adjustment is directly proportional to the rewards: the greater the reward, the greater the adjustment. You'll use the policy gradient method to play the Cart Pole game. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 14}}$<br>
***
We'll put all files in Chapter 14 in a subfolder /files/ch14. The code in the cell below will create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch14", exist_ok=True)

# 1. Policy-Based Reinforcement Learning
This section introducdes you to policy-based reinforcement learning.

## 1.1. What Is A Policy?
A policy, $\pi(a|s)$, can be any algorithm that tells the agent which action to take in a given state. Let's use the Cart Pole game we discussed in Chatper 13 as an example. We'll create a deep neural network that takes the current state as the input. We'll put one single neuron in the output layer so the output from the deep neural network is a signle number. We use sigmoid activation in the output layer so the output is a number between 0 and 1.  

Our poicy could be: if the output from the neural network is greater than or equal to 0.5, we'll move the cartpole left (i.e., take action 0); otherwise, we'll move the Cart Pole right. This is called a deterministic policy in the sense that the action is determined once we know the output from teh DNN. A policy can be stochastic as well: since the output from the neural network, let's call it p, is a number between 0 and 1, we can have a stochastic policy as follows:
* Move the cart pole to the left (i.e., take action 0) with probabiulity p;
* Move the cart pole to the right (i.e., take action 1) with probabiltiy 1-p;

The advantage of a stochastic policy is that it naturally allows for both exploitation and exploration. It allows for exploitation in the sense that the probabiltiy of action=0 is greater when the value of p increases. It also allows for exploration in the sense that the recommended policy is not taken with 100% certainty. 

## 1.2. What is the Policy Gradient Method?
Policy gradients is an algorithm to adjust the model parameters to achieve the best outcome for an reinforcement learning agent. 

In RL, the agent is trying to learn the best strategy in order to maximize his or her expected payoff over time. A strategy (also called a policy) maps a certain state to a certain action. A strategy is basically a decision rule that tells the agent what to do in a certain situation.

Let's say that the policy we are considering is $\pi _{\theta }(a_t|s_t,\theta)$, where $\theta$ are model parameters (e.g., the weights in a neural network). That is, the agent choose an action $a_t$ in time period $t$ based on the current state $s_t$, as a function of model parameters $\theta$. Let's say that the agent needs to choose a sequence of actions $(a_{0},a_{1},\ldots ,a_{T-1})$ to maximize her expected cumulated rewards as follows. In period $t$, after observing the state $s_t$ and taking action $a_t$, the agent receives a reward of $r(a_t,s_t)$. If the discount rate is $\gamma$, the expected cumulative reward to the agent is 
$$
R(s_{0},a_{0},\ldots ,a_{T-1},s_T)=\sum\limits_{t=0}^{T-1}\gamma ^{t}r(s_{t},a_{t})+\gamma ^{T}r(s_{T})
$$
where $s_T$ is the terminal state. 

The objective of the agent is to find the optimal parameter values for the model, $\theta$, to maiximize the expected cumulative payoffs: 

$$
\max_{\theta }E[R(s_{0},a_{0},\ldots ,a_{T-1},s_T)|\pi _{\theta }]
$$


The above maximization prooblem can be solved by using a gradient ascent algorithm. That is, we can update the model parameters $\theta$ by using the following formula until the parmaterters converge: 
$$
\theta \leftarrow \theta +Learning\ Rate\ast \nabla _{\theta }E[R|\pi _{\theta }]
$$

where $Learning\ Rate$ is the learning rate hyperparameter that controls how fast we update the model pamameters. This boils down to train the model to prodict the probability of the correct action based on the state. The solution is 

$$
\theta \leftarrow \theta +Learning\ Rate \times E[\sum\limits_{t=0}^{T-1} \nabla _{\theta }log\pi _{\theta }(a_{t}|s_{t})R|\pi _{\theta }]
$$
This is the formula we'll use in this chapter with the policy-gradient method. If you are interested, you can read the proof provided by OpenAI here 

https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof1.html. 

# 2. Policy Gradients in Cart Pole
We'll implement use policy gradients to solve the Cart Pole game. 


## 2.1. Create A Policy Network
First, we create a neural network to create a policy as follows:

In [2]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Dense(5, activation="relu", input_dim=4))
model.add(Dense(1, activation="sigmoid"))

Next, we'll choose teh proper optimizer and the loss function

In [3]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_function = keras.losses.binary_crossentropy

## 2.2. Calculate Gradients and Discounted Rewards
The policy gradients approach ajusts the parameters by the product the the reward and the gradients. We'll plan a game to obtain those values

In [4]:
import gym
import numpy as np

env = gym.make('CartPole-v0')
def training():
    rewards = []
    grads = []
    obs = env.reset()
    for _ in range(200):
        with tf.GradientTape() as tape:
            aprob = model(np.array([obs]))[0,0]
            action = 0 if np.random.uniform(0,1)<aprob else 1            
            y=np.array([1-action]).reshape((1,1))
            loss=tf.reduce_mean(loss_function(y,model(np.array([obs]))))
        grad=tape.gradient(loss,model.trainable_variables)
        obs, reward, done, _ =env.step(action)       
        rewards.append(reward)
        grads.append(np.array(grad))
        if done:
            break
    return rewards, grads

The training() function plays a full game and calculates the gradients and rewards.

In reinforcement learning, actions affect not only current period rewards, but also future rewards. We therefore use discounted rewards to assign credits properly as follows:

In [5]:
def discount_rs(r):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        running_add = gamma*running_add + r[i]
        discounted_rs[i] = running_add
    discounted_rs -= np.mean(discounted_rs)
    discounted_rs /= np.std(discounted_rs)    
    
    return discounted_rs.tolist()

## 2.3. Update Parameters
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games, as follows

In [6]:
batch_size=10
def create_batch(batch_size):
    gs=[]
    rs=[]
    for i in range(batch_size):
        rewards, grads = training()
        returns = discount_rs(rewards)
        gs += grads
        rs += returns
    return gs,rs

We'll use 50 batches of data to update the parameters and train the model as follows:

In [7]:
n_batches = 200
gamma = 0.95  
params=model.trainable_variables
num_layers=len(params)

for _ in range(n_batches):
    gs,rs=create_batch(batch_size)
    gradr=np.dot(np.array(gs).reshape(-1,num_layers).T,rs)/len(rs)
    optimizer.apply_gradients(zip(gradr,params))

model.save("files/ch14/cartpole_deep_pg.h5")  

  grads.append(np.array(grad))




Note here we adjust the parameters by the preduct of teh gradients and the discounted rewards. This is related to the solution to the rewards maximizatip probelm 

# 3. Play A Game with the Trained Model

We'll use the trained model to play a game.

## 3.1. Play A Complete Cart Pole Game
You can play a complete game by using the trained model

In [8]:
#model=keras.models.load_model("cartpole_deep_pg.h5")

obs=env.reset()
score=0
for step in range(200):
    score += 1
    aprob=model(np.array(obs).reshape(-1,4))
    action = 0 if np.random.uniform(0,1)<aprob else 1
    obs,reward,done,info=env.step(action)
    if done:
        print(f"Your score is {score}!")
        break

Your score is 200!


## 3.2. Average Performance of the Trained Model

We test ten games and see on averge how many consecutive steps the cart pole can stay upright. We define a test_a_game() function

In [9]:
def test_a_game():
    obs=env.reset()
    score=0
    for step in range(200):
        score += 1
        aprob=model(np.array(obs).reshape(-1,4))
        action = 0 if np.random.uniform(0,1)<aprob else 1
        obs,reward,done,info=env.step(action)
        if done:
            return score

We then test ten games and print out the score in each game as well as the average score

In [10]:
results=[]
for i in range(10):
    score=test_a_game()
    print(f"in game {i+1}, the score is {score}")
    results.append(score)
avg=sum(results)/len(results)   
print(f"the average score is {avg}")

in game 1, the score is 200
in game 2, the score is 165
in game 3, the score is 200
in game 4, the score is 200
in game 5, the score is 200
in game 6, the score is 156
in game 7, the score is 200
in game 8, the score is 140
in game 9, the score is 200
in game 10, the score is 200
the average score is 186.1


the average score is 200.0

So the trained deep Q network managed to make the cart pole stay upright for 200 consecutive time steps in every sigle game. 