# Policy Gradients

In Q Learning we estimate the Future Discounted Reward for each State Action Pair. We then take the action with the best Q Value.
In Policy gradients we directly optimize the score score function. We Learn a probability distribution of what action to take given the current state.

When there are many possible actions it becomes difficult for A Q Learning method to calculate the Q value for each action.
Policy Gradient Methods work better in these cases.


The Score Function:
$
\begin{align}
\text{J}(\theta) = E_{\pi}[\sum{\gamma r}] = E_{\pi}[r(\tau)]
\end{align}
$

$
\begin{align}
\bigtriangledown \text{J}(\theta) = \bigtriangledown \int \pi(\tau) r(\tau) d\tau
\end{align}
$

$
\begin{align}
\bigtriangledown \text{J}(\theta) = \int \bigtriangledown \pi(\tau) r(\tau) d\tau
\end{align}
$


$
\begin{align}
\bigtriangledown \text{J}(\theta) = \int \pi(\tau) \bigtriangledown \log(\pi(\tau)) r(\tau) d\tau \text{ ... using the log derivative trick } \bigtriangledown  x = x \bigtriangledown log(x)
\end{align}
$


$
\begin{align}
\bigtriangledown \text{J}(\theta) = E_{\pi}[ \bigtriangledown \log(\pi(\tau)) r(\tau) ]
\end{align}
$

So the Policy Gradient is the Expected log of the probability of taking a trajectory into the discounted reward for the trajectory.

In [1]:
import numpy as np
import gym
import tensorflow as tf
import tensorflow.contrib.layers as layers

In [2]:
env = gym.make("CartPole-v0")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [3]:
max_episodes = 1000
learning_rate =0.001
gamma = 0.95

In [4]:
# Calculate Future Discounted Rewards. Also Called Rewards to go.
def discount_and_normalize_rewards(episode_rewards):
    d_e_r = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(d_e_r))):
        cumulative = cumulative * gamma + episode_rewards[i]
        d_e_r[i] = cumulative
    
    mean= np.mean(d_e_r)
    std = np.std(d_e_r)
    
    d_e_r = (d_e_r - mean)/ std
    return d_e_r

### Building the Network

In [5]:


inputs = tf.placeholder(tf.float64, [None, state_size])
actions = tf.placeholder(tf.float64, [None, action_size])
d_e_r = tf.placeholder(tf.float64, [None, ])

fc1 = layers.fully_connected(inputs = inputs,
                      num_outputs = 10,
                      activation_fn = tf.nn.relu,
                      weights_initializer=layers.xavier_initializer())

fc2 = layers.fully_connected(inputs = fc1,
                      num_outputs = 10,
                      activation_fn = tf.nn.relu,
                      weights_initializer=layers.xavier_initializer())

fc3 = layers.fully_connected(inputs = fc2,
                      num_outputs = action_size,
                      activation_fn = None,
                      weights_initializer=layers.xavier_initializer())

# Action Distribution is the Probability for each Action. Use Softmax as Sum of Probability should be 1.
action_distribution = tf.nn.softmax(fc3)

# Get the negative of the log probability since tensorflow optimizers are built for minimizing
# So maximizing the score function is the same as minimizing the negative of the score function.
neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits=fc3, labels=actions)

# Multiply it by the furture discounted reward.
loss = tf.reduce_mean(neg_log_prob * d_e_r)

train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

### Train Agent

In [6]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

for episode in range(max_episodes):
    episode_states, episode_actions, episode_rewards = [],[],[]
    done = False
    state = env.reset()

    while not done:
        # Run the session to get an action.
        action_probability_distribution = sess.run(action_distribution, feed_dict = {
            inputs : state.reshape([-1, state_size])
        })
        # From the action probabilities we need to sample an action.
        action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())

        new_state, reward, done, info = env.step(action)
        
        # One hot encode the actoin to pass back to the network during optimization. 
        one_hot_action = np.zeros(action_size)
        one_hot_action[action] = 1

        episode_states.append(state)
        episode_actions.append(one_hot_action)
        episode_rewards.append(reward)

        state = new_state
        if done:
            print(f"Episode: {episode} Reward: {np.sum(episode_rewards)}")
            
            d_e_r_ = discount_and_normalize_rewards(episode_rewards)
            
            # Optimize the score.
            loss_, _ = sess.run([loss, train_opt], feed_dict = {
                inputs : np.array(episode_states),
                actions : np.array(episode_actions),
                d_e_r : d_e_r_
            })



Episode: 0 Reward: 18.0
Episode: 1 Reward: 24.0
Episode: 2 Reward: 10.0
Episode: 3 Reward: 21.0
Episode: 4 Reward: 17.0
Episode: 5 Reward: 56.0
Episode: 6 Reward: 27.0
Episode: 7 Reward: 23.0
Episode: 8 Reward: 24.0
Episode: 9 Reward: 10.0
Episode: 10 Reward: 14.0
Episode: 11 Reward: 19.0
Episode: 12 Reward: 16.0
Episode: 13 Reward: 15.0
Episode: 14 Reward: 16.0
Episode: 15 Reward: 16.0
Episode: 16 Reward: 41.0
Episode: 17 Reward: 12.0
Episode: 18 Reward: 53.0
Episode: 19 Reward: 22.0
Episode: 20 Reward: 10.0
Episode: 21 Reward: 15.0
Episode: 22 Reward: 40.0
Episode: 23 Reward: 15.0
Episode: 24 Reward: 39.0
Episode: 25 Reward: 19.0
Episode: 26 Reward: 27.0
Episode: 27 Reward: 28.0
Episode: 28 Reward: 17.0
Episode: 29 Reward: 9.0
Episode: 30 Reward: 17.0
Episode: 31 Reward: 18.0
Episode: 32 Reward: 16.0
Episode: 33 Reward: 17.0
Episode: 34 Reward: 19.0
Episode: 35 Reward: 20.0
Episode: 36 Reward: 17.0
Episode: 37 Reward: 22.0
Episode: 38 Reward: 56.0
Episode: 39 Reward: 12.0
Episode: 40

Episode: 325 Reward: 52.0
Episode: 326 Reward: 17.0
Episode: 327 Reward: 87.0
Episode: 328 Reward: 28.0
Episode: 329 Reward: 32.0
Episode: 330 Reward: 117.0
Episode: 331 Reward: 18.0
Episode: 332 Reward: 53.0
Episode: 333 Reward: 24.0
Episode: 334 Reward: 21.0
Episode: 335 Reward: 68.0
Episode: 336 Reward: 14.0
Episode: 337 Reward: 41.0
Episode: 338 Reward: 69.0
Episode: 339 Reward: 18.0
Episode: 340 Reward: 23.0
Episode: 341 Reward: 41.0
Episode: 342 Reward: 110.0
Episode: 343 Reward: 18.0
Episode: 344 Reward: 47.0
Episode: 345 Reward: 28.0
Episode: 346 Reward: 16.0
Episode: 347 Reward: 35.0
Episode: 348 Reward: 33.0
Episode: 349 Reward: 67.0
Episode: 350 Reward: 13.0
Episode: 351 Reward: 16.0
Episode: 352 Reward: 64.0
Episode: 353 Reward: 14.0
Episode: 354 Reward: 51.0
Episode: 355 Reward: 34.0
Episode: 356 Reward: 61.0
Episode: 357 Reward: 18.0
Episode: 358 Reward: 65.0
Episode: 359 Reward: 56.0
Episode: 360 Reward: 55.0
Episode: 361 Reward: 83.0
Episode: 362 Reward: 15.0
Episode: 3

Episode: 636 Reward: 101.0
Episode: 637 Reward: 200.0
Episode: 638 Reward: 200.0
Episode: 639 Reward: 200.0
Episode: 640 Reward: 137.0
Episode: 641 Reward: 200.0
Episode: 642 Reward: 136.0
Episode: 643 Reward: 200.0
Episode: 644 Reward: 200.0
Episode: 645 Reward: 200.0
Episode: 646 Reward: 101.0
Episode: 647 Reward: 200.0
Episode: 648 Reward: 200.0
Episode: 649 Reward: 200.0
Episode: 650 Reward: 110.0
Episode: 651 Reward: 200.0
Episode: 652 Reward: 11.0
Episode: 653 Reward: 197.0
Episode: 654 Reward: 200.0
Episode: 655 Reward: 127.0
Episode: 656 Reward: 33.0
Episode: 657 Reward: 200.0
Episode: 658 Reward: 200.0
Episode: 659 Reward: 91.0
Episode: 660 Reward: 200.0
Episode: 661 Reward: 200.0
Episode: 662 Reward: 200.0
Episode: 663 Reward: 140.0
Episode: 664 Reward: 200.0
Episode: 665 Reward: 200.0
Episode: 666 Reward: 86.0
Episode: 667 Reward: 164.0
Episode: 668 Reward: 188.0
Episode: 669 Reward: 149.0
Episode: 670 Reward: 200.0
Episode: 671 Reward: 200.0
Episode: 672 Reward: 197.0
Episo

Episode: 941 Reward: 157.0
Episode: 942 Reward: 200.0
Episode: 943 Reward: 144.0
Episode: 944 Reward: 200.0
Episode: 945 Reward: 200.0
Episode: 946 Reward: 200.0
Episode: 947 Reward: 200.0
Episode: 948 Reward: 200.0
Episode: 949 Reward: 200.0
Episode: 950 Reward: 200.0
Episode: 951 Reward: 200.0
Episode: 952 Reward: 200.0
Episode: 953 Reward: 200.0
Episode: 954 Reward: 200.0
Episode: 955 Reward: 200.0
Episode: 956 Reward: 200.0
Episode: 957 Reward: 200.0
Episode: 958 Reward: 200.0
Episode: 959 Reward: 200.0
Episode: 960 Reward: 200.0
Episode: 961 Reward: 200.0
Episode: 962 Reward: 200.0
Episode: 963 Reward: 200.0
Episode: 964 Reward: 200.0
Episode: 965 Reward: 172.0
Episode: 966 Reward: 200.0
Episode: 967 Reward: 145.0
Episode: 968 Reward: 122.0
Episode: 969 Reward: 200.0
Episode: 970 Reward: 200.0
Episode: 971 Reward: 200.0
Episode: 972 Reward: 200.0
Episode: 973 Reward: 197.0
Episode: 974 Reward: 200.0
Episode: 975 Reward: 200.0
Episode: 976 Reward: 165.0
Episode: 977 Reward: 200.0
E

### Evaluate agents performance

In [None]:
episodes = 100

total_reward = 0
for _ in range(episodes):
    state = env.reset()
    done = False
    while not done:
        action_probability_distribution = sess.run(action_distribution, feed_dict = {
            inputs : state.reshape([-1, state_size])
        })
        action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())
        state, reward, done, _ = env.step(action)
        total_reward += reward

print(f"Results after {episodes} episodes:")
print(f"Average Reward per episode: {total_reward / episodes}")

In [None]:
sess.close()