<a href="https://colab.research.google.com/github/pankajr141/experiments/blob/master/Reasoning/Reinforcement/Reasoning%20%7BRL%7D%20-%202%3A%20Policy_Gradient_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In policy gradient Approach we learn what is the optimal action in a given state, where as in value base we learn how good a state is. In other words in value based each state has a value, where as in policy gradient based states themself dont have any value but we need obtained these value by compute.

Source: https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149

## Example 1 - MultiArm Bandit Problem

Essentially, there are n-many slot machines, each with a different fixed payout probability. The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it. 

In [0]:
import tensorflow as tf
import numpy as np

In [0]:
#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2, 0, -0.2, -5]

num_bandits = len(bandits)

def pullBandit(bandit):

    #Get a random number.
    result = np.random.randn(1)
    # As we can see bandit value for index 3 is -5 hence it has largest change of being greater then 
    if result > bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.



In [0]:
tf.reset_default_graph()

#These two lines established the feed-forward part of the network. This does the actual choosing.
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights, 0)

#The next six lines establish the training proceedure. We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights, action_holder,[1])

loss = -(tf.log(responsible_weight) * reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

In [0]:
total_episodes = 1000   #Set total number of episodes to train agent on.
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.
e = 0.1  #Set the chance of taking a random action.

init = tf.initialize_all_variables()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0

    while i < total_episodes:
        
        #Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits.
        
        #Update the network.
        _, resp, ww = sess.run([update, responsible_weight, weights], feed_dict={reward_holder:[reward], action_holder:[action]})
        
        #Update our running tally of scores.
        total_reward[action] += reward
        if i % 50 == 0:
            print("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        i+=1

print("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print("...and it was right!")
else:
    print("...and it was wrong!")

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Running reward for the 4 bandits: [1. 0. 0. 0.]
Running reward for the 4 bandits: [ 0.  1.  0. 28.]
Running reward for the 4 bandits: [-2.  1. -1. 75.]
Running reward for the 4 bandits: [ -2.   0.  -1. 124.]
Running reward for the 4 bandits: [ -2.   1.  -2. 170.]
Running reward for the 4 bandits: [ -3.   1.  -2. 219.]
Running reward for the 4 bandits: [ -5.  -2.  -1. 263.]
Running reward for the 4 bandits: [ -5.  -2.   0. 308.]
Running reward for the 4 bandits: [ -5.  -2.   2. 356.]
Running reward for the 4 bandits: [ -4.   0.   2. 399.]
Running reward for the 4 bandits: [ -3.  -4.   2. 442.]
Running reward for the 4 bandits: [ -6.  -4.   2. 487.]
Running reward for the 4 bandits: [ -5.  -4.   1. 533.]
Running reward for the 4 bandits: [ -5.  -5.   0. 579.]
Running reward for the 4 bandits: [ -6.  -5.   1. 627.]
Running reward for the 4 bandits: [ -6.  -4.   0. 673.]
Running reward for the 4 bandits: [ -3.  -4.  

## Example 2 - CartPole

Above is a simple example, lets look at a more advance example which will interact with our env and will give the results which we can percieve. The Objective is to make cart pole stand as long as it can.

By Default we cannot render Gym in Jupyter, since its a sandbox model and jupyter as of now doesnot support webGL, so we wil 

In [0]:
!apt-get install -y xvfb python-opengl > /dev/null 2>&1
!pip install pyvirtualdisplay ffmpeg > /dev/null 2>&1
!apt-get install -y x11-utils > /dev/null 2>&1
!pip install piglet pyglet
!pip install gym==0.14.0

In [2]:
import glob
import io
import base64
from IPython.display import HTML

import gym
from gym.wrappers import Monitor
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

display = Display(visible=0, size=(400, 300))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1097'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1097'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
''' Some utility functions '''

def show_video():  
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [5]:
# Wiki https://github.com/openai/gym/wiki/CartPole-v0
env = gym.make("CartPole-v0")
print("Observations:", env.observation_space)  ## [position of cart, velocity of cart, angle of pole, rotation rate of pole]
print("Actions:", env.action_space)  ## Actions are LEFT and RIGHT

Observations: Box(4,)
Actions: Discrete(2)


#### Lets try Random Actions Agent

In [19]:
# Try running environment with random actions
import time

env = wrap_env(gym.make("CartPole-v0"))
env.reset()
reward_sum = 0
num_games = 10
num_game = 0

while num_game < num_games:
    env.render()
    observation, reward, done, _ = env.step(env.action_space.sample())
    reward_sum += reward
    if done:
        print("Reward for this episode was: {}".format(reward_sum))
        reward_sum = 0
        num_game += 1
        env.reset()
env.close()
show_video()

Reward for this episode was: 15.0
Reward for this episode was: 24.0
Reward for this episode was: 17.0
Reward for this episode was: 23.0
Reward for this episode was: 15.0
Reward for this episode was: 12.0
Reward for this episode was: 11.0
Reward for this episode was: 17.0
Reward for this episode was: 58.0
Reward for this episode was: 18.0


<b>The only action we can take in this ENV is LEFT and RIGHT<b>
  
  #### Let try a Neural Agent

In [0]:
# Constants defining our neural network
hidden_layer_neurons = 10
batch_size = 50
learning_rate = 1e-2
gamma = .99
dimen = 4

In [13]:
import tensorflow as tf

# Defining Graphs
tf.reset_default_graph()

# Define input placeholder
observations = tf.placeholder(tf.float32, [None, dimen], name="input_x")

# First layer of weights
W1 = tf.get_variable("W1", shape=[dimen, hidden_layer_neurons],
                    initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,W1))

# Second layer of weights
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, 1],
                    initializer=tf.contrib.layers.xavier_initializer())
output = tf.nn.sigmoid(tf.matmul(layer1, W2))

# We need to define the parts of the network needed for learning a policy
trainable_vars = [W1, W2]

input_y = tf.placeholder(tf.float32, [None, 1], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")

# Loss function
log_lik = tf.log(input_y * (input_y - output) + 
                  (1 - input_y) * (input_y + output))
loss = -tf.reduce_mean(log_lik * advantages)

# Gradients
new_grads = tf.gradients(loss, trainable_vars)
W1_grad = tf.placeholder(tf.float32, name="batch_grad1")
W2_grad = tf.placeholder(tf.float32, name="batch_grad2")

# Learning
batch_grad = [W1_grad, W2_grad]
adam = tf.train.AdamOptimizer(learning_rate=learning_rate)
update_grads = adam.apply_gradients(zip(batch_grad, [W1, W2]))

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [0]:
def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801]
    """
    return np.array([val * (gamma ** i) for i, val in enumerate(r)])

In [15]:
import numpy as np

reward_sum = 0
init = tf.global_variables_initializer()

# Placeholders for our observations, outputs and rewards
xs = np.empty(0).reshape(0,dimen)
ys = np.empty(0).reshape(0,1)
rewards = np.empty(0).reshape(0,1)

# Setting up our environment
sess = tf.Session()
rendering = False
sess.run(init)
observation = env.reset()

# Placeholder for out gradients
gradients = np.array([np.zeros(var.get_shape()) for var in trainable_vars])
num_episodes = 50000
num_episode = 0

while num_episode < num_episodes:
    #env.render()
    # Append the observations to our batch
    x = np.reshape(observation, [1, dimen])
    
    # Run the neural net to determine output
    tf_prob = sess.run(output, feed_dict={observations: x})
    
    # Determine the output based on our net, allowing for some randomness
    y = 0 if tf_prob > np.random.uniform() else 1
    
    # Append the observations and outputs for learning
    xs = np.vstack([xs, x])
    ys = np.vstack([ys, y])

    # Determine the oucome of our action
    observation, reward, done, _ = env.step(y)
    reward_sum += reward
    rewards = np.vstack([rewards, reward])
    
    ''' When Episode is finished, time to calculate discounted rewards and gradient'''
    if done:
        # Determine standardized rewards
        discounted_rewards = discount_rewards(rewards, gamma)
        discounted_rewards -= discounted_rewards.mean()
        discounted_rewards /= discounted_rewards.std()

        # Append gradients for case to running gradients
        gradients += np.array(sess.run(new_grads, feed_dict={observations: xs,
                                               input_y: ys,
                                               advantages: discounted_rewards}))
        
        # Clear out game variables
        xs = np.empty(0).reshape(0,dimen)
        ys = np.empty(0).reshape(0,1)
        rewards = np.empty(0).reshape(0,1)

        # Once batch full
        if num_episode % batch_size == 0:
            # Updated gradients
            sess.run(update_grads, feed_dict={W1_grad: gradients[0],
                                             W2_grad: gradients[1]})
            # Clear out gradients
            gradients *= 0
            
            # Print status
            print("Average reward for episode {}: {}".format(num_episode, reward_sum/batch_size))
            
            ''' Break when reward is 200 '''
            if reward_sum / batch_size >= 200:
                print("Solved in {} episodes!".format(num_episode))
                break
            reward_sum = 0
        num_episode += 1
        observation = env.reset()
env.close()

Average reward for episode 0: 0.6
Average reward for episode 50: 21.82
Average reward for episode 100: 22.48
Average reward for episode 150: 25.64
Average reward for episode 200: 23.24
Average reward for episode 250: 23.5
Average reward for episode 300: 25.38
Average reward for episode 350: 24.1
Average reward for episode 400: 24.76
Average reward for episode 450: 26.92
Average reward for episode 500: 26.5
Average reward for episode 550: 29.16
Average reward for episode 600: 30.06
Average reward for episode 650: 26.64
Average reward for episode 700: 34.82
Average reward for episode 750: 34.06
Average reward for episode 800: 35.86
Average reward for episode 850: 30.38
Average reward for episode 900: 33.7
Average reward for episode 950: 35.16
Average reward for episode 1000: 33.56
Average reward for episode 1050: 31.62
Average reward for episode 1100: 41.4
Average reward for episode 1150: 34.5
Average reward for episode 1200: 35.04
Average reward for episode 1250: 43.74
Average reward fo

In [0]:
! rm -rf video

**Lets See our trained bot in action**

In [21]:
env = wrap_env(gym.make("CartPole-v0"))
observation = env.reset()
observation
reward_sum = 0

while True:
    env.render()
    x = np.reshape(observation, [1, dimen])
    y = sess.run(output, feed_dict={observations: x})
    y = 0 if y > 0.5 else 1
    observation, reward, done, _ = env.step(y)
    reward_sum += reward
    if done:
        print("Total score: {}".format(reward_sum))
        break
env.close()
show_video()

Total score: 200.0
