# Siraj's Challenge: Policy Gradient Learning with Cart Pole V0

SIMONINI Thomas <br>
<a href="https://github.com/simoninithomas"> Github </a><br>
<a href="https://www.simoninithomas.com"> Website </a><br>
<a href="mailto:hello@simoninithomas.com"> My email </a>




The challenge of the week was: <b>solving a simple game using policy gradients (other than pong).</b>
I've chosen CartPole v1.0 because that's a basic game and there is a ton of documentations/tutorials about that kind of game. 

<h3> Goal </h3>
CartPole-v0 defines "solving" as <b>getting average reward of 195.0 over 100 consecutive trials. </b>

<br>
I've chosen to use jupyter notebook because <b> it's much better to understand the code </b> in this notebook, I will <b> try to explain all the parts of the code </b> 

<h3> Some interesting discoveries </h3>
<ul>
    <li>Using ELU instead ReLU <b> lead to better results </b> </li>
    <li> Using RMSProp as optimizer <b> lead to better results </b></li>
</ul>


<h3> Cart Pole V0 </h3>

<img src="https://cdn-images-1.medium.com/max/1200/1*G_whtIrY9fGlw3It6HFfhA.gif" alt="Cart Pole game" />

4 kinds of information given by the state:
<ul>
    <li>Position of the cart</li>
    <li> Velocity of the cart </li>
    <li> Position of the pole </li>
    <li> Velocity of the pole </li>
</ul>
<br>
An agent can push the cart:
<ul>
    <li> 0: left </li>
    <li> 1: right </ul>



This was made possible thanks these 2 fantastic resources:
<ul>
    <li> <a href="https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724">Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents </a> : this article helps me to define a part of the architecture and helps me a lot for the training part.</li>
    
   
  <li> <a href="https://gist.github.com/shanest/535acf4c62ee2a71da498281c2dfc4f4" >Policy gradients for reinforcement learning in TensorFlow</a></li>
  </ul>

### Import the dependencies

In [1]:
import gym
import numpy as np
import tensorflow as tf

### Our game environment

In [None]:
env = gym.make("CartPole-v0")

# Watch the simulation
env.reset()
rewards = []

for _ in range(100):
    env.render()
    
    # Take a random action
    state, reward, done, info = env.step(env.action_space.sample())
env.close()


[2017-12-13 17:33:10,592] Making new env: CartPole-v0


### Define our hyperparameters

In [3]:
input_size = 4 # 4 informations given by state
action_size = 2 # 2 actions possible: left / right
hidden_size = 64 # Hidden neurons

learning_rate = 0.001 
gamma = 0.99 #Discount rate

train_episodes = 5000 # An episode is a game
max_steps = 900 # Max steps per episode
batch_size = 5

### Build our Deep Neural Network

<img src="assets/nn.png" />

<i>Originally taken from, <a href="https://www.youtube.com/watch?v=pN7ETkOizGM">Siraj's Solving the basic game of Pong video </a> modified with my exceptional skills in paint </i>😂

In [4]:
class PGAgent():
    def __init__(self, input_size, action_size, hidden_size, learning_rate, gamma):
        
        self.input_size = input_size
        self.action_size = action_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        
        # Make the NN
        self.inputs = tf.placeholder(tf.float32, 
                      shape = [None, input_size])
                              
        # Using ELU is much better than using ReLU
        self.hidden_layer_1 = tf.contrib.layers.fully_connected(inputs = self.inputs,
                                                  num_outputs = hidden_size,
                                                  activation_fn = tf.nn.elu,
                                                  weights_initializer = tf.random_normal_initializer())

        self.output_layer = tf.contrib.layers.fully_connected(inputs = self.hidden_layer_1,
                                                         num_outputs = action_size,
                                                 activation_fn = tf.nn.softmax)
        
        # Log prob output
        self.output_log_prob = tf.log(self.output_layer)
        
        
        ### LOSS Function : feed the reward and chosen action in the DNN
        # Taken from this implementation https://gist.github.com/shanest/535acf4c62ee2a71da498281c2dfc4f4
        
        self.actions = tf.placeholder(tf.int32, shape = [None])
        self.rewards = tf.placeholder(tf.float32, shape = [None])
        
        # Get log probability of actions from episode : 
        self.indices = tf.range(0, tf.shape(self.output_log_prob)[0]) * tf.shape(self.output_log_prob)[1] + self.actions
        
        self.actions_probability = tf.gather(tf.reshape(self.output_layer, [-1]), self.indices)
        
        self.loss = -tf.reduce_mean(tf.log(self.actions_probability) * self.rewards)
        
  

        #  Collect some gradients after some training episodes outside the graph and then apply them.
        # Not implemented by me, taken from https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724#.mtwpvfi8b
        tvars = tf.trainable_variables()
        self.gradient_holders = []
        for idx,var in enumerate(tvars):
            placeholder = tf.placeholder(tf.float32, name=str(idx)+ '_holder')
            self.gradient_holders.append(placeholder)
        
        self.gradients = tf.gradients(self.loss,tvars)
        
        
        ### OPTIMIZER
        
        # Better to use RMSProp
        optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)
        self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))
        

### Define our advantage function

<p>What we must understand here is that immediate rewards <b>are more important than delayed rewards.</b>
</p>
<p> That's why we use gamma as a discount factor </p>
<img src="assets/discountreward.png" alt="Discount reward"/>

Why ? Because <b>delayed rewards have less impact</b>: imagine you screw up at step 5 (the bar is too leaning) we don't care of rewards after that because you will lose that's why the reward is more and more discounted

<img src="assets/d1.png"/>

<img src="assets/d2.png"/>

<i>Originally taken from, <a href="https://www.youtube.com/watch?v=tqrcjHuNdmQ">DQN Bootcamp Lecture: Core Lecture 4b Pong from Pixels -- Andrej Karpathy </a>

In [5]:
# Weight rewards differently : weight immediate rewards higher than delayed reward

def discount_rewards(r):
    # Init discount reward matrix
    discounted_reward= np.zeros_like(r) 
    
    # Running_add: store sum of reward
    running_add = 0
    
    # Foreach rewards
    for t in reversed(range(0, r.size)):
        
        running_add = running_add * gamma + r[t] # sum * y (gamma) + reward
        discounted_reward[t] = running_add
    return discounted_reward

Remember that:
<ul>
    <li> A positive advantage --> make the action <b>more likely to happen in the future</b>, at that state </li>
    <li> A negative advantage --> make the action <b>less likely to happen in the future</b>, at that state</li>
</ul>

### Train the agent

In [6]:
# Clear the graph

tf.reset_default_graph()

agent = PGAgent(input_size, action_size, hidden_size, learning_rate, gamma)

# Launch the tensorflow graph
with tf.Session() as sess:
    saver = tf.train.Saver()
    sess.run(tf.global_variables_initializer())
    
    nb_episodes = 0
    
    # Define total_rewards and total_length
    total_reward = []
    total_length = []
    
    # Not my implementation: 
    gradBuffer = sess.run(tf.trainable_variables())
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
        
    
    # While we have episodes to train
    while nb_episodes < train_episodes:
        state = env.reset()
        running_reward = 0
        episode_history = [] # Init the array that keep track the history in an episode
        
        for step in range(max_steps):
            #Probabilistically pick an action given our network outputs.
            # Not my implementation: taken from Udacity Q-learning quart https://github.com/udacity/deep-learning/blob/master/reinforcement/Q-learning-cart.ipynb 
            action_distribution = sess.run(agent.output_layer ,feed_dict={agent.inputs:[state]})
            action = np.random.choice(action_distribution[0],p=action_distribution[0])
            action = np.argmax(action_distribution == action)
            
            state_1, reward, done, info = env.step(action)
            
            # Append this step in the history of the episode
            episode_history.append([state, action, reward, state_1])
            
            # Now we are in this state (state is now state 1)
            state = state_1
            
            running_reward += reward
            
            if done == True:
                # Update the network
                episode_history = np.array(episode_history)
                episode_history[:,2] = discount_rewards(episode_history[:,2])
                feed_dict={agent.rewards:episode_history[:,2],
                        agent.actions:episode_history[:,1],agent.inputs:np.vstack(episode_history[:,0])}
                grads = sess.run(agent.gradients, feed_dict=feed_dict)
                
                
                for idx,grad in enumerate(grads):
                    gradBuffer[idx] += grad

                if nb_episodes % batch_size == 0 and nb_episodes != 0:
                    feed_dict= dictionary = dict(zip(agent.gradient_holders, gradBuffer))
                    _ = sess.run(agent.update_batch, feed_dict=feed_dict)
                    for ix,grad in enumerate(gradBuffer):
                        gradBuffer[ix] = grad * 0
                
                #(running_reward))
                total_reward.append(running_reward)
                total_length.append(step)
                break
                
        # For each 100 episodes
        if nb_episodes % 100 == 0:
            print("Episode: {}".format(nb_episodes),
                    "Total reward: {}".format(np.mean(total_reward[-100:])))
        nb_episodes += 1
    
    saver.save(sess, "checkpoints/cartPoleGame.ckpt")
        
        
  

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Episode: 0 Total reward: 23.0
Episode: 100 Total reward: 46.47
Episode: 200 Total reward: 49.83
Episode: 300 Total reward: 57.72
Episode: 400 Total reward: 62.65
Episode: 500 Total reward: 89.78
Episode: 600 Total reward: 105.59
Episode: 700 Total reward: 133.91
Episode: 800 Total reward: 142.04
Episode: 900 Total reward: 151.3
Episode: 1000 Total reward: 159.59
Episode: 1100 Total reward: 159.25
Episode: 1200 Total reward: 168.02
Episode: 1300 Total reward: 153.65
Episode: 1400 Total reward: 166.91
Episode: 1500 Total reward: 175.99
Episode: 1600 Total reward: 180.67
Episode: 1700 Total reward: 182.99
Episode: 1800 Total reward: 183.64
Episode: 1900 Total reward: 191.79
Episode: 2000 Total reward: 184.14
Episode: 2100 Total reward: 180.97
Episode: 2200 Total reward: 190.58
Episode: 2300 Total reward: 190.56
Episode: 2400 Total reward: 185.67
Episode: 2500 Total reward: 192.32
Episode: 2600 Total reward: 191.32
Episode: 2700 Total reward: 192.74
Episode: 2800 Total reward: 190.67
Episo

### Play the game

<p> Let see our agent playing the game </p>

In [13]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for episode in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
        
            
            #Probabilistically pick an action given our network outputs.
            # Not my implementation: taken from Udacity Q-learning quart https://github.com/udacity/deep-learning/blob/master/reinforcement/Q-learning-cart.ipynb 
            action_distribution = sess.run(agent.output_layer ,feed_dict={agent.inputs:[state]})
            action = np.random.choice(action_distribution[0],p=action_distribution[0])
            action = np.argmax(action_distribution == action)
            
            state_1, reward, done, info = env.step(action)
           
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, info = env.step(env.action_space.sample())

            else:
                state = state_1 # Next state
                t += 1
                
env.close()