# Cartpole: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent <b>that plays Cartpole </b>

<img src="http://neuro-educator.com/wp-content/uploads/2017/09/DQN.gif" alt="Cartpole gif"/>


## This notebook is part of the Free Deep Reinforcement Course 📝
<img src="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/preview.jpg" alt="Deep Reinforcement Course" style="width: 500px;"/>

<p> Deep Reinforcement Learning Course is a free series of blog posts about Deep Reinforcement Learning, where we'll learn the main algorithms, <b>and how to implement them in Tensorflow.</b></p>

<p>The goal of these articles is to <b>explain step by step from the big picture</b> and the mathematical details behind it, to the implementation with Tensorflow </p>


<a href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">Syllabus</a><br>
<a href="https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419">Part 0: Introduction to Reinforcement Learning </a><br>
<a href="https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe"> Part 1: Q-learning with FrozenLake</a><br>
<a href="https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8"> Part 2: Deep Q-learning with Doom</a><br>
<a href=""> Part 3: Policy Gradients with Doom </a><br>

## Checklist 📝
- To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`
- ⚠️⚠️⚠️ You need to download vizdoom and place the folder in the repos.
- If don't want to train, you must change **training to False** (in hyperparameters step). 


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    

## How to help  🙌
3 ways:
- **Clap our articles a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared
- **Share and speak about our articles**: By sharing our articles you help us to spread the word.
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Step 1: Import the libraries 📚

In [1]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment 🎮
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [2]:
env = gym.make('CartPole-v0')
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


[1]

## Step 3: Set up our hyperparameters ⚗️

In [3]:
## ENVIRONMENT Hyperparameters
state_size = 4
action_size = env.action_space.n

## TRAINING Hyperparameters
max_episodes = 10000
learning_rate = 0.01
gamma = 0.95 # Discount rate

## Step 4 : Define the preprocessing functions ⚙️
This function takes <b>the rewards and perform discounting.</b>

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model 🧠

<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/Policy%20Gradients/Cartpole/assets/catpole.png">

The idea is simple:
- Our state which is an array of 4 values will be used as an input.
- Our NN is 3 fully connected layers.
- Our output activation function is softmax that squashes the outputs to a probability distribution (for instance if we have 4, 2, 6 --> softmax --> (0.4, 0.2, 0.6)

In [5]:
with tf.name_scope("inputs"):
    input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
    actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
    discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name="discounted_episode_rewards")
    
    # Add this placeholder for having this variable in tensorboard
    mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

    with tf.name_scope("fc1"):
        fc1 = tf.contrib.layers.fully_connected(inputs = input_,
                                                num_outputs = 10,
                                                activation_fn=tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("fc2"):
        fc2 = tf.contrib.layers.fully_connected(inputs = fc1,
                                                num_outputs = action_size,
                                                activation_fn= tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())
    
    with tf.name_scope("fc3"):
        fc3 = tf.contrib.layers.fully_connected(inputs = fc2,
                                                num_outputs = action_size,
                                                activation_fn= None,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("softmax"):
        action_distribution = tf.nn.softmax(fc3)

    with tf.name_scope("loss"):
        # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function
        # If you have single-class labels, where an object can only belong to one class, you might now consider using 
        # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. 
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits = fc3, labels = actions)
        loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 
        
    
    with tf.name_scope("train"):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Step 6: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

In [6]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/pg/1")

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## Step 7: Train our Agent 🏃‍♂️

In [7]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for episode in range(max_episodes):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        
        env.render()
           
        while True:
            
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,4])})
            
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob

            # Perform a
            new_state, reward, done, info = env.step(action)

            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2 (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
            if done:
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                
                print("==========================================")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run([loss, train_opt], feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards 
                                                                })
                
 
                                                                 
                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards,
                                                                    mean_reward_: mean_reward
                                                                })
                
               
                writer.add_summary(summary, episode)
                writer.flush()
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state

Episode:  0
Reward:  21.0
Mean Reward 21.0
Max reward so far:  21.0
Episode:  1
Reward:  10.0
Mean Reward 15.5
Max reward so far:  21.0
Episode:  2
Reward:  14.0
Mean Reward 15.0
Max reward so far:  21.0
Episode:  3
Reward:  13.0
Mean Reward 14.5
Max reward so far:  21.0
Episode:  4
Reward:  18.0
Mean Reward 15.2
Max reward so far:  21.0
Episode:  5
Reward:  22.0
Mean Reward 16.3333333333
Max reward so far:  22.0
Episode:  6
Reward:  27.0
Mean Reward 17.8571428571
Max reward so far:  27.0
Episode:  7
Reward:  21.0
Mean Reward 18.25
Max reward so far:  27.0
Episode:  8
Reward:  50.0
Mean Reward 21.7777777778
Max reward so far:  50.0
Episode:  9
Reward:  17.0
Mean Reward 21.3
Max reward so far:  50.0
Episode:  10
Reward:  23.0
Mean Reward 21.4545454545
Max reward so far:  50.0
Episode:  11
Reward:  17.0
Mean Reward 21.0833333333
Max reward so far:  50.0
Episode:  12
Reward:  21.0
Mean Reward 21.0769230769
Max reward so far:  50.0
Episode:  13
Reward:  11.0
Mean Reward 20.3571428571
Max r

Episode:  69
Reward:  16.0
Mean Reward 24.7285714286
Max reward so far:  87.0
Episode:  70
Reward:  20.0
Mean Reward 24.661971831
Max reward so far:  87.0
Episode:  71
Reward:  58.0
Mean Reward 25.125
Max reward so far:  87.0
Episode:  72
Reward:  22.0
Mean Reward 25.0821917808
Max reward so far:  87.0
Episode:  73
Reward:  33.0
Mean Reward 25.1891891892
Max reward so far:  87.0
Episode:  74
Reward:  37.0
Mean Reward 25.3466666667
Max reward so far:  87.0
Episode:  75
Reward:  17.0
Mean Reward 25.2368421053
Max reward so far:  87.0
Episode:  76
Reward:  15.0
Mean Reward 25.1038961039
Max reward so far:  87.0
Episode:  77
Reward:  27.0
Mean Reward 25.1282051282
Max reward so far:  87.0
Episode:  78
Reward:  53.0
Mean Reward 25.4810126582
Max reward so far:  87.0
Episode:  79
Reward:  10.0
Mean Reward 25.2875
Max reward so far:  87.0
Episode:  80
Reward:  24.0
Mean Reward 25.2716049383
Max reward so far:  87.0
Episode:  81
Reward:  16.0
Mean Reward 25.1585365854
Max reward so far:  87.0


Episode:  141
Reward:  72.0
Mean Reward 30.6549295775
Max reward so far:  136.0
Episode:  142
Reward:  36.0
Mean Reward 30.6923076923
Max reward so far:  136.0
Episode:  143
Reward:  28.0
Mean Reward 30.6736111111
Max reward so far:  136.0
Episode:  144
Reward:  27.0
Mean Reward 30.6482758621
Max reward so far:  136.0
Episode:  145
Reward:  93.0
Mean Reward 31.0753424658
Max reward so far:  136.0
Episode:  146
Reward:  20.0
Mean Reward 31.0
Max reward so far:  136.0
Episode:  147
Reward:  80.0
Mean Reward 31.3310810811
Max reward so far:  136.0
Episode:  148
Reward:  105.0
Mean Reward 31.8255033557
Max reward so far:  136.0
Episode:  149
Reward:  16.0
Mean Reward 31.72
Max reward so far:  136.0
Episode:  150
Reward:  93.0
Mean Reward 32.1258278146
Max reward so far:  136.0
Episode:  151
Reward:  71.0
Mean Reward 32.3815789474
Max reward so far:  136.0
Episode:  152
Reward:  25.0
Mean Reward 32.3333333333
Max reward so far:  136.0
Episode:  153
Reward:  73.0
Mean Reward 32.5974025974
Ma

Episode:  211
Reward:  47.0
Mean Reward 40.1698113208
Max reward so far:  239.0
Episode:  212
Reward:  17.0
Mean Reward 40.0610328638
Max reward so far:  239.0
Episode:  213
Reward:  152.0
Mean Reward 40.5841121495
Max reward so far:  239.0
Episode:  214
Reward:  121.0
Mean Reward 40.9581395349
Max reward so far:  239.0
Episode:  215
Reward:  104.0
Mean Reward 41.25
Max reward so far:  239.0
Episode:  216
Reward:  86.0
Mean Reward 41.4562211982
Max reward so far:  239.0
Episode:  217
Reward:  76.0
Mean Reward 41.6146788991
Max reward so far:  239.0
Episode:  218
Reward:  137.0
Mean Reward 42.0502283105
Max reward so far:  239.0
Episode:  219
Reward:  21.0
Mean Reward 41.9545454545
Max reward so far:  239.0
Episode:  220
Reward:  195.0
Mean Reward 42.6470588235
Max reward so far:  239.0
Episode:  221
Reward:  150.0
Mean Reward 43.1306306306
Max reward so far:  239.0
Episode:  222
Reward:  183.0
Mean Reward 43.7578475336
Max reward so far:  239.0
Episode:  223
Reward:  14.0
Mean Reward 4

Episode:  279
Reward:  234.0
Mean Reward 64.7035714286
Max reward so far:  373.0
Episode:  280
Reward:  240.0
Mean Reward 65.3274021352
Max reward so far:  373.0
Episode:  281
Reward:  172.0
Mean Reward 65.7056737589
Max reward so far:  373.0
Episode:  282
Reward:  139.0
Mean Reward 65.964664311
Max reward so far:  373.0
Episode:  283
Reward:  157.0
Mean Reward 66.2852112676
Max reward so far:  373.0
Episode:  284
Reward:  195.0
Mean Reward 66.7368421053
Max reward so far:  373.0
Episode:  285
Reward:  330.0
Mean Reward 67.6573426573
Max reward so far:  373.0
Episode:  286
Reward:  202.0
Mean Reward 68.1254355401
Max reward so far:  373.0
Episode:  287
Reward:  194.0
Mean Reward 68.5625
Max reward so far:  373.0
Episode:  288
Reward:  281.0
Mean Reward 69.2975778547
Max reward so far:  373.0
Episode:  289
Reward:  172.0
Mean Reward 69.6517241379
Max reward so far:  373.0
Episode:  290
Reward:  321.0
Mean Reward 70.5154639175
Max reward so far:  373.0
Episode:  291
Reward:  156.0
Mean R

Episode:  347
Reward:  211.0
Mean Reward 137.82183908
Max reward so far:  2382.0
Episode:  348
Reward:  266.0
Mean Reward 138.189111748
Max reward so far:  2382.0
Episode:  349
Reward:  254.0
Mean Reward 138.52
Max reward so far:  2382.0
Episode:  350
Reward:  216.0
Mean Reward 138.740740741
Max reward so far:  2382.0
Episode:  351
Reward:  335.0
Mean Reward 139.298295455
Max reward so far:  2382.0
Episode:  352
Reward:  257.0
Mean Reward 139.631728045
Max reward so far:  2382.0
Episode:  353
Reward:  198.0
Mean Reward 139.796610169
Max reward so far:  2382.0
Episode:  354
Reward:  220.0
Mean Reward 140.022535211
Max reward so far:  2382.0
Episode:  355
Reward:  146.0
Mean Reward 140.039325843
Max reward so far:  2382.0
Episode:  356
Reward:  199.0
Mean Reward 140.204481793
Max reward so far:  2382.0
Episode:  357
Reward:  198.0
Mean Reward 140.365921788
Max reward so far:  2382.0
Episode:  358
Reward:  201.0
Mean Reward 140.534818942
Max reward so far:  2382.0
Episode:  359
Reward:  1

Episode:  413
Reward:  158.0
Mean Reward 140.355072464
Max reward so far:  2382.0
Episode:  414
Reward:  175.0
Mean Reward 140.438554217
Max reward so far:  2382.0
Episode:  415
Reward:  164.0
Mean Reward 140.495192308
Max reward so far:  2382.0
Episode:  416
Reward:  195.0
Mean Reward 140.625899281
Max reward so far:  2382.0
Episode:  417
Reward:  150.0
Mean Reward 140.648325359
Max reward so far:  2382.0
Episode:  418
Reward:  145.0
Mean Reward 140.658711217
Max reward so far:  2382.0
Episode:  419
Reward:  185.0
Mean Reward 140.764285714
Max reward so far:  2382.0
Episode:  420
Reward:  203.0
Mean Reward 140.912114014
Max reward so far:  2382.0
Episode:  421
Reward:  151.0
Mean Reward 140.936018957
Max reward so far:  2382.0
Episode:  422
Reward:  145.0
Mean Reward 140.945626478
Max reward so far:  2382.0
Episode:  423
Reward:  157.0
Mean Reward 140.983490566
Max reward so far:  2382.0
Episode:  424
Reward:  183.0
Mean Reward 141.082352941
Max reward so far:  2382.0
Episode:  425
Re

Episode:  480
Reward:  249.0
Mean Reward 143.424116424
Max reward so far:  2382.0
Episode:  481
Reward:  272.0
Mean Reward 143.690871369
Max reward so far:  2382.0
Episode:  482
Reward:  299.0
Mean Reward 144.01242236
Max reward so far:  2382.0
Episode:  483
Reward:  496.0
Mean Reward 144.739669421
Max reward so far:  2382.0
Episode:  484
Reward:  238.0
Mean Reward 144.931958763
Max reward so far:  2382.0
Episode:  485
Reward:  454.0
Mean Reward 145.567901235
Max reward so far:  2382.0
Episode:  486
Reward:  321.0
Mean Reward 145.928131417
Max reward so far:  2382.0
Episode:  487
Reward:  337.0
Mean Reward 146.319672131
Max reward so far:  2382.0
Episode:  488
Reward:  378.0
Mean Reward 146.793456033
Max reward so far:  2382.0
Episode:  489
Reward:  242.0
Mean Reward 146.987755102
Max reward so far:  2382.0
Episode:  490
Reward:  363.0
Mean Reward 147.427698574
Max reward so far:  2382.0
Episode:  491
Reward:  247.0
Mean Reward 147.630081301
Max reward so far:  2382.0
Episode:  492
Rew

Episode:  547
Reward:  150.0
Mean Reward 162.062043796
Max reward so far:  2382.0
Episode:  548
Reward:  139.0
Mean Reward 162.02003643
Max reward so far:  2382.0
Episode:  549
Reward:  140.0
Mean Reward 161.98
Max reward so far:  2382.0
Episode:  550
Reward:  140.0
Mean Reward 161.940108893
Max reward so far:  2382.0
Episode:  551
Reward:  139.0
Mean Reward 161.898550725
Max reward so far:  2382.0
Episode:  552
Reward:  136.0
Mean Reward 161.851717902
Max reward so far:  2382.0
Episode:  553
Reward:  126.0
Mean Reward 161.78700361
Max reward so far:  2382.0
Episode:  554
Reward:  137.0
Mean Reward 161.742342342
Max reward so far:  2382.0
Episode:  555
Reward:  126.0
Mean Reward 161.678057554
Max reward so far:  2382.0
Episode:  556
Reward:  127.0
Mean Reward 161.615798923
Max reward so far:  2382.0
Episode:  557
Reward:  128.0
Mean Reward 161.555555556
Max reward so far:  2382.0
Episode:  558
Reward:  113.0
Mean Reward 161.468694097
Max reward so far:  2382.0
Episode:  559
Reward:  11

Episode:  614
Reward:  304.0
Mean Reward 165.920325203
Max reward so far:  2382.0
Episode:  615
Reward:  489.0
Mean Reward 166.444805195
Max reward so far:  2382.0
Episode:  616
Reward:  395.0
Mean Reward 166.815235008
Max reward so far:  2382.0
Episode:  617
Reward:  421.0
Mean Reward 167.226537217
Max reward so far:  2382.0
Episode:  618
Reward:  348.0
Mean Reward 167.518578352
Max reward so far:  2382.0
Episode:  619
Reward:  375.0
Mean Reward 167.853225806
Max reward so far:  2382.0
Episode:  620
Reward:  618.0
Mean Reward 168.578099839
Max reward so far:  2382.0
Episode:  621
Reward:  407.0
Mean Reward 168.961414791
Max reward so far:  2382.0
Episode:  622
Reward:  400.0
Mean Reward 169.332263242
Max reward so far:  2382.0
Episode:  623
Reward:  463.0
Mean Reward 169.802884615
Max reward so far:  2382.0
Episode:  624
Reward:  495.0
Mean Reward 170.3232
Max reward so far:  2382.0
Episode:  625
Reward:  424.0
Mean Reward 170.728434505
Max reward so far:  2382.0
Episode:  626
Reward:

Episode:  680
Reward:  1072.0
Mean Reward 363.635829662
Max reward so far:  5925.0
Episode:  681
Reward:  762.0
Mean Reward 364.219941349
Max reward so far:  5925.0
Episode:  682
Reward:  629.0
Mean Reward 364.60761347
Max reward so far:  5925.0
Episode:  683
Reward:  684.0
Mean Reward 365.074561404
Max reward so far:  5925.0
Episode:  684
Reward:  720.0
Mean Reward 365.59270073
Max reward so far:  5925.0
Episode:  685
Reward:  545.0
Mean Reward 365.854227405
Max reward so far:  5925.0
Episode:  686
Reward:  608.0
Mean Reward 366.206695779
Max reward so far:  5925.0
Episode:  687
Reward:  433.0
Mean Reward 366.30377907
Max reward so far:  5925.0
Episode:  688
Reward:  552.0
Mean Reward 366.57329463
Max reward so far:  5925.0
Episode:  689
Reward:  489.0
Mean Reward 366.750724638
Max reward so far:  5925.0
Episode:  690
Reward:  465.0
Mean Reward 366.892908828
Max reward so far:  5925.0
Episode:  691
Reward:  559.0
Mean Reward 367.170520231
Max reward so far:  5925.0
Episode:  692
Rewar

KeyboardInterrupt: 