# Cartpole: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent <b>that plays Cartpole </b>

<img src="http://neuro-educator.com/wp-content/uploads/2017/09/DQN.gif" alt="Cartpole gif"/>


# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Policy gradients [Article](https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f)

## Step 1: Import the libraries 📚

In [None]:
!pip install gym

In [None]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment 🎮
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [None]:
env = gym.make('CartPole-v0')
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1)

## Step 3: Set up our hyperparameters ⚗️

In [5]:
## ENVIRONMENT
state_size = 4
action_size = env.action_space.n

## TRAINING
max_episodes = 300
learning_rate = 0.01
gamma = 0.95

## Step 4 : Define the preprocessing functions ⚙️
This function takes <b>the rewards and perform discounting.</b>

In [6]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
        
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model 🧠

<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/Policy%20Gradients/Cartpole/assets/catpole.png">

The idea is simple:
- Our state which is an array of 4 values will be used as an input.
- Our NN is 3 fully connected layers.
- Our output activation function is softmax that squashes the outputs to a probability distribution (for instance if we have 4, 2, 6 --> softmax --> (0.4, 0.2, 0.6)

In [7]:
with tf.name_scope('inputs'):
    input_ = tf.placeholder(tf.float32, [None, state_size], name='input_')
    actions = tf.placeholder(tf.int32, [None, action_size], name='actions')
    discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name='discounted_episode_rewards')
    
    # Placeholder for variable in tensorboard
    mean_reward_ = tf.placeholder(tf.float32, name='mean_reward')
    
    with tf.name_scope('fc1'):
        fc1 = tf.contrib.layers.fully_connected(
            inputs = input_,
            num_outputs = 10,
            activation_fn = tf.nn.relu,
            weights_initializer = tf.contrib.layers.xavier_initializer(),
        )
        
    with tf.name_scope('fc2'):
        fc2 = tf.contrib.layers.fully_connected(
            inputs = fc1,
            num_outputs = action_size,
            activation_fn = tf.nn.relu,
            weights_initializer = tf.contrib.layers.xavier_initializer(),
        )
        
    with tf.name_scope('fc3'):
        fc3 = tf.contrib.layers.fully_connected(
            inputs = fc2,
            num_outputs = action_size,
            activation_fn = None,
            weights_initializer = tf.contrib.layers.xavier_initializer(),
        )
        
    with tf.name_scope('softmax'):
        action_distribution = tf.nn.softmax(fc3)
        
    with tf.name_scope('loss'):
        # cross-entropy of the selected action (input as 1-hot) 
        # when the network puts fc3 into the softmax at the end 
        # (the "logits" go into softmax)
        # cross-entropy = - log(probability of selected action)
        # this will be larger for lower-probability actions
        # (intuitively: want to bump up the probabilities of those
        # in particular (if the reward was large) ? )
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(
            logits = fc3,
            labels = actions,
        )
        # Reducing this mean drives
        # parameters towards larger rewards (because of neg_)
        # bigger updates when rewards are bigger
        loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_)
        
    with tf.name_scope('train'):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please use `layer.__call__` method instead.


## Step 6: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

## Step 7: Train our Agent 🏃‍♂️