# Homework Assignment 8 - Monte Carlo Policy Gradients <a class="tocSkip">

See Thomas Simonini’s example [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb): replicate it but for a different environment — Lunar Lander!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Mountain-Car:-REINFORCE-Monte-Carlo-Policy-Gradients" data-toc-modified-id="Mountain-Car:-REINFORCE-Monte-Carlo-Policy-Gradients-1">Mountain Car: REINFORCE Monte Carlo Policy Gradients</a></span></li><li><span><a href="#This-is-a-notebook-from-Deep-Reinforcement-Learning-Course-with-Tensorflow" data-toc-modified-id="This-is-a-notebook-from-Deep-Reinforcement-Learning-Course-with-Tensorflow-2">This is a notebook from <a href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/" target="_blank">Deep Reinforcement Learning Course with Tensorflow</a></a></span><ul class="toc-item"><li><span><a href="#Step-1:-Import-the-libraries" data-toc-modified-id="Step-1:-Import-the-libraries-2.1">Step 1: Import the libraries</a></span></li><li><span><a href="#Step-2:-Create-our-environment" data-toc-modified-id="Step-2:-Create-our-environment-2.2">Step 2: Create our environment</a></span></li><li><span><a href="#Step-3:-Set-up-our-hyperparameters" data-toc-modified-id="Step-3:-Set-up-our-hyperparameters-2.3">Step 3: Set up our hyperparameters</a></span></li><li><span><a href="#Step-4-:-Define-the-preprocessing-functions️" data-toc-modified-id="Step-4-:-Define-the-preprocessing-functions️-2.4">Step 4 : Define the preprocessing functions️</a></span></li><li><span><a href="#Step-5:-Create-our-Policy-Gradient-Neural-Network-model" data-toc-modified-id="Step-5:-Create-our-Policy-Gradient-Neural-Network-model-2.5">Step 5: Create our Policy Gradient Neural Network model</a></span></li><li><span><a href="#Step-6:-Set-up-Tensorboard" data-toc-modified-id="Step-6:-Set-up-Tensorboard-2.6">Step 6: Set up Tensorboard</a></span></li><li><span><a href="#Step-7:-Train-our-Agent" data-toc-modified-id="Step-7:-Train-our-Agent-2.7">Step 7: Train our Agent</a></span></li><li><span><a href="#Step-8:-Evaluate-our-trained-model" data-toc-modified-id="Step-8:-Evaluate-our-trained-model-2.8">Step 8: Evaluate our trained model</a></span></li></ul></li><li><span><a href="#Report" data-toc-modified-id="Report-3">Report</a></span></li></ul></div>

# Mountain Car: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent that plays **[MountainCar-v0](https://gym.openai.com/envs/MountainCar-v0/)**

<br/>

<video controls src="../assets/mountain_car.mp4" />

# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)

The original notebook, with a solution for CartPole is [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb)


## Step 1: Import the libraries

In [1]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [2]:
# env_name = 'CartPole-v0'
env_name = 'MountainCar-v0'
# env = gym.make('MountainCar-v0')
env = gym.make(env_name)
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1);

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


## Step 3: Set up our hyperparameters 

In [3]:
## ENVIRONMENT Hyperparameters
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# print(action_size, state_size)
## TRAINING Hyperparameters
max_episodes = 300
learning_rate = 0.01
STEP_MULTIPLE = 3.0
gamma = 0.95 # Discount rate

## Step 4 : Define the preprocessing functions️
This function takes <b>the rewards and perform discounting.</b>

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model

The idea is simple:
- Our state which is an array of 2 values, **position** and **velocity**, which will be used as an input.
- Our NN is 3 fully connected layers.
- Our output activation function is **softmax** that squashes the outputs to a probability distribution:
    - for instance: $ softmax(4,\ 2,\ 6) \rightarrow (0.117,\ 0.016,\ 0.867) $

<br/>

<img src="../assets/mountain_car.jpeg">

In [5]:
with tf.device("/device:GPU:1"):
    with tf.name_scope("inputs"):
        input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
        actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
        discounted_episode_rewards_ = tf.placeholder(
            tf.float32, [None,], name="discounted_episode_rewards")

        # Add this placeholder for having this variable in tensorboard
        mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

        with tf.name_scope("fc1"):
            fc1 = tf.contrib.layers.fully_connected(
                inputs = input_, num_outputs = 10,
                activation_fn=tf.nn.relu,
                weights_initializer=tf.contrib.layers.xavier_initializer()) 
        with tf.name_scope("fc2"):
            fc2 = tf.contrib.layers.fully_connected(
                inputs = fc1, num_outputs = action_size,
                activation_fn=tf.nn.relu,
                weights_initializer=tf.contrib.layers.xavier_initializer())

        with tf.name_scope("fc3"):
            fc3 = tf.contrib.layers.fully_connected(
                inputs = fc2, num_outputs = action_size,
                activation_fn= None,
                weights_initializer=tf.contrib.layers.xavier_initializer()) 
            
        with tf.name_scope("softmax"):
            action_distribution = tf.nn.softmax(fc3)

        with tf.name_scope("loss"):
            # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy 
            # of the result after applying the softmax function
            # If you have single-class labels, where an object can only belong to one class,
            # you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits 
            # so that you don't have to convert your labels to a dense one-hot array. 
            neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(
                logits = fc3, labels = actions)
            loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 


        with tf.name_scope("train"):
            train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Step 6: Set up Tensorboard
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

In [6]:
# Setup TensorBoard Writer
# writer = tf.summary.FileWriter("/tensorboard/pg/1")
!rm -Rf ./tensorboard
writer = tf.summary.FileWriter("./tensorboard/pg/1")

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## Step 7: Train our Agent 

In [7]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

saver = tf.train.Saver()

import datetime

step_max = 20000
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    old_position, old_velocity = None, None
    for episode in range(max_episodes + 10):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        start_time = datetime.datetime.now()
        print("==========================================")
        print(f"The starting state is : {state}")
        
        #env.render()
        counter = 0
        episode_max_pos, episode_min_pos = float("-2.0"), float("2.0")
        direction_change_counter = 0
        fail = False
        while True:
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT,
            # WE'RE OUTPUT PROBABILITIES.

            action_probability_distribution = sess.run(
                action_distribution, feed_dict={input_: state.reshape([1,state_size])})
                # select action w.r.t the actions prob 
            action = np.random.choice(
                range(
                    action_probability_distribution.shape[1]),
                    p=action_probability_distribution.ravel())

            new_state, reward, done, info = env.step(action)

            if old_position is None:
                old_position, old_velocity = new_state
            else:
                old_position, old_velocity = position, velocity
                
            position, velocity = new_state
            velocity_sign = velocity * old_velocity
            
            bonus = 0.0
            if velocity_sign < 0.0:
                new_record = False
                direction_change_counter += 1
                if position > episode_max_pos:
                    episode_max_pos = position
                    new_record = True
                elif position < episode_min_pos:
                    episode_min_pos = position
                    new_record = True

                if new_record:
                    bonus = 10.0  # bonus for gaining potential energy
                else:
                    bonus = -2.0  # penalty for wasting potential energy

            reward += bonus
            
            counter += 1
            #if counter == 10:
            #    break
            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2
            # (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
           
            if counter % 1000 == 0:
                print(f"Step: {counter}")
        
            if counter >= step_max:
                # Bad Ending
                if episode <= max_episodes:
                    done = True
                    fail = True
                else:
                    step_max = 1000000
                    
                
            if done:
               
                if counter < step_max / STEP_MULTIPLE :
                    step_max = counter * STEP_MULTIPLE
                    print(f"The new step_max is : {step_max}")
                    
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                end_time = datetime.datetime.now()
                print("==========================================")
                print(f"Total Time: {end_time - start_time}")
                print(f"Step max: {step_max}")
                print(f"Fail: {fail}")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run(
                    [loss, train_opt],
                    feed_dict={
                        input_: np.vstack(np.array(episode_states)),
                        actions: np.vstack(np.array(episode_actions)),
                        discounted_episode_rewards_: discounted_episode_rewards 
                    }
                )
                
                # Write TF Summaries
                summary = sess.run(write_op,
                   feed_dict={
                       input_: np.vstack(np.array(episode_states)),
                       actions: np.vstack(np.array(episode_actions)),
                       discounted_episode_rewards_: discounted_episode_rewards,
                       mean_reward_: mean_reward
                   }
                )
                
                writer.add_summary(summary, episode)
                writer.flush()
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state
        
        # Save Model
        if episode % 100 == 0:
            saver.save(sess, "./models/model.ckpt")
            print("Model saved")

The starting state is : [-0.43852191  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Step: 5000
Step: 6000
Step: 7000
Step: 8000
Step: 9000
Step: 10000
Step: 11000
Step: 12000
Step: 13000
Step: 14000
Step: 15000
Total Time: 0:00:12.609279
Step max: 20000
Fail: False
Episode:  0
Reward:  -15466.0
Mean Reward -15466.0
Max reward so far:  -15466.0
Model saved
The starting state is : [-0.49709999  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Step: 5000
Step: 6000
Step: 7000
Step: 8000
Step: 9000
Step: 10000
Step: 11000
Step: 12000
Step: 13000
Step: 14000
Step: 15000
Step: 16000
Step: 17000
Step: 18000
Step: 19000
Step: 20000
Total Time: 0:00:15.978589
Step max: 20000
Fail: True
Episode:  1
Reward:  -20488.0
Mean Reward -17977.0
Max reward so far:  -15466.0
The starting state is : [-0.56177637  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Step: 5000
Step: 6000
Step: 7000
Step: 8000
Step: 9000
Step: 10000
Step: 11000
Step: 12000
Step: 13000
Step: 14000
Step: 15

Step: 1000
Step: 2000
Step: 3000
Total Time: 0:00:02.913585
Step max: 4905.0
Fail: False
Episode:  23
Reward:  -3136.0
Mean Reward -7677.791666666667
Max reward so far:  -1393.0
The starting state is : [-0.58680719  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Total Time: 0:00:03.951491
Step max: 4905.0
Fail: True
Episode:  24
Reward:  -4799.0
Mean Reward -7562.64
Max reward so far:  -1393.0
The starting state is : [-0.47480625  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:01.709857
Step max: 4905.0
Fail: False
Episode:  25
Reward:  -1880.0
Mean Reward -7344.076923076923
Max reward so far:  -1393.0
The starting state is : [-0.55947751  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Total Time: 0:00:04.115505
Step max: 4905.0
Fail: True
Episode:  26
Reward:  -4751.0
Mean Reward -7248.037037037037
Max reward so far:  -1393.0
The starting state is : [-0.44888659  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Total Time: 0:00:04.106099
Step max: 4905.0
F

Step: 1000
Step: 2000
Total Time: 0:00:01.719199
Step max: 4905.0
Fail: False
Episode:  49
Reward:  -1956.0
Mean Reward -5738.18
Max reward so far:  -1393.0
The starting state is : [-0.56101593  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Total Time: 0:00:03.630490
Step max: 4905.0
Fail: False
Episode:  50
Reward:  -4189.0
Mean Reward -5707.803921568628
Max reward so far:  -1393.0
The starting state is : [-0.43464614  0.        ]
Step: 1000
Step: 2000
Step: 3000
Total Time: 0:00:02.959416
Step max: 4905.0
Fail: False
Episode:  51
Reward:  -3399.0
Mean Reward -5663.403846153846
Max reward so far:  -1393.0
The starting state is : [-0.4386806  0.       ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Total Time: 0:00:03.672737
Step max: 4905.0
Fail: False
Episode:  52
Reward:  -4455.0
Mean Reward -5640.603773584906
Max reward so far:  -1393.0
The starting state is : [-0.46175162  0.        ]
Step: 1000
Step: 2000
Step: 3000
Total Time: 0:00:02.662484
Step max: 4905.0
Fail: False


Step: 1000
Step: 2000
Step: 3000
Total Time: 0:00:02.541368
Step max: 3339.0
Fail: False
Episode:  76
Reward:  -2967.0
Mean Reward -4686.454545454545
Max reward so far:  -933.0
The starting state is : [-0.55060091  0.        ]
The new step_max is : 2907.0
Total Time: 0:00:00.827583
Step max: 2907.0
Fail: False
Episode:  77
Reward:  -765.0
Mean Reward -4636.179487179487
Max reward so far:  -765.0
The starting state is : [-0.54791808  0.        ]
Step: 1000
Total Time: 0:00:01.647247
Step max: 2907.0
Fail: False
Episode:  78
Reward:  -1823.0
Mean Reward -4600.569620253164
Max reward so far:  -765.0
The starting state is : [-0.46029927  0.        ]
Step: 1000
Total Time: 0:00:01.169001
Step max: 2907.0
Fail: False
Episode:  79
Reward:  -1214.0
Mean Reward -4558.2375
Max reward so far:  -765.0
The starting state is : [-0.47634839  0.        ]
Step: 1000
Total Time: 0:00:00.825651
Step max: 2907.0
Fail: False
Episode:  80
Reward:  -891.0
Mean Reward -4512.962962962963
Max reward so far:  -7

The new step_max is : 1995.0
Total Time: 0:00:00.578056
Step max: 1995.0
Fail: False
Episode:  105
Reward:  -489.0
Mean Reward -3711.443396226415
Max reward so far:  -489.0
The starting state is : [-0.56510815  0.        ]
Step: 1000
Total Time: 0:00:01.207080
Step max: 1995.0
Fail: False
Episode:  106
Reward:  -1245.0
Mean Reward -3688.392523364486
Max reward so far:  -489.0
The starting state is : [-0.54550371  0.        ]
Step: 1000
Total Time: 0:00:01.068920
Step max: 1995.0
Fail: False
Episode:  107
Reward:  -1101.0
Mean Reward -3664.435185185185
Max reward so far:  -489.0
The starting state is : [-0.58676268  0.        ]
Step: 1000
Total Time: 0:00:01.601630
Step max: 1995.0
Fail: False
Episode:  108
Reward:  -1762.0
Mean Reward -3646.9816513761466
Max reward so far:  -489.0
The starting state is : [-0.45143215  0.        ]
The new step_max is : 1890.0
Total Time: 0:00:00.626118
Step max: 1890.0
Fail: False
Episode:  109
Reward:  -500.0
Mean Reward -3618.3727272727274
Max reward 

Total Time: 0:00:00.562486
Step max: 1542.0
Fail: False
Episode:  134
Reward:  -562.0
Mean Reward -3117.748148148148
Max reward so far:  -374.0
The starting state is : [-0.50122239  0.        ]
Step: 1000
Total Time: 0:00:01.271635
Step max: 1542.0
Fail: True
Episode:  135
Reward:  -1318.0
Mean Reward -3104.514705882353
Max reward so far:  -374.0
The starting state is : [-0.58020971  0.        ]
Step: 1000
Total Time: 0:00:00.908136
Step max: 1542.0
Fail: False
Episode:  136
Reward:  -880.0
Mean Reward -3088.2773722627735
Max reward so far:  -374.0
The starting state is : [-0.55176632  0.        ]
Total Time: 0:00:00.841464
Step max: 1542.0
Fail: False
Episode:  137
Reward:  -781.0
Mean Reward -3071.557971014493
Max reward so far:  -374.0
The starting state is : [-0.5862043  0.       ]
Step: 1000
Total Time: 0:00:01.354735
Step max: 1542.0
Fail: True
Episode:  138
Reward:  -1496.0
Mean Reward -3060.223021582734
Max reward so far:  -374.0
The starting state is : [-0.52805017  0.        

Total Time: 0:00:00.462438
Step max: 1359.0
Fail: False
Episode:  163
Reward:  -434.0
Mean Reward -2736.1524390243903
Max reward so far:  -363.0
The starting state is : [-0.43309979  0.        ]
Total Time: 0:00:00.783392
Step max: 1359.0
Fail: False
Episode:  164
Reward:  -773.0
Mean Reward -2724.2545454545457
Max reward so far:  -363.0
The starting state is : [-0.5476218  0.       ]
Total Time: 0:00:00.837899
Step max: 1359.0
Fail: False
Episode:  165
Reward:  -833.0
Mean Reward -2712.8614457831327
Max reward so far:  -363.0
The starting state is : [-0.4052487  0.       ]
Total Time: 0:00:00.633508
Step max: 1359.0
Fail: False
Episode:  166
Reward:  -630.0
Mean Reward -2700.3892215568862
Max reward so far:  -363.0
The starting state is : [-0.56601517  0.        ]
Total Time: 0:00:00.736637
Step max: 1359.0
Fail: False
Episode:  167
Reward:  -819.0
Mean Reward -2689.190476190476
Max reward so far:  -363.0
The starting state is : [-0.52368288  0.        ]
Total Time: 0:00:00.580822
Ste

Total Time: 0:00:00.752246
Step max: 834.0
Fail: True
Episode:  193
Reward:  -722.0
Mean Reward -2408.056701030928
Max reward so far:  -218.0
The starting state is : [-0.53097521  0.        ]
Total Time: 0:00:00.730654
Step max: 834.0
Fail: True
Episode:  194
Reward:  -834.0
Mean Reward -2399.9846153846156
Max reward so far:  -218.0
The starting state is : [-0.52710979  0.        ]
Total Time: 0:00:00.719929
Step max: 834.0
Fail: True
Episode:  195
Reward:  -804.0
Mean Reward -2391.841836734694
Max reward so far:  -218.0
The starting state is : [-0.51708582  0.        ]
Total Time: 0:00:00.734926
Step max: 834.0
Fail: True
Episode:  196
Reward:  -696.0
Mean Reward -2383.233502538071
Max reward so far:  -218.0
The starting state is : [-0.52280662  0.        ]
Total Time: 0:00:00.657062
Step max: 834.0
Fail: True
Episode:  197
Reward:  -680.0
Mean Reward -2374.631313131313
Max reward so far:  -218.0
The starting state is : [-0.4025149  0.       ]
Total Time: 0:00:00.731703
Step max: 834.

Total Time: 0:00:00.689083
Step max: 834.0
Fail: True
Episode:  223
Reward:  -734.0
Mean Reward -2174.6026785714284
Max reward so far:  -218.0
The starting state is : [-0.59722247  0.        ]
Total Time: 0:00:00.677511
Step max: 834.0
Fail: True
Episode:  224
Reward:  -694.0
Mean Reward -2168.0222222222224
Max reward so far:  -218.0
The starting state is : [-0.57750542  0.        ]
Total Time: 0:00:00.731707
Step max: 834.0
Fail: True
Episode:  225
Reward:  -864.0
Mean Reward -2162.2522123893805
Max reward so far:  -218.0
The starting state is : [-0.44138187  0.        ]
Total Time: 0:00:00.414235
Step max: 834.0
Fail: False
Episode:  226
Reward:  -355.0
Mean Reward -2154.2907488986784
Max reward so far:  -218.0
The starting state is : [-0.5477451  0.       ]
Total Time: 0:00:00.659824
Step max: 834.0
Fail: True
Episode:  227
Reward:  -756.0
Mean Reward -2148.157894736842
Max reward so far:  -218.0
The starting state is : [-0.50265473  0.        ]
Total Time: 0:00:00.516463
Step max: 

Total Time: 0:00:00.582284
Step max: 834.0
Fail: False
Episode:  253
Reward:  -572.0
Mean Reward -1988.015748031496
Max reward so far:  -218.0
The starting state is : [-0.43054406  0.        ]
Total Time: 0:00:00.676646
Step max: 834.0
Fail: True
Episode:  254
Reward:  -768.0
Mean Reward -1983.2313725490196
Max reward so far:  -218.0
The starting state is : [-0.53626377  0.        ]
Total Time: 0:00:00.566223
Step max: 834.0
Fail: False
Episode:  255
Reward:  -551.0
Mean Reward -1977.63671875
Max reward so far:  -218.0
The starting state is : [-0.50305372  0.        ]
Total Time: 0:00:00.625758
Step max: 834.0
Fail: False
Episode:  256
Reward:  -511.0
Mean Reward -1971.929961089494
Max reward so far:  -218.0
The starting state is : [-0.5582609  0.       ]
Total Time: 0:00:00.459604
Step max: 834.0
Fail: False
Episode:  257
Reward:  -448.0
Mean Reward -1966.0232558139535
Max reward so far:  -218.0
The starting state is : [-0.51077817  0.        ]
Total Time: 0:00:00.742370
Step max: 834

Total Time: 0:00:00.698314
Step max: 834.0
Fail: True
Episode:  283
Reward:  -692.0
Mean Reward -1848.2570422535211
Max reward so far:  -218.0
The starting state is : [-0.46885567  0.        ]
Total Time: 0:00:00.692920
Step max: 834.0
Fail: True
Episode:  284
Reward:  -686.0
Mean Reward -1844.178947368421
Max reward so far:  -218.0
The starting state is : [-0.58684818  0.        ]
Total Time: 0:00:00.705595
Step max: 834.0
Fail: True
Episode:  285
Reward:  -690.0
Mean Reward -1840.1433566433566
Max reward so far:  -218.0
The starting state is : [-0.52540056  0.        ]
Total Time: 0:00:00.491438
Step max: 834.0
Fail: False
Episode:  286
Reward:  -460.0
Mean Reward -1835.334494773519
Max reward so far:  -218.0
The starting state is : [-0.5127037  0.       ]
Total Time: 0:00:00.619065
Step max: 834.0
Fail: False
Episode:  287
Reward:  -573.0
Mean Reward -1830.951388888889
Max reward so far:  -218.0
The starting state is : [-0.45494638  0.        ]
Total Time: 0:00:00.530832
Step max: 8

## Step 8: Evaluate our trained model

Load our model and see if it generalizes well by solving 10 random games and averaging the score

In [8]:
with tf.Session() as sess:
    env.reset()
    rewards = []
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")

    for episode in range(10):
        state = env.reset()
        step = 0
        done = False
        total_rewards = 0
        print("****************************************************")
        print("EPISODE ", episode)

        while True:
            

            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1, state_size])})
            #print(action_probability_distribution)
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            total_rewards += reward

            if done:
                rewards.append(total_rewards)
                print ("Score", total_rewards)
                break
            state = new_state
    env.close()
    print ("Score over time: " +  str(sum(rewards)/10))

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
****************************************************
EPISODE  0
Score -654.0
****************************************************
EPISODE  1
Score -673.0
****************************************************
EPISODE  2
Score -652.0
****************************************************
EPISODE  3
Score -536.0
****************************************************
EPISODE  4
Score -815.0
****************************************************
EPISODE  5
Score -627.0
****************************************************
EPISODE  6
Score -408.0
****************************************************
EPISODE  7
Score -1178.0
****************************************************
EPISODE  8
Score -781.0
****************************************************
EPISODE  9
Score -1017.0
Score over time: -734.1


# Report

1.  Base run with _CartPole_ environment
2.  Changed environment to _MountainCar_
3.  Changed Neural Network input to match new environment state_space dimensions
4.  Fitness Function Experiments:
     1. Score initially improved but was stuck throughout the rest of training. Not very promising
	 2. Designed a new metric, **potential energy (PE)**
         1.  Successfully improving PE during a direction change grants a bonus of +10 reward
         2.  Failure to improve PE during a direction change provides a penalty of -2 reward
	 3. Added a **step limit multiplier** hyperparameter to the training that constrained training episode duration to be a multiple of our fastest training episode. Initial multiple was 1.5. 
         1.  This combined with **experiment B** definitely improved the score further during training. 
         2.  Post training evaluation results were not great.  Rewards were constantly in the negative thousands (~ -5000) 
	 4. We found that the last set of training episodes had very short training times due to the lower step limit multiplier (1.5). To loosen this constraint, we increased the step limit multiplier from 1.5 to 3.0.
         1.  Rewards constantly improved during training as before. 
         2.  Post training evaluation results were much better. Rewards were averaging ~-500. So this change led to an order of magnitude improvement in our evaluation testing.