In [1]:
import os
from gymhelpers import ExperimentsManager
import scipy.io as scipyio
import numpy as np
from collections import defaultdict

In [2]:
strategies = ["sparsemax","softmax","epsilon"]
backuprules = ["sparsebellman","softbellman","bellman"]
temperatures = [1, 0.1, 0.01]
temperatures_name = ["high","mid","low"]
action_res_list = [2001, 1001, 101, 3]
action_res_name = ["large","midlarge","midsmall","small"]

env_name = "InvertedPendulum-v1"
min_avg_rwd = 930
stop_training_min_avg_rwd = 950
layers_size = [512, 512]
n_ep = 4000
n_exps = 1

gym_stats_dir_prefix = os.path.join('Gym_stats', env_name)
figures_dir = 'Figures'
api_key = '###'
alg_id = '###'

data = defaultdict(lambda : defaultdict(lambda : defaultdict(lambda : defaultdict(lambda : None))))
for action_res, action_name in zip(action_res_list,action_res_name):
    for temperature, temperature_name in zip(temperatures, temperatures_name):
        for strategy in strategies:
            for backuprule in backuprules:
                print("Problem: {}, Actions: {}, Temp: {}, Strategy: {}, Backup: {}".format(env_name,np.prod(action_res),temperature,strategy,backuprule))
                expsman = ExperimentsManager(env_name=env_name, agent_value_function_hidden_layers_size=layers_size,
                                     figures_dir=figures_dir, discount=0.99, decay_eps=0.995, eps_min=1E-4, learning_rate=3E-4,
                                     decay_lr=True, max_step=2000, replay_memory_max_size=100000, ep_verbose=False,
                                     exp_verbose=False, learning_rate_end=3E-5, batch_size=64, upload_last_exp=False, double_dqn=True, dueling=False,
                                     target_params_update_period_steps=75, replay_period_steps=4, min_avg_rwd=min_avg_rwd,
                                     per_proportional_prioritization=True, per_apply_importance_sampling=True, per_alpha=0.2,
                                     per_beta0=0.4,
                                     results_dir_prefix=gym_stats_dir_prefix, gym_api_key=api_key, gym_algorithm_id=alg_id,
                                     strategy=strategy,backuprule=backuprule,temperature=temperature,action_res=31,
                                     video_recording=False)
                _, _, Rwd_per_ep_v, Loss_per_ep_v = expsman.run_experiments(n_exps=n_exps, n_ep=n_ep, stop_training_min_avg_rwd=stop_training_min_avg_rwd, plot_results=False)
                data[action_name][temperature_name][strategy][backuprule] = {"reward_list":Rwd_per_ep_v,"loss_list":Loss_per_ep_v}

scipyio.savemat(env_name+".mat", data)
print("{} is finished and is saved".format(env_name))

Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-18 17:50:52,876] Making new env: InvertedPendulum-v1
[2017-07-18 17:50:53,044] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


  factor = higher_border / mu
  lower_border = np.clip(lower_border, mu / factor, mu)
[2017-07-18 18:33:27,315] Making new env: InvertedPendulum-v1


Average episode duration: 637.920741 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-18 19:10:40,856] Making new env: InvertedPendulum-v1


Average episode duration: 557.761610 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 729.406539 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 944 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: sparsemax, Backup: softbellman


[2017-07-18 19:59:24,707] Making new env: InvertedPendulum-v1
[2017-07-18 19:59:24,712] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-18 20:35:46,003] Making new env: InvertedPendulum-v1


Average episode duration: 544.680978 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-18 21:16:35,644] Making new env: InvertedPendulum-v1


Average episode duration: 611.614140 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 683.982644 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 628 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: sparsemax, Backup: bellman


[2017-07-18 22:02:18,190] Making new env: InvertedPendulum-v1
[2017-07-18 22:02:18,195] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-18 22:40:20,542] Making new env: InvertedPendulum-v1


Average episode duration: 569.881363 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 573.616161616 (std = 426.383838384).


[2017-07-18 23:23:39,615] Making new env: InvertedPendulum-v1


Average episode duration: 648.983723 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 715.744107744 (std = 401.998538013).
Average episode duration: 1147.478143 ms
Average final reward: 715.27 (std=115.16).

The 100-episode moving average reached 930 after 657 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: softmax, Backup: sparsebellman


[2017-07-19 00:40:16,778] Making new env: InvertedPendulum-v1
[2017-07-19 00:40:16,782] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 01:12:47,607] Making new env: InvertedPendulum-v1


Average episode duration: 487.030674 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-19 01:45:21,662] Making new env: InvertedPendulum-v1


Average episode duration: 487.800445 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 334.726136 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2816 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: softmax, Backup: softbellman


[2017-07-19 02:07:45,798] Making new env: InvertedPendulum-v1
[2017-07-19 02:07:45,802] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 982.525252525 (std = 0.0).


[2017-07-19 02:29:38,899] Making new env: InvertedPendulum-v1


Average episode duration: 327.714386 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 991.262626263 (std = 8.73737373737).


[2017-07-19 02:47:13,394] Making new env: InvertedPendulum-v1


Average episode duration: 263.044901 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 942.62962963 (std = 69.146447411).
Average episode duration: 1347.312798 ms
Average final reward: 943.20 (std=103.72).

The 100-episode moving average reached 930 after 3995 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: softmax, Backup: bellman


[2017-07-19 04:17:08,049] Making new env: InvertedPendulum-v1
[2017-07-19 04:17:08,055] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 44.898989899 (std = 0.0).


[2017-07-19 04:35:45,102] Making new env: InvertedPendulum-v1


Average episode duration: 278.727839 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 464.247474747 (std = 419.348484848).


[2017-07-19 06:01:09,087] Making new env: InvertedPendulum-v1


Average episode duration: 1280.413510 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 642.831649832 (std = 425.464510479).
Average episode duration: 416.697868 ms
Average final reward: 643.20 (std=84.81).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: epsilon, Backup: sparsebellman


[2017-07-19 06:29:01,194] Making new env: InvertedPendulum-v1
[2017-07-19 06:29:01,198] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 999.222222222 (std = 0.0).


[2017-07-19 07:21:35,324] Making new env: InvertedPendulum-v1


Average episode duration: 787.986902 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 517.257575758 (std = 481.964646465).


[2017-07-19 07:38:10,246] Making new env: InvertedPendulum-v1


Average episode duration: 248.161314 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 678.171717172 (std = 454.584061286).
Average episode duration: 799.071103 ms
Average final reward: 678.12 (std=36.10).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: epsilon, Backup: softbellman


[2017-07-19 08:31:31,957] Making new env: InvertedPendulum-v1
[2017-07-19 08:31:31,962] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 994.919191919 (std = 0.0).


[2017-07-19 09:20:32,229] Making new env: InvertedPendulum-v1


Average episode duration: 734.489666 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 499.262626263 (std = 495.656565657).


[2017-07-19 09:28:24,571] Making new env: InvertedPendulum-v1


Average episode duration: 117.545370 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 503.966329966 (std = 404.756556928).
Average episode duration: 217.716118 ms
Average final reward: 505.60 (std=156.99).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 1, Strategy: epsilon, Backup: bellman


[2017-07-19 09:43:00,771] Making new env: InvertedPendulum-v1
[2017-07-19 09:43:00,775] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 69.1616161616 (std = 0.0).


[2017-07-19 09:57:54,763] Making new env: InvertedPendulum-v1


Average episode duration: 222.949732 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 534.580808081 (std = 465.419191919).


[2017-07-19 10:35:27,860] Making new env: InvertedPendulum-v1


Average episode duration: 562.720143 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 369.845117845 (std = 445.741754172).
Average episode duration: 333.694547 ms
Average final reward: 369.82 (std=45.55).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-19 10:57:48,643] Making new env: InvertedPendulum-v1
[2017-07-19 10:57:48,647] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 11:33:56,804] Making new env: InvertedPendulum-v1


Average episode duration: 541.260567 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-19 12:09:16,904] Making new env: InvertedPendulum-v1


Average episode duration: 529.453275 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 552.382376 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 653 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: sparsemax, Backup: softbellman


[2017-07-19 12:46:11,558] Making new env: InvertedPendulum-v1
[2017-07-19 12:46:11,563] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 13:33:01,669] Making new env: InvertedPendulum-v1


Average episode duration: 701.980077 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-19 14:08:42,548] Making new env: InvertedPendulum-v1


Average episode duration: 534.446312 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 642.910155 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 766 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: sparsemax, Backup: bellman


[2017-07-19 14:51:39,787] Making new env: InvertedPendulum-v1
[2017-07-19 14:51:39,791] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 15:26:35,696] Making new env: InvertedPendulum-v1


Average episode duration: 523.283717 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-19 16:05:55,083] Making new env: InvertedPendulum-v1


Average episode duration: 589.244804 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 807.598549 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 818 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: softmax, Backup: sparsebellman


[2017-07-19 16:59:50,797] Making new env: InvertedPendulum-v1
[2017-07-19 16:59:53,454] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 17:42:07,414] Making new env: InvertedPendulum-v1


Average episode duration: 632.711487 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 854.858585859 (std = 145.141414141).


[2017-07-19 18:13:05,417] Making new env: InvertedPendulum-v1


Average episode duration: 463.734579 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 903.239057239 (std = 136.840637561).
Average episode duration: 612.624477 ms
Average final reward: 904.21 (std=91.84).

The 100-episode moving average reached 930 after 844 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: softmax, Backup: softbellman


[2017-07-19 18:54:02,801] Making new env: InvertedPendulum-v1
[2017-07-19 18:54:02,806] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 19:29:22,679] Making new env: InvertedPendulum-v1


Average episode duration: 529.163472 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 522.368686869 (std = 477.631313131).


[2017-07-19 19:40:34,599] Making new env: InvertedPendulum-v1


Average episode duration: 167.219323 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 681.579124579 (std = 450.315120563).
Average episode duration: 572.847520 ms
Average final reward: 681.57 (std=0.49).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: softmax, Backup: bellman


[2017-07-19 20:18:51,936] Making new env: InvertedPendulum-v1
[2017-07-19 20:18:51,952] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 20:58:56,881] Making new env: InvertedPendulum-v1


Average episode duration: 600.562473 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-19 21:36:46,125] Making new env: InvertedPendulum-v1


Average episode duration: 566.662166 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 507.790081 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 879 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: epsilon, Backup: sparsebellman


[2017-07-19 22:10:42,914] Making new env: InvertedPendulum-v1
[2017-07-19 22:10:42,918] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-19 22:38:08,779] Making new env: InvertedPendulum-v1


Average episode duration: 410.897103 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 967.161616162 (std = 32.8383838384).


[2017-07-19 23:52:33,900] Making new env: InvertedPendulum-v1


Average episode duration: 1115.584737 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 666.32996633 (std = 426.284258897).
Average episode duration: 253.400529 ms
Average final reward: 666.58 (std=64.74).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: epsilon, Backup: softbellman


[2017-07-20 00:09:33,059] Making new env: InvertedPendulum-v1
[2017-07-20 00:09:33,064] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 00:39:55,200] Making new env: InvertedPendulum-v1


Average episode duration: 454.848335 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 01:04:02,882] Making new env: InvertedPendulum-v1


Average episode duration: 361.004089 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 544.683899 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2646 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.1, Strategy: epsilon, Backup: bellman


[2017-07-20 01:40:28,115] Making new env: InvertedPendulum-v1
[2017-07-20 01:40:28,395] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 02:13:03,141] Making new env: InvertedPendulum-v1


Average episode duration: 488.111724 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 659.101010101 (std = 340.898989899).


[2017-07-20 02:29:04,366] Making new env: InvertedPendulum-v1


Average episode duration: 239.625670 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 772.734006734 (std = 321.402649943).
Average episode duration: 497.685544 ms
Average final reward: 771.69 (std=148.87).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: sparsemax, Backup: sparsebellman


[2017-07-20 03:02:20,648] Making new env: InvertedPendulum-v1
[2017-07-20 03:02:20,653] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 03:40:04,535] Making new env: InvertedPendulum-v1


Average episode duration: 565.392155 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 04:17:15,673] Making new env: InvertedPendulum-v1


Average episode duration: 557.147882 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 500.039525 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 880 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: sparsemax, Backup: softbellman


[2017-07-20 04:50:41,283] Making new env: InvertedPendulum-v1
[2017-07-20 04:50:43,957] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 05:29:02,735] Making new env: InvertedPendulum-v1


Average episode duration: 574.071076 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 06:04:42,704] Making new env: InvertedPendulum-v1


Average episode duration: 533.585209 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 538.646127 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 593 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: sparsemax, Backup: bellman


[2017-07-20 06:40:43,638] Making new env: InvertedPendulum-v1
[2017-07-20 06:40:43,692] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 07:13:34,914] Making new env: InvertedPendulum-v1


Average episode duration: 492.099525 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 07:52:51,272] Making new env: InvertedPendulum-v1


Average episode duration: 588.358060 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 592.666494 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 1032 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: softmax, Backup: sparsebellman


[2017-07-20 08:32:28,867] Making new env: InvertedPendulum-v1
[2017-07-20 08:32:28,872] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 09:24:13,107] Making new env: InvertedPendulum-v1


Average episode duration: 775.271764 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 10:27:21,832] Making new env: InvertedPendulum-v1


Average episode duration: 946.520731 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 665.826372 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 1083 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: softmax, Backup: softbellman


[2017-07-20 11:11:51,767] Making new env: InvertedPendulum-v1
[2017-07-20 11:11:51,806] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 11:49:15,059] Making new env: InvertedPendulum-v1


Average episode duration: 560.132555 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 12:08:25,100] Making new env: InvertedPendulum-v1


Average episode duration: 286.728803 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 704.689666 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 3198 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: softmax, Backup: bellman


[2017-07-20 12:55:30,936] Making new env: InvertedPendulum-v1
[2017-07-20 12:55:30,940] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 13:40:10,396] Making new env: InvertedPendulum-v1


Average episode duration: 669.170603 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 14:15:05,990] Making new env: InvertedPendulum-v1


Average episode duration: 523.121173 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 510.986596 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 759 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: epsilon, Backup: sparsebellman


[2017-07-20 14:49:16,930] Making new env: InvertedPendulum-v1
[2017-07-20 14:49:16,935] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 3.37373737374 (std = 0.0).


[2017-07-20 15:01:51,030] Making new env: InvertedPendulum-v1


Average episode duration: 187.925752 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 501.686868687 (std = 498.313131313).


[2017-07-20 15:42:13,633] Making new env: InvertedPendulum-v1


Average episode duration: 604.938668 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 667.791245791 (std = 469.814125741).
Average episode duration: 573.208787 ms
Average final reward: 667.79 (std=0.19).

The 100-episode moving average reached 930 after 3578 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: epsilon, Backup: softbellman


[2017-07-20 16:20:32,200] Making new env: InvertedPendulum-v1
[2017-07-20 16:20:32,456] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 33.4848484848 (std = 0.0).


[2017-07-20 16:34:00,003] Making new env: InvertedPendulum-v1


Average episode duration: 201.164590 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 516.742424242 (std = 483.257575758).


[2017-07-20 17:17:10,156] Making new env: InvertedPendulum-v1


Average episode duration: 646.932397 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 677.828282828 (std = 455.619611837).
Average episode duration: 603.105704 ms
Average final reward: 677.88 (std=5.67).

The 100-episode moving average reached 930 after 2052 episodes.
Problem: InvertedPendulum-v1, Actions: 2001, Temp: 0.01, Strategy: epsilon, Backup: bellman


[2017-07-20 17:57:28,567] Making new env: InvertedPendulum-v1
[2017-07-20 17:57:28,571] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 14.3434343434 (std = 0.0).


[2017-07-20 18:17:06,224] Making new env: InvertedPendulum-v1


Average episode duration: 293.754896 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 34.5808080808 (std = 20.2373737374).


[2017-07-20 18:30:39,127] Making new env: InvertedPendulum-v1


Average episode duration: 202.592285 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 48.0269360269 (std = 25.1918818407).
Average episode duration: 172.344683 ms
Average final reward: 48.06 (std=68.91).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-20 18:42:13,955] Making new env: InvertedPendulum-v1
[2017-07-20 18:42:13,959] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-20 19:32:50,297] Making new env: InvertedPendulum-v1


Average episode duration: 758.435388 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-20 20:08:50,439] Making new env: InvertedPendulum-v1


Average episode duration: 539.386741 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 713.064741 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 646 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: sparsemax, Backup: softbellman


[2017-07-20 20:56:28,163] Making new env: InvertedPendulum-v1
[2017-07-20 20:56:28,167] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 873.676767677 (std = 0.0).


[2017-07-20 21:31:54,093] Making new env: InvertedPendulum-v1


Average episode duration: 530.833565 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 886.601010101 (std = 12.9242424242).


[2017-07-20 22:05:19,732] Making new env: InvertedPendulum-v1


Average episode duration: 500.619330 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 924.400673401 (std = 54.4884065892).
Average episode duration: 776.803112 ms
Average final reward: 925.16 (std=136.82).

The 100-episode moving average reached 930 after 717 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: sparsemax, Backup: bellman


[2017-07-20 22:57:13,467] Making new env: InvertedPendulum-v1
[2017-07-20 22:57:13,470] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 00:46:17,733] Making new env: InvertedPendulum-v1


Average episode duration: 1635.306125 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 878.257575758 (std = 121.742424242).


[2017-07-21 01:12:51,118] Making new env: InvertedPendulum-v1


Average episode duration: 397.672589 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 918.838383838 (std = 114.77985832).
Average episode duration: 714.678154 ms
Average final reward: 919.65 (std=135.56).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: softmax, Backup: sparsebellman


[2017-07-21 02:00:36,076] Making new env: InvertedPendulum-v1
[2017-07-21 02:00:36,144] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 93.3636363636 (std = 0.0).


[2017-07-21 02:20:58,937] Making new env: InvertedPendulum-v1


Average episode duration: 304.966756 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 61.595959596 (std = 31.7676767677).


[2017-07-21 02:31:03,677] Making new env: InvertedPendulum-v1


Average episode duration: 150.538456 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 53.4208754209 (std = 28.3981371575).
Average episode duration: 221.065929 ms
Average final reward: 53.34 (std=41.92).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: softmax, Backup: softbellman


[2017-07-21 02:45:54,546] Making new env: InvertedPendulum-v1
[2017-07-21 02:45:54,891] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 206.313131313 (std = 0.0).


[2017-07-21 03:00:03,702] Making new env: InvertedPendulum-v1


Average episode duration: 211.496445 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 496.873737374 (std = 290.560606061).


[2017-07-21 03:28:42,954] Making new env: InvertedPendulum-v1


Average episode duration: 429.033018 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 502.757575758 (std = 237.387621773).
Average episode duration: 259.960649 ms
Average final reward: 506.28 (std=227.92).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: softmax, Backup: bellman


[2017-07-21 03:46:10,200] Making new env: InvertedPendulum-v1
[2017-07-21 03:46:10,204] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 6.0 (std = 0.0).


[2017-07-21 03:53:55,886] Making new env: InvertedPendulum-v1


Average episode duration: 115.732341 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 503.0 (std = 497.0).


[2017-07-21 04:16:10,648] Making new env: InvertedPendulum-v1


Average episode duration: 333.044109 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 668.666666667 (std = 468.576093666).
Average episode duration: 497.719513 ms
Average final reward: 668.66 (std=0.45).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: epsilon, Backup: sparsebellman


[2017-07-21 04:49:27,733] Making new env: InvertedPendulum-v1
[2017-07-21 04:49:27,737] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 16.4545454545 (std = 0.0).


[2017-07-21 04:57:00,126] Making new env: InvertedPendulum-v1


Average episode duration: 112.424549 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 16.9292929293 (std = 0.474747474747).


[2017-07-21 05:01:18,015] Making new env: InvertedPendulum-v1


Average episode duration: 63.692365 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 344.61952862 (std = 463.424137686).
Average episode duration: 260.116009 ms
Average final reward: 344.55 (std=6.60).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: epsilon, Backup: softbellman


[2017-07-21 05:18:44,395] Making new env: InvertedPendulum-v1
[2017-07-21 05:18:44,400] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 3.0 (std = 0.0).


[2017-07-21 05:37:42,980] Making new env: InvertedPendulum-v1


Average episode duration: 283.855446 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 487.247474747 (std = 484.247474747).


[2017-07-21 06:04:25,309] Making new env: InvertedPendulum-v1


Average episode duration: 399.816622 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 658.164983165 (std = 463.417738727).
Average episode duration: 621.220924 ms
Average final reward: 658.26 (std=33.56).

The 100-episode moving average reached 930 after 3379 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 1, Strategy: epsilon, Backup: bellman


[2017-07-21 06:45:56,424] Making new env: InvertedPendulum-v1
[2017-07-21 06:45:56,445] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 07:25:24,582] Making new env: InvertedPendulum-v1


Average episode duration: 591.448076 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 07:59:14,879] Making new env: InvertedPendulum-v1


Average episode duration: 506.259676 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 569.409982 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 1211 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-21 08:37:18,080] Making new env: InvertedPendulum-v1
[2017-07-21 08:37:18,084] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 09:14:51,554] Making new env: InvertedPendulum-v1


Average episode duration: 562.736892 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 502.267676768 (std = 497.732323232).


[2017-07-21 09:19:14,988] Making new env: InvertedPendulum-v1


Average episode duration: 65.245633 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 668.178451178 (std = 469.266534631).
Average episode duration: 561.080386 ms
Average final reward: 668.18 (std=0.17).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: sparsemax, Backup: softbellman


[2017-07-21 09:56:44,743] Making new env: InvertedPendulum-v1
[2017-07-21 09:56:44,747] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 10:40:30,753] Making new env: InvertedPendulum-v1


Average episode duration: 655.744425 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 11:44:37,781] Making new env: InvertedPendulum-v1


Average episode duration: 960.894900 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 534.046637 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 721 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: sparsemax, Backup: bellman


[2017-07-21 12:20:20,020] Making new env: InvertedPendulum-v1
[2017-07-21 12:20:20,024] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 12:57:33,333] Making new env: InvertedPendulum-v1


Average episode duration: 557.019170 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 13:38:11,978] Making new env: InvertedPendulum-v1


Average episode duration: 608.891228 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 629.706459 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 825 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: softmax, Backup: sparsebellman


[2017-07-21 14:20:16,730] Making new env: InvertedPendulum-v1
[2017-07-21 14:20:16,735] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 812.161616162 (std = 0.0).


[2017-07-21 14:49:30,250] Making new env: InvertedPendulum-v1


Average episode duration: 437.813377 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 906.080808081 (std = 93.9191919192).


[2017-07-21 15:34:18,751] Making new env: InvertedPendulum-v1


Average episode duration: 671.515989 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 937.387205387 (std = 88.5478633195).
Average episode duration: 602.674126 ms
Average final reward: 938.01 (std=117.47).

The 100-episode moving average reached 930 after 1276 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: softmax, Backup: softbellman


[2017-07-21 16:14:37,214] Making new env: InvertedPendulum-v1
[2017-07-21 16:14:37,220] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 16:56:17,426] Making new env: InvertedPendulum-v1


Average episode duration: 624.310370 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 17:44:41,121] Making new env: InvertedPendulum-v1


Average episode duration: 725.219023 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 544.071939 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 750 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: softmax, Backup: bellman


[2017-07-21 18:21:04,457] Making new env: InvertedPendulum-v1
[2017-07-21 18:21:04,462] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 19:07:01,285] Making new env: InvertedPendulum-v1


Average episode duration: 688.404159 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 19:46:01,876] Making new env: InvertedPendulum-v1


Average episode duration: 584.522952 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 746.902926 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 909 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: epsilon, Backup: sparsebellman


[2017-07-21 20:35:56,129] Making new env: InvertedPendulum-v1
[2017-07-21 20:35:58,020] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 8.62626262626 (std = 0.0).


[2017-07-21 20:52:07,551] Making new env: InvertedPendulum-v1


Average episode duration: 241.725865 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 26.4494949495 (std = 17.8232323232).


[2017-07-21 21:04:45,204] Making new env: InvertedPendulum-v1


Average episode duration: 188.720551 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 350.966329966 (std = 459.166778878).
Average episode duration: 649.084302 ms
Average final reward: 351.06 (std=6.52).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: epsilon, Backup: softbellman


[2017-07-21 21:48:08,045] Making new env: InvertedPendulum-v1
[2017-07-21 21:48:08,090] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-21 22:09:22,892] Making new env: InvertedPendulum-v1


Average episode duration: 318.114920 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-21 22:51:25,500] Making new env: InvertedPendulum-v1


Average episode duration: 629.993225 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 620.285365 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 3269 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.1, Strategy: epsilon, Backup: bellman


[2017-07-21 23:32:55,126] Making new env: InvertedPendulum-v1
[2017-07-21 23:32:55,417] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 00:14:33,549] Making new env: InvertedPendulum-v1


Average episode duration: 623.886056 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 00:49:45,082] Making new env: InvertedPendulum-v1


Average episode duration: 527.268395 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 392.295860 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2569 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: sparsemax, Backup: sparsebellman


[2017-07-22 01:15:59,682] Making new env: InvertedPendulum-v1
[2017-07-22 01:15:59,685] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 01:49:06,152] Making new env: InvertedPendulum-v1


Average episode duration: 496.011855 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 02:49:41,814] Making new env: InvertedPendulum-v1


Average episode duration: 908.291406 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 824.834128 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 886 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: sparsemax, Backup: softbellman


[2017-07-22 03:44:46,722] Making new env: InvertedPendulum-v1
[2017-07-22 03:44:46,995] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 04:54:22,304] Making new env: InvertedPendulum-v1


Average episode duration: 1043.137515 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 05:32:54,852] Making new env: InvertedPendulum-v1


Average episode duration: 577.459308 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 775.508417508 (std = 317.479040598).
Average episode duration: 276.485550 ms
Average final reward: 775.83 (std=49.40).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: sparsemax, Backup: bellman


[2017-07-22 05:51:26,370] Making new env: InvertedPendulum-v1
[2017-07-22 05:51:26,374] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 06:29:07,710] Making new env: InvertedPendulum-v1


Average episode duration: 564.709631 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 07:06:28,608] Making new env: InvertedPendulum-v1


Average episode duration: 559.563007 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 1102.719591 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 767 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: softmax, Backup: sparsebellman


[2017-07-22 08:20:07,546] Making new env: InvertedPendulum-v1
[2017-07-22 08:20:07,550] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 08:55:33,791] Making new env: InvertedPendulum-v1


Average episode duration: 530.969047 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 604.803030303 (std = 395.196969697).


[2017-07-22 09:19:52,482] Making new env: InvertedPendulum-v1


Average episode duration: 364.073754 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 736.535353535 (std = 372.595276236).
Average episode duration: 663.658840 ms
Average final reward: 736.36 (std=70.67).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: softmax, Backup: softbellman


[2017-07-22 10:04:12,963] Making new env: InvertedPendulum-v1
[2017-07-22 10:04:12,968] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 10:44:13,482] Making new env: InvertedPendulum-v1


Average episode duration: 599.475300 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 11:18:58,376] Making new env: InvertedPendulum-v1


Average episode duration: 520.608016 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 355.512172 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2699 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: softmax, Backup: bellman


[2017-07-22 11:42:45,997] Making new env: InvertedPendulum-v1
[2017-07-22 11:42:46,001] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 12:47:29,103] Making new env: InvertedPendulum-v1


Average episode duration: 970.161141 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 13:31:23,971] Making new env: InvertedPendulum-v1


Average episode duration: 658.095102 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 587.742496 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 571 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: epsilon, Backup: sparsebellman


[2017-07-22 14:10:40,650] Making new env: InvertedPendulum-v1
[2017-07-22 14:10:40,654] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 14:52:59,568] Making new env: InvertedPendulum-v1


Average episode duration: 634.143233 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 538.656565657 (std = 461.343434343).


[2017-07-22 15:01:37,250] Making new env: InvertedPendulum-v1


Average episode duration: 128.143207 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 475.346801347 (std = 387.179668184).
Average episode duration: 724.745402 ms
Average final reward: 477.48 (std=97.49).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: epsilon, Backup: softbellman


[2017-07-22 15:50:02,301] Making new env: InvertedPendulum-v1
[2017-07-22 15:50:02,324] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 16:29:07,965] Making new env: InvertedPendulum-v1


Average episode duration: 585.850336 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 16:58:47,898] Making new env: InvertedPendulum-v1


Average episode duration: 444.396084 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 735.828282828 (std = 373.59522522).
Average episode duration: 320.998259 ms
Average final reward: 735.15 (std=108.81).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 1001, Temp: 0.01, Strategy: epsilon, Backup: bellman


[2017-07-22 17:20:19,916] Making new env: InvertedPendulum-v1
[2017-07-22 17:20:20,188] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 999.282828283 (std = 0.0).


[2017-07-22 17:46:46,377] Making new env: InvertedPendulum-v1


Average episode duration: 395.960714 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 547.207070707 (std = 452.075757576).


[2017-07-22 17:55:32,154] Making new env: InvertedPendulum-v1


Average episode duration: 130.775872 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 698.138047138 (std = 426.390251239).
Average episode duration: 620.569858 ms
Average final reward: 698.05 (std=39.64).

The 100-episode moving average reached 930 after 2350 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-22 18:37:02,713] Making new env: InvertedPendulum-v1
[2017-07-22 18:37:05,436] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 19:15:08,176] Making new env: InvertedPendulum-v1


Average episode duration: 570.084874 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 19:53:33,022] Making new env: InvertedPendulum-v1


Average episode duration: 575.518126 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 996.794612795 (std = 4.53310205852).
Average episode duration: 784.873619 ms
Average final reward: 996.83 (std=31.57).

The 100-episode moving average reached 930 after 607 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: sparsemax, Backup: softbellman


[2017-07-22 20:45:58,142] Making new env: InvertedPendulum-v1
[2017-07-22 20:45:58,146] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-22 21:25:27,372] Making new env: InvertedPendulum-v1


Average episode duration: 591.620983 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-22 22:53:24,117] Making new env: InvertedPendulum-v1


Average episode duration: 1318.391581 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 541.800858 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 1040 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: sparsemax, Backup: bellman


[2017-07-22 23:29:37,272] Making new env: InvertedPendulum-v1
[2017-07-22 23:29:37,297] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 00:09:30,927] Making new env: InvertedPendulum-v1


Average episode duration: 597.660226 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 00:51:27,270] Making new env: InvertedPendulum-v1


Average episode duration: 628.428544 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 757.804560 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 728 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: softmax, Backup: sparsebellman


[2017-07-23 01:42:04,484] Making new env: InvertedPendulum-v1
[2017-07-23 01:42:04,757] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 02:17:51,693] Making new env: InvertedPendulum-v1


Average episode duration: 536.137983 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 753.686868687 (std = 246.313131313).


[2017-07-23 03:04:08,213] Making new env: InvertedPendulum-v1


Average episode duration: 693.545104 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 835.791245791 (std = 232.226247262).
Average episode duration: 525.658424 ms
Average final reward: 835.54 (std=90.63).

The 100-episode moving average reached 930 after 1998 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: softmax, Backup: softbellman


[2017-07-23 03:39:17,807] Making new env: InvertedPendulum-v1
[2017-07-23 03:39:17,894] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 216.505050505 (std = 0.0).


[2017-07-23 03:57:56,369] Making new env: InvertedPendulum-v1


Average episode duration: 278.880649 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 608.252525253 (std = 391.747474747).


[2017-07-23 04:34:43,256] Making new env: InvertedPendulum-v1


Average episode duration: 550.410868 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 738.835016835 (std = 369.343061209).
Average episode duration: 550.515525 ms
Average final reward: 738.32 (std=128.14).

The 100-episode moving average reached 930 after 1442 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: softmax, Backup: bellman


[2017-07-23 05:11:31,182] Making new env: InvertedPendulum-v1
[2017-07-23 05:11:31,187] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 05:51:30,656] Making new env: InvertedPendulum-v1


Average episode duration: 599.129146 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 524.308080808 (std = 475.691919192).


[2017-07-23 06:05:52,186] Making new env: InvertedPendulum-v1


Average episode duration: 214.676511 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 620.892255892 (std = 411.71860181).
Average episode duration: 972.159098 ms
Average final reward: 621.52 (std=101.44).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: epsilon, Backup: sparsebellman


[2017-07-23 07:10:47,048] Making new env: InvertedPendulum-v1
[2017-07-23 07:10:47,052] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 7.0303030303 (std = 0.0).


[2017-07-23 07:29:06,079] Making new env: InvertedPendulum-v1


Average episode duration: 274.125236 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 503.515151515 (std = 496.484848485).


[2017-07-23 08:09:58,559] Making new env: InvertedPendulum-v1


Average episode duration: 612.438954 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 669.01010101 (std = 468.09040416).
Average episode duration: 179.952487 ms
Average final reward: 669.02 (std=0.43).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: epsilon, Backup: softbellman


[2017-07-23 08:22:04,045] Making new env: InvertedPendulum-v1
[2017-07-23 08:22:04,049] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 09:00:54,699] Making new env: InvertedPendulum-v1


Average episode duration: 582.019107 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 527.333333333 (std = 472.666666667).


[2017-07-23 09:10:09,451] Making new env: InvertedPendulum-v1


Average episode duration: 138.077134 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 684.888888889 (std = 445.634406988).
Average episode duration: 424.279757 ms
Average final reward: 684.94 (std=4.66).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 1, Strategy: epsilon, Backup: bellman


[2017-07-23 09:38:32,103] Making new env: InvertedPendulum-v1
[2017-07-23 09:38:32,107] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 999.96969697 (std = 0.0).


[2017-07-23 10:05:49,109] Making new env: InvertedPendulum-v1


Average episode duration: 408.686227 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 999.984848485 (std = 0.0151515151515).


[2017-07-23 10:46:50,105] Making new env: InvertedPendulum-v1


Average episode duration: 614.639968 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 999.98989899 (std = 0.0142849854785).
Average episode duration: 327.314652 ms
Average final reward: 999.99 (std=0.10).

The 100-episode moving average reached 930 after 2742 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-23 11:08:44,725] Making new env: InvertedPendulum-v1
[2017-07-23 11:08:44,730] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 11:58:58,420] Making new env: InvertedPendulum-v1


Average episode duration: 752.852360 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 898.696969697 (std = 101.303030303).


[2017-07-23 13:00:20,054] Making new env: InvertedPendulum-v1


Average episode duration: 919.795220 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 932.464646465 (std = 95.5094129094).
Average episode duration: 706.710739 ms
Average final reward: 933.14 (std=106.13).

The 100-episode moving average reached 930 after 3651 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: sparsemax, Backup: softbellman


[2017-07-23 13:47:32,666] Making new env: InvertedPendulum-v1
[2017-07-23 13:47:32,671] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 14:53:18,802] Making new env: InvertedPendulum-v1


Average episode duration: 985.932933 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 15:40:02,982] Making new env: InvertedPendulum-v1


Average episode duration: 700.456747 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 531.289801 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 684 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: sparsemax, Backup: bellman


[2017-07-23 16:15:33,574] Making new env: InvertedPendulum-v1
[2017-07-23 16:15:33,578] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 17:04:49,305] Making new env: InvertedPendulum-v1


Average episode duration: 738.374253 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 17:43:15,976] Making new env: InvertedPendulum-v1


Average episode duration: 576.055663 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 745.404040404 (std = 360.053058986).
Average episode duration: 162.201389 ms
Average final reward: 745.48 (std=49.86).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: softmax, Backup: sparsebellman


[2017-07-23 17:54:10,425] Making new env: InvertedPendulum-v1
[2017-07-23 17:54:10,429] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 18:27:47,625] Making new env: InvertedPendulum-v1


Average episode duration: 503.662785 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 19:07:12,894] Making new env: InvertedPendulum-v1


Average episode duration: 590.706640 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 363.227599 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 3178 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: softmax, Backup: softbellman


[2017-07-23 19:31:31,064] Making new env: InvertedPendulum-v1
[2017-07-23 19:31:31,067] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 20:04:38,030] Making new env: InvertedPendulum-v1


Average episode duration: 496.144305 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 20:40:47,706] Making new env: InvertedPendulum-v1


Average episode duration: 541.807815 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 598.279193 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 847 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: softmax, Backup: bellman


[2017-07-23 21:20:46,389] Making new env: InvertedPendulum-v1
[2017-07-23 21:20:46,393] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-23 22:07:36,406] Making new env: InvertedPendulum-v1


Average episode duration: 701.864228 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-23 22:41:30,463] Making new env: InvertedPendulum-v1


Average episode duration: 507.823491 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 1109.151491 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 1287 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: epsilon, Backup: sparsebellman


[2017-07-23 23:55:32,696] Making new env: InvertedPendulum-v1
[2017-07-23 23:55:32,701] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 00:43:30,522] Making new env: InvertedPendulum-v1


Average episode duration: 718.881227 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-24 01:03:38,748] Making new env: InvertedPendulum-v1


Average episode duration: 301.394145 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 595.505959 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2517 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: epsilon, Backup: softbellman


[2017-07-24 01:43:26,856] Making new env: InvertedPendulum-v1
[2017-07-24 01:43:27,145] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 02:20:29,087] Making new env: InvertedPendulum-v1


Average episode duration: 554.820901 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 530.772727273 (std = 469.227272727).


[2017-07-24 02:36:37,584] Making new env: InvertedPendulum-v1


Average episode duration: 241.573740 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 376.942760943 (std = 440.579322479).
Average episode duration: 206.640961 ms
Average final reward: 376.85 (std=35.27).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.1, Strategy: epsilon, Backup: bellman


[2017-07-24 02:50:29,807] Making new env: InvertedPendulum-v1
[2017-07-24 02:50:29,812] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 90.3737373737 (std = 0.0).


[2017-07-24 03:03:42,365] Making new env: InvertedPendulum-v1


Average episode duration: 197.496828 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 461.797979798 (std = 371.424242424).


[2017-07-24 04:06:49,293] Making new env: InvertedPendulum-v1


Average episode duration: 946.101930 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 617.868686869 (std = 375.082359558).
Average episode duration: 604.676382 ms
Average final reward: 618.68 (std=126.72).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: sparsemax, Backup: sparsebellman


[2017-07-24 04:47:14,109] Making new env: InvertedPendulum-v1
[2017-07-24 04:47:14,399] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 976.161616162 (std = 0.0).


[2017-07-24 06:25:54,600] Making new env: InvertedPendulum-v1


Average episode duration: 1479.376045 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 988.080808081 (std = 11.9191919192).


[2017-07-24 07:09:52,695] Making new env: InvertedPendulum-v1


Average episode duration: 658.836148 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 992.053872054 (std = 11.2375219098).
Average episode duration: 578.215141 ms
Average final reward: 992.13 (std=41.24).

The 100-episode moving average reached 930 after 1048 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: sparsemax, Backup: softbellman


[2017-07-24 07:48:32,592] Making new env: InvertedPendulum-v1
[2017-07-24 07:48:32,596] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 965.585858586 (std = 0.0).


[2017-07-24 08:43:17,757] Making new env: InvertedPendulum-v1


Average episode duration: 820.618607 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 982.792929293 (std = 17.2070707071).


[2017-07-24 09:23:53,429] Making new env: InvertedPendulum-v1


Average episode duration: 608.180863 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 988.528619529 (std = 16.2229818418).
Average episode duration: 696.109721 ms
Average final reward: 988.64 (std=55.71).

The 100-episode moving average reached 930 after 802 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: sparsemax, Backup: bellman


[2017-07-24 10:10:23,853] Making new env: InvertedPendulum-v1
[2017-07-24 10:10:23,858] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 10:53:46,191] Making new env: InvertedPendulum-v1


Average episode duration: 649.912578 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-24 11:36:06,452] Making new env: InvertedPendulum-v1


Average episode duration: 634.366832 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 667.011802 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 406 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: softmax, Backup: sparsebellman


[2017-07-24 12:20:40,652] Making new env: InvertedPendulum-v1
[2017-07-24 12:20:40,657] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 12:59:52,375] Making new env: InvertedPendulum-v1


Average episode duration: 587.316016 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-24 13:41:32,903] Making new env: InvertedPendulum-v1


Average episode duration: 624.483698 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 691.372172 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 566 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: softmax, Backup: softbellman


[2017-07-24 14:27:44,529] Making new env: InvertedPendulum-v1
[2017-07-24 14:27:44,594] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 923.727272727 (std = 0.0).


[2017-07-24 15:01:08,799] Making new env: InvertedPendulum-v1


Average episode duration: 500.435516 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 961.863636364 (std = 38.1363636364).


[2017-07-24 15:36:21,760] Making new env: InvertedPendulum-v1


Average episode duration: 527.590520 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 974.575757576 (std = 35.9553084494).
Average episode duration: 585.002687 ms
Average final reward: 974.83 (std=67.24).

The 100-episode moving average reached 930 after 693 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: softmax, Backup: bellman


[2017-07-24 16:15:27,332] Making new env: InvertedPendulum-v1
[2017-07-24 16:15:27,337] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 16:57:39,991] Making new env: InvertedPendulum-v1


Average episode duration: 632.528341 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-24 17:35:49,317] Making new env: InvertedPendulum-v1


Average episode duration: 571.694357 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 548.681645 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 829 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: epsilon, Backup: sparsebellman


[2017-07-24 18:12:29,427] Making new env: InvertedPendulum-v1
[2017-07-24 18:12:29,430] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 65.1212121212 (std = 0.0).


[2017-07-24 18:25:29,374] Making new env: InvertedPendulum-v1


Average episode duration: 194.356807 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 37.202020202 (std = 27.9191919192).


[2017-07-24 18:45:25,642] Making new env: InvertedPendulum-v1


Average episode duration: 298.467515 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 125.575757576 (std = 127.041288957).
Average episode duration: 232.895006 ms
Average final reward: 124.64 (std=122.56).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: epsilon, Backup: softbellman


[2017-07-24 19:01:03,098] Making new env: InvertedPendulum-v1
[2017-07-24 19:01:03,122] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 19:46:37,563] Making new env: InvertedPendulum-v1


Average episode duration: 683.033174 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 554.792929293 (std = 445.207070707).


[2017-07-24 19:56:08,298] Making new env: InvertedPendulum-v1


Average episode duration: 142.102999 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 652.471380471 (std = 388.872365763).
Average episode duration: 549.093865 ms
Average final reward: 652.80 (std=104.24).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 101, Temp: 0.01, Strategy: epsilon, Backup: bellman


[2017-07-24 20:32:50,120] Making new env: InvertedPendulum-v1
[2017-07-24 20:32:50,390] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 21:09:11,660] Making new env: InvertedPendulum-v1


Average episode duration: 544.735649 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 514.828282828 (std = 485.171717172).


[2017-07-24 21:26:47,820] Making new env: InvertedPendulum-v1


Average episode duration: 263.381467 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 676.552188552 (std = 457.424281669).
Average episode duration: 611.431502 ms
Average final reward: 676.55 (std=2.26).

The 100-episode moving average reached 930 after 3443 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-24 22:07:39,108] Making new env: InvertedPendulum-v1
[2017-07-24 22:07:39,112] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-24 22:45:28,516] Making new env: InvertedPendulum-v1


Average episode duration: 566.708476 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-24 23:41:12,687] Making new env: InvertedPendulum-v1


Average episode duration: 835.379554 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 1442.004391 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 859 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: sparsemax, Backup: softbellman


[2017-07-25 01:17:26,296] Making new env: InvertedPendulum-v1
[2017-07-25 01:17:26,588] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 01:56:48,381] Making new env: InvertedPendulum-v1


Average episode duration: 589.869990 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-25 02:37:55,113] Making new env: InvertedPendulum-v1


Average episode duration: 616.063899 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 681.113465 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 750 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: sparsemax, Backup: bellman


[2017-07-25 03:23:25,109] Making new env: InvertedPendulum-v1
[2017-07-25 03:23:25,112] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 03:59:34,240] Making new env: InvertedPendulum-v1


Average episode duration: 541.632711 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-25 05:11:50,982] Making new env: InvertedPendulum-v1


Average episode duration: 1083.566625 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 604.562364 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 734 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: softmax, Backup: sparsebellman


[2017-07-25 05:52:15,189] Making new env: InvertedPendulum-v1
[2017-07-25 05:52:15,192] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 168.696969697 (std = 0.0).


[2017-07-25 06:11:34,164] Making new env: InvertedPendulum-v1


Average episode duration: 289.151535 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 510.631313131 (std = 341.934343434).


[2017-07-25 07:40:59,449] Making new env: InvertedPendulum-v1


Average episode duration: 1340.694888 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 673.754208754 (std = 362.166013912).
Average episode duration: 479.672783 ms
Average final reward: 673.89 (std=131.25).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: softmax, Backup: softbellman


[2017-07-25 08:13:03,647] Making new env: InvertedPendulum-v1
[2017-07-25 08:13:03,651] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 824.363636364 (std = 0.0).


[2017-07-25 09:33:36,173] Making new env: InvertedPendulum-v1


Average episode duration: 1207.526087 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 912.181818182 (std = 87.8181818182).


[2017-07-25 10:07:17,327] Making new env: InvertedPendulum-v1


Average episode duration: 504.676052 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 941.454545455 (std = 82.7957758335).
Average episode duration: 496.973616 ms
Average final reward: 939.60 (std=106.05).

The 100-episode moving average reached 930 after 1210 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: softmax, Backup: bellman


[2017-07-25 10:40:31,136] Making new env: InvertedPendulum-v1
[2017-07-25 10:40:31,233] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 6.68686868687 (std = 0.0).


[2017-07-25 10:55:37,541] Making new env: InvertedPendulum-v1


Average episode duration: 225.950637 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 503.343434343 (std = 496.656565657).


[2017-07-25 11:19:48,556] Making new env: InvertedPendulum-v1


Average episode duration: 362.125809 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 668.895622896 (std = 468.252300662).
Average episode duration: 522.212755 ms
Average final reward: 668.89 (std=1.51).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: epsilon, Backup: sparsebellman


[2017-07-25 11:54:43,195] Making new env: InvertedPendulum-v1
[2017-07-25 11:54:43,199] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 6.48484848485 (std = 0.0).


[2017-07-25 12:16:04,988] Making new env: InvertedPendulum-v1


Average episode duration: 319.749247 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 221.545454545 (std = 215.060606061).


[2017-07-25 12:28:28,182] Making new env: InvertedPendulum-v1


Average episode duration: 185.110517 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 481.03030303 (std = 406.815456982).
Average episode duration: 632.585210 ms
Average final reward: 479.79 (std=157.61).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: epsilon, Backup: softbellman


[2017-07-25 13:10:45,335] Making new env: InvertedPendulum-v1
[2017-07-25 13:10:47,143] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 13:36:45,990] Making new env: InvertedPendulum-v1


Average episode duration: 389.031556 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-25 14:21:24,667] Making new env: InvertedPendulum-v1


Average episode duration: 668.971063 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 939.707070707 (std = 85.2670783213).
Average episode duration: 1112.137386 ms
Average final reward: 940.31 (std=103.54).

The 100-episode moving average reached 930 after 2821 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 1, Strategy: epsilon, Backup: bellman


[2017-07-25 15:35:39,646] Making new env: InvertedPendulum-v1
[2017-07-25 15:35:39,650] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 4.0303030303 (std = 0.0).


[2017-07-25 15:44:27,299] Making new env: InvertedPendulum-v1


Average episode duration: 131.204873 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 6.93939393939 (std = 2.90909090909).


[2017-07-25 15:53:16,974] Making new env: InvertedPendulum-v1


Average episode duration: 131.732644 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 332.595959596 (std = 460.554056962).
Average episode duration: 573.852670 ms
Average final reward: 332.66 (std=37.05).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: sparsemax, Backup: sparsebellman


[2017-07-25 16:31:38,513] Making new env: InvertedPendulum-v1
[2017-07-25 16:31:38,518] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 17:14:09,504] Making new env: InvertedPendulum-v1


Average episode duration: 636.957296 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-25 17:55:40,521] Making new env: InvertedPendulum-v1


Average episode duration: 622.057316 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 1088.977740 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 902 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: sparsemax, Backup: softbellman


[2017-07-25 19:08:23,267] Making new env: InvertedPendulum-v1
[2017-07-25 19:08:23,271] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 19:44:27,204] Making new env: InvertedPendulum-v1


Average episode duration: 540.367999 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-25 21:07:10,045] Making new env: InvertedPendulum-v1


Average episode duration: 1240.005771 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 555.703023 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 650 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: sparsemax, Backup: bellman


[2017-07-25 21:44:20,587] Making new env: InvertedPendulum-v1
[2017-07-25 21:44:20,591] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-25 22:22:16,661] Making new env: InvertedPendulum-v1


Average episode duration: 568.328236 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 982.393939394 (std = 17.6060606061).


[2017-07-25 23:01:21,187] Making new env: InvertedPendulum-v1


Average episode duration: 585.314653 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 980.04040404 (std = 14.7555810017).
Average episode duration: 525.675914 ms
Average final reward: 980.24 (std=66.20).

The 100-episode moving average reached 930 after 856 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: softmax, Backup: sparsebellman


[2017-07-25 23:36:30,262] Making new env: InvertedPendulum-v1
[2017-07-25 23:36:30,266] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 00:11:50,710] Making new env: InvertedPendulum-v1


Average episode duration: 529.330557 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 00:49:33,011] Making new env: InvertedPendulum-v1


Average episode duration: 564.756314 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 687.98989899 (std = 441.248916446).
Average episode duration: 152.046791 ms
Average final reward: 688.04 (std=1.41).

The 100-episode moving average reached 930 after 788 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: softmax, Backup: softbellman


[2017-07-26 00:59:48,669] Making new env: InvertedPendulum-v1
[2017-07-26 00:59:48,673] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 01:40:25,181] Making new env: InvertedPendulum-v1


Average episode duration: 608.364933 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 02:19:29,866] Making new env: InvertedPendulum-v1


Average episode duration: 585.415817 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 635.181205 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 828 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: softmax, Backup: bellman


[2017-07-26 03:01:57,055] Making new env: InvertedPendulum-v1
[2017-07-26 03:01:57,059] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 03:37:10,134] Making new env: InvertedPendulum-v1


Average episode duration: 527.484744 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 04:15:20,268] Making new env: InvertedPendulum-v1


Average episode duration: 571.711660 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 565.792246 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 869 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: epsilon, Backup: sparsebellman


[2017-07-26 04:53:10,528] Making new env: InvertedPendulum-v1
[2017-07-26 04:53:10,532] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 4.26262626263 (std = 0.0).


[2017-07-26 05:08:51,060] Making new env: InvertedPendulum-v1


Average episode duration: 233.801916 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 502.131313131 (std = 497.868686869).


[2017-07-26 05:35:37,683] Making new env: InvertedPendulum-v1


Average episode duration: 400.980581 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 668.087542088 (std = 469.3950995).
Average episode duration: 589.386963 ms
Average final reward: 668.10 (std=0.51).

The 100-episode moving average reached 930 after 2471 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: epsilon, Backup: softbellman


[2017-07-26 06:15:02,974] Making new env: InvertedPendulum-v1
[2017-07-26 06:15:02,978] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 06:52:01,630] Making new env: InvertedPendulum-v1


Average episode duration: 553.842607 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 505.828282828 (std = 494.171717172).


[2017-07-26 07:03:03,087] Making new env: InvertedPendulum-v1


Average episode duration: 164.628672 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 3 experiments: 338.511784512 (std = 467.753580338).
Average episode duration: 229.543346 ms
Average final reward: 338.50 (std=4.85).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.1, Strategy: epsilon, Backup: bellman


[2017-07-26 07:18:28,104] Making new env: InvertedPendulum-v1
[2017-07-26 07:18:28,144] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 07:45:10,008] Making new env: InvertedPendulum-v1


Average episode duration: 399.710423 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 08:12:45,419] Making new env: InvertedPendulum-v1


Average episode duration: 413.191387 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 575.318732 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 2779 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: sparsemax, Backup: sparsebellman


[2017-07-26 08:51:13,163] Making new env: InvertedPendulum-v1
[2017-07-26 08:51:13,168] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 09:31:25,015] Making new env: InvertedPendulum-v1


Average episode duration: 602.236567 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 10:06:27,541] Making new env: InvertedPendulum-v1


Average episode duration: 524.829269 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 563.529842 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 721 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: sparsemax, Backup: softbellman


[2017-07-26 10:44:08,277] Making new env: InvertedPendulum-v1
[2017-07-26 10:44:08,281] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 350.262626263 (std = 0.0).


[2017-07-26 11:04:21,019] Making new env: InvertedPendulum-v1


Average episode duration: 302.401549 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 675.131313131 (std = 324.868686869).


[2017-07-26 11:38:24,509] Making new env: InvertedPendulum-v1


Average episode duration: 510.239726 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 783.420875421 (std = 306.289135307).
Average episode duration: 692.512565 ms
Average final reward: 782.73 (std=122.12).

The 100-episode moving average reached 930 after 846 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: sparsemax, Backup: bellman


[2017-07-26 12:24:41,854] Making new env: InvertedPendulum-v1
[2017-07-26 12:24:41,858] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 13:37:03,994] Making new env: InvertedPendulum-v1


Average episode duration: 1084.787669 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 14:12:19,017] Making new env: InvertedPendulum-v1


Average episode duration: 528.073458 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 539.740430 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 740 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: softmax, Backup: sparsebellman


[2017-07-26 14:48:25,461] Making new env: InvertedPendulum-v1
[2017-07-26 14:48:25,465] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 981.191919192 (std = 0.0).


[2017-07-26 15:24:12,008] Making new env: InvertedPendulum-v1


Average episode duration: 535.831057 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 989.409090909 (std = 8.21717171717).


[2017-07-26 16:03:21,547] Making new env: InvertedPendulum-v1


Average episode duration: 586.716769 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 992.939393939 (std = 8.36305484405).
Average episode duration: 598.894393 ms
Average final reward: 993.01 (std=44.03).

The 100-episode moving average reached 930 after 857 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: softmax, Backup: softbellman


[2017-07-26 16:43:22,861] Making new env: InvertedPendulum-v1
[2017-07-26 16:43:22,865] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 17:23:58,220] Making new env: InvertedPendulum-v1


Average episode duration: 608.116826 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 18:00:04,629] Making new env: InvertedPendulum-v1


Average episode duration: 540.872443 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 601.354751 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 598 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: softmax, Backup: bellman


[2017-07-26 18:40:17,448] Making new env: InvertedPendulum-v1
[2017-07-26 18:40:17,452] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 19:36:11,675] Making new env: InvertedPendulum-v1


Average episode duration: 837.899505 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 966.070707071 (std = 33.9292929293).


[2017-07-26 21:13:46,245] Making new env: InvertedPendulum-v1


Average episode duration: 1462.812048 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 977.38047138 (std = 31.9888441482).
Average episode duration: 536.063541 ms
Average final reward: 977.61 (std=67.45).

The 100-episode moving average reached 930 after 1005 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: epsilon, Backup: sparsebellman


[2017-07-26 21:49:37,708] Making new env: InvertedPendulum-v1
[2017-07-26 21:49:37,712] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 1 experiment: 1000.0 (std = 0.0).


[2017-07-26 22:23:39,279] Making new env: InvertedPendulum-v1


Average episode duration: 509.645400 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 2 experiments: 1000.0 (std = 0.0).


[2017-07-26 22:50:37,615] Making new env: InvertedPendulum-v1


Average episode duration: 403.894908 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 1000.0 (std = 0.0).
Average episode duration: 559.140167 ms
Average final reward: 1000.00 (std=0.00).

The 100-episode moving average reached 930 after 3655 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: epsilon, Backup: softbellman


[2017-07-26 23:28:01,198] Making new env: InvertedPendulum-v1
[2017-07-26 23:28:01,203] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 648.494949495 (std = 0.0).


[2017-07-27 00:01:08,431] Making new env: InvertedPendulum-v1


Average episode duration: 496.064020 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 328.585858586 (std = 319.909090909).


[2017-07-27 00:07:53,824] Making new env: InvertedPendulum-v1


Average episode duration: 100.628298 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 552.390572391 (std = 410.371763332).
Average episode duration: 656.298435 ms
Average final reward: 551.52 (std=114.50).

The 100-episode moving average reached 930 after 0 episodes.
Problem: InvertedPendulum-v1, Actions: 3, Temp: 0.01, Strategy: epsilon, Backup: bellman


[2017-07-27 00:51:46,586] Making new env: InvertedPendulum-v1
[2017-07-27 00:51:46,590] Making new env: InvertedPendulum-v1



EXECUTING EXPERIMENT 0 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 1 experiment: 66.9292929293 (std = 0.0).


[2017-07-27 01:06:49,039] Making new env: InvertedPendulum-v1


Average episode duration: 224.896041 ms

EXECUTING EXPERIMENT 1 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Final mean reward, averaged over 2 experiments: 195.01010101 (std = 128.080808081).


[2017-07-27 01:30:51,571] Making new env: InvertedPendulum-v1


Average episode duration: 359.827801 ms

EXECUTING EXPERIMENT 2 OF 3 IN ENVIRONMENT InvertedPendulum-v1.
Minimum average reward reached. Stop training and exploration.
Final mean reward, averaged over 3 experiments: 463.34006734 (std = 393.622158836).
Average episode duration: 635.741909 ms
Average final reward: 462.58 (std=119.72).

The 100-episode moving average reached 930 after 0 episodes.
InvertedPendulum-v1 is finished and is saved
