# Deep Deterministic Policy Gradient
### This notebook contains the code for creating a DDPG agent for solving the environment Reacher

#### Logic

Following this repository https://github.com/enginBozkurt/Deep-Reinforcement-Learning-for-Enterprise-Nanodegree/tree/master/Project%202%20-Continuous%20Control and some bits of my DQN Pong code \
Idea is follows:
1) Create actor, actor target, critic & critic target models \
2) For every episode, pass state to actor to predict action \
3) Pass action to env.step() to obtain next_state, reward & terminate \
4) Continuously add them to memory until a certain length is achieved \
5) After which, we also concurrently do weights updates by randomly selecting batch_size of samples from agent's memory for model training

When mean reward of last 100 episodes is less than threshold, save model weights

#### Import modules

Please download models.py & memory.py into same directory before importing

In [1]:
import gym
import copy
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from tqdm import tqdm
from models import *
from memory import *

We define our hyprparamters here:
1) gamma is the discount factor multiplied to value of next state. We use this in th equation: q_target = reward + (1-done) * q_next * gamma to obtain our target action values for model training \
2) tau is used to do a gradual update of our target model weights. We use this in the equation: target_weights = (1-tau) * target_weights + tau * model_weights
3) num_states refer to the number of state variables. More details can be found here[https://www.gymlibrary.ml/environments/mujoco/reacher/] \
4) num_output refers to output size for critic model. We output an action value for every pair of actions + num_states \
5) min_memory_len is the minimum amount of historical experiences required before model training is allowed \
6) max_memory_size refers to maximum length of buffer deque. A deque holds only the latest max_memory_size of experiences with the older oones discarded. Here, we set it equal to min_memory_len \
7) epsilon_start is initial value for epsilon. For every iteration, we decay epsilon using the equation epsilon = max(episilon_min,epsilon_cay * epsilon). This is done to encourage gradual decrease of exploratory actions \
8) We set epsilon_min to 0.03 to ensure a minute chance of taking exploratory action towards the end of training \
9) We also define learning rates for both actor & critic models \
10) Finally, set device to GPU to enhance training speed. Else, use CPU

In [2]:
# episodes = 10000
gamma = 0.99
tau = 0.01
num_states = 11
num_output = 1
hidden_size = 256
batch_size = 128
num_actions = 2
min_memory_len = max_memory_size = 50000
epsilon_start = 1.0
epsilon_decay = 0.99999
epsilon_min = 0.03
critic_learning_rate = 0.001
actor_learning_rate = 0.0001
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

  and should_run_async(code)


#### Define classes 

We create class DDPG to define our agent behavior: \
&nbsp; a) __init__() initialises parameters used for agent-training. It starts by assigning input env to env variable, creates an experience buffer called buffer, resets &emsp; total_reward to 0 & calls _reset() method \
&nbsp; b) _reset() resets env & total_reward \
&nbsp; c) select_optimal_action() is used to observe agent's behavior after training. This will enable agent to take the action that maximises it's action-value at every &emsp; state \
&nbsp; d) select_action() chooses action based on action probability. If it is < epsilon, we take a random action. Else, act greedy \
&nbsp; e) get_experience() adds replay experiences to agent memory. Logic is as follows: \
&emsp; 1) Create episode_reward variable & set as None \
&emsp; 2) Select next action by passing actor_model, epsilon & device to select_action() \
&emsp; 3) Next action is passed to env.step(), to obtain next_state, reward, terminate & info \
&emsp; 4) Create Experience namedtuple using self.state,action,reward,terminate,next_state \
&emsp; 5) Append Experience to experience buffer \
&emsp; 6) Set state as next_state, increment total_reward \
&emsp; 7) IF next state is terminal, set episode_reward as total_reward. Print episode_reward. Call _reset() method, return True with \
&emsp; episode_reward \
&emsp; 8) ELSE, return False & episode_reward[with None value] \
&emsp; 9) IF we have collected enough memory, update_weights() & update_target_weights() \
&nbsp; f) update_target_weights() implements gradual update of target model weights with their current models using the equation target_weights = (1-tau) * \
&emsp; target_weights + tau * model_weights. Since this is a 'soft' update, we can implement it concurrently with update_weights() \
&nbsp; g) update_weights() is used to update model weights. Logic is as follows: \
&emsp; 1) Call sample method of experience buffer. Assign result as batch variable \
&emsp; 2) Assign resulting arrays to states, actions, rewards, dones, next_states\
&emsp; 3) Create tensors states_t,next_states_t, actions_t, rewards_t & done_mask \
&emsp; 4) Pass states_t, actions_t to critic_model to obtain action value prediction[q_values] \
&emsp; 5) Pass next_states_t to actor_target to obtain next state action[next_actions_t] \
&emsp; 6) Pass next_states_t, next_actions_t to critic_target to obtain next state action[q_next] \
&emsp; 7) Calculate q_targets using q_next * gamma * (1-done_mask) + rewards_t \
&emsp; 8) Calculate critic_loss using nn.MSELoss()(q_values, q_targets) & do backward propagation \
&emsp; 9) Calculate actor_loss using -critic_model(states_t, actor_model(states_t)).mean() & do backward propagation \
&emsp; 10) Update optimizers \
&emsp; 11) Update weights of target models

In [3]:
class DDPG:
    def __init__(self, env, buffer):
        self.env = env
        self.buffer = buffer
        self._reset()
        
    def _reset(self):
        self.state = env.reset()
        self.total_reward = 0
        
    def select_optimal_action(self, actor_model, device):
        state = torch.tensor(self.state).float().unsqueeze(0).to(device)
        action = actor_model(state).cpu().detach().numpy()[0]
        return action
        
    def select_action(self, actor_model, epsilon, device):
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state = torch.tensor(self.state).float().unsqueeze(0).to(device)
            action = actor_model(state).cpu().detach().numpy()[0]
        return action
    
    def get_experience(self, actor_model, actor_target, critic_model, critic_target, epsilon, device):
        episode_reward = None
        action = self.select_action(actor_model, epsilon, device)
        next_state, reward, terminate, info = self.env.step(action)
        exp = Experience(self.state,action,reward,terminate,next_state)
        self.buffer.append(exp)
        self.state = next_state
        self.total_reward += reward
        
        if terminate:
            episode_reward = self.total_reward
            print(f"Score {episode_reward}")
            self._reset()
            return True, episode_reward
        
        if len(buffer) == min_memory_len:
            self.update_weights(actor_model,actor_target,critic_model,critic_target) 
            
        return False, episode_reward    
        
    def update_target_weights(self, actor_model, actor_target, critic_model, critic_target):
        for target_param, param in zip(actor_target.parameters(), actor_model.parameters()):
            target_param.data.copy_(param.data * tau + target_param.data * (1.0 - tau))
       
        for target_param, param in zip(critic_target.parameters(), critic_model.parameters()):
            target_param.data.copy_(param.data * tau + target_param.data * (1.0 - tau))
        
    def update_weights(self, actor_model, actor_target, critic_model, critic_target):
        batch = buffer.sample(batch_size)
        states, actions, rewards, dones, next_states = batch
        
        states_t = torch.tensor(states).to(device)
        next_states_t = torch.tensor(next_states).to(device)
        actions_t = torch.tensor(actions).to(device)
        rewards_t = torch.tensor(rewards).to(device)
        done_mask = torch.tensor(dones).to(device)
        
        q_values = critic_model(states_t,actions_t)
        next_actions = actor_target(next_states_t)
        q_next = critic_target(next_states_t,next_actions.detach())
        q_next = q_next.detach()
        q_targets = q_next*gamma*(1-done_mask.unsqueeze(1)) + rewards_t.unsqueeze(1)
        
        critic_loss = nn.MSELoss()(q_values, q_targets)
        actor_loss = -critic_model(states_t, actor_model(states_t)).mean()
        
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()

        critic_optimizer.zero_grad()
        critic_loss.backward() 
        critic_optimizer.step()
  
        self.update_target_weights(actor_model, actor_target, critic_model, critic_target)
#         print(f"learns {self.learns}: target models weights updated")

#### Train agent

We start by creating env using gym.make() \
Next, we create our actor, actor_target, critic & critic_target models with their respective optimizers \
Then, we create experience buffer using ExperienceReplay()  & agent using DDPG() \
Finally, set epsilon to epislon_start & create empty list episode_rewards to contain each episode's rewards 

In [4]:
env = gym.make('Reacher-v4')

actor = Actor(num_states, hidden_size, num_actions).to(device)
actor_target = copy.deepcopy(actor).to(device)
actor_optimizer = optim.Adam(actor.parameters(), lr=actor_learning_rate)
critic = Critic(num_states + num_actions, hidden_size, num_output).to(device)
critic_target = copy.deepcopy(critic).to(device)
critic_optimizer = optim.Adam(critic.parameters(), lr=critic_learning_rate)

buffer = ExperienceReplay(max_memory_size)
agent = DDPG(env, buffer)
epsilon = epsilon_start
episode_rewards = []

  logger.warn(
  logger.warn(
  logger.warn(


Let's begin agent training \
While our mean reward for previous 100 episodes is < -6, we shall do the following:
1) Set terminate to False \
2) While NOT terminate: \
&nbsp; a) Set epsilon as max(epsilon*epsilon_decay,epsilon_min) \
&nbsp; b) Call get_experience() to obtain terminate & reward \
&nbsp; c) Render env \
&nbsp; d) IF terminate, append reward to episode_rewards & print mean reward of last 100 episodes \
&nbsp; e) IF we have attained minimum length for buffer, print 'model & target weights updated' 

Once required mean reward has been attained, print 'training complete', reset & close env

In [5]:
# for episode in tqdm(range(episodes)):
while True:    
    terminate = False
    while not terminate:
        epsilon = max(epsilon*epsilon_decay,epsilon_min)
        terminate, reward = agent.get_experience(actor, actor_target, critic, critic_target, epsilon, device)
        agent.env.render()
        if terminate:
            episode_rewards.append(reward)
            mean_reward = round(np.mean(episode_rewards[-100:]),3)
            print(f"mean reward: {mean_reward}")
            if len(buffer) == min_memory_len:
                print('model & target weights updated')
    if mean_reward > -6:
        print('training complete')
        break
        
env.reset()        
env.close() 

Score -38.96658240094558
mean reward: -38.967
Score -40.94329835069381
mean reward: -39.955
Score -41.255608473397224
mean reward: -40.388
Score -49.063982970467926
mean reward: -42.557
Score -43.004815843302254
mean reward: -42.647
Score -47.267952401844745
mean reward: -43.417
Score -43.367765819878045
mean reward: -43.41
Score -39.4344730452899
mean reward: -42.913
Score -38.722006072173166
mean reward: -42.447
Score -39.01721314659832
mean reward: -42.104
Score -46.0580254438627
mean reward: -42.464
Score -42.20276318372339
mean reward: -42.442
Score -52.32699440973057
mean reward: -43.202
Score -43.667407139906906
mean reward: -43.236
Score -37.13143326742037
mean reward: -42.829
Score -40.93063839194385
mean reward: -42.71
Score -44.675360791873665
mean reward: -42.826
Score -43.65017664329201
mean reward: -42.871
Score -39.025139993372306
mean reward: -42.669
Score -40.822530099564254
mean reward: -42.577
Score -51.20067311189741
mean reward: -42.987
Score -42.03935432062587
mea

Score -49.19599225216037
mean reward: -41.296
Score -40.65558161872294
mean reward: -41.311
Score -37.53130835798348
mean reward: -41.194
Score -45.533395722954715
mean reward: -41.16
Score -42.49736324616597
mean reward: -41.219
Score -39.89961927785977
mean reward: -41.158
Score -36.68075714018205
mean reward: -41.143
Score -41.69921728425272
mean reward: -41.165
Score -38.70222910413904
mean reward: -41.14
Score -34.4978495013589
mean reward: -41.09
Score -44.98688583175597
mean reward: -41.144
Score -48.01505631028271
mean reward: -41.215
Score -35.04359819601353
mean reward: -41.167
Score -38.23774435253906
mean reward: -41.167
Score -41.719775880862315
mean reward: -41.159
Score -46.916213780331795
mean reward: -41.21
Score -43.32855055004077
mean reward: -41.281
Score -38.70625572137947
mean reward: -41.269
Score -38.27957530533739
mean reward: -41.176
Score -42.93711404250696
mean reward: -41.189
Score -42.7444404785956
mean reward: -41.172
Score -38.14632342729657
mean reward:

Score -42.33225041034887
mean reward: -39.754
Score -39.30646516824249
mean reward: -39.743
Score -50.40344555170881
mean reward: -39.824
Score -36.01565591386011
mean reward: -39.799
Score -43.4427771947466
mean reward: -39.804
Score -40.16813686662595
mean reward: -39.759
Score -31.758271034604647
mean reward: -39.711
Score -32.192224529110916
mean reward: -39.621
Score -39.61018424030741
mean reward: -39.614
Score -38.25538040514176
mean reward: -39.606
Score -33.726130476961394
mean reward: -39.58
Score -41.14690631439518
mean reward: -39.603
Score -42.20871032854368
mean reward: -39.597
Score -41.15012855854732
mean reward: -39.618
Score -29.763147406664793
mean reward: -39.553
Score -44.57702199986803
mean reward: -39.594
Score -30.953703486994648
mean reward: -39.444
Score -42.14245383445147
mean reward: -39.385
Score -36.00504883164408
mean reward: -39.335
Score -40.05709901621753
mean reward: -39.305
Score -39.712227732449406
mean reward: -39.248
Score -45.62772870913254
mean 

Score -35.17352661337884
mean reward: -38.021
Score -40.02258861164556
mean reward: -37.936
Score -36.100739018231565
mean reward: -37.893
Score -34.70971295553368
mean reward: -37.844
Score -32.09549173310715
mean reward: -37.74
Score -33.78607701320145
mean reward: -37.694
Score -36.16521991631602
mean reward: -37.734
Score -35.57473612612683
mean reward: -37.705
Score -33.39922393408087
mean reward: -37.686
Score -39.762175050971486
mean reward: -37.744
Score -36.73438461488423
mean reward: -37.762
Score -37.48717533609892
mean reward: -37.755
Score -35.89979800529501
mean reward: -37.729
Score -38.26524179822266
mean reward: -37.778
Score -44.060116586640554
mean reward: -37.792
Score -37.063152710790916
mean reward: -37.718
Score -31.580775488971586
mean reward: -37.628
Score -42.35215082072381
mean reward: -37.682
Score -43.08764690227108
mean reward: -37.743
Score -42.36911735459932
mean reward: -37.825
Score -40.16006082556677
mean reward: -37.886
Score -38.32527842378033
mean 

Score -36.89736863694721
mean reward: -37.642
Score -36.43979393674748
mean reward: -37.679
Score -37.984152083063215
mean reward: -37.74
Score -46.75254077974085
mean reward: -37.911
Score -44.01581528575701
mean reward: -37.87
Score -28.351529113185027
mean reward: -37.82
Score -35.21482657775352
mean reward: -37.792
Score -39.6722184906726
mean reward: -37.77
Score -30.80262234317547
mean reward: -37.683
Score -43.28695971558177
mean reward: -37.784
Score -34.460497639612704
mean reward: -37.788
Score -39.15247723961584
mean reward: -37.805
Score -43.4649205995587
mean reward: -37.843
Score -40.06927016677559
mean reward: -37.889
Score -37.67467668856406
mean reward: -37.902
Score -34.03753736722644
mean reward: -37.871
Score -31.215660271170645
mean reward: -37.798
Score -36.05520901454495
mean reward: -37.723
Score -35.730617711582475
mean reward: -37.68
Score -33.31858604817714
mean reward: -37.661
Score -30.441716467400514
mean reward: -37.578
Score -29.98387056648311
mean rewar

Score -30.05317768072946
mean reward: -35.884
Score -34.71875073082998
mean reward: -35.805
Score -41.10142598020261
mean reward: -35.893
Score -37.020643762555956
mean reward: -35.766
Score -37.42581515326179
mean reward: -35.777
Score -38.85355999891468
mean reward: -35.83
Score -34.911114805229744
mean reward: -35.827
Score -38.51282348340215
mean reward: -35.677
Score -46.07524293870919
mean reward: -35.805
Score -43.15064025155985
mean reward: -35.89
Score -37.776764520794174
mean reward: -35.847
Score -36.7445909492774
mean reward: -35.922
Score -35.129113845406195
mean reward: -36.008
Score -39.419658057721286
mean reward: -36.046
Score -30.973555064076823
mean reward: -36.07
Score -34.079509511372436
mean reward: -36.106
Score -33.94774504755705
mean reward: -36.111
Score -35.35075989144034
mean reward: -36.183
Score -37.343598601311385
mean reward: -36.202
Score -31.712796376070663
mean reward: -36.176
Score -34.07383694867366
mean reward: -36.228
Score -43.33180983685808
mean

Score -30.649436462985083
mean reward: -33.839
model & target weights updated
Score -28.910790424592694
mean reward: -33.701
model & target weights updated
Score -26.411227145429365
mean reward: -33.594
model & target weights updated
Score -37.291333918059614
mean reward: -33.668
model & target weights updated
Score -34.11488036420062
mean reward: -33.596
model & target weights updated
Score -22.233899895300233
mean reward: -33.429
model & target weights updated
Score -35.92132523603595
mean reward: -33.515
model & target weights updated
Score -25.475525924178655
mean reward: -33.405
model & target weights updated
Score -28.06118817006423
mean reward: -33.37
model & target weights updated
Score -31.92828938045896
mean reward: -33.302
model & target weights updated
Score -25.626862047829928
mean reward: -33.154
model & target weights updated
Score -33.925318618553774
mean reward: -33.015
model & target weights updated
Score -28.694115529183353
mean reward: -32.959
model & target weights

Score -36.06452509635348
mean reward: -28.48
model & target weights updated
Score -29.411186236004006
mean reward: -28.52
model & target weights updated
Score -20.18798375137274
mean reward: -28.441
model & target weights updated
Score -26.957496790882242
mean reward: -28.391
model & target weights updated
Score -29.913816505541956
mean reward: -28.434
model & target weights updated
Score -26.648677585203835
mean reward: -28.362
model & target weights updated
Score -23.19868515242846
mean reward: -28.307
model & target weights updated
Score -36.94081622154231
mean reward: -28.377
model & target weights updated
Score -31.28589826429398
mean reward: -28.39
model & target weights updated
Score -38.861637324755044
mean reward: -28.498
model & target weights updated
Score -23.706522656459747
mean reward: -28.528
model & target weights updated
Score -27.14958593044045
mean reward: -28.508
model & target weights updated
Score -21.62643719663883
mean reward: -28.428
model & target weights upda

Score -32.20070102362645
mean reward: -27.719
model & target weights updated
Score -31.61044154017339
mean reward: -27.666
model & target weights updated
Score -26.080321561488788
mean reward: -27.614
model & target weights updated
Score -28.520411877849597
mean reward: -27.51
model & target weights updated
Score -31.20215858829997
mean reward: -27.585
model & target weights updated
Score -28.547682498780613
mean reward: -27.599
model & target weights updated
Score -20.672613978162044
mean reward: -27.59
model & target weights updated
Score -20.641180057187267
mean reward: -27.59
model & target weights updated
Score -26.06853212952148
mean reward: -27.481
model & target weights updated
Score -33.57149407650908
mean reward: -27.538
model & target weights updated
Score -31.891016698891708
mean reward: -27.582
model & target weights updated
Score -24.760105616708685
mean reward: -27.638
model & target weights updated
Score -28.672874402380472
mean reward: -27.654
model & target weights up

Score -28.17599312319105
mean reward: -26.425
model & target weights updated
Score -32.56609828266862
mean reward: -26.544
model & target weights updated
Score -30.217716509412387
mean reward: -26.586
model & target weights updated
Score -23.974884912488704
mean reward: -26.49
model & target weights updated
Score -23.454729176423875
mean reward: -26.406
model & target weights updated
Score -27.871262585318387
mean reward: -26.437
model & target weights updated
Score -21.39854636565213
mean reward: -26.364
model & target weights updated
Score -26.52546452113196
mean reward: -26.43
model & target weights updated
Score -17.825352187633793
mean reward: -26.411
model & target weights updated
Score -25.18346967654029
mean reward: -26.37
model & target weights updated
Score -29.44847769598441
mean reward: -26.323
model & target weights updated
Score -25.552958158373745
mean reward: -26.403
model & target weights updated
Score -25.691194153141524
mean reward: -26.378
model & target weights upd

Score -22.330389572923686
mean reward: -25.486
model & target weights updated
Score -24.443422981194967
mean reward: -25.465
model & target weights updated
Score -32.1851930158226
mean reward: -25.609
model & target weights updated
Score -23.226187117830694
mean reward: -25.589
model & target weights updated
Score -25.96754424209323
mean reward: -25.555
model & target weights updated
Score -19.630905504281483
mean reward: -25.495
model & target weights updated
Score -16.643233670333363
mean reward: -25.405
model & target weights updated
Score -23.921470781625548
mean reward: -25.4
model & target weights updated
Score -20.81090273666069
mean reward: -25.346
model & target weights updated
Score -29.361623660049684
mean reward: -25.395
model & target weights updated
Score -25.325534145891158
mean reward: -25.386
model & target weights updated
Score -32.880092917477725
mean reward: -25.517
model & target weights updated
Score -22.535397246398052
mean reward: -25.513
model & target weights 

Score -23.129000960521772
mean reward: -24.741
model & target weights updated
Score -28.333142592473
mean reward: -24.785
model & target weights updated
Score -19.498507051439226
mean reward: -24.772
model & target weights updated
Score -31.958203332404924
mean reward: -24.798
model & target weights updated
Score -27.516437661987958
mean reward: -24.82
model & target weights updated
Score -21.61080500947405
mean reward: -24.707
model & target weights updated
Score -25.952748436190987
mean reward: -24.741
model & target weights updated
Score -24.86866610632296
mean reward: -24.743
model & target weights updated
Score -26.11738205519441
mean reward: -24.778
model & target weights updated
Score -33.84239171658905
mean reward: -24.874
model & target weights updated
Score -30.313561082419824
mean reward: -24.949
model & target weights updated
Score -18.596379055338808
mean reward: -24.959
model & target weights updated
Score -22.92450536926705
mean reward: -24.938
model & target weights upd

Score -19.386474687972274
mean reward: -23.164
model & target weights updated
Score -24.88797981347783
mean reward: -23.164
model & target weights updated
Score -21.093729022270487
mean reward: -23.114
model & target weights updated
Score -18.70475782348102
mean reward: -22.963
model & target weights updated
Score -22.277190695573786
mean reward: -22.882
model & target weights updated
Score -22.946700489454948
mean reward: -22.926
model & target weights updated
Score -20.904299924126267
mean reward: -22.906
model & target weights updated
Score -19.956694598262487
mean reward: -22.919
model & target weights updated
Score -19.774527796120008
mean reward: -22.855
model & target weights updated
Score -16.493990639182286
mean reward: -22.774
model & target weights updated
Score -27.10115914816833
mean reward: -22.783
model & target weights updated
Score -18.978489776873484
mean reward: -22.679
model & target weights updated
Score -16.297641169945955
mean reward: -22.664
model & target weigh

Score -18.70684502967579
mean reward: -22.345
model & target weights updated
Score -23.817451340086745
mean reward: -22.383
model & target weights updated
Score -22.10532876014436
mean reward: -22.407
model & target weights updated
Score -25.357250203682977
mean reward: -22.495
model & target weights updated
Score -17.74107224037101
mean reward: -22.402
model & target weights updated
Score -20.421590511244005
mean reward: -22.416
model & target weights updated
Score -19.56478750166064
mean reward: -22.449
model & target weights updated
Score -27.635898359095286
mean reward: -22.551
model & target weights updated
Score -28.65625375765779
mean reward: -22.656
model & target weights updated
Score -28.98761479443768
mean reward: -22.814
model & target weights updated
Score -17.366159064081952
mean reward: -22.738
model & target weights updated
Score -24.329801767767478
mean reward: -22.768
model & target weights updated
Score -17.813467960451664
mean reward: -22.695
model & target weights 

Score -27.25389509118063
mean reward: -21.066
model & target weights updated
Score -34.32028465859775
mean reward: -21.133
model & target weights updated
Score -26.734136929291587
mean reward: -21.113
model & target weights updated
Score -19.361710031864522
mean reward: -21.017
model & target weights updated
Score -19.950561821010215
mean reward: -21.043
model & target weights updated
Score -24.2338715100756
mean reward: -21.042
model & target weights updated
Score -17.023121432992745
mean reward: -21.034
model & target weights updated
Score -23.695102019726384
mean reward: -21.019
model & target weights updated
Score -13.426084502128514
mean reward: -20.887
model & target weights updated
Score -22.762923410268293
mean reward: -20.909
model & target weights updated
Score -20.147613904445123
mean reward: -20.952
model & target weights updated
Score -25.57163614875428
mean reward: -21.017
model & target weights updated
Score -18.698467123533913
mean reward: -20.956
model & target weights

Score -15.731964115931042
mean reward: -19.866
model & target weights updated
Score -23.798424169814787
mean reward: -19.867
model & target weights updated
Score -24.407028281064036
mean reward: -19.977
model & target weights updated
Score -19.099123542647213
mean reward: -19.94
model & target weights updated
Score -14.017086111907929
mean reward: -19.879
model & target weights updated
Score -14.05551231475735
mean reward: -19.764
model & target weights updated
Score -14.991592066317404
mean reward: -19.727
model & target weights updated
Score -20.495195572344663
mean reward: -19.695
model & target weights updated
Score -19.051598075162996
mean reward: -19.637
model & target weights updated
Score -14.237779112321032
mean reward: -19.534
model & target weights updated
Score -25.505746748407514
mean reward: -19.597
model & target weights updated
Score -21.682301760926418
mean reward: -19.546
model & target weights updated
Score -18.69805851453214
mean reward: -19.4
model & target weights

Score -20.357191211501917
mean reward: -20.346
model & target weights updated
Score -28.66654697612853
mean reward: -20.427
model & target weights updated
Score -17.97166733699076
mean reward: -20.416
model & target weights updated
Score -28.825193453186547
mean reward: -20.562
model & target weights updated
Score -17.77776529510907
mean reward: -20.485
model & target weights updated
Score -25.05386286971201
mean reward: -20.519
model & target weights updated
Score -16.51060506195561
mean reward: -20.497
model & target weights updated
Score -18.720564676981578
mean reward: -20.54
model & target weights updated
Score -15.435179298568482
mean reward: -20.404
model & target weights updated
Score -18.720254299624244
mean reward: -20.413
model & target weights updated
Score -18.925321650415157
mean reward: -20.385
model & target weights updated
Score -14.694436098453071
mean reward: -20.339
model & target weights updated
Score -24.898136765895337
mean reward: -20.248
model & target weights 

Score -14.326078476517456
mean reward: -18.874
model & target weights updated
Score -19.757112560565258
mean reward: -18.884
model & target weights updated
Score -15.375374316039789
mean reward: -18.883
model & target weights updated
Score -18.520515139408403
mean reward: -18.881
model & target weights updated
Score -18.48854655177485
mean reward: -18.877
model & target weights updated
Score -18.0601330849051
mean reward: -18.911
model & target weights updated
Score -16.0053470875609
mean reward: -18.822
model & target weights updated
Score -17.33614397207851
mean reward: -18.762
model & target weights updated
Score -16.44614049579353
mean reward: -18.738
model & target weights updated
Score -17.38399807325328
mean reward: -18.743
model & target weights updated
Score -16.41011745982646
mean reward: -18.67
model & target weights updated
Score -24.811065408250197
mean reward: -18.697
model & target weights updated
Score -21.63158924147337
mean reward: -18.732
model & target weights updat

Score -20.24923423592463
mean reward: -18.786
model & target weights updated
Score -11.548389180438566
mean reward: -18.728
model & target weights updated
Score -22.903160068787674
mean reward: -18.792
model & target weights updated
Score -24.097699584360686
mean reward: -18.859
model & target weights updated
Score -29.50113787689841
mean reward: -18.99
model & target weights updated
Score -22.40025071983372
mean reward: -18.966
model & target weights updated
Score -17.987813761705677
mean reward: -18.93
model & target weights updated
Score -13.550722756735459
mean reward: -18.866
model & target weights updated
Score -18.428385043180214
mean reward: -18.928
model & target weights updated
Score -13.860369931264497
mean reward: -18.795
model & target weights updated
Score -17.717892020036313
mean reward: -18.822
model & target weights updated
Score -12.224351887989231
mean reward: -18.7
model & target weights updated
Score -17.069136704608574
mean reward: -18.651
model & target weights u

Score -18.61636099168656
mean reward: -17.332
model & target weights updated
Score -15.307853631005848
mean reward: -17.35
model & target weights updated
Score -14.697936018798778
mean reward: -17.312
model & target weights updated
Score -18.13527474132224
mean reward: -17.355
model & target weights updated
Score -14.402619440831849
mean reward: -17.322
model & target weights updated
Score -14.977053349423242
mean reward: -17.35
model & target weights updated
Score -18.173453007408654
mean reward: -17.361
model & target weights updated
Score -17.535034514182662
mean reward: -17.384
model & target weights updated
Score -17.503281853743605
mean reward: -17.302
model & target weights updated
Score -12.209636675856705
mean reward: -17.185
model & target weights updated
Score -18.201111739183442
mean reward: -17.22
model & target weights updated
Score -19.07382620932758
mean reward: -17.214
model & target weights updated
Score -13.975313076300319
mean reward: -17.138
model & target weights 

Score -12.714416678935967
mean reward: -17.506
model & target weights updated
Score -14.293518911253996
mean reward: -17.474
model & target weights updated
Score -19.67058372136802
mean reward: -17.495
model & target weights updated
Score -11.407537972870042
mean reward: -17.487
model & target weights updated
Score -12.907415185302415
mean reward: -17.434
model & target weights updated
Score -16.060073734657955
mean reward: -17.404
model & target weights updated
Score -15.401567521657785
mean reward: -17.418
model & target weights updated
Score -23.283941033893342
mean reward: -17.495
model & target weights updated
Score -20.272417146113227
mean reward: -17.538
model & target weights updated
Score -15.972953126074145
mean reward: -17.565
model & target weights updated
Score -16.367256362831533
mean reward: -17.567
model & target weights updated
Score -17.105901785497622
mean reward: -17.484
model & target weights updated
Score -17.29219437405136
mean reward: -17.473
model & target weig

Score -18.41006938307523
mean reward: -16.048
model & target weights updated
Score -17.19655760887729
mean reward: -15.988
model & target weights updated
Score -22.09385000955119
mean reward: -16.006
model & target weights updated
Score -13.102552916594506
mean reward: -15.977
model & target weights updated
Score -16.00748536651671
mean reward: -15.973
model & target weights updated
Score -21.48529060830511
mean reward: -16.017
model & target weights updated
Score -7.961307376494971
mean reward: -15.924
model & target weights updated
Score -15.450034067510593
mean reward: -15.894
model & target weights updated
Score -20.88507296378281
mean reward: -15.922
model & target weights updated
Score -14.495573909645065
mean reward: -15.863
model & target weights updated
Score -12.985612657361829
mean reward: -15.805
model & target weights updated
Score -20.100986755125707
mean reward: -15.85
model & target weights updated
Score -10.293238738668176
mean reward: -15.809
model & target weights up

Score -17.06054824128451
mean reward: -16.0
model & target weights updated
Score -11.653742620806574
mean reward: -15.962
model & target weights updated
Score -23.368068718417767
mean reward: -15.987
model & target weights updated
Score -13.4244847622481
mean reward: -15.976
model & target weights updated
Score -13.67436413512006
mean reward: -15.983
model & target weights updated
Score -13.481646712424874
mean reward: -15.916
model & target weights updated
Score -13.356955761045405
mean reward: -15.947
model & target weights updated
Score -17.871595908163766
mean reward: -15.946
model & target weights updated
Score -11.285077902128434
mean reward: -15.929
model & target weights updated
Score -11.97521047790397
mean reward: -15.86
model & target weights updated
Score -14.357692299111436
mean reward: -15.861
model & target weights updated
Score -12.416698860693309
mean reward: -15.855
model & target weights updated
Score -17.4664214319605
mean reward: -15.829
model & target weights upda

Score -14.791806468277855
mean reward: -14.58
model & target weights updated
Score -15.474771860245895
mean reward: -14.556
model & target weights updated
Score -8.069398009975504
mean reward: -14.524
model & target weights updated
Score -16.4951790687141
mean reward: -14.569
model & target weights updated
Score -14.545160577180267
mean reward: -14.571
model & target weights updated
Score -17.834968709687658
mean reward: -14.626
model & target weights updated
Score -15.064670784474682
mean reward: -14.602
model & target weights updated
Score -20.575171959695403
mean reward: -14.686
model & target weights updated
Score -10.034091189818872
mean reward: -14.705
model & target weights updated
Score -13.209686089784794
mean reward: -14.7
model & target weights updated
Score -13.542644659374837
mean reward: -14.66
model & target weights updated
Score -17.87768698059807
mean reward: -14.713
model & target weights updated
Score -14.052004468972177
mean reward: -14.713
model & target weights up

Score -27.920352443320425
mean reward: -15.546
model & target weights updated
Score -16.765037967204826
mean reward: -15.508
model & target weights updated
Score -16.263781653573332
mean reward: -15.57
model & target weights updated
Score -12.294312254887817
mean reward: -15.561
model & target weights updated
Score -11.745722524833543
mean reward: -15.543
model & target weights updated
Score -17.050085743832373
mean reward: -15.535
model & target weights updated
Score -15.232257171904832
mean reward: -15.546
model & target weights updated
Score -16.300052638801066
mean reward: -15.515
model & target weights updated
Score -12.16661495061527
mean reward: -15.537
model & target weights updated
Score -10.92311170829022
mean reward: -15.387
model & target weights updated
Score -13.231307614881059
mean reward: -15.391
model & target weights updated
Score -9.641386962056021
mean reward: -15.353
model & target weights updated
Score -8.192809867930327
mean reward: -15.251
model & target weights

Score -16.333565773018083
mean reward: -14.072
model & target weights updated
Score -11.536088849963365
mean reward: -14.025
model & target weights updated
Score -16.549068326590785
mean reward: -14.069
model & target weights updated
Score -7.194067773077063
mean reward: -14.031
model & target weights updated
Score -9.591943176698692
mean reward: -13.995
model & target weights updated
Score -20.25575054779135
mean reward: -14.101
model & target weights updated
Score -14.290313941021655
mean reward: -14.162
model & target weights updated
Score -9.746226967646836
mean reward: -14.129
model & target weights updated
Score -12.259746292843706
mean reward: -14.15
model & target weights updated
Score -10.986708042085935
mean reward: -14.115
model & target weights updated
Score -18.754573707189405
mean reward: -14.201
model & target weights updated
Score -15.429709731851622
mean reward: -14.169
model & target weights updated
Score -13.237148839086336
mean reward: -14.166
model & target weights

Score -15.737653889749508
mean reward: -13.656
model & target weights updated
Score -12.594593986953159
mean reward: -13.684
model & target weights updated
Score -12.307952446429269
mean reward: -13.685
model & target weights updated
Score -11.827514965564164
mean reward: -13.693
model & target weights updated
Score -12.310813647760362
mean reward: -13.629
model & target weights updated
Score -11.61710613817791
mean reward: -13.591
model & target weights updated
Score -10.514941069579727
mean reward: -13.564
model & target weights updated
Score -14.708433312784269
mean reward: -13.556
model & target weights updated
Score -7.9768713230675985
mean reward: -13.504
model & target weights updated
Score -16.99568940125124
mean reward: -13.527
model & target weights updated
Score -13.259713973558975
mean reward: -13.536
model & target weights updated
Score -13.995660465847038
mean reward: -13.576
model & target weights updated
Score -11.925804313002793
mean reward: -13.607
model & target weig

Score -16.496105102928684
mean reward: -13.335
model & target weights updated
Score -9.617451473925435
mean reward: -13.284
model & target weights updated
Score -16.399398786009478
mean reward: -13.368
model & target weights updated
Score -15.228588064663366
mean reward: -13.35
model & target weights updated
Score -12.419938795150367
mean reward: -13.342
model & target weights updated
Score -7.005828897091215
mean reward: -13.272
model & target weights updated
Score -12.14208914184336
mean reward: -13.274
model & target weights updated
Score -9.969282773023345
mean reward: -13.246
model & target weights updated
Score -13.387360263942199
mean reward: -13.202
model & target weights updated
Score -16.132974094638467
mean reward: -13.223
model & target weights updated
Score -14.656516452633584
mean reward: -13.274
model & target weights updated
Score -14.019416330558954
mean reward: -13.248
model & target weights updated
Score -18.86602904147723
mean reward: -13.28
model & target weights u

Score -8.217144795771864
mean reward: -12.453
model & target weights updated
Score -19.45060226509986
mean reward: -12.548
model & target weights updated
Score -16.082229410218837
mean reward: -12.575
model & target weights updated
Score -9.593651813947703
mean reward: -12.509
model & target weights updated
Score -9.078915068817636
mean reward: -12.454
model & target weights updated
Score -12.71230507583986
mean reward: -12.44
model & target weights updated
Score -12.133771376186406
mean reward: -12.373
model & target weights updated
Score -10.50077461955568
mean reward: -12.362
model & target weights updated
Score -13.471730351892607
mean reward: -12.371
model & target weights updated
Score -9.86513243486172
mean reward: -12.348
model & target weights updated
Score -8.849828504539124
mean reward: -12.31
model & target weights updated
Score -10.265655933770956
mean reward: -12.318
model & target weights updated
Score -13.788386085872045
mean reward: -12.296
model & target weights updat

Score -6.495857839558287
mean reward: -12.024
model & target weights updated
Score -15.908909660227186
mean reward: -12.048
model & target weights updated
Score -9.986415079621665
mean reward: -12.049
model & target weights updated
Score -17.415883823240165
mean reward: -12.135
model & target weights updated
Score -7.233965014079479
mean reward: -12.105
model & target weights updated
Score -13.268375371366375
mean reward: -12.1
model & target weights updated
Score -19.721295930134737
mean reward: -12.171
model & target weights updated
Score -11.54820340449087
mean reward: -12.216
model & target weights updated
Score -19.034998824019386
mean reward: -12.269
model & target weights updated
Score -11.787581169241859
mean reward: -12.31
model & target weights updated
Score -17.15110835908153
mean reward: -12.344
model & target weights updated
Score -15.372918421653713
mean reward: -12.347
model & target weights updated
Score -18.571486121949807
mean reward: -12.438
model & target weights up

Score -10.504162069630318
mean reward: -12.178
model & target weights updated
Score -6.3781828889865935
mean reward: -12.126
model & target weights updated
Score -15.121021051368777
mean reward: -12.087
model & target weights updated
Score -15.676933392292286
mean reward: -12.126
model & target weights updated
Score -12.703377137736327
mean reward: -12.082
model & target weights updated
Score -12.565270474020986
mean reward: -12.054
model & target weights updated
Score -10.855625960780827
mean reward: -11.976
model & target weights updated
Score -12.61697964231026
mean reward: -11.982
model & target weights updated
Score -17.564274626350688
mean reward: -12.022
model & target weights updated
Score -17.331732835071715
mean reward: -12.116
model & target weights updated
Score -13.36900949161125
mean reward: -12.116
model & target weights updated
Score -8.26729065717476
mean reward: -12.111
model & target weights updated
Score -15.377251769149597
mean reward: -12.141
model & target weight

Score -17.54317427174192
mean reward: -12.728
model & target weights updated
Score -6.417803848591413
mean reward: -12.666
model & target weights updated
Score -5.4706948675691045
mean reward: -12.545
model & target weights updated
Score -15.236008307987358
mean reward: -12.524
model & target weights updated
Score -11.243959526105202
mean reward: -12.503
model & target weights updated
Score -12.60388634727173
mean reward: -12.546
model & target weights updated
Score -8.326858422651542
mean reward: -12.476
model & target weights updated
Score -9.99045524279714
mean reward: -12.321
model & target weights updated
Score -9.004795959272808
mean reward: -12.348
model & target weights updated
Score -14.618648197806184
mean reward: -12.408
model & target weights updated
Score -13.209542213604887
mean reward: -12.46
model & target weights updated
Score -14.42201917183478
mean reward: -12.559
model & target weights updated
Score -11.45897802850011
mean reward: -12.59
model & target weights updat

Score -12.768829830750102
mean reward: -11.411
model & target weights updated
Score -9.202156501714219
mean reward: -11.403
model & target weights updated
Score -5.680501321589291
mean reward: -11.37
model & target weights updated
Score -17.051593374281484
mean reward: -11.394
model & target weights updated
Score -7.2315370938922605
mean reward: -11.335
model & target weights updated
Score -11.482314849027532
mean reward: -11.305
model & target weights updated
Score -5.983200035229383
mean reward: -11.251
model & target weights updated
Score -5.940738672536812
mean reward: -11.192
model & target weights updated
Score -10.375496869349812
mean reward: -11.083
model & target weights updated
Score -13.193731576795805
mean reward: -11.079
model & target weights updated
Score -7.543194080674593
mean reward: -10.963
model & target weights updated
Score -9.178846902469736
mean reward: -10.929
model & target weights updated
Score -6.598514253747866
mean reward: -10.894
model & target weights up

Score -10.327366267211572
mean reward: -10.821
model & target weights updated
Score -15.333668534539362
mean reward: -10.914
model & target weights updated
Score -9.028316067709284
mean reward: -10.901
model & target weights updated
Score -8.361674929928927
mean reward: -10.853
model & target weights updated
Score -9.687551684032947
mean reward: -10.874
model & target weights updated
Score -9.550517000396566
mean reward: -10.878
model & target weights updated
Score -8.962180334552704
mean reward: -10.901
model & target weights updated
Score -4.9784788129636075
mean reward: -10.857
model & target weights updated
Score -9.266952890320919
mean reward: -10.875
model & target weights updated
Score -9.037392005790998
mean reward: -10.889
model & target weights updated
Score -8.455078192850348
mean reward: -10.817
model & target weights updated
Score -9.351705806596891
mean reward: -10.74
model & target weights updated
Score -12.868330138874757
mean reward: -10.74
model & target weights updat

Score -9.831945076443915
mean reward: -10.025
model & target weights updated
Score -14.378688028311698
mean reward: -10.076
model & target weights updated
Score -8.512254877098758
mean reward: -10.071
model & target weights updated
Score -9.842025398049353
mean reward: -10.084
model & target weights updated
Score -12.422282972457637
mean reward: -10.115
model & target weights updated
Score -13.260783250539864
mean reward: -10.119
model & target weights updated
Score -9.354918157075462
mean reward: -10.113
model & target weights updated
Score -12.339087156527668
mean reward: -10.092
model & target weights updated
Score -6.9924284274690685
mean reward: -10.095
model & target weights updated
Score -11.625141293444871
mean reward: -10.13
model & target weights updated
Score -8.868977013395341
mean reward: -10.129
model & target weights updated
Score -15.887554087779488
mean reward: -10.171
model & target weights updated
Score -15.502359482192974
mean reward: -10.155
model & target weights 

Score -10.229920370919357
mean reward: -10.081
model & target weights updated
Score -5.657644355843371
mean reward: -10.014
model & target weights updated
Score -13.249685913118594
mean reward: -10.077
model & target weights updated
Score -10.888994150609648
mean reward: -10.07
model & target weights updated
Score -6.307541744528952
mean reward: -10.044
model & target weights updated
Score -5.950391042248316
mean reward: -9.945
model & target weights updated
Score -19.986888217693213
mean reward: -9.989
model & target weights updated
Score -8.368231431984993
mean reward: -9.968
model & target weights updated
Score -7.214978724835033
mean reward: -9.949
model & target weights updated
Score -12.055704526738626
mean reward: -9.979
model & target weights updated
Score -11.629059075445726
mean reward: -10.029
model & target weights updated
Score -10.964816246269054
mean reward: -10.036
model & target weights updated
Score -5.197616314758426
mean reward: -9.976
model & target weights updated

Score -9.483993061786355
mean reward: -9.814
model & target weights updated
Score -10.771355414932215
mean reward: -9.802
model & target weights updated
Score -9.208538958368406
mean reward: -9.777
model & target weights updated
Score -10.7160458010539
mean reward: -9.775
model & target weights updated
Score -12.295633820460678
mean reward: -9.846
model & target weights updated
Score -5.218612574147112
mean reward: -9.779
model & target weights updated
Score -7.486846579317583
mean reward: -9.801
model & target weights updated
Score -7.679780328434272
mean reward: -9.733
model & target weights updated
Score -9.528828825022474
mean reward: -9.736
model & target weights updated
Score -10.388518410343893
mean reward: -9.738
model & target weights updated
Score -5.656079085445607
mean reward: -9.727
model & target weights updated
Score -11.08860391815205
mean reward: -9.776
model & target weights updated
Score -5.963708563849597
mean reward: -9.733
model & target weights updated
Score -11.

Score -10.86797408398802
mean reward: -9.717
model & target weights updated
Score -4.287039404605317
mean reward: -9.656
model & target weights updated
Score -5.526229612442494
mean reward: -9.655
model & target weights updated
Score -11.549375222862116
mean reward: -9.66
model & target weights updated
Score -7.275871877299691
mean reward: -9.673
model & target weights updated
Score -5.453037110783906
mean reward: -9.609
model & target weights updated
Score -5.304964449199576
mean reward: -9.573
model & target weights updated
Score -7.16421401257
mean reward: -9.584
model & target weights updated
Score -9.71490981612187
mean reward: -9.534
model & target weights updated
Score -9.791199522619698
mean reward: -9.502
model & target weights updated
Score -10.360740058821023
mean reward: -9.51
model & target weights updated
Score -5.100467788474374
mean reward: -9.497
model & target weights updated
Score -9.29258129703783
mean reward: -9.482
model & target weights updated
Score -7.698943220

Score -8.259876138280287
mean reward: -9.252
model & target weights updated
Score -7.618728799016876
mean reward: -9.231
model & target weights updated
Score -6.3994905697715385
mean reward: -9.191
model & target weights updated
Score -4.90025258375498
mean reward: -9.189
model & target weights updated
Score -6.960885345647506
mean reward: -9.166
model & target weights updated
Score -9.223882288364328
mean reward: -9.181
model & target weights updated
Score -8.452381402786466
mean reward: -9.195
model & target weights updated
Score -6.232023160210642
mean reward: -9.171
model & target weights updated
Score -9.694182911471223
mean reward: -9.221
model & target weights updated
Score -7.527370551860449
mean reward: -9.195
model & target weights updated
Score -6.192098545468901
mean reward: -9.187
model & target weights updated
Score -8.881186385328578
mean reward: -9.201
model & target weights updated
Score -7.921425146594784
mean reward: -9.191
model & target weights updated
Score -10.50

Score -7.251274926955637
mean reward: -8.494
model & target weights updated
Score -6.786888124285262
mean reward: -8.487
model & target weights updated
Score -6.162727372914195
mean reward: -8.487
model & target weights updated
Score -9.33410915427669
mean reward: -8.491
model & target weights updated
Score -6.191843918792687
mean reward: -8.474
model & target weights updated
Score -6.64629180775168
mean reward: -8.435
model & target weights updated
Score -7.477059289695121
mean reward: -8.461
model & target weights updated
Score -11.699304754381822
mean reward: -8.446
model & target weights updated
Score -10.287899171035683
mean reward: -8.481
model & target weights updated
Score -10.366757267529792
mean reward: -8.48
model & target weights updated
Score -12.773127799434375
mean reward: -8.542
model & target weights updated
Score -5.682226175574828
mean reward: -8.502
model & target weights updated
Score -10.734775268461497
mean reward: -8.488
model & target weights updated
Score -8.0

Score -11.563077275625755
mean reward: -8.788
model & target weights updated
Score -9.76319829615091
mean reward: -8.782
model & target weights updated
Score -5.886156762336711
mean reward: -8.713
model & target weights updated
Score -8.257821787072103
mean reward: -8.739
model & target weights updated
Score -7.065851591511384
mean reward: -8.702
model & target weights updated
Score -5.298857270946339
mean reward: -8.675
model & target weights updated
Score -4.8978610648167935
mean reward: -8.639
model & target weights updated
Score -13.94751662477258
mean reward: -8.698
model & target weights updated
Score -8.196333799891697
mean reward: -8.671
model & target weights updated
Score -9.15029936341059
mean reward: -8.687
model & target weights updated
Score -5.444425117987877
mean reward: -8.675
model & target weights updated
Score -5.558665658680853
mean reward: -8.657
model & target weights updated
Score -14.178349697891834
mean reward: -8.651
model & target weights updated
Score -5.04

Score -5.129268649217543
mean reward: -9.31
model & target weights updated
Score -10.4617633129739
mean reward: -9.323
model & target weights updated
Score -10.395371667583637
mean reward: -9.372
model & target weights updated
Score -6.0483915210433485
mean reward: -9.377
model & target weights updated
Score -6.878886852668781
mean reward: -9.304
model & target weights updated
Score -4.477421758953117
mean reward: -9.298
model & target weights updated
Score -8.266219028930104
mean reward: -9.283
model & target weights updated
Score -7.934827886643216
mean reward: -9.341
model & target weights updated
Score -6.530677510301962
mean reward: -9.332
model & target weights updated
Score -11.891404332161878
mean reward: -9.351
model & target weights updated
Score -10.937986659770894
mean reward: -9.311
model & target weights updated
Score -12.35687247904202
mean reward: -9.405
model & target weights updated
Score -8.222454296638295
mean reward: -9.398
model & target weights updated
Score -10.

Score -7.770085034395385
mean reward: -8.123
model & target weights updated
Score -10.612111315692609
mean reward: -8.11
model & target weights updated
Score -9.097613204492552
mean reward: -8.092
model & target weights updated
Score -3.7444323567456133
mean reward: -8.006
model & target weights updated
Score -9.605609690579838
mean reward: -8.019
model & target weights updated
Score -7.190492613937932
mean reward: -7.99
model & target weights updated
Score -8.735882687476568
mean reward: -8.022
model & target weights updated
Score -6.275119340856067
mean reward: -7.969
model & target weights updated
Score -11.27771435804396
mean reward: -7.986
model & target weights updated
Score -7.825042959584076
mean reward: -7.968
model & target weights updated
Score -9.012246080221775
mean reward: -7.961
model & target weights updated
Score -5.8015906905868855
mean reward: -7.962
model & target weights updated
Score -9.381834269931943
mean reward: -7.973
model & target weights updated
Score -7.37

Score -6.701499126763699
mean reward: -7.851
model & target weights updated
Score -9.125408682301693
mean reward: -7.864
model & target weights updated
Score -10.16009743858987
mean reward: -7.876
model & target weights updated
Score -10.19792774704761
mean reward: -7.92
model & target weights updated
Score -6.1422578313572
mean reward: -7.887
model & target weights updated
Score -7.131606136809615
mean reward: -7.885
model & target weights updated
Score -5.579256226832495
mean reward: -7.896
model & target weights updated
Score -12.238894996969416
mean reward: -7.963
model & target weights updated
Score -10.146897932029889
mean reward: -7.97
model & target weights updated
Score -7.376479946659446
mean reward: -7.992
model & target weights updated
Score -8.43915871277555
mean reward: -7.934
model & target weights updated
Score -7.726597539832055
mean reward: -7.944
model & target weights updated
Score -10.507490211411621
mean reward: -7.971
model & target weights updated
Score -10.9092

Score -6.937808507657425
mean reward: -7.924
model & target weights updated
Score -3.3130405929666145
mean reward: -7.883
model & target weights updated
Score -6.68064534331188
mean reward: -7.866
model & target weights updated
Score -1.7587173615896377
mean reward: -7.806
model & target weights updated
Score -6.213814973984379
mean reward: -7.763
model & target weights updated
Score -7.933035473169149
mean reward: -7.733
model & target weights updated
Score -8.395375959270547
mean reward: -7.732
model & target weights updated
Score -7.373777869879283
mean reward: -7.703
model & target weights updated
Score -4.747874710729999
mean reward: -7.639
model & target weights updated
Score -9.791576663555833
mean reward: -7.66
model & target weights updated
Score -10.014343055093683
mean reward: -7.715
model & target weights updated
Score -7.599825114765955
mean reward: -7.728
model & target weights updated
Score -10.04189394425697
mean reward: -7.792
model & target weights updated
Score -9.22

Score -9.421402991881525
mean reward: -7.778
model & target weights updated
Score -5.16637184454074
mean reward: -7.732
model & target weights updated
Score -6.523974521673828
mean reward: -7.697
model & target weights updated
Score -5.950395626912429
mean reward: -7.68
model & target weights updated
Score -10.036827262604303
mean reward: -7.68
model & target weights updated
Score -11.2696551946784
mean reward: -7.701
model & target weights updated
Score -7.280341853196808
mean reward: -7.716
model & target weights updated
Score -4.815434452375273
mean reward: -7.707
model & target weights updated
Score -7.572201282384305
mean reward: -7.707
model & target weights updated
Score -8.460130536805808
mean reward: -7.723
model & target weights updated
Score -7.914533196948067
mean reward: -7.666
model & target weights updated
Score -5.423052768292016
mean reward: -7.68
model & target weights updated
Score -8.460367300971225
mean reward: -7.709
model & target weights updated
Score -6.9676908

Score -4.6267014342915544
mean reward: -7.171
model & target weights updated
Score -5.672900038002037
mean reward: -7.143
model & target weights updated
Score -8.106869350513808
mean reward: -7.145
model & target weights updated
Score -7.629492535843123
mean reward: -7.167
model & target weights updated
Score -5.368273668092807
mean reward: -7.136
model & target weights updated
Score -6.559701445535195
mean reward: -7.132
model & target weights updated
Score -6.985144889485577
mean reward: -7.132
model & target weights updated
Score -8.2205534936973
mean reward: -7.144
model & target weights updated
Score -4.947350058137048
mean reward: -7.128
model & target weights updated
Score -5.48451722372257
mean reward: -7.135
model & target weights updated
Score -2.8704035177578495
mean reward: -7.125
model & target weights updated
Score -7.495104324352903
mean reward: -7.129
model & target weights updated
Score -12.669489495379356
mean reward: -7.147
model & target weights updated
Score -7.103

Score -9.46006256955524
mean reward: -6.892
model & target weights updated
Score -4.690804800385084
mean reward: -6.884
model & target weights updated
Score -6.817447974066843
mean reward: -6.924
model & target weights updated
Score -10.902112851921448
mean reward: -6.958
model & target weights updated
Score -8.4268457770592
mean reward: -6.915
model & target weights updated
Score -8.436861356340426
mean reward: -6.929
model & target weights updated
Score -8.818069464761114
mean reward: -6.915
model & target weights updated
Score -11.855277624334649
mean reward: -6.988
model & target weights updated
Score -9.110582272195478
mean reward: -7.027
model & target weights updated
Score -8.452116528355583
mean reward: -7.062
model & target weights updated
Score -6.042219307224056
mean reward: -7.033
model & target weights updated
Score -5.03886458052461
mean reward: -7.042
model & target weights updated
Score -8.350832835551511
mean reward: -7.027
model & target weights updated
Score -10.4506

Score -3.461881979149004
mean reward: -7.19
model & target weights updated
Score -4.582162503594982
mean reward: -7.151
model & target weights updated
Score -7.602245729733786
mean reward: -7.167
model & target weights updated
Score -6.665071664710524
mean reward: -7.183
model & target weights updated
Score -8.548979584957872
mean reward: -7.185
model & target weights updated
Score -9.315839813070237
mean reward: -7.174
model & target weights updated
Score -8.791330920844008
mean reward: -7.205
model & target weights updated
Score -11.968472010862543
mean reward: -7.274
model & target weights updated
Score -6.854166515386789
mean reward: -7.258
model & target weights updated
Score -4.8304748656564245
mean reward: -7.24
model & target weights updated
Score -2.983844679587324
mean reward: -7.209
model & target weights updated
Score -7.023772965876364
mean reward: -7.231
model & target weights updated
Score -9.319075051984322
mean reward: -7.235
model & target weights updated
Score -10.46

Score -8.08810036646123
mean reward: -9.825
model & target weights updated
Score -11.06809245125994
mean reward: -9.887
model & target weights updated
Score -6.399275987266783
mean reward: -9.921
model & target weights updated
Score -5.396963387250241
mean reward: -9.905
model & target weights updated
Score -5.189469412236589
mean reward: -9.864
model & target weights updated
Score -8.460232941085467
mean reward: -9.844
model & target weights updated
Score -6.138310681834575
mean reward: -9.862
model & target weights updated
Score -9.134649170162318
mean reward: -9.845
model & target weights updated
Score -8.563866754110496
mean reward: -9.873
model & target weights updated
Score -10.422385461195542
mean reward: -9.91
model & target weights updated
Score -4.580467009564727
mean reward: -9.88
model & target weights updated
Score -6.981677096859462
mean reward: -9.848
model & target weights updated
Score -10.31843219672768
mean reward: -9.879
model & target weights updated
Score -8.90082

Score -6.966381342568911
mean reward: -6.894
model & target weights updated
Score -2.891141718986938
mean reward: -6.818
model & target weights updated
Score -5.436471464302866
mean reward: -6.827
model & target weights updated
Score -2.2566929084154337
mean reward: -6.78
model & target weights updated
Score -6.059409439637351
mean reward: -6.737
model & target weights updated
Score -2.9830514639816976
mean reward: -6.678
model & target weights updated
Score -8.330784213083254
mean reward: -6.683
model & target weights updated
Score -6.776984090963574
mean reward: -6.663
model & target weights updated
Score -8.308210645536741
mean reward: -6.69
model & target weights updated
Score -6.3667034835706575
mean reward: -6.71
model & target weights updated
Score -7.7873870146861
mean reward: -6.713
model & target weights updated
Score -9.035592853271556
mean reward: -6.741
model & target weights updated
Score -6.5808140482833695
mean reward: -6.721
model & target weights updated
Score -13.415

Score -5.19767984850793
mean reward: -7.251
model & target weights updated
Score -8.126147365279174
mean reward: -7.269
model & target weights updated
Score -6.638065271076081
mean reward: -7.257
model & target weights updated
Score -5.456726078845964
mean reward: -7.221
model & target weights updated
Score -6.942688016028637
mean reward: -7.225
model & target weights updated
Score -6.015131713548586
mean reward: -7.151
model & target weights updated
Score -10.692595232897128
mean reward: -7.235
model & target weights updated
Score -1.3220107109522834
mean reward: -7.187
model & target weights updated
Score -4.2667451716835485
mean reward: -7.157
model & target weights updated
Score -6.889161259143671
mean reward: -7.133
model & target weights updated
Score -8.038665048707774
mean reward: -7.103
model & target weights updated
Score -5.774131684941063
mean reward: -7.072
model & target weights updated
Score -8.26394533623064
mean reward: -7.079
model & target weights updated
Score -8.69

Score -9.709898963843386
mean reward: -7.222
model & target weights updated
Score -11.88725125091857
mean reward: -7.272
model & target weights updated
Score -8.354557880269647
mean reward: -7.275
model & target weights updated
Score -7.064687280531722
mean reward: -7.288
model & target weights updated
Score -2.742571673761845
mean reward: -7.232
model & target weights updated
Score -7.542664764273674
mean reward: -7.221
model & target weights updated
Score -10.96590281524542
mean reward: -7.248
model & target weights updated
Score -10.813551634171903
mean reward: -7.29
model & target weights updated
Score -12.240145695682884
mean reward: -7.372
model & target weights updated
Score -9.850122368337704
mean reward: -7.399
model & target weights updated
Score -3.104665644703459
mean reward: -7.314
model & target weights updated
Score -3.1178502289557444
mean reward: -7.228
model & target weights updated
Score -6.870793214424065
mean reward: -7.267
model & target weights updated
Score -4.5

Score -4.959709342280469
mean reward: -6.927
model & target weights updated
Score -8.445394556334488
mean reward: -6.913
model & target weights updated
Score -8.183812115812051
mean reward: -6.963
model & target weights updated
Score -7.076737608174714
mean reward: -7.003
model & target weights updated
Score -4.490750339794755
mean reward: -6.979
model & target weights updated
Score -7.534246008590702
mean reward: -7.009
model & target weights updated
Score -2.570461015569559
mean reward: -6.985
model & target weights updated
Score -7.395840024177308
mean reward: -6.971
model & target weights updated
Score -6.868701716357097
mean reward: -6.971
model & target weights updated
Score -8.546524929057643
mean reward: -6.971
model & target weights updated
Score -4.122726298882339
mean reward: -6.919
model & target weights updated
Score -2.6516998777404495
mean reward: -6.891
model & target weights updated
Score -9.234781006958997
mean reward: -6.898
model & target weights updated
Score -9.17

Score -4.606210284317189
mean reward: -6.707
model & target weights updated
Score -6.410599201570475
mean reward: -6.686
model & target weights updated
Score -5.132152583304688
mean reward: -6.696
model & target weights updated
Score -8.194053044258384
mean reward: -6.751
model & target weights updated
Score -11.136191451268854
mean reward: -6.77
model & target weights updated
Score -8.446433971530169
mean reward: -6.763
model & target weights updated
Score -6.647474875784691
mean reward: -6.769
model & target weights updated
Score -8.069104491294341
mean reward: -6.742
model & target weights updated
Score -5.400602147415654
mean reward: -6.694
model & target weights updated
Score -4.031147351632692
mean reward: -6.683
model & target weights updated
Score -8.198676540431503
mean reward: -6.707
model & target weights updated
Score -7.268184290924189
mean reward: -6.665
model & target weights updated
Score -10.18479390867365
mean reward: -6.707
model & target weights updated
Score -4.479

Score -3.463636198673781
mean reward: -6.842
model & target weights updated
Score -3.099760123109133
mean reward: -6.833
model & target weights updated
Score -11.156216131924925
mean reward: -6.862
model & target weights updated
Score -4.387539692912568
mean reward: -6.833
model & target weights updated
Score -7.859414993927024
mean reward: -6.81
model & target weights updated
Score -7.822108540173671
mean reward: -6.844
model & target weights updated
Score -5.777588386473201
mean reward: -6.771
model & target weights updated
Score -3.1901836319358607
mean reward: -6.745
model & target weights updated
Score -3.2579379848198418
mean reward: -6.705
model & target weights updated
Score -2.339733730688102
mean reward: -6.667
model & target weights updated
Score -7.831299765759842
mean reward: -6.679
model & target weights updated
Score -3.341489621851865
mean reward: -6.612
model & target weights updated
Score -6.886585847360904
mean reward: -6.575
model & target weights updated
Score -3.5

Score -7.900177172480655
mean reward: -6.176
model & target weights updated
Score -3.6769963326771733
mean reward: -6.189
model & target weights updated
Score -10.055939966582816
mean reward: -6.211
model & target weights updated
Score -6.685557069890047
mean reward: -6.245
model & target weights updated
Score -6.474315565258274
mean reward: -6.241
model & target weights updated
Score -2.7079206768057227
mean reward: -6.232
model & target weights updated
Score -3.388707050525102
mean reward: -6.206
model & target weights updated
Score -7.509717729845144
mean reward: -6.23
model & target weights updated
Score -8.121193889129165
mean reward: -6.245
model & target weights updated
Score -4.8308270545257
mean reward: -6.224
model & target weights updated
Score -4.607316533640965
mean reward: -6.143
model & target weights updated
Score -5.605088249589707
mean reward: -6.127
model & target weights updated
Score -7.640843323246875
mean reward: -6.175
model & target weights updated
Score -9.749

SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


#### Save model weights

Save model weights to "reacher_actor.pth" & "reacher_critic.pth" \
We can load trained weights for a new agent if we do not want to retrain

In [6]:
torch.save(actor.state_dict(), "reacher_actor.pth")
torch.save(critic.state_dict(), "reacher_critic.pth")

### Run this if you only want to use pre-trained weights & observe agent in action

#### Observe agent
Let's see how our agent performs for 10 episodes 

In [5]:
episodes = 10
env = gym.make('Reacher-v4')
buffer =  ExperienceReplay(max_memory_size)
trained_actor = Actor(num_states, hidden_size, num_actions).to(device)
trained_actor.load_state_dict(torch.load("reacher_actor.pth"))
trained_critic = Critic(num_states + num_actions, hidden_size, num_output).to(device)
trained_critic.load_state_dict(torch.load("reacher_critic.pth"))

agent = DDPG(env, buffer)
episode_rewards = []

for episode in tqdm(range(episodes)):
    terminate = False
    episode_reward = 0
    agent._reset()
    while not terminate:
        action = agent.select_optimal_action(trained_actor,device=device)
        next_state, reward, terminate, info = agent.env.step(action)
        env.render()
        episode_reward += reward
        agent.state = next_state
        if terminate:
            episode_rewards.append(episode_reward)
mean_reward = sum(episode_rewards)/len(episode_rewards)            
print("mean reward: {%.3f}" % mean_reward)

env.reset()        
env.close() 

100%|██████████| 10/10 [00:11<00:00,  1.19s/it]

mean reward: {-4.901}





SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


#### Conclusion

As observed, our agent is able to reach the object most of the times within the time limit