# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import os, sys
from unityagents import UnityEnvironment
import numpy as np

import torch

os.chdir('..')

work_dir = os.getcwd()

sys.path.insert(0, 'continuous_control/src')

from model import (
    Critic,
    Actor,
    DDPG,
)

from util import _state_to_torch

from utils.buffer import ReplayBuffer

from torch.utils.tensorboard import SummaryWriter

env = UnityEnvironment(file_name='continuous_control/unity/Reacher_Linux_Many/Reacher.x86_64')

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

%load_ext autoreload
%autoreload 2

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_size -> 5.0
		goal_speed -> 1.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


In [2]:
replaybuffer = ReplayBuffer(
    5000,
    state_size=30,
    action_size=4,
)

In [4]:
import time
start = time.time()

for i in range(10):
    
    state = np.random.rand(20, 30)
    state_prime = np.random.rand(20, 30)
    action = np.random.rand(20, 4)
    reward = np.random.rand(20)
    weight = np.random.rand(20)
    done = np.random.rand(20)

    replaybuffer.add(
        state=state,
        state_prime=state_prime,
        reward=reward,
        action=action,
        weight=weight,
        done=done
    )
    
    print(len(replaybuffer))
    
print(time.time() - start)

20
40
60
80
100
120
140
160
180
200
0.0034160614013671875


In [53]:
replaybuffer._action[0:4990, :]

array([[0.33284286, 0.75338119, 0.04289542, 0.53129129],
       [0.2982976 , 0.11867255, 0.47619062, 0.2646239 ],
       [0.47743836, 0.74935759, 0.60646312, 0.53024298],
       ...,
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ]])

In [13]:
(
    state, 
    state_prime,
    action,
    reward,
    done, 
    weight
) = replaybuffer.draw(201)


In [8]:
len(replaybuffer)

200

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

## Setting up the networks

In [3]:
actor = Actor(
    in_dim=state_size,
    out_dim=action_size,
    hidden_dim=128,
    squeeze_dim=64,
    res_block=5,
    limit_l=-1,
    limit_h=1,
)        

critic = Critic(
    in_dim=state_size+action_size,
    out_dim=1,
    hidden_dim=128,
    squeeze_dim=64,
    res_block=5,
)        

### Load checkpoints

In [None]:
actor.load_state_dict(torch.load("models/actor.ckp"))
critic.load_state_dict(torch.load("models/critic.ckp"))

### Deep Deterministic Policy Gradient Handler

In [4]:
ddpg = DDPG(
    critic=critic,
    actor=actor,
    lr_critic=10**-3,
    lr_actor=10**-3,
    tau_critic=10**-3,
    tau_actor=10**-3,
)


## Setting up the replaybuffer


In [5]:
replaybuffer = ReplayBuffer(
    10000,
    state_size=state_size,
    action_size=action_size,
)

## Setting up hyperparamters

In [6]:
experiement_name = "test"

episodes = 2000

batch_size = 64

gamma = 0.95

device = "cuda"

update_steps = 10

sigma_end = 0.1
sigma_decay = 0.99
sigma = 0.5

## Setting up the tensorboard writer


In [7]:
writer = SummaryWriter(f"continuous_control/runs/{experiement_name}")

In [11]:
action

array([[ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        , -1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 0.93134606,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.        , -1.        ,  1.        ],
       [ 1.        ,  1.

In [29]:
for episode in range(episodes):
    
    sigma = max(sigma_end, sigma_decay*sigma) # decrease sigma
    
    print(episode, sigma)
    

0 0.495
1 0.49005
2 0.48514949999999996
3 0.480298005
4 0.47549502494999996
5 0.47074007470049994
6 0.46603267395349496
7 0.46137234721396
8 0.45675862374182036
9 0.45219103750440215
10 0.44766912712935814
11 0.44319243585806456
12 0.4387605114994839
13 0.43437290638448905
14 0.43002917732064416
15 0.4257288855474377
16 0.4214715966919633
17 0.4172568807250437
18 0.41308431191779327
19 0.4089534687986153
20 0.40486393411062915
21 0.4008152947695229
22 0.39680714182182764
23 0.3928390704036094
24 0.38891067969957327
25 0.38502157290257755
26 0.3811713571735518
27 0.37735964360181623
28 0.3735860471657981
29 0.3698501866941401
30 0.3661516848271987
31 0.3624901679789267
32 0.3588652662991374
33 0.35527661363614604
34 0.3517238474997846
35 0.34820660902478673
36 0.34472454293453886
37 0.3412772975051935
38 0.33786452453014154
39 0.33448587928484014
40 0.33114102049199173
41 0.3278296102870718
42 0.3245513141842011
43 0.32130580104235906
44 0.31809274303193547
45 0.31491181560161613
46 0.3

## Send models to device

In [9]:
ddpg = ddpg.to(device)

In [None]:
iteration = 0

for episode in range(episodes):

    env_info = env.reset(train_mode=True)[brain_name]    
    states = env_info.vector_observations                 
    scores = np.zeros(num_agents)                          
    
    losses_critic = []
    losses_actor = []
    
    sigma = max(sigma_end, sigma_decay*sigma) # decrease sigma
        
    while True:
        
        action, q = ddpg.forward(s=torch.from_numpy(states).float().to(device))
            
         # Add noise
        action = action.detach().cpu().numpy()
        action = action + np.random.normal(loc=0.0, scale=sigma, size=action.shape)
        action = np.clip(action, -1, 1)
        
        env_info = env.step(action)[brain_name] 
        
        next_states = env_info.vector_observations
        
        rewards = env_info.rewards                      
        dones = env_info.local_done                       
        scores += env_info.rewards                         
    
        if np.any(dones):                                 
            break
            
        replaybuffer.add(
            state=states,
            state_prime=next_states,
            reward=np.array(rewards),
            action=action,
            weight=None,
            done=np.array(dones)
        )
            
        states = next_states
        
        for i in range(update_steps):
            
            iteration +=1 
        
            (
                state, 
                state_prime,
                action,
                reward,
                done, 
                weight
            ) = replaybuffer.draw(batch_size, replace=False)

            if state is not None:

                cl, al = ddpg.loss(
                    gamma=gamma,
                    s=torch.from_numpy(state).float().to(device),
                    sprime=torch.from_numpy(state_prime).float().to(device),
                    r=torch.from_numpy(reward).float().to(device),
                    a=torch.from_numpy(action).float().to(device),
                    d=torch.from_numpy(done).float().to(device),
                    w=torch.from_numpy(weight).float().to(device),
                )

                ddpg.update(cl=cl, al=al)

                losses_critic.append(cl.item())
                losses_actor.append(al.item())
                
                if iteration % 10 == 0:
                    print(iteration, np.mean(losses_critic), np.mean(losses_actor))
                
    writer.add_scalar("training/kpi/score", np.mean(scores), episode) # The score
    writer.add_scalar("training/kpi/loss_actor", np.mean(losses_actor), episode) # The actor loss
    writer.add_scalar("training/kpi/loss_critic", np.mean(losses_critic), episode) # The critic loss

40 1.8446524180471897 -0.274514901265502
50 1.0324084306135775 -0.2612938731908798
60 0.720352313419183 -0.3114666278163592
70 0.5536590100731701 -0.3383105706423521
80 0.44944866724312305 -0.3492603346705437
90 0.3789249758546551 -0.3714071698486805
100 0.32856666331312484 -0.38626342692545484
110 0.2895795177668333 -0.40027233529835937
120 0.2592107468802068 -0.41119028247065015
130 0.23501176279969513 -0.4174619419872761
140 0.21531412239948458 -0.42633118643002077
150 0.19867818478960544 -0.4371251820276181
160 0.18438850440658056 -0.44593246109210527
170 0.17211432297980148 -0.45589053109288213
180 0.16159178625792264 -0.46004232853651045
190 0.15223602753248996 -0.46412291331216693
200 0.1440647382508306 -0.46910518749671826
210 0.1367428654132204 -0.46884195349282687
220 0.13030047126996674 -0.4690875327116565
230 0.12446538262069225 -0.4718258760124445
240 0.11937299746399124 -0.47160908338569457
250 0.11465802169404923 -0.47134086638689043
260 0.11056420442688725 -0.4714476734

In [18]:
torch.__version__

'1.9.0'

In [70]:
action.shape

(20, 4)

In [63]:
states.shape

(20, 33)

In [None]:
replay_buffer.draw(10)

In [13]:
replay_buffer.draw(200).state_prime.shape

torch.Size([200, 33])

In [73]:
len(replay_buffer)

NameError: name 'replay_buffer' is not defined

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [9]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

KeyboardInterrupt: 

When finished, you can close the environment.

In [6]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```