# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [3]:
env = UnityEnvironment(file_name='Reacher_multi.app', no_graphics=True) #Reacher_multi

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_size -> 5.0
		goal_speed -> 1.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [4]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [5]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [6]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.1099999975413084


When finished, you can close the environment.

### 4. It's Your Turn! : Train the Agent with DDPG

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
states.shape

(20, 33)

In [8]:
%load_ext autoreload
%autoreload 2

In [9]:
len(rewards) #rewards is list

20

In [10]:
action_size

4

In [13]:
%reload_ext autoreload

from ddpg_agent_reacher import Agent
from collections import deque
agent = Agent(state_size=state_size, action_size=action_size, random_seed=2) #state_size 33, action_size 4

#The size of tensor a (51200) must match the size of tensor b (2560) at non-singleton 
#dimension 0
env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
print(states[0].shape)

def ddpg(n_episodes=1000, max_t=300, print_every=100):
    scores_deque = deque(maxlen=print_every)
    scores = []
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        agent.reset() 
        score = 0
        for t in range(max_t):
            action = agent.act(states) #(20,33) or (33,)
            action = np.clip(action, -1, 1)
            env_info = env.step([action])[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards 
            dones = env_info.local_done    
             
            agent.step(states, action, rewards, next_states, dones)
            states = next_states
            score = score + np.average(rewards)
            if dones[0]:
                break 
        scores_deque.append(score)
        scores.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)), end="")
        #torch.save(agent.actor_local.state_dict(), 'p2_checkpoint_actor.pth')
        #torch.save(agent.critic_local.state_dict(), 'p2_checkpoint_critic.pth')
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            
    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

(33,)

 LALALALA self.memory  1

 LALALALA self.memory  2

 LALALALA self.memory  3

 LALALALA self.memory  4

 LALALALA self.memory  5

 LALALALA self.memory  6

 LALALALA self.memory  7

 LALALALA self.memory  8

 LALALALA self.memory  9

 LALALALA self.memory  10

 LALALALA self.memory  11

 LALALALA self.memory  12

 LALALALA self.memory  13

 LALALALA self.memory  14

 LALALALA self.memory  15

 LALALALA self.memory  16

 LALALALA self.memory  17

 LALALALA self.memory  18

 LALALALA self.memory  19

 LALALALA self.memory  20

 LALALALA self.memory  21

 LALALALA self.memory  22

 LALALALA self.memory  23

 LALALALA self.memory  24

 LALALALA self.memory  25

 LALALALA self.memory  26

 LALALALA self.memory  27

 LALALALA self.memory  28

 LALALALA self.memory  29

 LALALALA self.memory  30

 LALALALA self.memory  31

 LALALALA self.memory  32

 LALALALA self.memory  33

 LALALALA self.memory  34

 LALALALA self.memory  35

 LALALALA self.memory  36

 LALALALA self.memory  37

 LA


 LALALALA self.memory  308

 LALALALA self.memory  309

 LALALALA self.memory  310

 LALALALA self.memory  311

 LALALALA self.memory  312

 LALALALA self.memory  313

 LALALALA self.memory  314

 LALALALA self.memory  315

 LALALALA self.memory  316

 LALALALA self.memory  317

 LALALALA self.memory  318

 LALALALA self.memory  319

 LALALALA self.memory  320

 LALALALA self.memory  321

 LALALALA self.memory  322

 LALALALA self.memory  323

 LALALALA self.memory  324

 LALALALA self.memory  325

 LALALALA self.memory  326

 LALALALA self.memory  327

 LALALALA self.memory  328

 LALALALA self.memory  329

 LALALALA self.memory  330

 LALALALA self.memory  331

 LALALALA self.memory  332

 LALALALA self.memory  333

 LALALALA self.memory  334

 LALALALA self.memory  335

 LALALALA self.memory  336

 LALALALA self.memory  337

 LALALALA self.memory  338

 LALALALA self.memory  339

 LALALALA self.memory  340

 LALALALA self.memory  341

 LALALALA self.memory  342

 LALALALA self.memo


 LALALALA self.memory  622

 LALALALA self.memory  623

 LALALALA self.memory  624

 LALALALA self.memory  625

 LALALALA self.memory  626

 LALALALA self.memory  627

 LALALALA self.memory  628

 LALALALA self.memory  629

 LALALALA self.memory  630

 LALALALA self.memory  631

 LALALALA self.memory  632

 LALALALA self.memory  633

 LALALALA self.memory  634

 LALALALA self.memory  635

 LALALALA self.memory  636

 LALALALA self.memory  637

 LALALALA self.memory  638

 LALALALA self.memory  639

 LALALALA self.memory  640

 LALALALA self.memory  641

 LALALALA self.memory  642

 LALALALA self.memory  643

 LALALALA self.memory  644

 LALALALA self.memory  645

 LALALALA self.memory  646

 LALALALA self.memory  647

 LALALALA self.memory  648

 LALALALA self.memory  649

 LALALALA self.memory  650

 LALALALA self.memory  651

 LALALALA self.memory  652

 LALALALA self.memory  653

 LALALALA self.memory  654

 LALALALA self.memory  655

 LALALALA self.memory  656

 LALALALA self.memo


 LALALALA self.memory  932

 LALALALA self.memory  933

 LALALALA self.memory  934

 LALALALA self.memory  935

 LALALALA self.memory  936

 LALALALA self.memory  937

 LALALALA self.memory  938

 LALALALA self.memory  939

 LALALALA self.memory  940

 LALALALA self.memory  941

 LALALALA self.memory  942

 LALALALA self.memory  943

 LALALALA self.memory  944

 LALALALA self.memory  945

 LALALALA self.memory  946

 LALALALA self.memory  947

 LALALALA self.memory  948

 LALALALA self.memory  949

 LALALALA self.memory  950

 LALALALA self.memory  951

 LALALALA self.memory  952

 LALALALA self.memory  953

 LALALALA self.memory  954

 LALALALA self.memory  955

 LALALALA self.memory  956

 LALALALA self.memory  957

 LALALALA self.memory  958

 LALALALA self.memory  959

 LALALALA self.memory  960

 LALALALA self.memory  961

 LALALALA self.memory  962

 LALALALA self.memory  963

 LALALALA self.memory  964

 LALALALA self.memory  965

 LALALALA self.memory  966

 LALALALA self.memo


 LALALALA self.memory  1227

 LALALALA self.memory  1228

 LALALALA self.memory  1229

 LALALALA self.memory  1230

 LALALALA self.memory  1231

 LALALALA self.memory  1232

 LALALALA self.memory  1233

 LALALALA self.memory  1234

 LALALALA self.memory  1235

 LALALALA self.memory  1236

 LALALALA self.memory  1237

 LALALALA self.memory  1238

 LALALALA self.memory  1239

 LALALALA self.memory  1240

 LALALALA self.memory  1241

 LALALALA self.memory  1242

 LALALALA self.memory  1243

 LALALALA self.memory  1244

 LALALALA self.memory  1245

 LALALALA self.memory  1246

 LALALALA self.memory  1247

 LALALALA self.memory  1248

 LALALALA self.memory  1249

 LALALALA self.memory  1250

 LALALALA self.memory  1251

 LALALALA self.memory  1252

 LALALALA self.memory  1253

 LALALALA self.memory  1254

 LALALALA self.memory  1255

 LALALALA self.memory  1256

 LALALALA self.memory  1257

 LALALALA self.memory  1258

 LALALALA self.memory  1259

 LALALALA self.memory  1260

 LALALALA sel


 LALALALA self.memory  1510

 LALALALA self.memory  1511

 LALALALA self.memory  1512

 LALALALA self.memory  1513

 LALALALA self.memory  1514

 LALALALA self.memory  1515

 LALALALA self.memory  1516

 LALALALA self.memory  1517

 LALALALA self.memory  1518

 LALALALA self.memory  1519

 LALALALA self.memory  1520

 LALALALA self.memory  1521

 LALALALA self.memory  1522

 LALALALA self.memory  1523

 LALALALA self.memory  1524

 LALALALA self.memory  1525

 LALALALA self.memory  1526

 LALALALA self.memory  1527

 LALALALA self.memory  1528

 LALALALA self.memory  1529

 LALALALA self.memory  1530

 LALALALA self.memory  1531

 LALALALA self.memory  1532

 LALALALA self.memory  1533

 LALALALA self.memory  1534

 LALALALA self.memory  1535

 LALALALA self.memory  1536

 LALALALA self.memory  1537

 LALALALA self.memory  1538

 LALALALA self.memory  1539

 LALALALA self.memory  1540

 LALALALA self.memory  1541

 LALALALA self.memory  1542

 LALALALA self.memory  1543

 LALALALA sel


 LALALALA self.memory  1809

 LALALALA self.memory  1810

 LALALALA self.memory  1811

 LALALALA self.memory  1812

 LALALALA self.memory  1813

 LALALALA self.memory  1814

 LALALALA self.memory  1815

 LALALALA self.memory  1816

 LALALALA self.memory  1817

 LALALALA self.memory  1818

 LALALALA self.memory  1819

 LALALALA self.memory  1820

 LALALALA self.memory  1821

 LALALALA self.memory  1822

 LALALALA self.memory  1823

 LALALALA self.memory  1824

 LALALALA self.memory  1825

 LALALALA self.memory  1826

 LALALALA self.memory  1827

 LALALALA self.memory  1828

 LALALALA self.memory  1829

 LALALALA self.memory  1830

 LALALALA self.memory  1831

 LALALALA self.memory  1832

 LALALALA self.memory  1833

 LALALALA self.memory  1834

 LALALALA self.memory  1835

 LALALALA self.memory  1836

 LALALALA self.memory  1837

 LALALALA self.memory  1838

 LALALALA self.memory  1839

 LALALALA self.memory  1840

 LALALALA self.memory  1841

 LALALALA self.memory  1842

 LALALALA sel


 LALALALA self.memory  2110

 LALALALA self.memory  2111

 LALALALA self.memory  2112

 LALALALA self.memory  2113

 LALALALA self.memory  2114

 LALALALA self.memory  2115

 LALALALA self.memory  2116

 LALALALA self.memory  2117

 LALALALA self.memory  2118

 LALALALA self.memory  2119

 LALALALA self.memory  2120

 LALALALA self.memory  2121

 LALALALA self.memory  2122

 LALALALA self.memory  2123

 LALALALA self.memory  2124

 LALALALA self.memory  2125

 LALALALA self.memory  2126

 LALALALA self.memory  2127

 LALALALA self.memory  2128

 LALALALA self.memory  2129

 LALALALA self.memory  2130

 LALALALA self.memory  2131

 LALALALA self.memory  2132

 LALALALA self.memory  2133

 LALALALA self.memory  2134

 LALALALA self.memory  2135

 LALALALA self.memory  2136

 LALALALA self.memory  2137

 LALALALA self.memory  2138

 LALALALA self.memory  2139

 LALALALA self.memory  2140

 LALALALA self.memory  2141

 LALALALA self.memory  2142

 LALALALA self.memory  2143

 LALALALA sel


 LALALALA self.memory  2408

 LALALALA self.memory  2409

 LALALALA self.memory  2410

 LALALALA self.memory  2411

 LALALALA self.memory  2412

 LALALALA self.memory  2413

 LALALALA self.memory  2414

 LALALALA self.memory  2415

 LALALALA self.memory  2416

 LALALALA self.memory  2417

 LALALALA self.memory  2418

 LALALALA self.memory  2419

 LALALALA self.memory  2420

 LALALALA self.memory  2421

 LALALALA self.memory  2422

 LALALALA self.memory  2423

 LALALALA self.memory  2424

 LALALALA self.memory  2425

 LALALALA self.memory  2426

 LALALALA self.memory  2427

 LALALALA self.memory  2428

 LALALALA self.memory  2429

 LALALALA self.memory  2430

 LALALALA self.memory  2431

 LALALALA self.memory  2432

 LALALALA self.memory  2433

 LALALALA self.memory  2434

 LALALALA self.memory  2435

 LALALALA self.memory  2436

 LALALALA self.memory  2437

 LALALALA self.memory  2438

 LALALALA self.memory  2439

 LALALALA self.memory  2440

 LALALALA self.memory  2441

 LALALALA sel

RuntimeError: The size of tensor a (51200) must match the size of tensor b (2560) at non-singleton dimension 0

### 5. Watch a Smart Agent!

In [None]:
agent.actor_local.load_state_dict(torch.load('p2_checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('p2_checkpoint_critic.pth'))

state = env.reset()
for t in range(200):
    action = agent.act(state, add_noise=False)
    env.render()
    state, reward, done, _ = env.step(action)
    if done:
        break 

env.close()

In [None]:
#env.close()