# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import torch
import random

seed = 1337
random.seed(1337)
np.random.seed(1337)
torch.manual_seed(1337)
torch.backends.cudnn.deterministic = True

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
env = UnityEnvironment(file_name='Reacher.app',seed=seed)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  4.81451988e+00 -1.00000000e+00
  6.38908386e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
  8.53890657e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# while True:
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

When finished, you can close the environment.

In [6]:
# env.close()

In [7]:
# Run agent

def run_agent(model_path):
    from ppo_agent import Agent

    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]
    env_info = env.reset(train_mode=False)[brain_name]
    n_observations = env_info.vector_observations.shape[1]
    n_actions = brain.vector_action_space_size
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    agent = Agent(n_observations, n_actions)
    agent.load_state_dict(torch.load(model_path))

    scores = np.zeros(1)                          # initialize the score (for each agent)
    while True:
        obs = torch.Tensor(np.expand_dims(env_info.vector_observations[0], 0))
        with torch.no_grad():
            action, _, _, _ = agent.get_action_and_value(obs)
        torch.clamp(action, -1, 1)
        action = action.numpy()
        env_info = env.step(action)[brain_name]           # send all actions to the environment
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
# run_agent('checkpoints/model_step_976.pickle')

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [None]:
from ppo import run_ppo

run_ppo(env)

update 1/1953. Last update in 2.86102294921875e-06s
last 100 returns: 0.07999999821186066
update 2/1953. Last update in 7.998077869415283s
last 100 returns: 0.08399999812245369
update 3/1953. Last update in 7.86649227142334s
last 100 returns: 0.10428571195474692
update 4/1953. Last update in 7.7167439460754395s
last 100 returns: 0.10777777536875671
update 5/1953. Last update in 7.829853057861328s
last 100 returns: 0.18090908686545762
update 6/1953. Last update in 8.042262077331543s
last 100 returns: 0.1738461499603895
update 7/1953. Last update in 8.097480773925781s
last 100 returns: 0.21066666195789974
update 8/1953. Last update in 7.799475908279419s
last 100 returns: 0.19705881912480383
update 9/1953. Last update in 7.708945035934448s
last 100 returns: 0.2084210479730054
update 10/1953. Last update in 7.753421068191528s
last 100 returns: 0.22142856647925718
update 11/1953. Last update in 7.9409849643707275s
last 100 returns: 0.20999999530613422
update 12/1953. Last update in 7.874290

last 100 returns: 2.2213999503478408
update 96/1953. Last update in 7.6938769817352295s
last 100 returns: 2.237299949992448
update 97/1953. Last update in 7.683856964111328s
last 100 returns: 2.288999948836863
update 98/1953. Last update in 7.736672878265381s
last 100 returns: 2.2863999488949776
update 99/1953. Last update in 7.674743890762329s
last 100 returns: 2.275799949131906
update 100/1953. Last update in 7.717886924743652s
last 100 returns: 2.2610999494604767
update 101/1953. Last update in 7.732435941696167s
last 100 returns: 2.281999948993325
update 102/1953. Last update in 7.683976888656616s
last 100 returns: 2.334799947813153
update 103/1953. Last update in 7.777731895446777s
last 100 returns: 2.3144999482668935
update 104/1953. Last update in 7.70491099357605s
last 100 returns: 2.35089994745329
update 105/1953. Last update in 7.690739154815674s
last 100 returns: 2.3660999471135438
update 106/1953. Last update in 7.723138093948364s
last 100 returns: 2.433599945604801
update 

last 100 returns: 2.7101999394223095
update 190/1953. Last update in 7.759142160415649s
last 100 returns: 2.7753999379649756
update 191/1953. Last update in 7.784319877624512s
last 100 returns: 2.777099937926978
update 192/1953. Last update in 7.756137132644653s
last 100 returns: 2.7899999376386404
update 193/1953. Last update in 7.681793928146362s
last 100 returns: 2.7818999378196896
update 194/1953. Last update in 7.726746082305908s
last 100 returns: 2.7473999385908248
update 195/1953. Last update in 7.742915153503418s
last 100 returns: 2.7073999394848944
update 196/1953. Last update in 7.673784017562866s
last 100 returns: 2.6658999404124915
update 197/1953. Last update in 7.730867147445679s
last 100 returns: 2.700599939636886
update 198/1953. Last update in 7.7046778202056885s
last 100 returns: 2.7095999394357206
update 199/1953. Last update in 7.696652173995972s
last 100 returns: 2.6745999402180316
update 200/1953. Last update in 7.661372900009155s
last 100 returns: 2.6972999397106

last 100 returns: 2.6355999410897493
update 284/1953. Last update in 7.658980131149292s
last 100 returns: 2.6096999416686595
update 285/1953. Last update in 7.751826047897339s
last 100 returns: 2.57779994238168
update 286/1953. Last update in 7.6570820808410645s
last 100 returns: 2.6115999416261912
update 287/1953. Last update in 7.668106317520142s
last 100 returns: 2.6210999414138496
update 288/1953. Last update in 7.763859987258911s
last 100 returns: 2.645399940870702
update 289/1953. Last update in 7.669438123703003s
last 100 returns: 2.6772999401576816
update 290/1953. Last update in 7.732395172119141s
last 100 returns: 2.6882999399118126
update 291/1953. Last update in 7.679421901702881s
last 100 returns: 2.725899939071387
update 292/1953. Last update in 7.665967226028442s
last 100 returns: 2.741399938724935
update 293/1953. Last update in 7.772955894470215s
last 100 returns: 2.7749999379739165
update 294/1953. Last update in 7.694005012512207s
last 100 returns: 2.77319993801415
u

last 100 returns: 2.583199942260981
update 378/1953. Last update in 8.238384246826172s
last 100 returns: 2.568599942587316
update 379/1953. Last update in 8.298235177993774s
last 100 returns: 2.5733999424800276
update 380/1953. Last update in 8.050881147384644s
last 100 returns: 2.551799942962825
update 381/1953. Last update in 7.837753772735596s
last 100 returns: 2.5542999429069457
update 382/1953. Last update in 7.686144113540649s
last 100 returns: 2.566899942625314
update 383/1953. Last update in 7.707397937774658s
last 100 returns: 2.545099943112582
update 384/1953. Last update in 7.942054033279419s
last 100 returns: 2.5046999440155924
update 385/1953. Last update in 7.790740966796875s
last 100 returns: 2.509399943910539
update 386/1953. Last update in 7.677792072296143s
last 100 returns: 2.518799943700433
update 387/1953. Last update in 7.65902304649353s
last 100 returns: 2.4977999441698193
update 388/1953. Last update in 7.691082000732422s
last 100 returns: 2.4999999441206455
upd

last 100 returns: 2.1687999515235425
update 472/1953. Last update in 7.672152042388916s
last 100 returns: 2.1336999523080884
update 473/1953. Last update in 7.607626914978027s
last 100 returns: 2.1296999523974955
update 474/1953. Last update in 7.598994016647339s
last 100 returns: 2.1021999530121684
update 475/1953. Last update in 7.72314977645874s
last 100 returns: 2.085199953392148
update 476/1953. Last update in 7.646453142166138s
last 100 returns: 2.0933999532088636
update 477/1953. Last update in 7.7278501987457275s
last 100 returns: 2.081699953470379
update 478/1953. Last update in 7.679636001586914s
last 100 returns: 2.0475999542325733
update 479/1953. Last update in 7.668774127960205s
last 100 returns: 1.976199955828488
update 480/1953. Last update in 7.639127016067505s
last 100 returns: 1.9802999557368457
update 481/1953. Last update in 7.672805309295654s
last 100 returns: 1.965799956060946
update 482/1953. Last update in 7.691009044647217s
last 100 returns: 1.967599956020713


In [None]:
def copy_model_and_plot_learning_curve():
    import pickle
    import matplotlib.pyplot as plt
    from collections import deque
    import os
    import datetime
    import shutil
    
    datetime_stamp = datetime.datetime.now().strftime('%y%m%d_%H%M')
    plot_path = f'checkpoints/{datetime_stamp}'
    
    if not os.path.exists(plot_path):
        os.makedirs(plot_path)
    else:
        print(f'directory {plot_path} already exists')
        return
    
    shutil.copyfile('checkpoints/eplen_and_returns_976.pickle', f'{plot_path}/eplen_and_returns.pickle')
    shutil.copyfile('checkpoints/model_step_976.pickle', f'{plot_path}/final_model.pickle')

    with open(f'{plot_path}/eplen_and_returns.pickle', 'rb') as f:
        _, total_rewards = zip(*pickle.load(f))

    smoothed = []
    queue = deque([], maxlen=10)
    for r in total_rewards:
        queue.append(r)
        smoothed.append(sum(queue)/len(queue))
    fig,ax = plt.subplots()
    ax.plot(smoothed)
    ax.set_xlabel('episodes')
    plt.savefig(f'{plot_path}/learning_curve.png')
    plt.show()
copy_model_and_plot_learning_curve()

In [None]:
path = f'checkpoints/03/eplen_and_returns_976.pickle'
import os
print(os.path.dirname(path))

In [None]:
# from ddpg_agent import Agent

# agent = Agent(state_size=33, action_size=4, random_seed=2)
# scores = agent.run_unity_ddpg(env)
# env.close()

# fig = plt.figure()
# ax = fig.add_subplot(111)
# plt.plot(np.arange(1, len(scores)+1), scores)
# plt.ylabel('Score')
# plt.xlabel('Episode #')
# plt.show()