# Train Continuous Control Agent (many arm)
---

In this notebook, we train the Unity ML-Agent to solve the double-jointed arm environment.

### 1. Import the Necessary Packages

In [2]:
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from ddpg_agent_con1 import Agent

# be sure to change to your environment
one_arm_file = '/home/robert/RL_ubuntu/udacity_rlnd_cont_cntl/Reacher_Linux/Reacher.x86_64'
many_arm_file = '/home/robert/RL_ubuntu/udacity_rlnd_cont_cntl/Reacher_Linux (2)/Reacher_Linux/Reacher.x86_64'

seed = 33

### 2. Instantiate the Environment and Agent

In [3]:
from unityagents import UnityEnvironment
# one arm
# env = UnityEnvironment(file_name=one_arm_file, seed=seed)

# many arms
env = UnityEnvironment(file_name=many_arm_file)

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)

# size of each action
action_size = brain.vector_action_space_size

# examine the state space
states = env_info.vector_observations
state_size = states.shape[1]

# Set seed
agent = Agent(state_size=state_size, action_size=action_size, random_seed=seed)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


### 3. Train the Agent with DDPG

Run the code cell below to train the agent from scratch.  Alternatively, you can skip to the next code cell to load the pre-trained weights from file.

In [4]:
def ddpg(n_episodes=1000, max_t=10000, from_checkpoint=False):
    if from_checkpoint:
        agent.actor_local.load_state_dict(torch.load('checkpoint_actor1.pth'))
        agent.critic_local.load_state_dict(torch.load('checkpoint_critic1.pth'))    
    scores_deque = deque(maxlen=100)
    scores_g = []
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        agent.reset()
        for t in range(max_t):
            actions = agent.act(states)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            agent.step(states, actions, rewards, next_states, dones)
            states = next_states
            scores+= env_info.rewards
            if np.any(dones):
                break
        score = np.mean(scores)
        scores_deque.append(score)
        scores_g.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_deque), score), end="")
        if i_episode % 10 == 0:
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))   
        if np.mean(scores_deque) > 30:
            torch.save(agent.actor_local.state_dict(), 'final_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'final_critic.pth')
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_deque)))
            #print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
    return scores

scores = ddpg()
#scores = ddpg(from_checkpoint=True)

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Episode 10	Average Score: 2.66	Score: 4.05
Episode 20	Average Score: 4.00	Score: 5.63
Episode 30	Average Score: 4.92	Score: 6.13
Episode 40	Average Score: 5.53	Score: 7.08
Episode 50	Average Score: 5.18	Score: 1.72
Episode 60	Average Score: 4.63	Score: 2.23
Episode 70	Average Score: 4.20	Score: 1.53
Episode 80	Average Score: 3.82	Score: 1.41
Episode 90	Average Score: 3.49	Score: 0.91
Episode 100	Average Score: 3.23	Score: 0.90
Episode 110	Average Score: 3.02	Score: 0.72
Episode 120	Average Score: 2.55	Score: 0.45
Episode 130	Average Score: 1.92	Score: 0.72
Episode 140	Average Score: 1.25	Score: 0.41
Episode 150	Average Score: 0.92	Score: 0.47
Episode 153	Average Score: 0.88	Score: 0.26

KeyboardInterrupt: 

In [None]:
env.close()

### 4. Watch a Smart Agent!

In the next code cell, you will load the trained weights from file to watch a smart agent!

In [None]:
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))

env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions =  agent.act(states)                       # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

### 5. Explore

In this exercise, we have provided a sample DDPG agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster than this benchmark implementation.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task!
- Write your own DDPG implementation.  Use this code as reference only when needed -- try as much as you can to write your own algorithm from scratch.
- You may also like to implement prioritized experience replay, to see if it speeds learning.  
- The current implementation adds Ornsetein-Uhlenbeck noise to the action space.  However, it has [been shown](https://blog.openai.com/better-exploration-with-parameter-noise/) that adding noise to the parameters of the neural network policy can improve performance.  Make this change to the code, to verify it for yourself!
- Write a blog post explaining the intuition behind the DDPG algorithm and demonstrating how to use it to solve an RL environment of your choosing.  