# Continuous Control - Project Submission

---

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.5 which is incompatible.[0m


## Set up initial environment
The cell below instantiates the environment and sets some initial variables:

- brain_name
- action_size: the number of actions that can be performed in the environment
- state_size: the number of values retured from the envionment to represent the current state

In this case, the envionment will run a single agent

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   5.75471878e+00  -1.00000000e+00
   5.55726671e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -1.68164849e-01]


### Define networks and helper methods.

Define the Actor and Critic networks for the Deep Deterministic Policy Gradient (DDPG) networks.
 - Actor Model:  Receives the envionment state and outputs actions.  A basic fully connected network with ReLy activation on the two hidden layers and tahn activatio on teh output layer since the envornment receives actions values between -1 and 1
 - Critic Model:  Receives the envonment state and actions as input, outputs the predicted reward (state-value function).
 - StepInfo class: defines an step of the replay buffer used for training.
 - reset_game - resets the environment
 - soft_update_target - updates the weights using one model as a source and another as the target.  Weights are only updated fractionally, based upon the value of the input param ```tau```
 - env_step - Performs a step upon the environment.  The actions are chosen by submitting the given state values to the actor model. Epsilon represents the probability of choosing a random action instead of using the model.  Values are stored as a StepInfo object added to the replay buffer.
 - get_batch - get a tuple of values from the history buffer that is relevant for training.  Each member of the tuple is a list of random samples from the replay history with length equal to the desired batch size.  List indexes of the returen elements correspond to the same sample from the replay history


In [3]:
# define netowrks
import torch
import torch.nn.functional as F
import torch.optim as optim
import copy
import random

# actor - take in a state and output a distribution of actions
class ActorModel(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(ActorModel, self).__init__()
        self.state_size   = state_size
        self.action_size = action_size

        self.fc1 = torch.nn.Linear(state_size, 256)
        self.fc2 = torch.nn.Linear(256, 256)
        self.out = torch.nn.Linear(256, action_size)
        
    def forward(self, states):
        batch_size = states.size(0)
        x = F.relu(self.fc1(states))
        x = F.relu(self.fc2(x))
        x = F.tanh(self.out(x))
        return x

# critic - take in a state AND actions - outputs a state value function - V
class CriticModel(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(CriticModel, self).__init__()
        self.state_size   = state_size
        self.action_size = action_size

        self.fc1 = torch.nn.Linear(state_size, 256)
        self.fc2 = torch.nn.Linear(256+action_size, 256)
        self.fc3 = torch.nn.Linear(256, 128)
        self.out = torch.nn.Linear(128, 1)
        
    def forward(self, states, actions):
        batch_size = states.size(0)
        xs = F.leaky_relu(self.fc1(states))
        x = torch.cat((xs, actions), dim=1) #add in actions to the network
        x = F.leaky_relu(self.fc2(x))
        x = F.leaky_relu(self.fc3(x))
        x = self.out(x)
        return x
    
class StepInfo:
    def __init__(self, step_number, states, actions, rewards, next_states, dones):
        self.step_number = step_number
        self.states = states
        self.actions = actions
        self.rewards = rewards
        self.next_states = next_states
        self.dones = dones
        
    def __str__(self):
        return "step_number: {},  states: {},  actions: {},  rewards: {},  next_states: {}".format(self.step_number, self.states, self.actions, self.rewards, self.next_states)

def reset_game(in_env, brain_name):
    # **important note** When training the environment, set `train_mode=True`
    env_info = in_env.reset(train_mode=True)[brain_name]      # reset the environment    
    states = env_info.vector_observations
    return states

def env_step(in_env, brain_name, states, replay_buffer, actor_model, epsilon, add_noise=True):
    #Play a game. Add to the replay_buffer
    if (len(replay_buffer) > 0):
        step_number = replay_buffer[-1].step_number + 1
    else:
        step_number = 0

    state_tensor = torch.from_numpy(states).float().cuda()

    rand_num = random.uniform(0, 1)
    if rand_num > epsilon:
        actor_model.eval()
        with torch.no_grad():
            actions_tensor = actor_model(state_tensor)
        actor_model.train()
        actions_np = actions_tensor.detach().cpu().numpy()
    else:
        actions_np = np.random.randn(num_agents, action_size)  # select an action (for each agent)

    env_info = in_env.step(actions_np)[brain_name]         # send all actions to tne environment
    next_states = env_info.vector_observations          # get next state (for each agent)
    rewards = env_info.rewards                          # get reward (for each agent)
    dones = env_info.local_done                         # see if episode finished

    this_step_info = StepInfo(step_number, states, actions_np, rewards, next_states, dones)
    replay_buffer.append(this_step_info)

    return next_states
            
def soft_update_target(local_model, target_model, tau):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
        target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
        
def getBatch(replay_buffer, batch_size):
    return_states = np.zeros((batch_size, state_size))
    return_actions = np.zeros((batch_size, action_size))
    return_rewards = np.zeros((batch_size, 1))
    return_next_states = np.zeros((batch_size, state_size))
    return_next_actions = np.zeros((batch_size, action_size))
    
#     print("replay_buffer[0].states.shape: {}".format(replay_buffer[0].states.shape))
#     print("replay_buffer[0].rewards[0]: {}".format(replay_buffer[0].rewards[0]))
#     print("replay_buffer[0].actions.shape: {}".format(replay_buffer[0].actions.shape))
#     print("replay_buffer[0].next_states.shape: {}".format(replay_buffer[0].next_states.shape))

    for i in range(batch_size):
        rand_frame_index = random.randint(0,len(replay_buffer)-2)
        return_states[i] = replay_buffer[rand_frame_index].states[0]
        return_actions[i] = replay_buffer[rand_frame_index].actions[0]
        return_rewards[i] = replay_buffer[rand_frame_index].rewards[0]
        return_next_states[i] = replay_buffer[rand_frame_index].next_states[0]
        return_next_actions[i] = replay_buffer[rand_frame_index+1].actions[0]
        #### TODO - make sure "next" actions don't roll over onto the next  playthrough.
        
    return return_states, return_actions, return_rewards, return_next_states, return_next_actions

### Instantiate Objects

Objects in the next cell are used in the training loop.  These are stored in a separate 
cell to allow the trainng loop to ran multiple times without resetting the objects that 
need to be continuously updated.

Relevant metadata values were chosen through a painful amound of trial-and-error.
 - buffer_length 100000 seems like big number, but seems to work 
 - lr_actor / lr_actor larger learning rates proved unstable.
  - weight_decay - a recommendation from a friend doing the same project.  

In [4]:
from collections import deque

# instantiate objects that will can be re-used
buffer_length = 100000
replay_buffer = deque(maxlen=buffer_length)

actor_model_local   = ActorModel(state_size, action_size).cuda()
actor_model_target  = ActorModel(state_size, action_size).cuda()
critic_model_local  = CriticModel(state_size, action_size).cuda()
critic_model_target = CriticModel(state_size, action_size).cuda()

if True:  # set to false if running for the first time or fresh models are desired.
    actor_model_local.load_state_dict(torch.load("actor_model_local.pt"))
    actor_model_target.load_state_dict(torch.load("actor_model_target.pt"))
    critic_model_local.load_state_dict(torch.load("critic_model_local.pt"))
    critic_model_target.load_state_dict(torch.load("critic_model_target.pt"))

lr_actor = .0002
lr_critic = .0002
weight_decay = 0.0
actor_optimizer = optim.Adam(actor_model_local.parameters(), lr=lr_actor)
critic_optimizer = optim.Adam(critic_model_local.parameters(), lr=lr_critic, weight_decay=weight_decay)

### Train the model

 - Perform training.  
 - Perform a step on the environment, saving data to the replay buffer.
 - After each step is performed:
   - Grab a batch of random step data from the replay history.  
   - Train the local critic and actor models.
   - Perform soft updates on teh target critic and actor models.
 - Save the models when done.
 
This was ran multiple times.  The printout below is a sample after the model was already trained for about 200 epochs.


In [12]:
# train the model

epochs = 10 # 5
steps_per_epoch = 2000 
learning_iterations_per_step = 1 
gamma = 0.99
tau = 0.001
batch_size = 128
epsilon_decay_batches = 50
epsilon_max = 0.05
epsilon_min = 0.0
current_state = reset_game(env, brain_name)

for epoch in range(epochs):
    total_actor_loss = 0.0
    total_critic_loss = 0.0    
    total_rewards = 0
    epsilon = max(epsilon_min, epsilon_max-epoch/epsilon_decay_batches)
    
    for game_step in range(steps_per_epoch):
        #play a game
        #total_rewards = play_game(env, brain_name, replay_buffer, actor_model_local)
        current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_local, epsilon, add_noise=True)
        total_rewards += np.sum(replay_buffer[-1].rewards)
        
        #if the game is done, reset and continue
        if np.any(replay_buffer[-1].dones):
            # new game
            current_state = reset_game(env, brain_name)
            current_state = env_step(env, brain_name, current_state, replay_buffer, actor_model_local, epsilon, add_noise=True)
            total_rewards += np.sum(replay_buffer[-1].rewards)
        
        if len(replay_buffer) < 10000:
            continue  
    
        #do some learning
        for learning_iteration in range(learning_iterations_per_step):
            states, actions, rewards, next_states, next_actions = getBatch(replay_buffer, batch_size)

            # convert to tensors for input into the models.
            rewards_tensor = torch.from_numpy(rewards).float().cuda()
            next_actions_tensor = torch.from_numpy(next_actions).float().cuda()
            next_states_tensor = torch.from_numpy(next_states).float().cuda()
            states_tensor = torch.from_numpy(states).float().cuda()
            actions_tensor = torch.from_numpy(actions).float().cuda()
            
            # ---------------------------- update critic ---------------------------- #
            # Get predicted next-state actions and Q values from target models
            
            # Compute Q targets for current states (y_i)
            Q_targets_next = critic_model_target(next_states_tensor, next_actions_tensor)
            Q_targets = rewards_tensor + (gamma * Q_targets_next)
                
            # Compute critic loss
            Q_expected = critic_model_local(states_tensor, actions_tensor)
            critic_loss = F.mse_loss(Q_expected, Q_targets)
            
            # Minimize the critic loss
            critic_optimizer.zero_grad()
            critic_loss.backward()
            torch.nn.utils.clip_grad_norm_(critic_model_local.parameters(), 1)
            critic_optimizer.step()
            total_critic_loss += critic_loss.item()

            # ---------------------------- update actor ---------------------------- #
            # Compute actor loss
            actions_pred = actor_model_local(states_tensor)
            actor_loss = -critic_model_local(states_tensor, actions_pred).mean()

            # Minimize the actor loss
            actor_optimizer.zero_grad()
            actor_loss.backward()
            actor_optimizer.step()
            total_actor_loss += actor_loss.item()

            # ----------------------- update target networks ----------------------- #
            #use very small Tau and update with every step
            soft_update_target(critic_model_local, critic_model_target, tau)
            soft_update_target(actor_model_local, actor_model_target, tau)
            
    print("epoch: {} - epsilon: {:.3f}, total_rewards: {:.3f}, critic_loss: {:.8f}, actor_loss: {:.6f}".format(epoch, epsilon, total_rewards, total_critic_loss/learning_iterations_per_step/steps_per_epoch, total_actor_loss/learning_iterations_per_step/steps_per_epoch))
    
torch.save(actor_model_local.state_dict(),  "actor_model_local.pt")
torch.save(actor_model_target.state_dict(), "actor_model_target.pt")
torch.save(critic_model_local.state_dict(), "critic_model_local.pt")
torch.save(critic_model_target.state_dict(),"critic_model_target.pt")


epoch: 0 - epsilon: 0.050, total_rewards: 42.420, critic_loss: 0.00156044, actor_loss: -3.061823
epoch: 1 - epsilon: 0.030, total_rewards: 57.680, critic_loss: 0.00168753, actor_loss: -3.019857
epoch: 2 - epsilon: 0.010, total_rewards: 51.430, critic_loss: 0.00189521, actor_loss: -3.045541
epoch: 3 - epsilon: 0.000, total_rewards: 46.930, critic_loss: 0.00184486, actor_loss: -3.055374
epoch: 4 - epsilon: 0.000, total_rewards: 65.500, critic_loss: 0.00196355, actor_loss: -3.084221
epoch: 5 - epsilon: 0.000, total_rewards: 66.210, critic_loss: 0.00223274, actor_loss: -3.121827
epoch: 6 - epsilon: 0.000, total_rewards: 57.790, critic_loss: 0.00247114, actor_loss: -3.147329
epoch: 7 - epsilon: 0.000, total_rewards: 48.800, critic_loss: 0.00250819, actor_loss: -3.118186
epoch: 8 - epsilon: 0.000, total_rewards: 64.550, critic_loss: 0.00245082, actor_loss: -3.132777
epoch: 9 - epsilon: 0.000, total_rewards: 47.650, critic_loss: 0.00268918, actor_loss: -3.132944


### Play a series of games to see how the model performs.  

The score is printed after each game is done.

In [5]:
actor_model_local.load_state_dict(torch.load("actor_model_local.pt"))
actor_model_target.load_state_dict(torch.load("actor_model_target.pt"))
critic_model_local.load_state_dict(torch.load("critic_model_local.pt"))
critic_model_target.load_state_dict(torch.load("critic_model_target.pt"))
    
epsilon = 0.0
total_game_count = 100
total_rewards_over_all_games = 0

for game_count in range(total_game_count):
    total_rewards = 0.0
    current_state = reset_game(env, brain_name)
    replay = []
    
    for i in range(2000):
        current_state = env_step(env, brain_name, current_state, replay, actor_model_local, epsilon, add_noise=True)
        total_rewards += np.sum(replay[-1].rewards)
        total_rewards_over_all_games += np.sum(replay[-1].rewards)
        if np.any(replay[-1].dones):
            break
        
    print("game {}, total_rewards: {:.3f}".format(game_count+1, total_rewards))
    
print("Average reward over {} games: {:.3f}".format(total_game_count, total_rewards_over_all_games/total_game_count))

game 1, total_rewards: 37.760
game 2, total_rewards: 39.120
game 3, total_rewards: 35.020
game 4, total_rewards: 37.940
game 5, total_rewards: 34.610
game 6, total_rewards: 39.290
game 7, total_rewards: 38.300
game 8, total_rewards: 38.490
game 9, total_rewards: 38.270
game 10, total_rewards: 34.310
game 11, total_rewards: 34.340
game 12, total_rewards: 32.500
game 13, total_rewards: 13.660
game 14, total_rewards: 31.010
game 15, total_rewards: 39.040
game 16, total_rewards: 36.240
game 17, total_rewards: 38.430
game 18, total_rewards: 27.960
game 19, total_rewards: 36.820
game 20, total_rewards: 37.040
game 21, total_rewards: 26.520
game 22, total_rewards: 37.400
game 23, total_rewards: 37.150
game 24, total_rewards: 37.760
game 25, total_rewards: 35.870
game 26, total_rewards: 36.480
game 27, total_rewards: 33.940
game 28, total_rewards: 37.030
game 29, total_rewards: 33.470
game 30, total_rewards: 31.530
game 31, total_rewards: 32.720
game 32, total_rewards: 37.410
game 33, total_re