# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import operator
import tensorflow as tf
from tensorflow.python import debug as tf_debug

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
env = UnityEnvironment(file_name="Banana_Linux/Banana.x86_64")

I0909 19:17:47.966329 140205682263872 environment.py:105] 
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [1.         0.         0.         0.         0.84408134 0.
 0.         1.         0.         0.0748472  0.         1.
 0.         0.         0.25755    1.         0.         0.
 0.         0.74177343 0.         1.         0.         0.
 0.25854847 0.         0.         1.         0.         0.09355672
 0.         1.         0.         0.         0.31969345 0.
 0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 1.0


When finished, you can close the environment.

In [6]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
class QNetwork:

    def __init__(self, state_size, action_size, learning_rate, name):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.name = name

        with tf.variable_scope(self.name):
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name='inputs')
            self.IS_weights_ = tf.placeholder(tf.float32, [None, 1], name='IS_weights')
            self.actions_ = tf.placeholder(tf.float32, [None, action_size], name='actions')
            self.target_Q = tf.placeholder(tf.float32, [None], name='target_Q')

            self.value_fc = tf.layers.dense(
                inputs=self.inputs_,
                units=512,
                activation=tf.nn.elu,
                kernel_initializer=tf.contrib.layers.xavier_initializer(),
                name='value_fc'
            )

            self.value = tf.layers.dense(
                inputs=self.value_fc,
                units=1,
                activation=None,
                kernel_initializer=tf.contrib.layers.xavier_initializer(),
                name='value'
            )

            self.advantage_fc = tf.layers.dense(
                inputs=self.inputs_,
                units=512,
                activation=tf.nn.elu,
                kernel_initializer=tf.contrib.layers.xavier_initializer(),
                name='advantage_fc'
            )

            self.advantage = tf.layers.dense(
                inputs=self.advantage_fc,
                units=action_size,
                activation=None,
                kernel_initializer=tf.contrib.layers.xavier_initializer(),
                name='advantage'
            )

            self.output = self.value + tf.subtract(
                self.advantage,
                tf.reduce_mean(self.advantage, axis=1, keepdims=True))
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
            self.absolute_errors = tf.abs(self.target_Q - self.Q)
            self.loss = tf.reduce_mean(self.IS_weights_ * tf.squared_difference(self.target_Q, self.Q))
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)

In [8]:
class ReplayBuffer:
    def __init__(self, size):
        assert size > 0
        self._storage = []
        self._maxsize = size
        self._next_idx = 0

    def __len__(self):
        return len(self._storage)

    def add(self, obs_t, action, reward, obs_tp1, done):
        data = (obs_t, action, reward, obs_tp1, done)

        if self._next_idx >= len(self._storage):
            self._storage.append(data)
        else:
            self._storage[self._next_idx] = data
        self._next_idx += 1
        if self._next_idx == self._maxsize:
            self._next_idx = 0

    def _encode_sample(self, idxes):
        obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
        for i in idxes:
            data = self._storage[i]
            obs_t, action, reward, obs_tp1, done = data
            obses_t.append(np.array(obs_t, copy=False))
            actions.append(action)
            rewards.append(reward)
            obses_tp1.append(np.array(obs_tp1, copy=False))
            dones.append(done)
        return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)

    def sample(self, batch_size):
        n = len(self._storage)
        idxes = [np.random.randint(n) for _ in range(batch_size)]
        return self._encode_sample(idxes)


class PrioritizedReplayBuffer(ReplayBuffer):
    def __init__(self, size, alpha):
        super(PrioritizedReplayBuffer, self).__init__(size)
        assert alpha >= 0
        self._alpha = alpha

        it_capacity = 1
        while it_capacity < size:
            it_capacity *= 2

        self._it_sum = SumSegmentTree(it_capacity)
        self._it_min = MinSegmentTree(it_capacity)
        self._it_max = MaxSegmentTree(it_capacity)

    def add(self, *args, **kwargs):
        idx = self._next_idx
        super(PrioritizedReplayBuffer, self).add(*args, **kwargs)

        max_priority = self._it_max.max()
        if max_priority <= 0:
            max_priority = 1.0

        self._it_sum[idx] = self._it_min[idx] = self._it_max[idx] = max_priority

    def _sample_proportional(self, batch_size):
        res = []
        p_total = self._it_sum.sum()
        every_range_len = p_total / batch_size
        for i in range(batch_size):
            mass = np.random.rand() * every_range_len + i * every_range_len
            idx = self._it_sum.find_prefixsum_idx(mass)
            res.append(idx)
        return res

    def sample(self, batch_size, beta):
        assert beta > 0

        idxes = self._sample_proportional(batch_size)

        weights = []
        p_sum = self._it_sum.sum()
        p_min = self._it_min.min() / p_sum
        n = len(self._storage)
        max_weight = (p_min * n) ** (-beta)

        for idx in idxes:
            p_sample = self._it_sum[idx] / p_sum
            weight = (p_sample * n) ** (-beta)
            weights.append(weight / max_weight)
        weights = np.array(weights)
        encoded_sample = self._encode_sample(idxes)
        return tuple(list(encoded_sample) + [weights, idxes])

    def update_priorities(self, idxes, priorities):
        assert len(idxes) == len(priorities)
        n = len(self._storage)
        for idx, priority in zip(idxes, priorities):
            priority += 0.01
            assert priority > 0
            assert 0 <= idx < n
            self._it_sum[idx] = \
                self._it_min[idx] = \
                self._it_max[idx] = min(1.0, priority) ** self._alpha

In [9]:
class SegmentTree(object):
    def __init__(self, capacity, operation, neutral_element):
        assert capacity > 0 and capacity & (capacity - 1) == 0, "capacity must be positive and a power of 2"
        self._capacity = capacity
        self._value = [neutral_element for _ in range(2 * capacity)]
        self._operation = operation

    def _reduce_helper(self, start, end, node, node_start, node_end):
        if start == node_start and end == node_end:
            return self._value[node]
        mid = (node_start + node_end) // 2
        if end <= mid:
            return self._reduce_helper(start, end, 2 * node, node_start, mid)
        elif mid + 1 <= start:
            return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end)
        else:
            return self._operation(
                self._reduce_helper(start, mid, 2 * node, node_start, mid),
                self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end)
            )

    def reduce(self, start=0, end=None):
        if end is None:
            end = self._capacity
        if end < 0:
            end += self._capacity
        end -= 1
        return self._reduce_helper(start, end, 1, 0, self._capacity - 1)

    def __setitem__(self, idx, val):
        idx += self._capacity
        self._value[idx] = val
        idx //= 2
        while idx >= 1:
            self._value[idx] = self._operation(
                self._value[2 * idx],
                self._value[2 * idx + 1]
            )
            idx //= 2

    def __getitem__(self, idx):
        assert 0 <= idx < self._capacity
        return self._value[self._capacity + idx]


class SumSegmentTree(SegmentTree):
    def __init__(self, capacity):
        super(SumSegmentTree, self).__init__(
            capacity=capacity,
            operation=operator.add,
            neutral_element=0.0
        )

    def sum(self, start=0, end=None):
        return super(SumSegmentTree, self).reduce(start, end)

    def find_prefixsum_idx(self, prefixsum):
        assert 0 <= prefixsum <= self.sum() + 1e-5
        idx = 1
        while idx < self._capacity:
            if self._value[2 * idx] > prefixsum:
                idx = 2 * idx
            else:
                prefixsum -= self._value[2 * idx]
                idx = 2 * idx + 1
        return idx - self._capacity

    def __setitem__(self, idx, val):
        assert val >= 0
        super(SumSegmentTree, self).__setitem__(idx, val)


class MinSegmentTree(SegmentTree):
    def __init__(self, capacity):
        super(MinSegmentTree, self).__init__(
            capacity=capacity,
            operation=min,
            neutral_element=float('inf')
        )

    def min(self, start=0, end=None):
        return super(MinSegmentTree, self).reduce(start, end)


class MaxSegmentTree(SegmentTree):
    def __init__(self, capacity):
        super(MaxSegmentTree, self).__init__(
            capacity=capacity,
            operation=max,
            neutral_element=float('-inf')
        )

    def max(self, start=0, end=None):
        return super(MaxSegmentTree, self).reduce(start, end)


In [10]:
class LinearSchedule:
    def __init__(self, schedule_timestamps, final_p, initial_p=1.0):
        self.schedule_timestamps = schedule_timestamps
        self.final_p = final_p
        self.initial_p = initial_p

    def value(self, t):
        fraction = min(1.0, float(t) / self.schedule_timestamps)
        return self.initial_p + fraction * (self.final_p - self.initial_p)



In [11]:
memory_size = 20000
memory = PrioritizedReplayBuffer(memory_size, 0.6)

possible_actions = np.identity(action_size, dtype=int).tolist()

env_info = env.reset()[brain_name]
state = env_info.vector_observations[0]

pretrain_length = 20000
for i in range(pretrain_length):
    action = np.random.randint(action_size)
    env_info = env.step(action)[brain_name]
    reward = env_info.rewards[0]
    done = env_info.local_done[0]

    if done:
        next_state = np.zeros(state.shape)
    else:
        next_state = env_info.vector_observations[0]

    experience = state, action, reward, next_state, done
    memory.add(*experience)

    if done:
        env_info = env.reset()[brain_name]
        state = env_info.vector_observations[0]
    else:
        state = next_state


In [12]:
tf.reset_default_graph()

learning_rate = 0.00025

DQNetwork = QNetwork([state_size], action_size, learning_rate, name='DQNetwork')

TargetNetwork = QNetwork([state_size], action_size, learning_rate, name='TargetNetwork')

W0909 19:18:12.990290 140205682263872 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0909 19:18:12.990820 140205682263872 deprecation.py:323] From <ipython-input-7-388b859a4327>:20: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0909 19:18:13.284495 140205682263872 deprecation.py:506] From /home/kgdev/anaconda3/envs/drlnd/lib/python3.6/site-packages/tensorflow/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Cal

In [13]:
def update_target_graph():
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'DQNetwork')
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'TargetNetwork')
    op_holder = []
    for from_var, to_var in zip(from_vars, to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder


def predict_action(explore_probability, state):
    tradeoff = np.random.rand()
    if explore_probability > tradeoff:
        action = np.random.randint(action_size)
    else:
        Qs = sess.run(DQNetwork.output, feed_dict={DQNetwork.inputs_: state.reshape((1, *state.shape))})
        action = np.argmax(Qs)
    return action




In [None]:
total_episodes = 5000
max_steps = 5000
batch_size = 64
max_tau = 10000
gamma = 0.95

with tf.Session(config=tf.ConfigProto()) as sess:
#    sess = tf_debug.LocalCLIDebugWrapperSession(sess)
    print(sess.list_devices())

    writer = tf.summary.FileWriter('tensorboard/1', sess.graph)
    tf.summary.scalar('loss', DQNetwork.loss)
    write_op = tf.summary.merge_all()

    sess.run(tf.global_variables_initializer())

    decay_step = 0
    tau = 0
    
    explore_schedule = LinearSchedule(20000, 0.01)
    beta_schedule = LinearSchedule(6000, 1, 0.4)

    update_target = update_target_graph()
    sess.run(update_target)

    for episode in range(total_episodes):
        step = 0
        total_rewards = 0

        env_info = env.reset()[brain_name]
        state = env_info.vector_observations[0]

        while step < max_steps:
            step += 1
            decay_step += 1
            tau += 1
    
            explore_probability = explore_schedule.value(decay_step)
            action = predict_action(explore_probability, state)

            env_info = env.step(action)[brain_name]
            reward = env_info.rewards[0]
            total_rewards += reward
            done = env_info.local_done[0]

            if done:
                next_state = np.zeros(state.shape)
                step = max_steps
            else:
                next_state = env_info.vector_observations[0]

            experience = state, action, reward, next_state, done
            memory.add(*experience)

            if done:
                print('Episode: {}'.format(episode),
                      'Total reward: {}'.format(total_rewards))
            else:
                state = next_state

            beta = beta_schedule.value(decay_step)
            
            if decay_step % 1000 == 0:
                print(
                    'explore_probability: {}'.format(explore_probability),
                    'beta: {}'.format(beta),
                )
            
            states, actions, rewards, next_states, dones, weights, idxes = memory.sample(batch_size, beta)
            actions = [possible_actions[action] for action in actions]
            weights = [[weight] for weight in weights]

            q_next_states = sess.run(DQNetwork.output, feed_dict={DQNetwork.inputs_: next_states})
            q_target_next_states = sess.run(TargetNetwork.output, feed_dict={TargetNetwork.inputs_: next_states})

            target_Q = []
            for i in range(0, len(states)):
                done = dones[i]
                if done:
                    target_Q.append(rewards[i])
                else:
                    action = np.argmax(q_next_states[i])
                    target = rewards[i] + gamma * q_target_next_states[i][action]
                    target_Q.append(target)

            _, summary, loss, absolute_errors = sess.run(
                [DQNetwork.optimizer, write_op, DQNetwork.loss, DQNetwork.absolute_errors],
                feed_dict={
                    DQNetwork.inputs_: states,
                    DQNetwork.target_Q: target_Q,
                    DQNetwork.actions_: actions,
                    DQNetwork.IS_weights_: weights
                })

            memory.update_priorities(
                idxes,
                absolute_errors
            )

            writer.add_summary(summary, episode)
            writer.flush()

            if tau > max_tau:
                sess.run(update_target)
                tau = 0
                print('Model updated')

        if episode % 100 == 99:
            print('loss: {}'.format(loss))


[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 5297597527349262905), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 4521961447181447024), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 11660122383662626222), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 7690629940, 3708971275919048207)]
Episode: 0 Total reward: -1.0
Episode: 1 Total reward: -2.0
Episode: 2 Total reward: 1.0
explore_probability: 0.9505 beta: 0.5
Episode: 3 Total reward: -1.0
Episode: 4 Total reward: 1.0
Episode: 5 Total reward: 1.0
explore_probability: 0.901 beta: 0.6
Episode: 6 Total reward: -2.0
Episode: 7 Total reward: 1.0
Episode: 8 Total reward: 3.0
Episode: 9 Total reward: 2.0
explore_probability: 0.8515 beta: 0.7
Episode: 10 Total reward: 0.0
Episode: 11 Total reward: -1.0
Episode: 12 Total reward: -1.0
explore_probability: 0.802 beta: 0.8
Episode: 13 Total reward: 1.0
Epi