# Banana Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment by a deep-Q-network (DQN)

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from collections import OrderedDict, namedtuple, deque
import matplotlib.pyplot as plt

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [None]:
env = UnityEnvironment(file_name="...")

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

print(brain)

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

### 4. Here the DQN agent

#### 4.1. Helper functions

- `greedy_act` performs the greedy action selection. In this case the action-value function Q is the neural network (the model) defined in the next tab
- `sample_random_from_buffer` takes a random sequence from the `buffer_memory` (where the sequences ($s_t,a_t,r_t,s_{t+1}$) at each time steps are stored)
- `learn` optimizes the parameters of the neural network (the Q agent then) in order to match the target network Q' (the one built in the action greedy selection)

In [None]:
### Helper functions

def greedy_act(input_state, epsilon, model):
    """Returns actions for given state following an epsilon greedy policy.
    
    Params Input:
    ======
        state (np_array): current state
        eps (float): epsilon for epsilon-greedy action selection
        model (nn.Sequential): the DQN used to learn the Q function
        
    Return:
    ======
        action (integer in the range of action_size): the selected epsilon-greedy action
    """
    state = torch.from_numpy(input_state).float().unsqueeze(0).to(device) #conversion of state in torch tensor
        
    #Evaluation time
    model.eval()
    with torch.no_grad():
        action = model(state)
        
    #Training time
    model.train()

    # Epsilon-greedy action selection
    if random.random() > epsilon:
        return np.argmax(action.cpu().data.numpy())
    else:
        return random.choice(np.arange(action_size))
        

def sample_random_from_buffer(buffer_memory, batch_size):
    """Random samplig from the buffer memory
    
    Params Input:
    ======
        buffer_memory (deque(maxlen)): the buffer memory to store experiences 
        batch_size(int): the batch size (the # of expereinces pick up randomly form buffer_memory)
        
    Return:
    ======
        Tuple[torch.Tensor]: with states, actions, rewards, next states and dones
    """
    experiences = random.sample(buffer_memory, k=batch_size)
    states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
    actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
    rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
    next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
    dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
    return (states, actions, rewards, next_states, dones)


def learn(experiences, disc_fac, Q_target_model, Q_local_model, tau, lr):
    """Update value parameters with GD using given batch of experience tuples.

        Params Input:
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples 
            disc_fac (float): discount factor
            Q_target_model (nn.Sequential): the DQN as target Q function
            Q_local_model (nn.Sequential): the DQN as local Q function
        """
    
    states, actions, rewards, next_states, dones = experiences
    
    # Get max predicted Q values (for next states) from target model
    Q_targets_next_action = Q_target_model(next_states).detach().max(1)[0].unsqueeze(1)
    # Compute Q targets for current states 
    Q_targets = rewards + (disc_fac * Q_targets_next_action * (1 - dones))

    # Get expected Q values from local model
    Q_expected = Q_local_model(states).gather(1, actions)

    ## Training
    
    # Compute loss
    loss = F.mse_loss(Q_expected, Q_targets)
    # Minimize the loss
    optimizer = optim.Adam(Q_local_model.parameters(), lr)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Soft update (with prob tau) parameters of the Q target model  #
    for target_params, local_params in zip(Q_target_model.parameters(), Q_local_model.parameters()):
            target_params.data.copy_(tau*local_params.data + (1.0-tau)*target_params.data)

#### 4.2. Inizialization

After some inizilization (see the comment in the following code), in the next lines are defined the model for the action-value function Q and for the target action-value function Q '. We choose a neural net with two fully connected hidden layer of size 64.

In [None]:
### Inizialization

BUFFER_SIZE = int(1e5)  # replay buffer size
FIRST_HIDDEN = 64       # the size of the first hidden layer
SECOND_HIDDEN = 64      # the size of the second hidden layer

## The DQN-model

# The neural net for the action value function Q
dqn_model_local = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(state_size, FIRST_HIDDEN)),
    ('relu1', nn.ReLU()),
    ('fc2',nn.Linear(FIRST_HIDDEN, SECOND_HIDDEN)),
    ('relu2', nn.ReLU()),
    ('output', nn.Linear(SECOND_HIDDEN,action_size))
]))

dqn_model_local.to(device)

# The neural net for the target action value function Q'
dqn_model_target = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(state_size, FIRST_HIDDEN)),
    ('relu1', nn.ReLU()),
    ('fc2',nn.Linear(FIRST_HIDDEN, SECOND_HIDDEN)),
    ('relu2', nn.ReLU()),
    ('output', nn.Linear(SECOND_HIDDEN,action_size))
]))

dqn_model_target.to(device)


## The replay buffer
buffer_memory = deque(maxlen = BUFFER_SIZE)

## The tuple for each experience
experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])


### 5. The code to run and to collect scores

In [None]:
### Inizialization
BATCH_SIZE = 64         # minibatch size of sequences in memory_buffer
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 4        # how often to update the network

def run_dqn_for_banana(n_episode=2000, max_t_per_episode=1000, eps_gre_in=1., eps_gre_end=0.01, eps_dec=0.995):
    """Deep-Q learning for banana Unity Environment.
    
    Params Input
    ==========
        n_episode (int): maximum number of episodes
        max_t_per_episode (int): maximum time steps per episode (max of actions chosen per episode)
        eps_gre_in (float): if epsilon greedy action selection, is the initial value of epsilon
        eps_gre_end(float): if epsilon greedy action selection, is the ending value of epsilon
        eps_dec (float): factor to decrease the epsilon value after each episode
        
    Params Output
    ==========
        scores (list of floats): are the scores collected at the end of each episode
        
    """
    
    ## Initalization
    scores = []                       # list of scores for each episode
    scores_window = deque(maxlen=100) # last 100 scores (to plot the avarage in a window of the last 100 scores)
    epsilon = eps_gre_in              # starting epsilon
    
    
    for i_episode in range(1, n_episode+1):
        init_env_info = env.reset(train_mode=True)[brain_name]   # initial env infos
        state = init_env_info.vector_observations[0]             # initial state after reset env
        score = 0.0                                              # initial score
        
        for t in range(max_t_per_episode):
            action = greedy_act(state, epsilon, dqn_model_local)      #epsilon-greedy action selection
            env_info_after_step = env.step(action)[brain_name]        #performs the selected action on the env
            next_state = env_info_after_step.vector_observations[0]   #get the next state
            reward = env_info_after_step.rewards[0]                   #get the reward
            done = env_info_after_step.local_done[0]                  #done status
            
            #store the sequence (s,a,r',s') on the buffer memory
            exp = experience(state, action, reward, next_state, done) 
            buffer_memory.append(exp)
            
            #perform the training of the Q-neural-net each UPDATE_EVERY
            tmp_t = (t+1) % UPDATE_EVERY
            if  tmp_t == 0:
                if len(buffer_memory) > BATCH_SIZE:
                    experiences = sample_random_from_buffer(buffer_memory, BATCH_SIZE) #sample BATCH_SIZE sequences from the buffer_memory
                    learn(experiences, GAMMA, dqn_model_target, dqn_model_local, TAU, LR) #train the Q-neural-net
             
            state = next_state
            score += reward
            if done:
                break 
        
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        epsilon = max(eps_gre_end, eps_dec*epsilon) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window)>=13.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(dqn_model_local.state_dict(), 'checkpoint.pth')
            break
    return scores

### 6. Print and plot of the outcome

In [None]:
scores = run_dqn_for_banana()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()        