# Report for Banana Collector solved by a DQN algorithm
## Code implementation description
The whole code contains four python files:
1. `model.py` contains a class __Actor__, which is a neural network (NN) with two hidden layers (37,64,64,4), so this NN codifies the policy.
1. `dqn_agent.py` has a class __Agent__ with an __Actor__ class and __ReplayBuffer__ class as members, Agent has three important methods `act`, `step` and `learn`. Furthermore __ReplayBuffer__ is an implementation of the Expirience Replay technique propose in the DQN algorithm.
1. `dqn_monitor.py` has a function `dqn_interact` which implements the interacion of the Unity environment and the class __Agent__ following the DQN algorithm.
1. `learn_and_prove.py` is the main program and has three options `[-h|--help]` to know how to use the program, `[--train]` to train an agent, and `[--file FILE]` to save the output data from the program for a post-process. 

## Learning algorithm
The __DQN algorithm__ takes the following function to let the Agent learn from experience, here is performed the $\epsilon_i$ management under GLIE technique for the _exploitation-exploration dilemma_ and a $\epsilon$-greedy policy function is accomplished under `agent.act(...)`,

```python
def dqn_interact(env, agent,
                 n_episodes=2000, window=100, max_t=1000,
                 eps_start=1.0, eps_end=0.005, eps_decay=0.980,
                 filename='checkpoint.pth'):
    """ Deep Q-Learning Agent-Environment interaction.
    
    Params
    ======
        env: instance of UnityEnvironment class
        agent: instance of class Agent (see dqn_agent.py for details)
        n_episodes (int): maximum number of training episodes
        window (int): number of episodes to consider when calculating average rewards
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
        filename (string): name of the file to save weights
    """
    # all returns
    all_returns = []
    # initialize average rewards
    avg_rewards = deque(maxlen=n_episodes)
    # initialize monitor for most recent rewards
    samp_rewards = deque(maxlen=window)
    # initialize best average reward
    best_avg_reward = -np.inf
    # initialize eps
    eps = eps_start
    # for each episode
    for i_episode in range(1, n_episodes+1):
        # begin the episode
        state = reset(env, train_mode=True)
        # initialize the sample reward
        samp_reward = 0
        for t in range(max_t):
            # agent selects an action
            action =  agent.act(state, eps)
            # agent performs the selected action
            next_state, reward, done = step(env, action)
            # agent performs internal updates based on sampled experience
            agent.step(state, action, reward, next_state, done)
            # updated the sample reward
            samp_reward += reward
            # update the state (s-> s') to next time step
            state = next_state
            if done:
                break
        # save final sampled reward
        samp_rewards.append(samp_reward)
        all_returns.append(samp_reward)
        # update epsion with GLIE
        eps = max(eps_end, eps_decay*eps)
        # stopping criteria
        if np.mean(samp_rewards)>=15.0:
            # safe weights
            break

    return

```
Inside `agent.step(...)` method is carried out the learning process, the agent begin in a [_tabula rasa_](https://en.wikipedia.org/wiki/Tabula_rasa) state, after that each interaction with the environment is saved in the replay buffer and only when the replay buffer is fulfilled with at least `BATCH_SIZE` of experiences the learning process begin with a sample of randomly `BUFFER_SIZE` experiences, this sample is performed in the ReplayBuffer.sample() method.

```python
class Agent:
 
    # ...
    def step(self, state, action, reward, next_state, done):
        """ Interact Agent and Environment.
        """
        self.memory.add(state, action, reward, next_state, done)

        # Learn every UPDATE_EVERY time steps.
        self.t_step = (self.t_step + 1) % UPDATE_EVERY
        if self.t_step ==0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory) > BATCH_SIZE:
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

```

The `learn(...)` method called from `agent.step(...)` performs the TD-update using a target NN as proposed in [this paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) but with a little change in the update step of the target NN using a `soft_update` method proposed in the [paper](https://arxiv.org/pdf/1509.02971.pdf),

```python
class Agent:
    
    # ...
    def learn(self, experiences, gamma):
        """ Update value parameters using given batch of experience tuples.

        Params
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
            gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        # Get max predicted Q values (for next states) from target model
        Q_targets_next = self.actor_target(next_states).detach().max(1)[0].unsqueeze(1)
        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Get expected Q values from local model
        Q_expected = self.actor_local(states).gather(1,actions)

        # Compute loss
        loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update target network
        self.soft_update(self.actor_local, self.actor_target, TAU)    
```

__Hyperparameters:__ The following are the important parameters

```python
# dqn_interact.py
max_t=1000              # maximum number of steps per episode
eps_start=1.0           # starting epsilon
eps_end=0.005           # final epsilon, holding a bit of exploitation
eps_decay=0.980         # epsilon decay
# dqn_agent.py
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update or target parameters
LR = 5e-4               # learning rate
UPDATE_EVERY = 4        # how often to update the network
FC1_UNITS = 64          # number of neurons in fisrt layer
FC2_UNITS = 64          # number of neurons in second layer
```
after exporing some of them, the most important found was `eps_decay`.

Finally, __the neural networks architecture__ was 37->64->64->4,

```python
Actor(
  (fc1): Linear(in_features=37, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)

```

the definition can be found in `model.py`, the activation function for hidden layers are `relu` functions,

```python
class Actor(nn.Module):
    """ Actor (Policy) Model. """

    def __init__(self, state_size, action_size, seed, fc1_units=64, fc2_units=64):
        """ Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)

    def forward(self, state):
        """ Build a network that maps state -> action values. """
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)
```

## Plots of rewards
Reports the number of episodes needed to solve the environment

## Ideas for future work
The submission has concrete future ideas for improving performance