# Project 1 - Navigation

### 1. Install dependencies
Most importantly install [Unity ML-agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md), PyTorch, and NumPy

In [None]:
from unityagents import UnityEnvironment
import numpy as np
import random
import torch
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

Make sure you have the Unity enviroment downloaded and change the path of the file_name

In [None]:
env = UnityEnvironment(file_name="Banana.app")

Environments contain brains which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Exame the State and Action Spaces
The simulation contains a single agent that navigates a large environment. At each time step, it has four actions at its disposal:

- 0 - walk forward
- 1 - walk backward
- 2 - turn left
- 3 - turn right

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana.

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the enviroment
print("Number of agents:", len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print("Number of actions:", action_size)

# examine the state space
state = env_info.vector_observations[0]
print("Example of a state:", state)
state_size = len(state)
print("States have length of:", state_size)

### 3. Instantiate and initialize the agent
The learning agent is imported from a separate file "./agent.py" and takes `state_size`, `action_size` and a `seed` as instance variables.

A few highlights of the agent:
- The agent follows an epsilon-greedy policy 
- The agent uses a buffer to store recent steps `(state, action, reward, next_state, done)` tuples and replay them
- The agent maximizes reward based on a deep Q-learning network 

In [None]:
from agent import Agent

agent = Agent(state_size=state_size, action_size=action_size, seed=0)

### 4. Test the untrained agent
Run an **untrained** agent for 200 time steps to see what happens to the score.

In [None]:
env_info = env.reset(train_mode=False)[brain_name]      # reset environment
state = env_info.vector_observations[0]                 # get first state from the reseted environment
score = 0
for j in range(200):
    action = agent.act(state)                           # agent select an action based on policy and current state
    env_info = env.step(action)[brain_name]             # send the action to the enviroment
    next_state = env_info.vector_observations[0]        # get the next state
    reward = env_info.rewards[0]                        # get the reward
    done = env_info.local_done[0]                       # check if the episode has finished
    score += reward                                     # update the total score
    state = next_state                                  # set the state as the next state for the following step
    if done:                                            # exit loop if episode finished
        break

print("Score: {}".format(score))

### 5. Train an agent with Deep Q-Network (DQN)
The agent actually runs on an underlying Q-learning network for large state spaces (even though the enviroment's state space is discrete at 37, it is too large to populate and to calculate a Q-Table at every step. Therefore, we make use of a Q-learning network and enhance this with multiple layers, hence Deep Q-learning Network (or DQN, for short).

Let's train the agent until it achieves a average score of +13 over 100 episodes.

In [None]:
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    '''
    -------------------------------------------
    Parameters
    
    n_episodes: # of episodes that the agent is training for
    max_t:      # of time steps (max) the agent is taking per episode
    eps_start:  start value of epsilon for the epsilon-greedy policy
    eps_end:    terminal value of epsilon
    eps_decay:  discount rate of epsilon for each episode
    -------------------------------------------
    '''
    scores = []
    scores_window = deque(maxlen=100)
    eps = eps_start
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]       # turn on train mode of the environment
        state = env_info.vector_observations[0]                 # select first state
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)                      # agent select an action based on policy and current state
            env_info = env.step(action)[brain_name]             # send action to the environment
            next_state = env_info.vector_observations[0]        # get next state from the enviroment
            reward = env_info.rewards[0]                        # get reward
            done = env_info.local_done[0]                       # check if the episode has finished
            agent.step(state, action, reward, next_state, done) # agent records enviroment response in recent step
            state = next_state                                  # set the state as the next state for the following step
            score += reward                                     # update the total score
            if done:                                            # exit loop if episode finished
                break
                
        scores_window.append(score)                           
        scores.append(score)
        eps = max(eps_end, eps_decay*eps)
        
        # print average 100-episode score for each episode
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        
        # print average 100-episode score
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        
        # print and save Q-Network weights when a score of +13 over 100 episodes has been achieved 
        if np.mean(scores_window)>=13.0:
            print('\nEnviroment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint_wandb.pth')
            break
    return scores




In [None]:
scores = dqn()

### Visualize the scores
Plot the scores according to their episodes. We can see a gradual increase in the scores as we increase the training episodes.

In [None]:
fig = plt.figure()
x = np.arange(len(scores))
y = scores

# plot scores
plt.plot(x, y)
# plot trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r-", linewidth=5)
plt.ylabel('Scores')
plt.xlabel('Episode #')
plt.show()

In [None]:
env.close()

### 6. Test a trained agent
Run a **trained** agent for 200 time steps to see what happens to the score. Compare this with the score of the untrained agent from 4.

In [None]:
def trained_agent(filepath):
    checkpoint = torch.load(filepath)
    agent.qnetwork_local.load_state_dict(checkpoint)
    
    return agent

In [None]:
agent = trained_agent("checkpoint_dqn.pth")

env_info = env.reset(train_mode=False)[brain_name]      # reset environment
state = env_info.vector_observations[0]                 # get first state from the reseted environment
score = 0
for j in range(200):
    action = agent.act(state)                           # agent select an action based on policy and current state
    env_info = env.step(action)[brain_name]             # send the action to the enviroment
    next_state = env_info.vector_observations[0]        # get the next state
    reward = env_info.rewards[0]                        # get the reward
    done = env_info.local_done[0]                       # check if the episode has finished
    score += reward                                     # update the total score
    state = next_state                                  # set the state as the next state for the following step
    if done:                                            # exit loop if episode finished
        break

print("Score: {}".format(score))

In [None]:
env.close()