# Reacher

---

In this notebook, you will find a solution for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893). This project involves the use of the Unity ML-Agents environment.

Firstly, you need to choose how you want to run this notebook:

- Set 'run_mode' to "train" if you want to train a new agent from scratch.
- Alternatively, set 'run_mode' to "test" if you want to evaluate a pre-trained model.

If you prefer not to visualize the agent during the training and testing process, change 'no_rendering' to True.


In [1]:
# !pip -q install ./python

In [2]:
# Select the 'train' to train an agent from scratch or 'test' to test a saved agent.
run_mode = 'test'  

# Set to True to visualize the agent during evaluation.
no_rendering = False   

env_version = 'many'


### 1. Start the Environment

We begin by importing some necessary packages.

In [3]:
from unityagents import UnityEnvironment
import numpy as np
from collections import deque
import random
import torch
import numpy as np


import matplotlib.pyplot as plt
%matplotlib inline

Next, we're going to start the environment! **_Before you run the code cell below_**, ensure the `file_name` parameter matches the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

Since I'm running this on Windows 10 64-bit, and the environment is located in "./Banana_Windows_x86_64/", I'm going to set my environment like so:

```python
env = UnityEnvironment(file_name="./Banana_Windows_x86_64/banana.exe")


In [4]:
if env_version.lower() == 'single':
    env = UnityEnvironment(file_name='./Reacher_Windows_x86_64/Reacher.exe')
else: #many
    env = UnityEnvironment(file_name='./ManyReachers_Windows_x86_64/Reacher.exe')
    

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 4. Creating a Smart Agent

In this section, we will instantiate the Deep Q-Learning Agent, defined in `dqn_agent.py`. We only need to specify the state size, the action size, and a seed for generating random numbers.


In [7]:
from ddpg_agent import Agent
agent = Agent(state_size=state_size, action_size=action_size, random_seed=37)
# agents = []
# for i in range(num_agents):
#     agents.append(Agent(state_size=state_size, action_size=action_size, random_seed=i))

### 5. Training Loop

In this step, we will train our agent using the 'dqn' function. The training will run for 'n_episodes' or until the agent achieves an average score of 15 or higher over a span of 100 episodes. After training, a graph is generated to provide a visual representation of the agent's performance across the episodes, illustrating how the score has evolved over the course of training.




In [8]:
if run_mode == 'train':
    def ddpg(n_episodes=1000, max_t=1000, print_every=10):
        """Deep Q-Learning.

        Params
        ======
            n_episodes (int): maximum number of training episodes
            max_t (int): maximum number of timesteps per episode
            eps_start (float): starting value of epsilon, for epsilon-greedy action selection
            eps_end (float): minimum value of epsilon
            eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
        """
        scores = []                        # list containing scores from each episode
        scores_window = deque(maxlen=100)  # last 100 scores
        for i_episode in range(1, n_episodes+1):
            env_info = env.reset(train_mode=True)[brain_name] # reset the environment
            agent.reset()
            states = env_info.vector_observations            # get the current state

            score = 0
            # agents_score = [0 for i in range(num_agents)]
            score_old = 0
            for t in range(max_t):               
                actions = agent.act(states)
                # print('actions: ', actions)
                env_info = env.step(actions)[brain_name]        # send the action to the environment
                next_state = env_info.vector_observations      # get the next state
                rewards = env_info.rewards                      # get the reward
                dones = env_info.local_done                     # see if episode has finished
                score += np.mean(rewards)

                interval = 50
                if t % interval == 0:
                    # print(f'\npartial mean score in t={t}: {max_t*score/t}')
                    slope = (score-score_old)#/interval
                    print(f'accumulated score over {t} steps: {score} - slope: {slope}')
                    score_old = score


                for i in range(num_agents):
                    agent.step(states[i], actions[i], rewards[i], next_state[i], dones[i], t)
                 
                states = next_state
                if np.any(dones):
                    break 

            scores_window.append(score)       # save most recent score
            scores.append(score)              # save most recent score
            # eps = max(eps_end, eps_decay*eps) # decrease epsilon
            print('\rEpisode {}\tScore: {:.2f}\n'.format(i_episode, score), end="")
            if i_episode % print_every == 0:
                print('\rwindow (100) average score at episode {}: {:.2f}\n'.format(i_episode, np.mean(scores_window)))
            if np.mean(scores_window)>=32:
                print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}\n'.format(i_episode-100, np.mean(scores_window)))

                torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
                torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
                # for n, agent in enumerate(agents):
                #     torch.save(agent.actor_local.state_dict(), str(n)+'_checkpoint_actor.pth')
                #     torch.save(agent.critic_local.state_dict(), str(n)+'_checkpoint_critic.pth')
                break
        return scores

    scores = ddpg(max_t=10000)

    # plot the scores
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

else:
    # Load the weights
    agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
    agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))
    # for n, agent in enumerate(agents):
    #     agent.actor_local.load_state_dict(torch.load(str(n)+'_checkpoint_actor.pth'))
    #     agent.critic_local.load_state_dict(torch.load(str(n)+'_checkpoint_critic.pth'))


In [9]:
# torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
# torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')


### 6. Testing

In this section, we evaluate the performance of the trained model over 100 episodes. Our agent is considered successful if it achieves an average score of 13 or higher. If the average score falls below this threshold, it indicates that the agent requires further training or parameter tuning.


In [10]:
n_episodes = 100  
episodes_score = [] 
for i_episode in range(1, n_episodes+1):
    env_info = env.reset(train_mode=False)[brain_name] # reset the environment
    states = env_info.vector_observations            # get the current state   
    score = 0                                   # initialize the score
    
    while True:
        actions = agent.act(states)
        env_info = env.step(np.array(actions))[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations      # get the next state
        rewards = env_info.rewards                      # get the reward
        dones = env_info.local_done                     # see if episode has finished2
        score += np.mean(env_info.rewards)
        
        states = next_state
        if np.any(dones):
            break 
    episodes_score.append(score)
    print("Episode {} Score: {}".format(i_episode, score))

score_avg = sum(episodes_score) / len(episodes_score)
if score_avg > 30:
    print("Smart Agent PASSED :) Average score = ", score_avg)
else:
    print("Smart Agent FAILED :( Average score = ", score_avg)

Episode 1 Score: 38.869499131199085
Episode 2 Score: 38.66349913580357
Episode 3 Score: 38.52849913882102


In [None]:
env.close()