# Reacher

---

In this notebook, you will find a solution for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893). This project involves the use of the Unity ML-Agents environment called Reacher. In this notebook, we are solving the second version of the environment, which includes 20 agents.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The environment is considered solved, when the average (over 100 episodes) of those average scores is at least +30.


### 1. Start the Environment

Run the following code cell to install a few packages. This may take a few minutes to complete.


In [2]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.36 which is incompatible.[0m
[31mjupyter-console 6.4.3 has requirement jupyter-client>=7.0.0, but you'll have jupyter-client 5.2.4 which is incompatible.[0m


Now you need to choose how you want to run this notebook:

- Set 'run_mode' to "train" if you want to train a new agent from scratch.
- Alternatively, set 'run_mode' to "test" if you want to evaluate a pre-trained model.

If you prefer not to visualize the agent during the training and testing process, change 'no_rendering' to True.

In [1]:
# Select the 'train' to train an agent from scratch or 'test' to test a saved agent.
run_mode = 'test'  

# Set to True to visualize the agent during evaluation.
no_rendering = True

Importing some necessary packages.

In [3]:
from unityagents import UnityEnvironment
import numpy as np
from collections import deque
import random
import torch
import numpy as np


import matplotlib.pyplot as plt
%matplotlib inline

Next, we're going to start the environment! **_Before you run the code cell below_**, ensure the `file_name` parameter matches the location of the Unity environment that you downloaded.

- **Mac**: `\"path/to/Reacher.app\"`\n",
- **Windows** (x86): `\"path/to/Reacher_Windows_x86/Reacher.exe\"`\n",
- **Windows** (x86_64): `\"path/to/Reacher_Windows_x86_64/Reacher.exe\"`\n",
- **Linux** (x86): `\"path/to/Reacher_Linux/Reacher.x86\"`\n",
- **Linux** (x86_64): `\"path/to/Reacher_Linux/Reacher.x86_64\"`\n",
- **Linux** (x86, headless): `\"path/to/Reacher_Linux_NoVis/Reacher.x86\"`\n",
- **Linux** (x86_64, headless): `\"path/to/Reacher_Linux_NoVis/Reacher.x86_64\"`\n",

Since I'm running this on Windows 10 64-bit, and the environment is located in "./Reacher_Windows_x86_64/Reacher.exe", I'm going to set my environment like so:

```python
env = UnityEnvironment(file_name="./Reacher_Windows_x86_64/Reacher.exe")


In [4]:
env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64') 

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   5.75471878e+00  -1.00000000e+00
   5.55726624e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -1.68164849e-01]


### 4. Creating a Smart Agent

In this section, we will instantiate a DDPG agent defined in `ddpg_agent.py` and `model.py`.

In [7]:
from ddpg_agent import Agent
agent = Agent(state_size=state_size, action_size=action_size, random_seed=37)

### 5. Training Loop

In this step, we will train our agent using the 'ddpg' function. The training will run for 'n_episodes' or until the agent achieves an average score of 30 or higher over a span of 100 episodes. After training, a graph is generated to provide a visual representation of the agent's performance across the episodes, illustrating how the score has evolved over the course of training.




In [8]:
if run_mode == 'train':
    def ddpg(n_episodes=1000, max_t=1000, print_every=10):
        """
        Deep Q-Learning.

        Args:
            n_episodes (int): maximum number of training episodes
            max_t (int): maximum number of timesteps per episode
            print_every (int): frequency of printing information
        """
        scores = []  # list containing scores from each episode
        scores_window = deque(maxlen=100)  # last 100 scores

        for i_episode in range(1, n_episodes + 1):
            env_info = env.reset(train_mode=True)[brain_name]  # reset the environment
            agent.reset()
            states = env_info.vector_observations  # get the current state
            score = 0
            score_old = 0

            for t in range(max_t):
                actions = agent.act(states)  # agent takes an action
                env_info = env.step(actions)[brain_name]  # send the action to the environment
                next_state = env_info.vector_observations  # get the next state
                rewards = env_info.rewards  # get the reward
                dones = env_info.local_done  # check if episode has finished
                score += np.mean(rewards)

                if t % 50 == 0:  # print accumulated score every 50 steps
                    slope = score - score_old
                    print(f'accumulated score over {t} steps: {score} - slope: {slope}')
                    score_old = score

                for i in range(num_agents):  # update the agent's state
                    agent.step(states[i], actions[i], rewards[i], next_state[i], dones[i], t)

                states = next_state
                if np.any(dones):
                    break

            scores_window.append(score)  # save most recent score
            scores.append(score)  # save most recent score

            print('\rEpisode {}\tScore: {:.2f}\n'.format(i_episode, score), end="")

            if i_episode % print_every == 0:  # print average score every "print_every" episodes
                print('\rwindow (100) average score at episode {}: {:.2f}\n'.format(i_episode, np.mean(scores_window)))

            if np.mean(scores_window) >= 32:  # stop when average score is 32 or above
                print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}\n'.format(i_episode-100, np.mean(scores_window)))
                torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
                torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
                break

        return scores

    scores = ddpg(max_t=10000)

    # plot the scores
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

else:  # load weights if not in training mode
    agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
    agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))


![alt text](./training.png)

### 6. Testing

In this section, we evaluate the performance of the trained model over 100 episodes. Our agent is considered successful if it achieves an average score of 30 or higher. If the average score falls below this threshold, it indicates that the agent requires further training or parameter tuning.


In [9]:
n_episodes = 100  
episodes_score = [] 
for i_episode in range(1, n_episodes+1):
    env_info = env.reset(train_mode=False)[brain_name] # reset the environment
    states = env_info.vector_observations            # get the current state   
    score = 0                                   # initialize the score
    
    while True:
        actions = agent.act(states)
        env_info = env.step(np.array(actions))[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations      # get the next state
        rewards = env_info.rewards                      # get the reward
        dones = env_info.local_done                     # see if episode has finished2
        score += np.mean(env_info.rewards)
        
        states = next_state
        if np.any(dones):
            break 
    episodes_score.append(score)
    print("Episode {} Score: {}".format(i_episode, score))

score_avg = sum(episodes_score) / len(episodes_score)
if score_avg > 30:
    print("Smart Agent PASSED :) Average score = ", score_avg)
else:
    print("Smart Agent FAILED :( Average score = ", score_avg)

Episode 1 Score: 38.808999132551364
Episode 2 Score: 38.62599913664179
Episode 3 Score: 38.59999913722284
Episode 4 Score: 38.52199913896634
Episode 5 Score: 38.97349912887452
Episode 6 Score: 38.48499913979346
Episode 7 Score: 38.772999133356045
Episode 8 Score: 38.145999147370624
Episode 9 Score: 38.19249914633127
Episode 10 Score: 38.576999137737005
Episode 11 Score: 38.83549913195904
Episode 12 Score: 38.98149912869566
Episode 13 Score: 38.7299991343172
Episode 14 Score: 38.616999136842985
Episode 15 Score: 38.502999139391164
Episode 16 Score: 38.02399915009765
Episode 17 Score: 38.208999145962544
Episode 18 Score: 38.26649914467729
Episode 19 Score: 38.713999134674786
Episode 20 Score: 37.714499157015474
Episode 21 Score: 38.67749913549065
Episode 22 Score: 38.58149913763647
Episode 23 Score: 38.95949912918746
Episode 24 Score: 38.46349914027398
Episode 25 Score: 38.375999142229865
Episode 26 Score: 38.246999145113136
Episode 27 Score: 38.296999143995535
Episode 28 Score: 39.05799

In [10]:
env.close()