# Reacher

---

In this notebook, you will find a solution for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893). This project involves the use of the Unity ML-Agents environment called Reacher. In this notebook, we are solving the second version of the environment, which includes 20 agents.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The environment is considered solved, when the average (over 100 episodes) of those average scores is at least +30.


### 1. Start the Environment

Run the following code cell to install a few packages. This may take a few minutes to complete.


In [1]:
# !pip -q install ./python

Now you need to choose how you want to run this notebook:

- Set 'run_mode' to "train" if you want to train a new agent from scratch.
- Alternatively, set 'run_mode' to "test" if you want to evaluate a pre-trained model.

If you prefer not to visualize the agent during the training and testing process, change 'no_rendering' to True.

In [2]:
# Select the 'train' to train an agent from scratch or 'test' to test a saved agent.
run_mode = 'train'  

Importing some necessary packages.

In [3]:
from unityagents import UnityEnvironment
import numpy as np
from collections import deque
import random
import torch
import numpy as np


import matplotlib.pyplot as plt
%matplotlib inline

Next, we're going to start the environment! **_Before you run the code cell below_**, ensure the `file_name` parameter matches the location of the Unity environment that you downloaded.

- **Mac**: `\"path/to/Reacher.app\"`\n",
- **Windows** (x86): `\"path/to/Reacher_Windows_x86/Reacher.exe\"`\n",
- **Windows** (x86_64): `\"path/to/Reacher_Windows_x86_64/Reacher.exe\"`\n",
- **Linux** (x86): `\"path/to/Reacher_Linux/Reacher.x86\"`\n",
- **Linux** (x86_64): `\"path/to/Reacher_Linux/Reacher.x86_64\"`\n",
- **Linux** (x86, headless): `\"path/to/Reacher_Linux_NoVis/Reacher.x86\"`\n",
- **Linux** (x86_64, headless): `\"path/to/Reacher_Linux_NoVis/Reacher.x86_64\"`\n",

Since I'm running this on Windows 10 64-bit, and the environment is located in "./Reacher_Windows_x86_64/Reacher.exe", I'm going to set my environment like so:

```python
env = UnityEnvironment(file_name="./Reacher_Windows_x86_64/Reacher.exe")


In [4]:
import gym
from gym import spaces
import numpy as np

class RastriginEnv(gym.Env):
    def __init__(self, dim=1, num_agents=1, A=10, lower_bound=-5.12, upper_bound=5.12, action_bound=0.1, max_steps=1000):
        super(RastriginEnv, self).__init__()

        # Define the action and observation space
        self.A = A
        self.dim = dim
        self.num_agents = num_agents
        self.action_bound = action_bound
        self.action_space = spaces.Box(low=-1, high=1, shape=(num_agents, dim), dtype=np.float32)
        self.observation_space = spaces.Box(low=lower_bound, high=upper_bound, shape=(num_agents, dim), dtype=np.float32)
        self.max_steps = max_steps

        # Initialize state
        self.state = np.zeros((num_agents, dim))
        self.step_counter = 0

    def rastrigin(self, x):
        return self.A*self.dim + np.sum(x**2 - self.A*np.cos(2*np.pi*x), axis=-1)

    def step(self, action):
        scaled_action = action * self.action_bound
        self.state = np.clip(self.state + scaled_action, self.observation_space.low, self.observation_space.high)
        reward = -self.rastrigin(self.state)
        self.step_counter += 1
        done = self.step_counter >= self.max_steps
        return self.state, reward, done, {}

    def reset(self):
        self.state = self.observation_space.sample()
        self.step_counter = 0
        return self.state

    def render(self, mode='human'):
        pass  # We won't implement a visual render for this task


In [5]:
env = RastriginEnv(dim=5, num_agents=20, lower_bound=-3, upper_bound=3, action_bound=0.5)
obs = env.reset()
print("Initial observation:", obs)
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
print("Next observation:", obs)
print("Reward:", reward)


Initial observation: [[-1.8398514  -0.92133987  0.12140139  1.403267   -2.0415401 ]
 [-0.01456831 -2.1628232  -2.019801   -1.5671711   0.48864836]
 [-1.6684493  -2.818995   -1.0669568  -1.5613036  -2.5588317 ]
 [ 2.49198    -2.594826   -1.0294344   1.4365749  -1.2727402 ]
 [ 0.43790284 -0.56863374 -0.85580206  0.128077    2.23805   ]
 [ 2.0975633   1.0148455   2.5744777   1.4767663   1.6508489 ]
 [-1.4667577   1.5167792   0.3381715   0.095038   -1.4715967 ]
 [ 0.47181317 -2.9183812   1.8668735   0.69830495  0.54932064]
 [-1.9897789   0.17949325  0.9706049  -1.1805102  -1.9408485 ]
 [ 0.52977455 -2.8028681  -0.06323256 -0.4877141   1.5689406 ]
 [ 0.86965364  2.770099   -2.769549   -0.20977056  2.114638  ]
 [-1.4119538   1.89087    -0.66991854 -2.4841287  -0.89261997]
 [-0.15749717  0.8753817  -0.97738516 -0.29472995 -2.8710322 ]
 [-0.8818643   1.6479303  -2.7280545  -2.7971487   0.561728  ]
 [ 2.8042388  -2.765727    2.6656823   1.3633544  -0.89852905]
 [ 2.6841755   2.4390404   0.70006

In [6]:
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64') 

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [7]:
# # get the default brain
# brain_name = env.brain_names[0]
# brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

In [8]:
# reset the environment
states = env.reset()

# number of agents
num_agents = env.num_agents
print('Number of agents:', num_agents)

# size of each action
# action = env.action_space.sample()
# action_size = brain.vector_action_space_size
action_size = len(env.action_space.sample()[0])
print('Size of each action:', action_size)

# examine the state space 
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 5
There are 20 agents. Each observes a state with length: 5
The state for the first agent looks like: [ 2.7110703  2.458031   2.2814033  2.891583  -1.2251228]


In [9]:
print(states.shape)
print(type(states))


(20, 5)
<class 'numpy.ndarray'>


### 4. Creating a Smart Agent

In this section, we will instantiate a DDPG agent defined in `ddpg_agent.py` and `model.py`.

In [10]:
from ddpg_agent_curiosity import Agent
agent = Agent(state_size=state_size, action_size=action_size, random_seed=37)

### 5. Training Loop

In this step, we will train our agent using the 'ddpg' function. The training will run for 'n_episodes' or until the agent achieves an average score of 30 or higher over a span of 100 episodes. After training, a graph is generated to provide a visual representation of the agent's performance across the episodes, illustrating how the score has evolved over the course of training.




In [11]:
if run_mode == 'train':
    def ddpg(n_episodes=10000, max_t=1000, print_every=10):
        """
        Deep Q-Learning.

        Args:
            n_episodes (int): maximum number of training episodes
            max_t (int): maximum number of timesteps per episode
            print_every (int): frequency of printing information
        """
        scores = []  # list containing scores from each episode
        scores_window = deque(maxlen=100)  # last 100 scores

        for i_episode in range(1, n_episodes + 1):
            states = env.reset()  # reset the environment
            agent.reset()
            # states = env_info.vector_observations  # get the current state
            score = 0
            score_old = 0

            for t in range(max_t):
                # print(type(states))
                actions = agent.act(states)  # agent takes an action
                next_state, rewards, done, info  = env.step(actions)  # send the action to the environment
                # next_state = obs
                # rewards = env_info.rewards  # get the reward
                # dones = env_info.local_done  # check if episode has finished
                score += np.mean(-1*rewards)

                # print(f'Function value: {rewards}')

                # if t % 50 == 0:  # print accumulated score every 50 steps
                #     slope = score - score_old
                #     print(f'accumulated score over {t} steps: {score} - slope: {slope}')
                #     score_old = score

                for i in range(num_agents):  # update the agent's state
                    agent.step(states[i], actions[i], rewards[i], next_state[i], done, t)

                states = next_state
                if done:
                    break

            scores_window.append(score/max_t)  # save most recent score
            scores.append(score/max_t)  # save most recent score

            print('\rEpisode {}\tMin score: {:.2f}, average score: {:.2f}\n'.format(i_episode, min(-1*rewards),score/max_t, end=""))

            # if i_episode % print_every == 0:  # print average score every "print_every" episodes
            #     print('\rwindow (100) average score at episode {}: {:.2f}\n'.format(i_episode, np.mean(scores_window)))

            # if np.mean(scores_window) >= 32:  # stop when average score is 32 or above
            #     print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}\n'.format(i_episode-100, np.mean(scores_window)))
            #     torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            #     torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            #     break

        return scores

    scores = ddpg()

    # plot the scores
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

else:  # load weights if not in training mode
    agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
    agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))


KeyboardInterrupt: 

### 6. Testing

In this section, we evaluate the performance of the trained model over 100 episodes. Our agent is considered successful if it achieves an average score of 30 or higher. If the average score falls below this threshold, it indicates that the agent requires further training or parameter tuning.


In [None]:
n_episodes = 100  
episodes_score = [] 
for i_episode in range(1, n_episodes+1):
    env_info = env.reset(train_mode=False)[brain_name] # reset the environment
    states = env_info.vector_observations            # get the current state   
    score = 0                                   # initialize the score
    
    while True:
        actions = agent.act(states)
        env_info = env.step(np.array(actions))[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations      # get the next state
        rewards = env_info.rewards                      # get the reward
        dones = env_info.local_done                     # see if episode has finished2
        score += np.mean(env_info.rewards)
        
        states = next_state
        if np.any(dones):
            break 
    episodes_score.append(score)
    print("Episode {} Score: {}".format(i_episode, score))

score_avg = sum(episodes_score) / len(episodes_score)
if score_avg > 30:
    print("Smart Agent PASSED :) Average score = ", score_avg)
else:
    print("Smart Agent FAILED :( Average score = ", score_avg)

TypeError: reset() got an unexpected keyword argument 'train_mode'

In [None]:
env.close()