# Continuous Control

---

This repository is based on the continuous control problem studied in the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.



### 1. Start the Environment

The environment the agent operates on is based on a pre-built [Unity ML-Agent](https://github.com/Unity-Technologies/ml-agents) and is provided by the *CoControlEnv* class in the *cocontrol.environment* package.

This class that is encapsulates the pre-built Unity environment, which must be downloaded to one of the following locations **_before running the code cells below_**:

- **Mac**: `"cocontrol/resources/Reacher.app"`
- **Windows** (x86): `"cocontrol/resources/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"cocontrol/resources/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"cocontrol/resources/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"cocontrol/resources/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"cocontrol/resources/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"cocontrol/resources/Reacher_Linux_NoVis/Reacher.x86_64"`

It is important that the *cocontrol/resources* folder **_only contains the agent for your operating system_**!

If the code cell below returns an error, please revisit the installation instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [None]:
import pkg_resources
import random
import torch
import numpy as np

from unityagents import UnityEnvironment
from unityagents.exception import UnityEnvironmentException

PLATFORM_PATHS = ['Reacher.app',
        'Reacher_Windows_x86/Reacher.exe',
        'Reacher_Windows_x86_64/Reacher.exe',
        'Reacher_Linux/Reacher.x86',
        'Reacher_Linux_NoVis/Reacher.x86',
        'Reacher_Linux/Reacher.x86_64',
        'Reacher_Linux_NoVis/Reacher.x86_64']

In [None]:
# %load -s CoControlEnv cocontrol/environment
class CoControlEnv:
    """Banana collection environment.

    The environment accepts actions and provides states and rewards in response.
    """

    def __init__(self):
        for path in PLATFORM_PATHS:
            try:
                unity_resource = pkg_resources.resource_filename('cocontrol', 'resources/' + path)
                self._env = UnityEnvironment(file_name=unity_resource)
                print("Environment loaded from " + path)
                break
            except UnityEnvironmentException as e:
                print("Attempted to load " + path + ":")
                print(e)
                print("")
                pass

        if not hasattr(self, '_env'):
            raise Exception("No unity environment found, setup the environment as described in the README.")

        # get the default brain
        self._brain_name = self._env.brain_names[0]
        self._brain = self._env.brains[self._brain_name]

        self._info = None
        self._scores = None

    def generate_episode(self, agent, max_steps=None, train_mode=False):
        """Create a generator for and episode driven by an actor.
        Args:
            actor: An actor that provides the next action for a given state.
            max_steps: Maximum number of steps (int) to take in the episode. If
                None, the episode is generated until a terminal state is reached.

        Returns:
            A generator providing a tuple of the current state, the action taken,
            the obtained reward, the next state and a flag whether the next
            state is terminal or not.
        """
        states = self.reset(train_mode=train_mode)
        is_terminal = False
        count = 0

        while not is_terminal and (max_steps is None or count < max_steps):
            actions = agent.act(states)
            rewards, next_states, is_terminals = self.step(actions)

            step_data = (states, actions, rewards, next_states, is_terminals)

            states = next_states
            is_terminal = np.any(is_terminals)
            count += 1

            yield step_data

    def reset(self, train_mode=False):
        """Reset and initiate a new episode in the environment.

        Args:
            train_mode: Indicate if the environment should be initiated in
                training mode or not.

        Returns:
            The initial state of the episode (np.array).
        """
        if self._info is not None and not np.any(self._info.local_done):
            raise Exception("Env is active, call terminate first")

        self._info = self._env.reset(train_mode=train_mode)[self._brain_name]
        self._scores = [0.0] * self.get_agent_size()

        return self._info.vector_observations

    def step(self, actions):
        """Execute an action on all instances.

        Args:
            action: An tensor of ints representing the actions each instance.

        Returns:
            A tuple containing the rewards (floats), the next states (np.array) and
            a booleans indicating if the next state is terminal or not.
        """
        if self._info is None:
            raise Exception("Env is not active, call reset first")

        if torch.is_tensor(actions):
            actions = actions.numpy()

        self._info = self._env.step(actions)[self._brain_name]
        next_states = self._info.vector_observations
        rewards = self._info.rewards
        is_terminals = self._info.local_done
        self._scores += rewards

        return rewards, next_states, is_terminals

    def terminate(self):
        self._info = None
        self._score = None

    def close(self):
        self._env.close()
        self._info = None

    def get_score(self):
        """Return the cumulative reward of the current episode."""
        return self._score

    def get_agent_size(self):
        if self._info is None:
            raise ValueError("No agents are initialized")

        return len(self._info.agents)

    def get_action_size(self):
        return self._brain.vector_action_space_size

    def get_state_size(self):
        return self._brain.vector_observation_space_size


To generate episodes on the environment the *CoControlAgent* class provides and agent that selects actions in a particular state of the environement based on a policy function:


In [None]:
# %load -s CoControlAgent cocontrol/environment
class CoControlAgent:
    """Agent based on a policy approximator."""

    def __init__(self, pi):
        """Initialize the agent.

        Args:
            pi: policy-function that is callable with n states and returns a
                (n, a)-dim array-like containing the value of each action.
        """
        self._pi = pi

    def act(self, states):
        """Select actions for the given states.

        Args:
            state: An array-like of states to choose the actions for.
        Returns:
            An array-like of floats representing the actions.
        """
        if not torch.is_tensor(states):
            try:
                states = torch.from_numpy(states)
            except:
                states = torch.from_numpy(np.array(states, dtype=np.float))

        states = states.float()

        with torch.no_grad():
            return self._pi(states)


Now we can test the environment and run an episodewith a random or dummy policy. To run the test, uncomment the *test_env()* invocation and choose the policy you like.

**_After you ran the test you need to comment it out again and restart the kernel, as recreating the environment does not work!_**

In [None]:
def test_env():
    env = CoControlEnv()

    # all actions are between -1 and 1
    dummy_pi = lambda s: torch.rand(env.get_agent_size(), env.get_action_size()) * 2.0 - 1.0

    # Alternative dummy policy:
    #
    # cnt = 0
    # dummy_pi = lambda s: torch.ones(env.get_agent_size(), env.get_action_size()) \
    #     * ((cnt % 10)/10 * 2 - 1) * -1 ** (cnt % 10)

    agent = CoControlAgent(dummy_pi)
    episode = enumerate(env.generate_episode(agent, max_steps = 1000))
    for count, step_data in episode:
        # Consume the generated steps
        cnt = count

    env.close()
    
# test_env()


### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```