# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
# Watch for changes
%load_ext autoreload
%autoreload 2

from unityagents import UnityEnvironment
import numpy as np

# Monkey patch missing attributes for newer numpy versions
if not hasattr(np, "float_"):
    np.float_ = np.float64
    
if not hasattr(np, "int_"):
    np.int_ = np.int64

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
env = UnityEnvironment(file_name='Reacher_Linux/Reacher.x86_64', no_graphics=True)

Found path: /home/oliver/project-showroom/projects/reinforcement-learning/continuous-control/p2_continuous-control-synchronous/Reacher_Linux/Reacher.x86_64
Mono path[0] = '/home/oliver/project-showroom/projects/reinforcement-learning/continuous-control/p2_continuous-control-synchronous/Reacher_Linux/Reacher_Data/Managed'
Mono config path = '/home/oliver/project-showroom/projects/reinforcement-learning/continuous-control/p2_continuous-control-synchronous/Reacher_Linux/Reacher_Data/MonoBleedingEdge/etc'
Preloaded 'libgrpc_csharp_ext.x64.so'
Unable to preload the following plugins:
	ScreenSelector.so
	libgrpc_csharp_ext.x86.so
	ScreenSelector.so
Logging to /home/oliver/.config/unity3d/Unity Technologies/Unity Environment/Player.log


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
print(states.shape)
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
(20, 33)
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [None]:
def find_moving_observation_slice(env, brain_name, num_agents=20, action_size=4, steps=100):
    print("🔍 Running motion scan...")
    obs_log = []

    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations

    for i in range(steps):
        actions = np.random.uniform(-1, 1, (num_agents, action_size))
        env_info = env.step(actions)[brain_name]
        state = env_info.vector_observations
        obs_log.append(state[0])  # only track agent 0

    obs_array = np.array(obs_log)  # shape: (steps, obs_dim)

    # Compute std per dimension
    std_per_dim = np.std(obs_array, axis=0)
    print(f"📐 Observation vector dim = {obs_array.shape[1]}")
    
    # Look for most moving 3D slices
    for i in range(0, len(std_per_dim) - 2):
        slice_std = np.sum(std_per_dim[i:i+3])
        if slice_std > 1e-3:  # filter out noise
            print(f"Candidate slice [{i}:{i+3}] has std sum {slice_std:.4f} → stds = {std_per_dim[i:i+3]}")


In [8]:
#env = UnityEnvironment(file_name="Reacher_Linux/Reacher.x86_64", no_graphics=True)
brain_name = env.brain_names[0]
find_moving_observation_slice(env, brain_name, steps=50)
#env.close()


🔍 Running motion scan...
Step 1/50
Step 2/50
Step 3/50
Step 4/50
Step 5/50
Step 6/50
Step 7/50
Step 8/50
Step 9/50
Step 10/50
Step 11/50
Step 12/50
Step 13/50
Step 14/50
Step 15/50
Step 16/50
Step 17/50
Step 18/50
Step 19/50
Step 20/50
Step 21/50
Step 22/50
Step 23/50
Step 24/50
Step 25/50
Step 26/50
Step 27/50
Step 28/50
Step 29/50
Step 30/50
Step 31/50
Step 32/50
Step 33/50
Step 34/50
Step 35/50
Step 36/50
Step 37/50
Step 38/50
Step 39/50
Step 40/50
Step 41/50
Step 42/50
Step 43/50
Step 44/50
Step 45/50
Step 46/50
Step 47/50
Step 48/50
Step 49/50
Step 50/50
📐 Observation vector dim = 33
Candidate slice [0:3] has std sum 2.3164 → stds = [1.31692796 0.1575788  0.84184435]
Candidate slice [1:4] has std sum 1.0095 → stds = [0.1575788  0.84184435 0.01011591]
Candidate slice [2:5] has std sum 1.0188 → stds = [0.84184435 0.01011591 0.16685545]
Candidate slice [3:6] has std sum 0.1938 → stds = [0.01011591 0.16685545 0.01682419]
Candidate slice [4:7] has std sum 0.2943 → stds = [0.16685545 0.

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                 # get the current state (for each agent)
scores = np.zeros(num_agents)                         # initialize the score (for each agent)

initial_hand_pos = states[:, 29:32].copy()            # save initial hand positions
max_delta = 0.0                                        # track the max movement across any agent

step_count = 0
while True:
    actions = np.random.randn(num_agents, action_size)  # select random actions
    actions = np.clip(actions, -1, 1)                    # clip to [-1, 1]
    
    env_info = env.step(actions)[brain_name]            # step the environment
    next_states = env_info.vector_observations
    rewards = env_info.rewards
    dones = env_info.local_done
    scores += rewards
    states = next_states
    
    hand_pos = states[:, 29:32]
    step_delta = np.linalg.norm(hand_pos - initial_hand_pos, axis=1)
    max_delta = max(max_delta, np.max(step_delta))
    
    step_count += 1
    if step_count % 10 == 0:
        print(f"Step {step_count}: max hand movement from start: {np.max(step_delta):.4f}")
        print("Full obs (agent 0):", states[0])

    if np.any(dones):
        break

print(f"\nTotal score (averaged over agents): {np.mean(scores):.2f}")
print(f"Max hand movement during episode: {max_delta:.4f}")

if max_delta < 1e-3:
    print("⚠️  WARNING: Hand position did not change. Unity environment may be ignoring actions.")
else:
    print("✅ Hand movement detected. Unity responds to actions.")


Step 10: max hand movement from start: 0.0000
Full obs (agent 0): [ 2.85377502e-02 -3.64059973e+00 -1.66121674e+00  9.77393091e-01
  3.46945366e-03  7.50132371e-04 -2.11400598e-01  9.30635482e-02
 -1.49353864e-02  3.30568328e-02  1.45819858e-01  1.54888093e-01
 -3.40541005e-01 -1.32419968e+00 -7.77705240e+00 -6.31999969e-01
  8.48606527e-01 -2.04001009e-01  1.17901683e-01  4.73655820e-01
 -1.14352691e+00 -7.18460500e-01 -4.20570135e-01 -3.06774259e+00
  4.92994165e+00  4.79087770e-01  6.93596649e+00 -1.00000000e+00
  3.98652649e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -5.22214413e-01]
Step 20: max hand movement from start: 0.0000
Full obs (agent 0): [ 2.06256866e-01 -3.66967893e+00 -1.58326721e+00  9.79209483e-01
  2.51365490e-02  5.16924635e-03 -2.01221868e-01 -6.27368391e-01
 -2.63733286e-02  5.93717918e-02  2.60860652e-01 -1.01066196e+00
  2.30751181e+00 -1.94613647e+00 -6.69010878e+00 -8.62930298e-01
  6.72415912e-01 -2.97619790e-01  3.04072350e-01  6.05656147e-01
 -1.1

KeyboardInterrupt: 

When finished, you can close the environment.

In [9]:
env.close()

# **Architecture Decisions for TD3 Training with Unity ML-Agents**

## **1️⃣ Use Replay Buffer in Its Own Process**
### ✅ Justification:
- **Avoids race conditions** when both the training and data collection processes interact with the buffer.
- **Prevents memory contention** by isolating replay storage operations from CPU/GPU workloads.
- **Decouples storage from compute-intensive tasks**, making data access smoother.

---

## **2️⃣ Replay Buffer as the Bridge Between Training and Data Collection**
### ✅ Justification:
- **Producer-Consumer Model**: The **data collection process (Unity environment)** produces experience data, while the **training process** consumes it for learning.
- **Ensures non-blocking behavior**: Training can proceed independently while new data is collected.
- **Avoids excessive synchronization overhead**, allowing each process to run at its own pace.

---

## **3️⃣ Three Separate Processes for Efficiency**
### ✅ Justification:
1. **Replay Buffer Process**:
   - Manages stored experience and handles sampling/insertion requests efficiently.
   - Runs in a **dedicated process** to prevent contention with the training loop.

2. **Training Process**:
   - Fetches batches from the replay buffer asynchronously and updates the TD3 neural networks.
   - Utilizes the **GPU fully without waiting** for new experience data.

3. **Data Collection Process (Unity Environment)**:
   - Steps the environment in parallel for **20 agents**, collecting `(state, action, reward, next_state, done)` tuples.
   - Pushes data to the replay buffer **without blocking training**.
   - Offloads simulation work to the **CPU**, allowing for efficient resource utilization.

---

## **🌟 Summary**
- **Replay Buffer runs independently** to mediate between training and data collection.
- **Training and data collection processes operate in parallel**, preventing bottlenecks.
- **Each process is optimized for its specific task**, ensuring full utilization of CPU & GPU resources.

This architecture balances **parallelism, efficiency, and stability**, leading to **faster training times** while keeping the system modular and scalable.




I also decided to use the ReplayBuffer implementation from https://github.com/ShangtongZhang/DeepRL/. His code is superior modularized.

Because of my decisions to utilize a producer-consumer model for the ReplayBuffer I have to place all the code in python files that are directly executed in a terminal. A jupyter notebook is not suitable for parallel / async code.