# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
# !pip -q install ./python

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# %load_ext autoreload
# %autoreload 2

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [3]:
# env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

In [4]:
env = UnityEnvironment(file_name='Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


 Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
import torchviz

In [6]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [7]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [8]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# while True:
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

In [9]:
# import ddpg_agent
# env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    False
# state = env_info.vector_observations[0]                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# # for t in range(
# total_steps = 0
# while True:
#     total_steps += 1
#     action = agent.act(state, eps)
#     env_start = timeit.default_timer()
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     total_env_time += timeit.default_timer() - env_start
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     agent.step(state, action, reward, next_state, done)
#     state = next_state
#     score += reward
#     if done:
#         break 
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

In [None]:
import runner
import ddpg_agent

# %load_ext autoreload
# %autoreload 2b
from IPython.core.debugger import set_trace
set_trace()
agent = ddpg_agent.Agent(state_size=state_size, action_size=action_size, random_seed=1)
scores = runner.run(env, agent, brain_name=brain_name)

--Return--
None
> [0;32m<ipython-input-10-2e4a8bc04803>[0m(7)[0;36m<module>[0;34m()[0m
[0;32m      5 [0;31m[0;31m# %autoreload 2b[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      6 [0;31m[0;32mfrom[0m [0mIPython[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdebugger[0m [0;32mimport[0m [0mset_trace[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 7 [0;31m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0magent[0m [0;34m=[0m [0mddpg_agent[0m[0;34m.[0m[0mAgent[0m[0;34m([0m[0mstate_size[0m[0;34m=[0m[0mstate_size[0m[0;34m,[0m [0maction_size[0m[0;34m=[0m[0maction_size[0m[0;34m,[0m [0mrandom_seed[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      9 [0;31m[0mscores[0m [0;34m=[0m [0mrunner[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0menv[0m[0;34m,[0m [0magent[0m[0;34m,[0m [0mbrain_name[0m[0;34m=[0m[0mbrain_name[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m


ipdb>  b ddpg_agent.py:72


Breakpoint 1 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:72


ipdb>  c


[0;31m    [... skipped 1 hidden frame][0m

[0;31m    [... skipped 1 hidden frame][0m

[0;31m    [... skipped 1 hidden frame][0m

[0;31m    [... skipped 1 hidden frame][0m

> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(72)[0;36mstep[0;34m()[0m
[0;32m     70 [0;31m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m)[0m [0;34m>[0m [0mBATCH_SIZE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     71 [0;31m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;31m1[0;32m--> 72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;3

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(73)[0;36mstep[0;34m()[0m
[0;32m     71 [0;31m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;31m1[0;32m    72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     74 [0;31m[0;34m[0m[0m
[0m[0;32m     75 [0;31m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0madd_noise[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  experiences


(tensor([[ 3.1671, -2.4554,  0.0806,  ...,  1.0000,  0.0000, -0.5330],
        [ 1.0665, -3.7361, -0.9581,  ...,  1.0000,  0.0000, -0.5330],
        [ 3.0599,  0.2080,  2.6059,  ...,  1.0000,  0.0000, -0.5330],
        ...,
        [ 1.4406, -3.5766, -1.0735,  ...,  1.0000,  0.0000, -0.5330],
        [ 3.2900, -2.2858,  0.1483,  ...,  1.0000,  0.0000, -0.5330],
        [ 0.0529, -3.9996, -0.0236,  ...,  1.0000,  0.0000, -0.5330]]), tensor([[ 0.4450,  0.3527, -0.7196,  0.1385],
        [-0.3562,  0.4302, -0.3297, -0.1066],
        [-0.3164, -0.0193, -0.1500, -0.2409],
        [ 0.2616, -0.0708, -0.3062,  0.0624],
        [ 0.6537, -0.1507,  0.9100,  0.1443],
        [ 0.2219,  0.5544,  0.1349, -0.1559],
        [-0.0358,  0.0632,  0.5828, -0.1998],
        [-0.4480, -0.3018,  0.1283, -0.4996],
        [ 0.7071,  0.0266, -1.0000,  0.2297],
        [ 0.0276, -0.4318,  0.1052,  0.0431],
        [ 0.1078,  0.4386, -0.0437,  0.2469],
        [ 0.1496,  0.4152, -0.2396,  0.1752],
        [ 0.

ipdb>  shape experiences


*** SyntaxError: invalid syntax


ipdb>  experiences.shape


*** AttributeError: 'tuple' object has no attribute 'shape'


ipdb>  np.shape(experiences)


*** ValueError: only one element tensors can be converted to Python scalars


ipdb>  type(experiences)


<class 'tuple'>


ipdb>  l


[1;32m     68 [0m[0;34m[0m[0m
[1;32m     69 [0m            [0;31m# Learn, if enough samples are available in memory[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m     70 [0m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m)[0m [0;34m>[0m [0mBATCH_SIZE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m     71 [0m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;31m1[1;32m    72 [0m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m---> 73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m     74 [0m[0;34m[0m[0m
[1;32m     75 [0m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[

ipdb>  s


--Call--
> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(90)[0;36mlearn[0;34m()[0m
[0;32m     88 [0;31m        [0mself[0m[0;34m.[0m[0mnoise[0m[0;34m.[0m[0mreset[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     89 [0;31m[0;34m[0m[0m
[0m[0;32m---> 90 [0;31m    [0;32mdef[0m [0mlearn[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mexperiences[0m[0;34m,[0m [0mgamma[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     91 [0;31m        """Update policy and value parameters using given batch of experience tuples.
[0m[0;32m     92 [0;31m        [0mQ_targets[0m [0;34m=[0m [0mr[0m [0;34m+[0m [0mγ[0m [0;34m*[0m [0mcritic_target[0m[0;34m([0m[0mnext_state[0m[0;34m,[0m [0mactor_target[0m[0;34m([0m[0mnext_state[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(102)[0;36mlearn[0;34m()[0m
[0;32m    100 [0;31m            [0mgamma[0m [0;34m([0m[0mfloat[0m[0;34m)[0m[0;34m:[0m [0mdiscount[0m [0mfactor[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    101 [0;31m        """
[0m[0;32m--> 102 [0;31m        [0mstates[0m[0;34m,[0m [0mactions[0m[0;34m,[0m [0mrewards[0m[0;34m,[0m [0mnext_states[0m[0;34m,[0m [0mdones[0m [0;34m=[0m [0mexperiences[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    103 [0;31m[0;34m[0m[0m
[0m[0;32m    104 [0;31m        [0;31m# ---------------------------- update critic ---------------------------- #[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  h l


Print lines of code from the current stack frame


ipdb>  l


[1;32m     97 [0m        [0mParams[0m[0;34m[0m[0;34m[0m[0m
[1;32m     98 [0m        [0;34m==[0m[0;34m==[0m[0;34m==[0m[0;34m[0m[0;34m[0m[0m
[1;32m     99 [0m            [0mexperiences[0m [0;34m([0m[0mTuple[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mTensor[0m[0;34m][0m[0;34m)[0m[0;34m:[0m [0mtuple[0m [0mof[0m [0;34m([0m[0ms[0m[0;34m,[0m [0ma[0m[0;34m,[0m [0mr[0m[0;34m,[0m [0ms[0m[0;31m'[0m[0;34m,[0m [0mdone[0m[0;34m)[0m [0mtuples[0m[0;34m[0m[0;34m[0m[0m
[1;32m    100 [0m            [0mgamma[0m [0;34m([0m[0mfloat[0m[0;34m)[0m[0;34m:[0m [0mdiscount[0m [0mfactor[0m[0;34m[0m[0;34m[0m[0m
[1;32m    101 [0m        """
[0;32m--> 102 [0;31m        [0mstates[0m[0;34m,[0m [0mactions[0m[0;34m,[0m [0mrewards[0m[0;34m,[0m [0mnext_states[0m[0;34m,[0m [0mdones[0m [0;34m=[0m [0mexperiences[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    103 [0m[0;34m[0m[0m
[1;32m    104 [0m        [0;

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(106)[0;36mlearn[0;34m()[0m
[0;32m    104 [0;31m        [0;31m# ---------------------------- update critic ---------------------------- #[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    105 [0;31m        [0;31m# Get predicted next-state actions and Q values from target models[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 106 [0;31m        [0mactions_next[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mactor_target[0m[0;34m([0m[0mnext_states[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    107 [0;31m        [0mQ_targets_next[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mcritic_target[0m[0;34m([0m[0mnext_states[0m[0;34m,[0m [0mactions_next[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    108 [0;31m        [0;31m# Compute Q targets for current states (y_i)[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  np.shape(states)


torch.Size([64, 33])


ipdb>  np.shape(actions)


torch.Size([64, 4])


ipdb>  np.shape(rewards)


torch.Size([64, 1])


ipdb>  b act


*** The specified object 'act' is not a function or was not found along sys.path.


ipdb>  b Agent.act


Breakpoint 2 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:75


ipdb>  self.memory


<ddpg_agent.ReplayBuffer object at 0x7fb4ed990518>


ipdb>  self.memory.memory


deque([Experience(state=array([ 0.00000000e+00, -4.00000000e+00,  0.00000000e+00,  1.00000000e+00,
       -0.00000000e+00, -0.00000000e+00, -4.37113883e-08,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -1.00000000e+01,  0.00000000e+00,
        1.00000000e+00, -0.00000000e+00, -0.00000000e+00, -4.37113883e-08,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -6.30408478e+00, -1.00000000e+00,
       -4.92529202e+00,  0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
       -5.33014059e-01]), action=array([-0.13057055,  0.42653787, -0.03732488,  0.23326162], dtype=float32), reward=[0.0], next_state=array([ 1.06143951e-02, -3.99998260e+00,  7.24867499e-03,  9.99998748e-01,
        1.32018444e-03, -1.00042257e-06,  9.01577994e-04, -3.59757915e-02,
        5.71791061e-05,  5.27655929e-02,  2.12117672e-01,  4.92979190e-04,
        1.4462210

ipdb>   self.memory.memory[0]


Experience(state=array([ 0.00000000e+00, -4.00000000e+00,  0.00000000e+00,  1.00000000e+00,
       -0.00000000e+00, -0.00000000e+00, -4.37113883e-08,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -1.00000000e+01,  0.00000000e+00,
        1.00000000e+00, -0.00000000e+00, -0.00000000e+00, -4.37113883e-08,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -6.30408478e+00, -1.00000000e+00,
       -4.92529202e+00,  0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
       -5.33014059e-01]), action=array([-0.13057055,  0.42653787, -0.03732488,  0.23326162], dtype=float32), reward=[0.0], next_state=array([ 1.06143951e-02, -3.99998260e+00,  7.24867499e-03,  9.99998748e-01,
        1.32018444e-03, -1.00042257e-06,  9.01577994e-04, -3.59757915e-02,
        5.71791061e-05,  5.27655929e-02,  2.12117672e-01,  4.92979190e-04,
        1.44622102e-01, 

ipdb>   self.memory.memory[0].reward


[0.0]


ipdb>   w


[0;31m    [... skipping 31 hidden frame(s)][0m

None
  [0;32m<ipython-input-10-2e4a8bc04803>[0m(9)[0;36m<module>[0;34m()[0m
[1;32m      5 [0m[0;31m# %autoreload 2b[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      6 [0m[0;32mfrom[0m [0mIPython[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdebugger[0m [0;32mimport[0m [0mset_trace[0m[0;34m[0m[0;34m[0m[0m
[1;32m      7 [0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8 [0m[0magent[0m [0;34m=[0m [0mddpg_agent[0m[0;34m.[0m[0mAgent[0m[0;34m([0m[0mstate_size[0m[0;34m=[0m[0mstate_size[0m[0;34m,[0m [0maction_size[0m[0;34m=[0m[0maction_size[0m[0;34m,[0m [0mrandom_seed[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9 [0;31m[0mscores[0m [0;34m=[0m [0mrunner[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0menv[0m[0;34m,[0m [0magent[0m[0;34m,[0m [0mbrain_name[0m[0;34m=[0m[0mbrain_name[0m[0;34m)[0m[0;34m[0m[0;34

ipdb>  b 191


Breakpoint 3 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:191


ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(77)[0;36mact[0;34m()[0m
[1;31m2[0;32m    75 [0;31m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0madd_noise[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     76 [0;31m        [0;34m"""Returns actions for given state as per current policy."""[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 77 [0;31m        [0mstate[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mstate[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     78 [0;31m        [0mself[0m[0;34m.[0m[0mactor_local[0m[0;34m.[0m[0meval[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     79 [0;31m        [0mself[0m[0;34m.[0m[0mepsilon[0m [0;34m=[0m [0mmax[0

ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(77)[0;36mact[0;34m()[0m
[1;31m2[0;32m    75 [0;31m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0madd_noise[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     76 [0;31m        [0;34m"""Returns actions for given state as per current policy."""[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 77 [0;31m        [0mstate[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mstate[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     78 [0;31m        [0mself[0m[0;34m.[0m[0mactor_local[0m[0;34m.[0m[0meval[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     79 [0;31m        [0mself[0m[0;34m.[0m[0mepsilon[0m [0;34m=[0m [0mmax[0

ipdb>  l


[1;31m1[1;32m    72 [0m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m     73 [0m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m     74 [0m[0;34m[0m[0m
[1;31m2[1;32m    75 [0m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0madd_noise[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m     76 [0m        [0;34m"""Returns actions for given state as per current policy."""[0m[0;34m[0m[0;34m[0m[0m
[0;32m---> 77 [0;31m        [0mstate[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mstate[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m

ipdb>  np.shape(state)


(33,)


ipdb>  h



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv           undisplay
a      clear      display  j         p       q        s            unt      
alias  commands   down     jump      pdef    quit     skip_hidden  until    
args   condition  enable   l         pdoc    r        source       up       
b      cont       exit     list      pfile   restart  step         w        
break  continue   h        ll        pinfo   return   tbreak       whatis   
bt     d          help     longlist  pinfo2  retval   u            where    
c      debug      ignore   n         pp      run      unalias    

Miscellaneous help topics:
exec  pdb



ipdb>  h bt


w(here)
        Print a stack trace, with the most recent frame at the bottom.
        An arrow indicates the "current frame", which determines the
        context of most commands.  'bt' is an alias for this command.


ipdb>  h cl


cl(ear) filename:lineno
cl(ear) [bpnumber [bpnumber...]]
        With a space separated list of breakpoint numbers, clear
        those breakpoints.  Without argument, clear all breaks (but
        first ask confirmation).  With a filename:lineno argument,
        clear all breaks at that line in that file.


ipdb>  h l


Print lines of code from the current stack frame


ipdb>  h list


Print lines of code from the current stack frame


ipdb>  h EOF


EOF
        Handles the receipt of EOF as a command.


ipdb>  h a


a(rgs)
        Print the argument list of the current function.


ipdb>  h alias


alias [name [command [parameter parameter ...] ]]
        Create an alias called 'name' that executes 'command'.  The
        command must *not* be enclosed in quotes.  Replaceable
        parameters can be indicated by %1, %2, and so on, while %* is
        replaced by all the parameters.  If no command is given, the
        current alias for name is shown. If no name is given, all
        aliases are listed.

        Aliases may be nested and can contain anything that can be
        legally typed at the pdb prompt.  Note!  You *can* override
        internal pdb commands with aliases!  Those internal commands
        are then hidden until the alias is removed.  Aliasing is
        recursively applied to the first word of the command line; all
        other words in the line are left alone.

        As an example, here are two useful aliases (especially when
        placed in the .pdbrc file):

        # Print instance variables (usage "pi classInst")
        alias pi for k in %1.__di

ipdb>  h



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv           undisplay
a      clear      display  j         p       q        s            unt      
alias  commands   down     jump      pdef    quit     skip_hidden  until    
args   condition  enable   l         pdoc    r        source       up       
b      cont       exit     list      pfile   restart  step         w        
break  continue   h        ll        pinfo   return   tbreak       whatis   
bt     d          help     longlist  pinfo2  retval   u            where    
c      debug      ignore   n         pp      run      unalias    

Miscellaneous help topics:
exec  pdb



ipdb>  h break


b(reak) [ ([filename:]lineno | function) [, condition] ]
        Without argument, list all breaks.

        With a line number argument, set a break at this line in the
        current file.  With a function name, set a break at the first
        executable line of that function.  If a second argument is
        present, it is a string specifying an expression which must
        evaluate to true before the breakpoint is honored.

        The line number may be prefixed with a filename and a colon,
        to specify a breakpoint in another file (probably one that
        hasn't been loaded yet).  The file is searched for on
        sys.path; the .py suffix may be omitted.


ipdb>  h whatis


whatis arg
        Print the type of the argument.


ipdb>  h up


u(p) [count]
        Move the current frame count (default one) levels up in the
        stack trace (to an older frame).

        Will skip hidden frames.


ipdb>  h unt


unt(il) [lineno]
        Without argument, continue execution until the line with a
        number greater than the current one is reached.  With a line
        number, continue execution until a line with a number greater
        or equal to that is reached.  In both cases, also stop when
        the current frame returns.


ipdb>  h



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv           undisplay
a      clear      display  j         p       q        s            unt      
alias  commands   down     jump      pdef    quit     skip_hidden  until    
args   condition  enable   l         pdoc    r        source       up       
b      cont       exit     list      pfile   restart  step         w        
break  continue   h        ll        pinfo   return   tbreak       whatis   
bt     d          help     longlist  pinfo2  retval   u            where    
c      debug      ignore   n         pp      run      unalias    

Miscellaneous help topics:
exec  pdb



ipdb>  h retval


retval
        Print the return value for the last return of a function.


ipdb>  htbreak


*** NameError: name 'htbreak' is not defined


ipdb>  h tbreak


tbreak [ ([filename:]lineno | function) [, condition] ]
        Same arguments as break, but sets a temporary breakpoint: it
        is automatically deleted when first hit.


ipdb>  h d


d(own) [count]
        Move the current frame count (default one) levels down in the
        stack trace (to a newer frame).

        Will skip hidden frames.


ipdb>  h debug


debug code
        Enter a recursive debugger that steps through the code
        argument (which is an arbitrary expression or statement to be
        executed in the current environment).


ipdb>  h j


j(ump) lineno
        Set the next line that will be executed.  Only available in
        the bottom-most frame.  This lets you jump back and execute
        code again, or jump forward to skip code that you don't want
        to run.

        It should be noted that not all jumps are allowed -- for
        instance it is not possible to jump into the middle of a
        for loop or out of a finally clause.


ipdb>  h h 


h(elp)
        Without argument, print the list of available commands.
        With a command name as argument, print help about that command.
        "help pdb" shows the full pdb documentation.
        "help exec" gives help on the ! command.


ipdb>  h up


u(p) [count]
        Move the current frame count (default one) levels up in the
        stack trace (to an older frame).

        Will skip hidden frames.


ipdb>  h



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv           undisplay
a      clear      display  j         p       q        s            unt      
alias  commands   down     jump      pdef    quit     skip_hidden  until    
args   condition  enable   l         pdoc    r        source       up       
b      cont       exit     list      pfile   restart  step         w        
break  continue   h        ll        pinfo   return   tbreak       whatis   
bt     d          help     longlist  pinfo2  retval   u            where    
c      debug      ignore   n         pp      run      unalias    

Miscellaneous help topics:
exec  pdb



ipdb>  h ignore


ignore bpnumber [count]
        Set the ignore count for the given breakpoint number.  If
        count is omitted, the ignore count is set to 0.  A breakpoint
        becomes active when the ignore count is zero.  When non-zero,
        the count is decremented each time the breakpoint is reached
        and the breakpoint is not disabled and any associated
        condition evaluates to true.


ipdb>  h cl


cl(ear) filename:lineno
cl(ear) [bpnumber [bpnumber...]]
        With a space separated list of breakpoint numbers, clear
        those breakpoints.  Without argument, clear all breaks (but
        first ask confirmation).  With a filename:lineno argument,
        clear all breaks at that line in that file.


ipdb>  h rv


retval
        Print the return value for the last return of a function.


ipdb>  h psource


Print (or run through pager) the source code for an object.


ipdb>  h pp


pp expression
        Pretty-print the value of the expression.


ipdb>  h pdef


Print the call signature for any callable object.

        The debugger interface to %pdef


ipdb>  h ll


Print lines of code from the current stack frame.

        Shows more lines than 'list' does.


ipdb>  h



Documented commands (type help <topic>):
EOF    cl         disable  interact  next    psource  rv           undisplay
a      clear      display  j         p       q        s            unt      
alias  commands   down     jump      pdef    quit     skip_hidden  until    
args   condition  enable   l         pdoc    r        source       up       
b      cont       exit     list      pfile   restart  step         w        
break  continue   h        ll        pinfo   return   tbreak       whatis   
bt     d          help     longlist  pinfo2  retval   u            where    
c      debug      ignore   n         pp      run      unalias    

Miscellaneous help topics:
exec  pdb



ipdb>  h display


display [expression]

        Display the value of the expression if it changed, each time execution
        stops in the current frame.

        Without expression, list all display expressions for the current frame.


ipdb>  h disable


disable bpnumber [bpnumber ...]
        Disables the breakpoints given as a space separated list of
        breakpoint numbers.  Disabling a breakpoint means it cannot
        cause the program to stop execution, but unlike clearing a
        breakpoint, it remains in the list of breakpoints and can be
        (re-)enabled.


ipdb>  h condition


condition bpnumber [condition]
        Set a new condition for the breakpoint, an expression which
        must evaluate to true before the breakpoint is honored.  If
        condition is absent, any existing condition is removed; i.e.,
        the breakpoint is made unconditional.


ipdb>  h break


b(reak) [ ([filename:]lineno | function) [, condition] ]
        Without argument, list all breaks.

        With a line number argument, set a break at this line in the
        current file.  With a function name, set a break at the first
        executable line of that function.  If a second argument is
        present, it is a string specifying an expression which must
        evaluate to true before the breakpoint is honored.

        The line number may be prefixed with a filename and a colon,
        to specify a breakpoint in another file (probably one that
        hasn't been loaded yet).  The file is searched for on
        sys.path; the .py suffix may be omitted.


ipdb>  b


Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:72
	breakpoint already hit 1 time
2   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:75
	breakpoint already hit 2 times
3   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:191


ipdb>  clear 2


Deleted breakpoint 2 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:75


ipdb>  b


Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:72
	breakpoint already hit 1 time
3   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:191


ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(72)[0;36mstep[0;34m()[0m
[0;32m     70 [0;31m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m)[0m [0;34m>[0m [0mBATCH_SIZE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     71 [0;31m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;31m1[0;32m--> 72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     74 [0;31m[0;34m[0m[0m
[0m


ipdb>  s


--Call--
> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(187)[0;36msample[0;34m()[0m
[0;32m    185 [0;31m        [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0mappend[0m[0;34m([0m[0me[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    186 [0;31m[0;34m[0m[0m
[0m[0;32m--> 187 [0;31m    [0;32mdef[0m [0msample[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    188 [0;31m        [0;34m"""Randomly sample a batch of experiences from memory."""[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(189)[0;36msample[0;34m()[0m
[0;32m    187 [0;31m    [0;32mdef[0m [0msample[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    188 [0;31m        [0;34m"""Randomly sample a batch of experiences from memory."""[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m   191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me

ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36msample[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0mactio

ipdb>  p experiences[0].reward


[0.0]


ipdb>  p experiences[0].state


array([ 3.74260521,  1.30481577,  0.72065175,  0.21013209,  0.14361505,
       -0.54456174,  0.79916942, -2.99421239,  0.59926176, -0.92209125,
        1.71133161, -0.89367723, -6.1378274 ,  7.62707138, -0.54108763,
        1.25264263, -0.72786838, -0.13838984,  0.66001654, -0.12423407,
       -4.16871071,  1.64133048, -0.20037714,  1.66415811, -3.14168978,
        3.82270241,  3.34989738, -1.        , -7.26486015,  0.        ,
        1.        ,  0.        , -0.53301406])


ipdb>  p experiences[0].action


array([-0.23741655, -0.43217555,  0.04944853, -0.34994152], dtype=float32)


ipdb>  l


[1;32m    186 [0m[0;34m[0m[0m
[1;32m    187 [0m    [0;32mdef[0m [0msample[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m    188 [0m        [0;34m"""Randomly sample a batch of experiences from memory."""[0m[0;34m[0m[0;34m[0m[0m
[1;32m    189 [0m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m    190 [0m[0;34m[0m[0m
[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m

ipdb>  p experiences[0].next_states


*** AttributeError: 'Experience' object has no attribute 'next_states'


ipdb>  p experiences[0].next_state


array([ 3.86730766,  1.10121477,  0.23466928,  0.0877711 ,  0.06690416,
       -0.59871387,  0.79332328, -4.0011344 ,  0.37668628, -1.19627059,
        1.50833106, -3.17391753, -6.04429054,  7.76144505, -0.82129276,
        1.17817092, -0.65274692, -0.01859114,  0.72584504, -0.21615945,
       -4.15250683,  1.84039509, -0.79419285,  0.94869405, -3.19777179,
        3.66718698,  3.61785316, -1.        , -7.13520432,  0.        ,
        1.        ,  0.        , -0.53301406])


ipdb>  np.shape(p experiences[0].next_state)


*** SyntaxError: invalid syntax


ipdb>  p np.shape(experiences[0].next_state)


(33,)


ipdb>  b 75


Breakpoint 4 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:75


ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  w


[0;31m    [... skipping 31 hidden frame(s)][0m

None
  [0;32m<ipython-input-10-2e4a8bc04803>[0m(9)[0;36m<module>[0;34m()[0m
[1;32m      5 [0m[0;31m# %autoreload 2b[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      6 [0m[0;32mfrom[0m [0mIPython[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdebugger[0m [0;32mimport[0m [0mset_trace[0m[0;34m[0m[0;34m[0m[0m
[1;32m      7 [0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8 [0m[0magent[0m [0;34m=[0m [0mddpg_agent[0m[0;34m.[0m[0mAgent[0m[0;34m([0m[0mstate_size[0m[0;34m=[0m[0mstate_size[0m[0;34m,[0m [0maction_size[0m[0;34m=[0m[0maction_size[0m[0;34m,[0m [0mrandom_seed[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9 [0;31m[0mscores[0m [0;34m=[0m [0mrunner[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0menv[0m[0;34m,[0m [0magent[0m[0;34m,[0m [0mbrain_name[0m[0;34m=[0m[0mbrain_name[0m[0;34m)[0m[0;34m[0m[0;34

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[1;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  disable 3


Disabled breakpoint 3 at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:191


ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[0;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[0;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  b


Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:72
	breakpoint already hit 2 times
3   breakpoint   keep no    at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:191
	breakpoint already hit 3 times
4   breakpoint   keep yes   at /home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py:75


ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(191)[0;36m<listcomp>[0;34m()[0m
[0;32m    189 [0;31m        [0mexperiences[0m [0;34m=[0m [0mrandom[0m[0;34m.[0m[0msample[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mbatch_size[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    190 [0;31m[0;34m[0m[0m
[0m[0;31m3[0;32m-> 191 [0;31m        [0mstates[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mfrom_numpy[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mvstack[0m[0;34m([0m[0;34m[[0m[0me[0m[0;34m.[0m[0mstate[0m [0;32mfor[0m [0me[0m [0;32min[0m [0mexperiences[0m [0;32mif[0m [0me[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m][0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mfloat[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mto[0m[0;34m([0m[0mdevice[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    192 [0;31m        [0ma

ipdb>  c


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(72)[0;36mstep[0;34m()[0m
[0;32m     70 [0;31m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m)[0m [0;34m>[0m [0mBATCH_SIZE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     71 [0;31m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;31m1[0;32m--> 72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     74 [0;31m[0;34m[0m[0m
[0m


ipdb>  w


[0;31m    [... skipping 31 hidden frame(s)][0m

None
  [0;32m<ipython-input-10-2e4a8bc04803>[0m(9)[0;36m<module>[0;34m()[0m
[1;32m      5 [0m[0;31m# %autoreload 2b[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      6 [0m[0;32mfrom[0m [0mIPython[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdebugger[0m [0;32mimport[0m [0mset_trace[0m[0;34m[0m[0;34m[0m[0m
[1;32m      7 [0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8 [0m[0magent[0m [0;34m=[0m [0mddpg_agent[0m[0;34m.[0m[0mAgent[0m[0;34m([0m[0mstate_size[0m[0;34m=[0m[0mstate_size[0m[0;34m,[0m [0maction_size[0m[0;34m=[0m[0maction_size[0m[0;34m,[0m [0mrandom_seed[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9 [0;31m[0mscores[0m [0;34m=[0m [0mrunner[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0menv[0m[0;34m,[0m [0magent[0m[0;34m,[0m [0mbrain_name[0m[0;34m=[0m[0mbrain_name[0m[0;34m)[0m[0;34m[0m[0;34

ipdb>  l


[1;32m     67 [0m        [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m=[0m [0;34m([0m[0mself[0m[0;34m.[0m[0mt_step[0m [0;34m+[0m [0;36m1[0m[0;34m)[0m [0;34m%[0m [0mUPDATE_EVERY[0m[0;34m[0m[0;34m[0m[0m
[1;32m     68 [0m[0;34m[0m[0m
[1;32m     69 [0m            [0;31m# Learn, if enough samples are available in memory[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m     70 [0m        [0;32mif[0m [0mlen[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mmemory[0m[0;34m)[0m [0;34m>[0m [0mBATCH_SIZE[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m     71 [0m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;31m1[0;32m--> 72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m     73 [0m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;

ipdb>  ll


[1;32m     62 [0m    [0;32mdef[0m [0mstep[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0maction[0m[0;34m,[0m [0mreward[0m[0;34m,[0m [0mnext_state[0m[0;34m,[0m [0mdone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m     63 [0m        [0;34m"""Save experience in replay memory, and use random sample from buffer to learn."""[0m[0;34m[0m[0;34m[0m[0m
[1;32m     64 [0m        [0;31m# Save experience / reward[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m     65 [0m        [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0madd[0m[0;34m([0m[0mstate[0m[0;34m,[0m [0maction[0m[0;34m,[0m [0mreward[0m[0;34m,[0m [0mnext_state[0m[0;34m,[0m [0mdone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m     66 [0m[0;34m[0m[0m
[1;32m     67 [0m        [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m=[0m [0;34m([0m[0mself[0m[0;34m.[0m[0mt_step[0m [0;34m+[0m [0;36m1[0m[0;34m)[0m [0;34m%[0m [0mUPDATE_

ipdb>  np.shape(next_state)


(33,)


ipdb>  p np.shape(experiences[0].next_state)


*** NameError: name 'experiences' is not defined


ipdb>  n


> [0;32m/home/jmh/udacity_rl/deep-reinforcement-learning/p2_continuous-control/ddpg_agent.py[0m(73)[0;36mstep[0;34m()[0m
[0;32m     71 [0;31m            [0;32mif[0m [0mself[0m[0;34m.[0m[0mt_step[0m [0;34m==[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;31m1[0;32m    72 [0;31m                [0mexperiences[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mmemory[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 73 [0;31m                [0mself[0m[0;34m.[0m[0mlearn[0m[0;34m([0m[0mexperiences[0m[0;34m,[0m [0mGAMMA[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     74 [0;31m[0;34m[0m[0m
[0m[1;31m4[0;32m    75 [0;31m    [0;32mdef[0m [0mact[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mstate[0m[0;34m,[0m [0madd_noise[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  p np.shape(experiences[0].next_state)


*** AttributeError: 'Tensor' object has no attribute 'next_state'


ipdb>  type(experiences)


<class 'tuple'>


ipdb>   p np.shape(experiences.next_state[0])


*** AttributeError: 'tuple' object has no attribute 'next_state'


ipdb>   p np.shape(experiences[3])


torch.Size([64, 33])


When finished, you can close the environment.

In [None]:
env.close()

# 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```