# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
from ppo import run_ppo

run_ppo(env)

update 1/2000. Last update in 2.86102294921875e-06s
last 100 returns: -0.004999999888241291
update 2/2000. Last update in 7.3322999477386475s
last 100 returns: -0.004999999888241291
update 3/2000. Last update in 7.081874847412109s
last 100 returns: -0.004999999888241291
update 4/2000. Last update in 7.090569019317627s
last 100 returns: -0.004999999888241291
update 5/2000. Last update in 7.45884895324707s
last 100 returns: -0.004999999888241291
update 6/2000. Last update in 6.940677881240845s
last 100 returns: -0.004999999888241291
update 7/2000. Last update in 6.92144513130188s
last 100 returns: -0.004999999888241291
update 8/2000. Last update in 7.022716760635376s
last 100 returns: -0.004999999888241291
update 9/2000. Last update in 7.033691883087158s
last 100 returns: -0.004999999888241291
update 10/2000. Last update in 7.131771087646484s
last 100 returns: -0.004999999888241291
update 11/2000. Last update in 7.047396898269653s
last 100 returns: -0.004999999888241291
update 12/2000. L

last 100 returns: 0.002000000216066837
update 94/2000. Last update in 7.046533107757568s
last 100 returns: 0.009000000320374965
update 95/2000. Last update in 6.384293794631958s
last 100 returns: 0.006000000275671482
update 96/2000. Last update in 6.26589822769165s
last 100 returns: 0.007000000290572643
update 97/2000. Last update in 7.000575065612793s
last 100 returns: 0.011000000350177288
update 98/2000. Last update in 6.633790969848633s
last 100 returns: 0.007000000290572643
update 99/2000. Last update in 5.525175094604492s
last 100 returns: 0.009900000337511301
update 100/2000. Last update in 6.815004825592041s
last 100 returns: 0.014000000394880772
update 101/2000. Last update in 7.303357124328613s
last 100 returns: 0.010000000335276127
update 102/2000. Last update in 6.9667582511901855s
last 100 returns: 0.008000000305473804
update 103/2000. Last update in 5.576627016067505s
last 100 returns: 0.007000000290572643
update 104/2000. Last update in 5.412121057510376s
last 100 returns

last 100 returns: 0.010000000335276127
update 186/2000. Last update in 6.7697131633758545s
last 100 returns: 0.01300000037997961
update 187/2000. Last update in 6.84839391708374s
last 100 returns: 0.0010000002011656762
update 188/2000. Last update in 6.921565294265747s
last 100 returns: 0.009000000320374965
update 189/2000. Last update in 6.5138020515441895s
last 100 returns: 0.009900000337511301
update 190/2000. Last update in 5.749224901199341s
last 100 returns: 0.011000000350177288
update 191/2000. Last update in 5.916723966598511s
last 100 returns: 0.01300000037997961
update 192/2000. Last update in 6.500004768371582s
last 100 returns: 0.016000000424683095
update 193/2000. Last update in 7.036465883255005s
last 100 returns: 0.009000000320374965
update 194/2000. Last update in 7.09641695022583s
last 100 returns: 0.006900000292807818
update 195/2000. Last update in 6.846789121627808s
last 100 returns: 0.012000000365078449
update 196/2000. Last update in 6.946839809417725s
last 100 re

last 100 returns: 0.008000000324100255
update 277/2000. Last update in 5.288305997848511s
last 100 returns: 0.002000000216066837
update 278/2000. Last update in 6.1650390625s
last 100 returns: 0.007000000290572643
update 279/2000. Last update in 6.89237117767334s
last 100 returns: 0.005000000279396772
update 280/2000. Last update in 6.9305689334869385s
last 100 returns: 0.005000000260770321
update 281/2000. Last update in 6.861060857772827s
last 100 returns: 0.004000000245869159
update 282/2000. Last update in 6.415318965911865s
last 100 returns: 0.005000000260770321
update 283/2000. Last update in 6.204959154129028s
last 100 returns: 0.009000000320374965
update 284/2000. Last update in 5.738430023193359s
last 100 returns: 0.01300000037997961
update 285/2000. Last update in 6.653461933135986s
last 100 returns: 0.007000000290572643
update 286/2000. Last update in 5.613886833190918s
last 100 returns: 0.007000000290572643
update 287/2000. Last update in 5.528902769088745s
last 100 returns

last 100 returns: 0.009000000320374965
update 368/2000. Last update in 6.529462099075317s
last 100 returns: 0.005000000260770321
update 369/2000. Last update in 7.229299783706665s
last 100 returns: 0.011000000350177288
update 370/2000. Last update in 7.023618698120117s
last 100 returns: 0.008000000305473804
update 371/2000. Last update in 6.709799289703369s
last 100 returns: 0.0030000002309679987
update 372/2000. Last update in 6.770022869110107s
last 100 returns: 0.012000000365078449
update 373/2000. Last update in 6.72698712348938s
last 100 returns: 0.009000000320374965
update 374/2000. Last update in 6.93975305557251s
last 100 returns: 0.014000000394880772
update 375/2000. Last update in 6.727915048599243s
last 100 returns: 0.007000000290572643
update 376/2000. Last update in 7.174200057983398s
last 100 returns: 0.008000000305473804
update 377/2000. Last update in 6.999399900436401s
last 100 returns: 0.008000000305473804
update 378/2000. Last update in 7.314280033111572s
last 100 re

last 100 returns: 0.011000000350177288
update 460/2000. Last update in 8.262586832046509s
last 100 returns: 0.010000000335276127
update 461/2000. Last update in 7.151349067687988s
last 100 returns: 0.008000000305473804
update 462/2000. Last update in 6.8684117794036865s
last 100 returns: 0.011000000350177288
update 463/2000. Last update in 7.344277858734131s
last 100 returns: 0.010000000335276127
update 464/2000. Last update in 7.925554990768433s
last 100 returns: 0.006000000275671482
update 465/2000. Last update in 7.927506923675537s
last 100 returns: 0.012900000382214784
update 466/2000. Last update in 7.2133378982543945s
last 100 returns: 0.005000000260770321
update 467/2000. Last update in 7.272950887680054s
last 100 returns: 0.014000000394880772
update 468/2000. Last update in 7.656799793243408s
last 100 returns: 0.009000000320374965
update 469/2000. Last update in 6.548891067504883s
last 100 returns: 0.014900000412017106
update 470/2000. Last update in 7.392539739608765s
last 100

ValueError: Expected parameter loc (Tensor of shape (128, 2)) of distribution Normal(loc: torch.Size([128, 2]), scale: torch.Size([128, 2])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan]], grad_fn=<AddmmBackward0>)

In [None]:
def copy_model_and_plot_learning_curve():
    import pickle
    import matplotlib.pyplot as plt
    from collections import deque
    import os
    import datetime
    import shutil
    
    datetime_stamp = datetime.datetime.now().strftime('%y%m%d_%H%M')
    plot_path = f'checkpoints/{datetime_stamp}'
    
    if not os.path.exists(plot_path):
        os.makedirs(plot_path)
    else:
        print(f'directory {plot_path} already exists')
        return
    
    shutil.copyfile(f'{brain_name}_scores.pickle', f'{plot_path}/scores.pickle')
    shutil.copyfile(f'{brain_name}_model_checkpoint.pickle', f'{plot_path}/model.pickle')

    with open(f'{plot_path}/scores.pickle', 'rb') as f:
        total_rewards = pickle.load(f)

    smoothed = []
    queue = deque([], maxlen=10)
    for r in total_rewards:
        queue.append(r)
        smoothed.append(sum(queue)/len(queue))
    fig,ax = plt.subplots()
    ax.plot(smoothed)
    ax.set_xlabel('total episodes (across all agents)')
    plt.savefig(f'{plot_path}/learning_curve.png')
    plt.show()
copy_model_and_plot_learning_curve()

In [None]:
%debug

> [0;32m/Users/hoonji/miniconda3/envs/drlnd/lib/python3.6/site-packages/torch/distributions/distribution.py[0m(56)[0;36m__init__[0;34m()[0m
[0;32m     54 [0;31m                [0;32mif[0m [0;32mnot[0m [0mvalid[0m[0;34m.[0m[0mall[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     55 [0;31m                    raise ValueError(
[0m[0;32m---> 56 [0;31m                        [0;34mf"Expected parameter {param} "[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     57 [0;31m                        [0;34mf"({type(value).__name__} of shape {tuple(value.shape)}) "[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     58 [0;31m                        [0;34mf"of distribution {repr(self)} "[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> up
> [0;32m/Users/hoonji/miniconda3/envs/drlnd/lib/python3.6/site-packages/torch/distributions/normal.py[0m(50)[0;36m__init__[0;34m()[0m
[0;32m     48 [0;31m        [0;32melse[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m

  import numpy as np


tensor([])
ipdb> p action_means
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        

ipdb> p x
tensor([[-5.0975e+00, -1.0197e+00,  3.0000e+01,  ...,  3.2085e+00,
         -3.0000e+01,  4.2532e+00],
        [-4.8998e+00,  2.2894e-01,  3.0000e+01,  ..., -2.2882e+00,
          3.0000e+01, -5.5568e+00],
        [-8.7491e-01,  6.6584e-01,  3.0000e+01,  ...,  2.1152e+00,
          3.0000e+01,  1.3300e-01],
        ...,
        [-1.0900e+01, -1.8522e+00,  0.0000e+00,  ...,  2.1152e+00,
         -9.5367e-06,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  5.8587e+00,
         -3.0000e+01, -9.8100e-01],
        [-7.9626e+00, -1.5000e+00,  0.0000e+00,  ...,  5.6939e+00,
          3.0000e+01,  6.0190e+00]])
ipdb> for param in model.parameters():   print(param.data)
*** NameError: name 'model' is not defined
ipdb> for param in self.actor_means.parameters():   print(param.data)
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, 