# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
from ppo import run_ppo

run_ppo(env)

Beginning training loop
update 10/50000. finished 718 episodes. Last update in 35.69816517829895s
average of last 100 returns: -0.004999999888241291
update 20/50000. finished 1439 episodes. Last update in 33.805360317230225s
average of last 100 returns: -0.004999999888241291
update 30/50000. finished 2159 episodes. Last update in 33.228659868240356s
average of last 100 returns: -0.004999999888241291
update 40/50000. finished 2880 episodes. Last update in 33.8143470287323s
average of last 100 returns: -0.004999999888241291
update 50/50000. finished 3597 episodes. Last update in 30.281995058059692s
average of last 100 returns: -0.004999999888241291
update 60/50000. finished 4317 episodes. Last update in 31.149674892425537s
average of last 100 returns: -0.004999999888241291
update 70/50000. finished 5039 episodes. Last update in 29.448139667510986s
average of last 100 returns: -0.004999999888241291
update 80/50000. finished 5760 episodes. Last update in 30.4218909740448s
average of last 1

update 650/50000. finished 42151 episodes. Last update in 29.541893243789673s
average of last 100 returns: 0.0210000004991889
update 660/50000. finished 42608 episodes. Last update in 30.819448947906494s
average of last 100 returns: 0.015000000409781934
update 670/50000. finished 43078 episodes. Last update in 30.448153018951416s
average of last 100 returns: 0.01300000037997961
update 680/50000. finished 43534 episodes. Last update in 29.44635009765625s
average of last 100 returns: 0.014000000394880772
update 690/50000. finished 43992 episodes. Last update in 27.895966291427612s
average of last 100 returns: 0.014000000394880772
update 700/50000. finished 44431 episodes. Last update in 27.88070297241211s
average of last 100 returns: 0.012000000365078449
update 710/50000. finished 44897 episodes. Last update in 26.828408002853394s
average of last 100 returns: 0.007000000309199094
update 720/50000. finished 45362 episodes. Last update in 26.922012090682983s
average of last 100 returns: 0.

update 1300/50000. finished 73406 episodes. Last update in 29.091934204101562s
average of last 100 returns: 0.004000000245869159
update 1310/50000. finished 73922 episodes. Last update in 28.034245014190674s
average of last 100 returns: 0.010000000335276127
update 1320/50000. finished 74450 episodes. Last update in 27.348743200302124s
average of last 100 returns: 0.010000000335276127
update 1330/50000. finished 74939 episodes. Last update in 28.77805995941162s
average of last 100 returns: 0.01300000037997961
update 1340/50000. finished 75418 episodes. Last update in 27.865222215652466s
average of last 100 returns: 0.012000000365078449
update 1350/50000. finished 75966 episodes. Last update in 27.263211011886597s
average of last 100 returns: 0.012000000383704901
update 1360/50000. finished 76477 episodes. Last update in 27.962546825408936s
average of last 100 returns: 0.009000000320374965
update 1370/50000. finished 76997 episodes. Last update in 29.28160524368286s
average of last 100 r

update 1940/50000. finished 107564 episodes. Last update in 27.110058069229126s
average of last 100 returns: 0.005000000260770321
update 1950/50000. finished 108090 episodes. Last update in 25.516396045684814s
average of last 100 returns: 0.011000000350177288
update 1960/50000. finished 108623 episodes. Last update in 28.424100160598755s
average of last 100 returns: 0.008000000305473804
update 1970/50000. finished 109170 episodes. Last update in 27.74460005760193s
average of last 100 returns: 0.005000000260770321
update 1980/50000. finished 109712 episodes. Last update in 28.430938005447388s
average of last 100 returns: 0.006000000275671482
update 1990/50000. finished 110257 episodes. Last update in 30.1386137008667s
average of last 100 returns: 0.010000000335276127
update 2000/50000. finished 110815 episodes. Last update in 30.371028900146484s
average of last 100 returns: 0.005000000260770321
update 2010/50000. finished 111351 episodes. Last update in 30.849958896636963s
average of la

update 2580/50000. finished 142260 episodes. Last update in 29.33025622367859s
average of last 100 returns: 0.007000000290572643
update 2590/50000. finished 142789 episodes. Last update in 26.82097315788269s
average of last 100 returns: 1.8626451492309571e-10
update 2600/50000. finished 143326 episodes. Last update in 29.599945068359375s
average of last 100 returns: 0.006000000275671482
update 2610/50000. finished 143860 episodes. Last update in 26.073863744735718s
average of last 100 returns: 0.008000000305473804
update 2620/50000. finished 144431 episodes. Last update in 31.407354831695557s
average of last 100 returns: 0.006000000275671482
update 2630/50000. finished 144987 episodes. Last update in 29.555095195770264s
average of last 100 returns: 0.010000000335276127
update 2640/50000. finished 145536 episodes. Last update in 31.188035011291504s
average of last 100 returns: 0.0030000002309679987
update 2650/50000. finished 146074 episodes. Last update in 30.05698299407959s
average of

update 3220/50000. finished 176909 episodes. Last update in 29.450183868408203s
average of last 100 returns: 0.002000000216066837
update 3230/50000. finished 177447 episodes. Last update in 28.845417976379395s
average of last 100 returns: 0.008000000305473804
update 3240/50000. finished 177985 episodes. Last update in 28.906162977218628s
average of last 100 returns: 0.008000000305473804
update 3250/50000. finished 178503 episodes. Last update in 29.98879599571228s
average of last 100 returns: 0.015000000409781934
update 3260/50000. finished 179035 episodes. Last update in 31.45686411857605s
average of last 100 returns: 0.004000000245869159
update 3270/50000. finished 179582 episodes. Last update in 31.38096523284912s
average of last 100 returns: 0.0010000002011656762
update 3280/50000. finished 180131 episodes. Last update in 31.2564640045166s
average of last 100 returns: 0.007000000290572643
update 3290/50000. finished 180673 episodes. Last update in 32.211796283721924s
average of las

update 3860/50000. finished 211666 episodes. Last update in 31.180580854415894s
average of last 100 returns: 0.006000000275671482
update 3870/50000. finished 212205 episodes. Last update in 32.26558065414429s
average of last 100 returns: 0.010000000335276127
update 3880/50000. finished 212733 episodes. Last update in 30.869125843048096s
average of last 100 returns: 0.016000000424683095
update 3890/50000. finished 213299 episodes. Last update in 29.967977046966553s
average of last 100 returns: 0.005000000260770321
update 3900/50000. finished 213849 episodes. Last update in 30.02523183822632s
average of last 100 returns: 0.009000000320374965
update 3910/50000. finished 214377 episodes. Last update in 30.657170057296753s
average of last 100 returns: 0.004000000245869159
update 3920/50000. finished 214927 episodes. Last update in 30.530492305755615s
average of last 100 returns: 0.008000000305473804
update 3930/50000. finished 215469 episodes. Last update in 32.5585880279541s
average of las

update 4500/50000. finished 246443 episodes. Last update in 32.3896119594574s
average of last 100 returns: 0.010000000335276127
update 4510/50000. finished 246988 episodes. Last update in 31.555157899856567s
average of last 100 returns: 0.006000000275671482
update 4520/50000. finished 247516 episodes. Last update in 31.351958990097046s
average of last 100 returns: 0.008000000305473804
update 4530/50000. finished 248048 episodes. Last update in 31.99430799484253s
average of last 100 returns: 0.018000000454485417
update 4540/50000. finished 248586 episodes. Last update in 30.816017150878906s
average of last 100 returns: 0.009000000320374965
update 4550/50000. finished 249139 episodes. Last update in 31.869002103805542s
average of last 100 returns: 0.005000000260770321
update 4560/50000. finished 249672 episodes. Last update in 30.42377018928528s
average of last 100 returns: 0.016000000424683095
update 4570/50000. finished 250212 episodes. Last update in 30.468272924423218s
average of las

ValueError: Expected parameter loc (Tensor of shape (204, 2)) of distribution Normal(loc: torch.Size([204, 2]), scale: torch.Size([204, 2])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan]], grad_fn=<AddmmBackward0>)

In [None]:
def copy_model_and_plot_learning_curve():
    import pickle
    import matplotlib.pyplot as plt
    from collections import deque
    import os
    import datetime
    import shutil
    
    datetime_stamp = datetime.datetime.now().strftime('%y%m%d_%H%M')
    plot_path = f'checkpoints/{datetime_stamp}'
    
    if not os.path.exists(plot_path):
        os.makedirs(plot_path)
    else:
        print(f'directory {plot_path} already exists')
        return
    
    shutil.copyfile(f'{brain_name}_scores.pickle', f'{plot_path}/scores.pickle')
    shutil.copyfile(f'{brain_name}_model_checkpoint.pickle', f'{plot_path}/model.pickle')

    with open(f'{plot_path}/scores.pickle', 'rb') as f:
        total_rewards = pickle.load(f)

    smoothed = []
    queue = deque([], maxlen=10)
    for r in total_rewards:
        queue.append(r)
        smoothed.append(sum(queue)/len(queue))
    fig,ax = plt.subplots()
    ax.plot(smoothed)
    ax.set_xlabel('total episodes (across all agents)')
    plt.savefig(f'{plot_path}/learning_curve.png')
    plt.show()
copy_model_and_plot_learning_curve()