# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [None]:
from ppo import run_ppo

run_ppo(env, seed=257)

  dtype=torch.float32)).squeeze(-1)


episode 50. Last update in 2.855409860610962s
last 100 returns: 0.0020000002346932887
episode 100. Last update in 2.7011828422546387s
last 100 returns: 0.002000000216066837
episode 150. Last update in 3.070188045501709s
last 100 returns: 0.011000000350177288
episode 200. Last update in 2.6557908058166504s
last 100 returns: 0.002000000216066837
episode 250. Last update in 2.4497053623199463s
last 100 returns: -0.0009999998286366462
episode 300. Last update in 2.577122211456299s
last 100 returns: 0.0030000002309679987
episode 350. Last update in 2.450421094894409s
last 100 returns: -0.0009999998286366462
episode 400. Last update in 2.3683111667633057s
last 100 returns: -0.0009999998286366462
episode 450. Last update in 3.13635516166687s
last 100 returns: 0.012000000365078449
episode 500. Last update in 3.058702230453491s
last 100 returns: 0.011000000350177288
episode 550. Last update in 2.821558952331543s
last 100 returns: 0.004000000245869159
episode 600. Last update in 2.61165189743042

episode 4750. Last update in 3.3746469020843506s
last 100 returns: 0.01300000037997961
episode 4800. Last update in 3.2200779914855957s
last 100 returns: 0.011000000350177288
episode 4850. Last update in 3.2027220726013184s
last 100 returns: 0.011000000350177288
episode 4900. Last update in 3.0394668579101562s
last 100 returns: 0.009000000320374965
episode 4950. Last update in 2.855041980743408s
last 100 returns: 0.008000000305473804
episode 5000. Last update in 3.104552984237671s
last 100 returns: 0.007900000307708979
episode 5050. Last update in 2.7517142295837402s
last 100 returns: 0.0020000002346932887
episode 5100. Last update in 3.140267848968506s
last 100 returns: 0.009000000320374965
episode 5150. Last update in 2.7885239124298096s
last 100 returns: 0.0030000002309679987
episode 5200. Last update in 3.238213062286377s
last 100 returns: 0.01100000036880374
episode 5250. Last update in 3.0104939937591553s
last 100 returns: 0.005000000260770321
episode 5300. Last update in 2.93518

episode 9450. Last update in 2.6637251377105713s
last 100 returns: 0.0010000002011656762
episode 9500. Last update in 2.582023859024048s
last 100 returns: 1.8626451492309571e-10
episode 9550. Last update in 2.7538280487060547s
last 100 returns: 0.005000000260770321
episode 9600. Last update in 3.0488970279693604s
last 100 returns: 0.009000000320374965
episode 9650. Last update in 2.8406639099121094s
last 100 returns: 0.007000000290572643
episode 9700. Last update in 2.5056240558624268s
last 100 returns: -0.0009999998286366462
episode 9750. Last update in 2.896458864212036s
last 100 returns: 0.006000000275671482
episode 9800. Last update in 2.637148141860962s
last 100 returns: 0.002000000216066837
episode 9850. Last update in 2.9014971256256104s
last 100 returns: 0.0030000002309679987
episode 9900. Last update in 2.9789538383483887s
last 100 returns: 0.007000000290572643
episode 9950. Last update in 2.973422050476074s
last 100 returns: 0.011000000350177288
episode 10000. Last update in 

episode 14100. Last update in 3.002620220184326s
last 100 returns: 0.009000000320374965
episode 14150. Last update in 3.1576991081237793s
last 100 returns: 0.010000000335276127
episode 14200. Last update in 2.9420559406280518s
last 100 returns: 0.007000000290572643
episode 14250. Last update in 3.091214895248413s
last 100 returns: 0.007000000290572643
episode 14300. Last update in 3.4374639987945557s
last 100 returns: 0.011000000350177288
episode 14350. Last update in 2.742936134338379s
last 100 returns: 0.0030000002309679987
episode 14400. Last update in 3.7039949893951416s
last 100 returns: 0.012000000365078449
episode 14450. Last update in 3.325575113296509s
last 100 returns: 0.008000000305473804
episode 14500. Last update in 3.4786570072174072s
last 100 returns: 0.011900000367313623
episode 14550. Last update in 3.4622788429260254s
last 100 returns: 0.008000000305473804
episode 14600. Last update in 3.314865827560425s
last 100 returns: 0.010000000335276127
episode 14650. Last updat

In [None]:
%debug

In [None]:
def copy_model_and_plot_learning_curve():
    import pickle
    import matplotlib.pyplot as plt
    from collections import deque
    import os
    import datetime
    import shutil
    
    datetime_stamp = datetime.datetime.now().strftime('%y%m%d_%H%M')
    plot_path = f'checkpoints/{datetime_stamp}'
    
    if not os.path.exists(plot_path):
        os.makedirs(plot_path)
    else:
        print(f'directory {plot_path} already exists')
        return
    
    shutil.copyfile(f'{brain_name}_scores.pickle', f'{plot_path}/scores.pickle')
    shutil.copyfile(f'{brain_name}_model_checkpoint.pickle', f'{plot_path}/model.pickle')

    with open(f'{plot_path}/scores.pickle', 'rb') as f:
        total_rewards = pickle.load(f)

    smoothed = []
    queue = deque([], maxlen=10)
    for r in total_rewards:
        queue.append(r)
        smoothed.append(sum(queue)/len(queue))
    fig,ax = plt.subplots()
    ax.plot(smoothed)
    ax.set_xlabel('total episodes (across all agents)')
    plt.savefig(f'{plot_path}/learning_curve.png')
    plt.show()
copy_model_and_plot_learning_curve()