# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
# print('The state for the first agent looks like:', states[0],states[1])
print(env_info.vector_observations[:,:8])
print(env_info.vector_observations[:,8:16])
print(env_info.vector_observations[:,16:24])


Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
[[-6.65278625 -1.5        -0.          0.          6.83172083  6.
  -0.          0.        ]
 [-6.4669857  -1.5         0.          0.         -6.83172083  6.
   0.          0.        ]]


### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
from ppo import run_ppo

run_ppo(env)

Beginning training loop
update 10/50000. finished 567 episodes. Last update in 29.78994107246399s
average of last 100 returns: 0.005000000260770321
max of last 100: 0.19000000320374966
update 20/50000. finished 1138 episodes. Last update in 29.221557140350342s
average of last 100 returns: 0.009000000320374965
max of last 100: 0.20000000298023224
update 30/50000. finished 1701 episodes. Last update in 28.6325581073761s
average of last 100 returns: 0.002000000216066837
max of last 100: 0.10000000149011612
update 40/50000. finished 2284 episodes. Last update in 30.17584490776062s
average of last 100 returns: 0.002000000216066837
max of last 100: 0.10000000149011612
update 50/50000. finished 2853 episodes. Last update in 30.26931118965149s
average of last 100 returns: 0.006000000294297933
max of last 100: 0.10000000149011612
update 60/50000. finished 3407 episodes. Last update in 28.236219882965088s
average of last 100 returns: 0.008000000305473804
max of last 100: 0.10000000149011612
upda

update 510/50000. finished 24423 episodes. Last update in 27.502923011779785s
average of last 100 returns: 0.05900000108405948
max of last 100: 0.30000000447034836
update 520/50000. finished 24755 episodes. Last update in 27.522988080978394s
average of last 100 returns: 0.048000000901520255
max of last 100: 0.20000000298023224
update 530/50000. finished 25065 episodes. Last update in 27.35724902153015s
average of last 100 returns: 0.05400000100955367
max of last 100: 0.30000000447034836
update 540/50000. finished 25334 episodes. Last update in 28.782644033432007s
average of last 100 returns: 0.04600000087171793
max of last 100: 0.20000000298023224
update 550/50000. finished 25641 episodes. Last update in 28.249165058135986s
average of last 100 returns: 0.04200000083073974
max of last 100: 0.19000000320374966
update 560/50000. finished 25913 episodes. Last update in 28.540213108062744s
average of last 100 returns: 0.06590000117197632
max of last 100: 0.4000000059604645
update 570/50000.

update 1020/50000. finished 35170 episodes. Last update in 29.09603714942932s
average of last 100 returns: 0.11000000189989806
max of last 100: 0.490000007674098
update 1030/50000. finished 35355 episodes. Last update in 28.003173828125s
average of last 100 returns: 0.09500000160187483
max of last 100: 0.5000000074505806
update 1040/50000. finished 35539 episodes. Last update in 28.926176071166992s
average of last 100 returns: 0.09290000161156059
max of last 100: 0.30000000447034836
update 1050/50000. finished 35704 episodes. Last update in 29.289483070373535s
average of last 100 returns: 0.1190000019595027
max of last 100: 0.490000007674098
update 1060/50000. finished 35875 episodes. Last update in 29.332782983779907s
average of last 100 returns: 0.10000000167638064
max of last 100: 0.5000000074505806
update 1070/50000. finished 36040 episodes. Last update in 27.481988191604614s
average of last 100 returns: 0.09590000163763762
max of last 100: 0.4000000059604645
update 1080/50000. fin

update 1530/50000. finished 42225 episodes. Last update in 28.482436656951904s
average of last 100 returns: 0.2920000045374036
max of last 100: 1.0000000149011612
update 1540/50000. finished 42310 episodes. Last update in 27.15563416481018s
average of last 100 returns: 0.2350000036880374
max of last 100: 1.2000000178813934
update 1550/50000. finished 42382 episodes. Last update in 25.178118228912354s
average of last 100 returns: 0.28890000449493525
max of last 100: 1.600000023841858
update 1560/50000. finished 42462 episodes. Last update in 28.34302592277527s
average of last 100 returns: 0.2560000040009618
max of last 100: 1.700000025331974
update 1570/50000. finished 42533 episodes. Last update in 26.568253993988037s
average of last 100 returns: 0.33400000520050527
max of last 100: 1.7900000270456076
update 1580/50000. finished 42616 episodes. Last update in 25.297611713409424s
average of last 100 returns: 0.2610000040754676
max of last 100: 1.4000000208616257
update 1590/50000. finis

update 2040/50000. finished 45646 episodes. Last update in 26.159268856048584s
average of last 100 returns: 0.31100000482052564
max of last 100: 1.6900000255554914
update 2050/50000. finished 45711 episodes. Last update in 28.524893283843994s
average of last 100 returns: 0.35990000557154417
max of last 100: 2.2000000327825546
update 2060/50000. finished 45764 episodes. Last update in 27.195997953414917s
average of last 100 returns: 0.4360000066831708
max of last 100: 2.400000035762787
update 2070/50000. finished 45821 episodes. Last update in 27.046528100967407s
average of last 100 returns: 0.3710000057145953
max of last 100: 2.500000037252903
update 2080/50000. finished 45864 episodes. Last update in 29.000964879989624s
average of last 100 returns: 0.5392000082321465
max of last 100: 2.600000038743019
update 2090/50000. finished 45926 episodes. Last update in 29.02534794807434s
average of last 100 returns: 0.369000005684793
max of last 100: 2.2000000327825546
update 2100/50000. finish

update 2550/50000. finished 49079 episodes. Last update in 29.00020718574524s
average of last 100 returns: 0.453000006955117
max of last 100: 2.2000000327825546
update 2560/50000. finished 49131 episodes. Last update in 29.261638879776s
average of last 100 returns: 0.46800000716000795
max of last 100: 2.1000000312924385
update 2570/50000. finished 49209 episodes. Last update in 28.32998299598694s
average of last 100 returns: 0.36500000562518836
max of last 100: 1.5000000223517418
update 2580/50000. finished 49283 episodes. Last update in 27.732496976852417s
average of last 100 returns: 0.3220000049844384
max of last 100: 2.1000000312924385
update 2590/50000. finished 49362 episodes. Last update in 28.919245958328247s
average of last 100 returns: 0.27670000432059166
max of last 100: 1.4000000208616257
update 2600/50000. finished 49432 episodes. Last update in 29.503748178482056s
average of last 100 returns: 0.35100000543519855
max of last 100: 1.9000000283122063
update 2610/50000. finis

update 3060/50000. finished 51417 episodes. Last update in 28.61173415184021s
average of last 100 returns: 0.8364000126719475
max of last 100: 2.7000000402331352


In [10]:
import ppo
import torch

def try_model(model_path=f'{brain_name}_model_final.pickle', hidden_size=32):
    env_info = env.reset(train_mode=False)[brain_name]
    agent = ppo.Agent(state_size, action_size, hidden_size).to('cpu')
    agent.load_state_dict(torch.load(model_path))

    while not any(env_info.local_done):
        a, probs = agent.pi(torch.Tensor(env_info.vector_observations))
        env_info = env.step(a.numpy())[brain_name]
try_model()