# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
#env = UnityEnvironment(file_name="Tennis_Linux/Tennis.x86_64")
env = UnityEnvironment(file_name="Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


When finished, you can close the environment.

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [5]:
#from buffer import ReplayBuffer
from common.Memory import ReplayMemory
from maddpg import MADDPG
import torch
import numpy as np
from tensorboardX import SummaryWriter
import os
from utilities import transpose_list, transpose_to_tensor
from collections import deque

# keep training awake
#from workspace_utils import keep_awake

# for saving gif
#import imageio

%load_ext autoreload
%autoreload 2

In [6]:
seed = 12345
np.random.seed(seed)
torch.manual_seed(seed)

# number of parallel agents
#parallel_envs = 4
# number of training episodes.
# change this to higher number to experiment. say 30000.
number_of_episodes = 2000
episode_length = 200
batchsize = 128
# how many episodes to save policy and gif
save_interval = 1000
t = 0

In [7]:
# amplitude of OU noise
# this slowly decreases to 0
noise = 6
noise_reduction = 0.9999

# how many episodes before update
episode_per_update = 10

log_path = os.getcwd()+"/log"
model_dir= os.getcwd()+"/model_dir"

os.makedirs(model_dir, exist_ok=True)

#torch.set_num_threads(parallel_envs)
#env = envs.make_parallel_env(parallel_envs)

# keep 5000 episodes worth of replay
buffer = ReplayMemory(int(1e5))

# initialize policy and critic

in_actor = state_size
hidden_in_actor = 256
hidden_out_actor = 256
out_actor = 2
# critic input = obs from both agents + actions from both agents
in_critic = 2*state_size + 2*action_size
hidden_in_critic = 512
hidden_out_critic = 256

maddpg = MADDPG(in_actor, hidden_in_actor, hidden_out_actor, 
                out_actor, in_critic, hidden_in_critic, hidden_out_critic)
logger = SummaryWriter(log_dir=log_path)
agent0_reward = []
agent1_reward = []

In [8]:
# training loop
# show progressbar
import progressbar as pb
widget = ['episode: ', pb.Counter(),'/',str(number_of_episodes),' ', 
          pb.Percentage(), ' ', pb.ETA(), ' ', pb.Bar(marker=pb.RotatingMarker()), ' ' ]

timer = pb.ProgressBar(widgets=widget, maxval=number_of_episodes).start()
'''
env_info = env.reset(train_mode=True)[brain_name]
states = env_info.vector_observations
actions = np.random.randn(num_agents, action_size)
actions = np.clip(actions, -1, 1)
env_info = env.step(actions)[brain_name] 
'''

scores_deque = deque(np.zeros(100))

# initialize buffer with random actions

for _ in range(1500):
    env_info = env.reset(train_mode=True)[brain_name]
    obs = env_info.vector_observations
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    actions = actions.astype(np.float32)
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_obs = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    
    # add data to buffer
    transition = (obs, actions, rewards, next_obs, dones)
    buffer.push(*transition)

for episode in range(0, number_of_episodes):

    timer.update(episode)

    reward_this_episode = np.zeros(num_agents)
    env_info = env.reset(train_mode=True)[brain_name]
    obs = env_info.vector_observations

    #for calculating rewards for this particular episode - addition of all time steps

    # save info or not
    save_info = ((episode) % save_interval == 0 or episode==number_of_episodes-1)
    #frames = []
    tmax = 0
    '''
    if save_info:
        frames.append(env.render('rgb_array'))
    '''
    #for episode_t in range(episode_length):
    while True:
        #t += 1 #parallel_envs

        # explore = only explore for a certain number of episodes
        # action input needs to be transposed
        actions = maddpg.act(torch.tensor(obs, dtype=torch.float), noise=noise)
        noise *= noise_reduction

        actions = torch.stack(actions).detach().numpy()

        # step forward one frame
        env_info = env.step(actions)[brain_name]
        next_obs = env_info.vector_observations            # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
 
        # add data to buffer
        transition = (obs, actions, rewards, next_obs, dones)
        buffer.push(*transition)

        reward_this_episode += rewards

        obs = next_obs
        '''
        # save gif frame
        if save_info:
            frames.append(env.render('rgb_array'))
            tmax+=1
        '''
        if np.any(dones):
            break
    
    scores_deque.append(np.sum(reward_this_episode))
    
    # update once after every episode_per_update
    if len(buffer) > batchsize and episode % episode_per_update == 0:
        samples = buffer.sample(batchsize)
        for a_i in range(num_agents):
            #samples = buffer.sample(batchsize)
            maddpg.update(samples, a_i, logger)
        maddpg.update_targets() #soft update the target network towards the actual networks

    agent0_reward.append(reward_this_episode[0])
    agent1_reward.append(reward_this_episode[1])

    if episode % 100 == 0 or episode == number_of_episodes-1:
        avg_rewards = [np.mean(agent0_reward), np.mean(agent1_reward)]
        agent0_reward = []
        agent1_reward = []
        for a_i, avg_rew in enumerate(avg_rewards):
            logger.add_scalar('agent%i/mean_episode_rewards' % a_i, avg_rew, episode)
        print('last 100 avg reward {} is {}'.format(episode, np.mean(scores_deque)))

    #saving model
    save_dict_list =[]
    if save_info:
        for i in range(2):

            save_dict = {'actor_params' : maddpg.maddpg_agent[i].actor.state_dict(),
                         'actor_optim_params': maddpg.maddpg_agent[i].actor_optimizer.state_dict(),
                         'critic_params' : maddpg.maddpg_agent[i].critic.state_dict(),
                         'critic_optim_params' : maddpg.maddpg_agent[i].critic_optimizer.state_dict()}
            save_dict_list.append(save_dict)

            torch.save(save_dict_list, 
                       os.path.join(model_dir, 'episode-{}.pt'.format(episode)))
        '''
        # save gif files
        imageio.mimsave(os.path.join(model_dir, 'episode-{}.gif'.format(episode)), 
                        frames, duration=.04)
        '''
        
logger.close()
timer.finish()
env.close()

episode: 0/2000   0% ETA:  --:--:-- |                                        | 

last 100 avg reward 0 is -9.900989877705526e-05


episode: 20/2000   1% ETA:  0:09:06 |                                        | 

last 100 avg reward 25 is -0.0004761904064151976


episode: 40/2000   2% ETA:  0:05:20 |                                        | 

last 100 avg reward 50 is 0.0005960266247687751


episode: 79/2000   3% ETA:  0:03:26 |/                                       | 

last 100 avg reward 75 is 0.00022727289152416316


episode: 95/2000   4% ETA:  0:03:12 |-                                       | 

last 100 avg reward 100 is -0.0005472635025557


episode: 112/2000   5% ETA:  0:02:58 |\\                                     | 

last 100 avg reward 125 is -0.0002654865219266014


episode: 149/2000   7% ETA:  0:02:37 |//                                     | 

last 100 avg reward 150 is -0.00043824679437149094


episode: 169/2000   8% ETA:  0:02:27 |---                                    | 

last 100 avg reward 175 is -0.0013043476079685101


episode: 204/2000  10% ETA:  0:02:17 ||||                                    | 

last 100 avg reward 200 is -0.0016943519359510207


episode: 219/2000  10% ETA:  0:02:15 |////                                   | 

last 100 avg reward 225 is -0.0011042942367265561


episode: 254/2000  12% ETA:  0:02:08 |\\\\                                   | 

last 100 avg reward 250 is -0.0014529912082473096


episode: 272/2000  13% ETA:  0:02:05 ||||||                                  | 

last 100 avg reward 275 is -0.002021276352411889


episode: 291/2000  14% ETA:  0:02:02 |/////                                  | 

last 100 avg reward 300 is -0.0020199498751588595


episode: 322/2000  16% ETA:  0:01:59 |\\\\\\                                 | 

last 100 avg reward 325 is -0.00201877908772426


episode: 339/2000  16% ETA:  0:01:57 |||||||                                 | 

last 100 avg reward 350 is -0.001352549619361461


episode: 376/2000  18% ETA:  0:01:52 |-------                                | 

last 100 avg reward 375 is -0.0015966383849873261


episode: 391/2000  19% ETA:  0:01:51 |\\\\\\\                                | 

last 100 avg reward 400 is -0.0014171653916081506


episode: 427/2000  21% ETA:  0:01:46 |////////                               | 

last 100 avg reward 425 is -0.0014448666399762657


episode: 447/2000  22% ETA:  0:01:44 |--------                               | 

last 100 avg reward 450 is -0.0016515423694182653


episode: 465/2000  23% ETA:  0:01:42 |\\\\\\\\\                              | 

last 100 avg reward 475 is -0.0016666663836480842


episode: 502/2000  25% ETA:  0:01:39 |/////////                              | 

last 100 avg reward 500 is -0.0016805321604211794


episode: 517/2000  25% ETA:  0:01:38 |----------                             | 

last 100 avg reward 525 is -0.001533546035710615


episode: 536/2000  26% ETA:  0:01:36 |\\\\\\\\\\                             | 

last 100 avg reward 550 is -0.0017050688345265645


episode: 575/2000  28% ETA:  0:01:32 |///////////                            | 

last 100 avg reward 575 is -0.0018639050357969556


episode: 594/2000  29% ETA:  0:01:30 |-----------                            | 

last 100 avg reward 600 is -0.001868758624294685


episode: 613/2000  30% ETA:  0:01:28 |\\\\\\\\\\\                            | 

last 100 avg reward 625 is -0.0018732779436121303


episode: 653/2000  32% ETA:  0:01:25 |////////////                           | 

last 100 avg reward 650 is -0.001877496376157442


episode: 673/2000  33% ETA:  0:01:23 |-------------                          | 

last 100 avg reward 675 is -0.0018814430024820504


episode: 692/2000  34% ETA:  0:01:22 |\\\\\\\\\\\\\                          | 

last 100 avg reward 700 is -0.001885143272606621


episode: 712/2000  35% ETA:  0:01:20 ||||||||||||||                          | 

last 100 avg reward 725 is -0.0020096849325421935


episode: 752/2000  37% ETA:  0:01:17 |--------------                         | 

last 100 avg reward 750 is -0.0022444180360657066


episode: 772/2000  38% ETA:  0:01:15 |\\\\\\\\\\\\\\\                        | 

last 100 avg reward 775 is -0.0023515978785546405


episode: 792/2000  39% ETA:  0:01:13 ||||||||||||||||                        | 

last 100 avg reward 800 is -0.0023418421010496086


episode: 811/2000  40% ETA:  0:01:12 |///////////////                        | 

last 100 avg reward 825 is -0.002548595817988944


episode: 851/2000  42% ETA:  0:01:09 |\\\\\\\\\\\\\\\\                       | 

last 100 avg reward 850 is -0.0027444792028073883


episode: 869/2000  43% ETA:  0:01:08 |||||||||||||||||                       | 

last 100 avg reward 875 is -0.0025204915064768714


episode: 885/2000  44% ETA:  0:01:07 |/////////////////                      | 

last 100 avg reward 900 is -0.002507492209543715


episode: 924/2000  46% ETA:  0:01:04 |\\\\\\\\\\\\\\\\\\                     | 

last 100 avg reward 925 is -0.0025925922950050754


episode: 944/2000  47% ETA:  0:01:03 |||||||||||||||||||                     | 

last 100 avg reward 950 is -0.0027687913311962623


episode: 977/2000  48% ETA:  0:01:01 |-------------------                    | 

last 100 avg reward 975 is -0.002843865875473253


episode: 991/2000  49% ETA:  0:01:00 |\\\\\\\\\\\\\\\\\\\                    | 

last 100 avg reward 1000 is -0.0030063575625988055


episode: 1020/2000  51% ETA:  0:00:59 |///////////////////                   | 

last 100 avg reward 1025 is -0.0030728238626494078


episode: 1048/2000  52% ETA:  0:00:58 |\\\\\\\\\\\\\\\\\\\                   | 

last 100 avg reward 1050 is -0.003049521859926208


episode: 1064/2000  53% ETA:  0:00:57 |||||||||||||||||||||                  | 

last 100 avg reward 1075 is -0.0030272105886113075


episode: 1100/2000  55% ETA:  0:00:54 |--------------------                  | 

last 100 avg reward 1100 is -0.003089092125950209


episode: 1117/2000  55% ETA:  0:00:53 |\\\\\\\\\\\\\\\\\\\\\                 | 

last 100 avg reward 1125 is -0.003148449947951182


episode: 1154/2000  57% ETA:  0:00:51 |/////////////////////                 | 

last 100 avg reward 1150 is -0.003205435355004075


episode: 1174/2000  58% ETA:  0:00:50 |----------------------                | 

last 100 avg reward 1175 is -0.00333855769868508


episode: 1203/2000  60% ETA:  0:00:48 |||||||||||||||||||||||                | 

last 100 avg reward 1200 is -0.003466563887728076


episode: 1219/2000  60% ETA:  0:00:47 |///////////////////////               | 

last 100 avg reward 1225 is -0.003514328514974491


episode: 1246/2000  62% ETA:  0:00:46 |\\\\\\\\\\\\\\\\\\\\\\\               | 

last 100 avg reward 1250 is -0.003634344637504249


episode: 1261/2000  63% ETA:  0:00:45 ||||||||||||||||||||||||               | 

last 100 avg reward 1275 is -0.003749999709069989


episode: 1296/2000  64% ETA:  0:00:43 |------------------------              | 

last 100 avg reward 1300 is -0.003647394425140629


episode: 1312/2000  65% ETA:  0:00:42 |\\\\\\\\\\\\\\\\\\\\\\\\              | 

last 100 avg reward 1325 is -0.0037587654866999194


episode: 1348/2000  67% ETA:  0:00:39 |/////////////////////////             | 

last 100 avg reward 1350 is -0.003866298813539731


episode: 1365/2000  68% ETA:  0:00:38 |-------------------------             | 

last 100 avg reward 1375 is -0.0039024387339892264


episode: 1402/2000  70% ETA:  0:00:36 |||||||||||||||||||||||||||            | 

last 100 avg reward 1400 is -0.004003997045822894


episode: 1421/2000  71% ETA:  0:00:35 |//////////////////////////            | 

last 100 avg reward 1425 is -0.004102227758972626


episode: 1453/2000  72% ETA:  0:00:33 |\\\\\\\\\\\\\\\\\\\\\\\\\\\           | 

last 100 avg reward 1450 is -0.004197291782465694


episode: 1470/2000  73% ETA:  0:00:32 ||||||||||||||||||||||||||||           | 

last 100 avg reward 1475 is -0.004289339815365709


episode: 1501/2000  75% ETA:  0:00:30 |----------------------------          | 

last 100 avg reward 1500 is -0.0042535912182687


episode: 1518/2000  75% ETA:  0:00:29 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\          | 

last 100 avg reward 1525 is -0.004341943133370389


episode: 1548/2000  77% ETA:  0:00:27 |/////////////////////////////         | 

last 100 avg reward 1550 is -0.004367049986542824


episode: 1579/2000  78% ETA:  0:00:25 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\        | 

last 100 avg reward 1575 is -0.00433174195657162


episode: 1596/2000  79% ETA:  0:00:24 |||||||||||||||||||||||||||||||        | 

last 100 avg reward 1600 is -0.004415049684671428


episode: 1629/2000  81% ETA:  0:00:22 |------------------------------        | 

last 100 avg reward 1625 is -0.004495944095039492


episode: 1645/2000  82% ETA:  0:00:21 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\       | 

last 100 avg reward 1650 is -0.004517418332929818


episode: 1678/2000  83% ETA:  0:00:19 |///////////////////////////////       | 

last 100 avg reward 1675 is -0.004538288003311971


episode: 1692/2000  84% ETA:  0:00:18 |--------------------------------      | 

last 100 avg reward 1700 is -0.004558578282511942


episode: 1725/2000  86% ETA:  0:00:16 |||||||||||||||||||||||||||||||||      | 

last 100 avg reward 1725 is -0.004633077481498397


episode: 1741/2000  87% ETA:  0:00:15 |/////////////////////////////////     | 

last 100 avg reward 1750 is -0.004705564276406341


episode: 1774/2000  88% ETA:  0:00:13 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\     | 

last 100 avg reward 1775 is -0.004776119120490513


episode: 1791/2000  89% ETA:  0:00:12 |||||||||||||||||||||||||||||||||||    | 

last 100 avg reward 1800 is -0.004844818234851271


episode: 1824/2000  91% ETA:  0:00:10 |----------------------------------    | 

last 100 avg reward 1825 is -0.00491173388310713


episode: 1841/2000  92% ETA:  0:00:09 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\    | 

last 100 avg reward 1850 is -0.00497693462494946


episode: 1870/2000  93% ETA:  0:00:08 |///////////////////////////////////   | 

last 100 avg reward 1875 is -0.005040485550449626


episode: 1902/2000  95% ETA:  0:00:06 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\  | 

last 100 avg reward 1900 is -0.005102448496801862


episode: 1919/2000  95% ETA:  0:00:05 |||||||||||||||||||||||||||||||||||||  | 

last 100 avg reward 1925 is -0.005162882249019047


episode: 1950/2000  97% ETA:  0:00:03 |------------------------------------- | 

last 100 avg reward 1950 is -0.005221842725950587


episode: 1966/2000  98% ETA:  0:00:02 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ | 

last 100 avg reward 1975 is -0.005279383152859691


episode: 2000/2000 100% Time: 0:02:04 |||||||||||||||||||||||||||||||||||||||| 
