# Mathias Babin - P3 Collaboration and Competition Training

This is my implementation for solving the P3 Collaboration-Competition project for [Udacity's Deep Reinforcement Learning course](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893). Details on the project are provided in the **README** for this repository. The purpose of this notebook is to **train** an Agent to solve this environment. If you wish to watch a **finished** agent perform in this enviroment, please go to the **Collab-Test** notebook included in this repository.


### 1. Setting up the Environment

Running the following cell gaurentees that both [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/) have been installed correctly, along with several other packages. 

In [None]:
from unityagents import UnityEnvironment
from agent import Agent
from collections import deque
import numpy as np
import torch
import matplotlib.pyplot as plt
%matplotlib inline

This project was built and tested on a 64-bit OSX system. To make this application run on a different OS please change the file path in the next cell to one of the following:

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

Note that all of these files **_should_** already be included in the repository as .zip files, simply extract the one that matches your current OS (OSX .app already extracted).

The next cell simply sets up the Enviroment. **_IMPORTANT:_**  If the following cell opens a Unity Window that crashes, this is because the rest of the cells in the project are not being executed fast enough. To avoid this, please select **Restart & Run All** under **Kernal**. This will execute all the cells in the project.

In [None]:
env = UnityEnvironment(file_name="Tennis.app")

### 2. Training the Agent

Start by importing some necessary packages and intialize values for the training of the agent.

In [None]:
# Get brains from Unity ML
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

env_info = env.reset(train_mode=True)[brain_name] # reset the environment

num_agents = len(env_info.agents) # get number of agents

action_size = brain.vector_action_space_size # get action size

states = env_info.vector_observations
state_size = states.shape[1] # get state space size

# Initialize the agents
agents = Agent(state_size=state_size, action_size=action_size, seed=10)

Train the Agents for n episodes, and report its average score over 100 episodes. This environment is considered solved once the agent has maintained a score of +0.50 for atleast 100 episodes. Initially, the blue agent begins by taking only random actions inorder to add experiences to the shared replay buffer.

In [None]:
beRandom = True # set to true if agent 2 should use random policy
num_episodes = 1500 # number of episodes
scores_avg = deque(maxlen=100) # average over 100 episodes
all_scores = [] # scores used for visualization.

for i in range(1, num_episodes+1): # loop over all episodes
    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations # Get initial states
    scores = np.zeros(num_agents) # scores that each agent recieves
    score = 0 # score for each episode

    while True: # loop over all timesteps
        action1 = agents.act(states[0]) # action for agent 1
        action2 = agents.act(states[1]) # action for agent 2
        actions = np.random.randn(num_agents, action_size) # randomized actions
        actions = np.clip(actions, -1, 1) # clip random actions
        actions[0] = action1 # replace random action with agent 1 action
        if not beRandom:
            actions[1] = action2 # replace random action with agent 2 action
        elif np.mean(scores_avg) >= 0.05 and i >= 100 and beRandom: # if agent 1 has improved enough, switch agent 2 policy
            beRandom = False
        
        env_info = env.step(actions)[brain_name] # step in the environment
        next_states = env_info.vector_observations # get next state
        rewards = env_info.rewards # get rewards
        dones = env_info.local_done # get if episode is done
        scores += env_info.rewards # sum rewards as score
        
        # update NNs
        agents.step(states[0], actions[0], rewards[0], next_states[0], dones[0]) # add agent 1 experiences to buffer
        agents.step(states[1], actions[1], rewards[1], next_states[1], dones[1]) # add agent 2 experiences to buffer
        
        states = next_states  # prepare for next epsiode by setting a new state
        if np.any(dones): # exit if done episode
            break

    score = np.max(scores) # score is largest of two agents
    scores_avg.append(score) # average score over 100 episodes
    all_scores.append(score) # keep track of all scores for graphs
    if i > 1:
        print('\rEpisode: {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i, np.mean(scores_avg), score), end="")
    if i % 100 == 0:
        print('\rEpisode: {}\tAverage Score: {:.2f}'.format(i, np.mean(scores_avg)))

Plot the training results of training (Score vs. Episode Number).

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(all_scores)+1), all_scores)
plt.ylabel('Score')
plt.xlabel('Episode Num')
plt.show()

Finally, save the trained weights and close the environment down.

In [None]:
torch.save(agents.actor_local.state_dict(), 'checkpoint_actor.pth')
torch.save(agents.critic_local.state_dict(), 'checkpoint_critic.pth')
        
env.close()

### 3. Implementation Details

If you have any questions about the implementation details of this project please refer to the **Report.pdf** file included with this repository for a full explanation of both the algorithms and design decisions chosen.