# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program. We also demonstrate a possible solution for the second project.


### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

We will use here the headless version since the project is set up in a dockerized environment to execute it on remote computing (possible GPU enabled)
resources as well. 


Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
``


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch

from unityagents import UnityEnvironment
from collections import deque
from ddpg_agent import Agent

sns.set()
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [None]:
env = UnityEnvironment(file_name='../Reacher_Linux_NoVis/Reacher.x86_64')
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment note: train_mode is True since we would like to train our agent here
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size

### 2. Train the agent with DDPG

During the assigment we train the agent with `Deep Deterministic Policy Gradient` algorithm. You can find an outstanding description about it 
[here](https://spinningup.openai.com/en/latest/algorithms/ddpg.html). The underlying deep(? - maybe rather wide than deep) neural networks for the actor and the critic are defined in [model.py](./model.py). The agent itself in the [ddpg_agent.py](./ddpg_agent.py). These implementations are based on the examples provided for the course for the actor-critic methods chapter [here](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-bipedal) or [here](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum). 

I choose to solve the First Version as: the task is episodic, and in order to solve the environment, the agent must get an average score of +30 over 100 consecutive episodes.

Next we define a function to execute the training itself, which we will use for later experiments with the hyperparams. 

In [None]:
def ddpg(n_episodes=10, max_t=1000, print_every=5, agent=Agent(33,4,42)):
    widget = ['training loop: ', pb.Percentage(), ' ', pb.Bar(), ' ', pb.ETA() ]
    timer = pb.ProgressBar(widgets=widget, maxval=n_episodes+1).start()
    
    scores_deque = deque(maxlen=100)
    scores_all = []
    
    for i_episode in range(1, n_episodes + 1):
        env_info = env.reset(train_mode=True)[brain_name]

        state = env_info.vector_observations[0]  # get the current state (for each agent)
        scores = np.zeros(num_agents)
        mean_score = 0
        agent.reset()
        for t in range(max_t):
            action = agent.act(state)
            env_info = env.step(action)[brain_name]
            next_state = env_info.vector_observations[0]
            reward = env_info.rewards[0]
            done = env_info.local_done[0]

            agent.step(state, action, reward, next_state, done)
            state = next_state
            scores += reward
            if done:
                break
        
        score = np.mean(scores)
        scores_deque.append(score)
        mean_score = np.mean(scores_deque)
        scores_all.append(score)

        if i_episode % print_every == 0 or (len(scores_deque) == 100 and np.mean(scores_deque) >= 30):
            timer.update(i_episode+1)
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            print('\rEpisode {}\tAverage Score: {:.2f}\tMin Score: {:.2f}\tMax Score: {:.2f}'.format(i_episode, np.mean(scores_deque),
                                                                                                    np.min(scores_deque), np.max(scores_deque)))

        if len(scores_deque) == 100 and np.mean(scores_deque) >= 30:  
            print('Environment solved !')
            break

    timer.finish()
    return scores_all

#### 2.1. Baseline for experiments

Here we define our baseline for the later experiments, to check the effect of the chosen parameters on the convergence. 


In [None]:
scores_default = ddpg(n_episodes=1000, max_t=1000, print_every=50,
              agent=Agent(state_size=33,action_size=4,random_seed =42,actor_fc1_size=128, actor_fc2_size=128,
                 actor_fc3_size=64, critic_fcs1_size=128, critic_fc2_size=128, critic_fc3_size=64,
                 lr_actor=0.0001, lr_critic=0.0001, batch_size=128, buffer_size=1e5, gamma=0.99,
                 tau=0.001, weight_decay=0))

plt.figure(figsize=(15,8))
plt.plot(np.arange(1, len(scores_default)+1),np.squeeze(np.vstack(scores_default)), label='default')
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('Test run scores')
plt.legend()
plt.show();

#### 2.2. Experiments - learning rate

Thus we can see that the convergence is rather slow , we try to increase the learning rate to speed up the training of the NN.

In [None]:
scores_lr1 = ddpg(n_episodes=1000, max_t=3000, print_every=50,
              agent=Agent(state_size=33,action_size=4,random_seed =42,actor_fc1_size=512, actor_fc2_size=256,
                 actor_fc3_size=128, critic_fcs1_size=512, critic_fc2_size=256, critic_fc3_size=128,
                 lr_actor=0.0002, lr_critic=0.0002, batch_size=128, buffer_size=1e5, gamma=0.99,
                 tau=0.001, weight_decay=0))

plt.figure(figsize=(15,8))
plt.plot(np.arange(1, len(scores_lr2)+1),np.squeeze(np.vstack(scores_lr2)), label='default')
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('Test run scores')
plt.legend()
plt.show();