# Breakout Atari

In this environment, the observation is an **RGB image** of the screen, which is an array of shape **(210, 160, 3)** Each action is repeatedly performed for a duration of *k* frames, where *k* is uniformly sampled from {2, 3, 4}. 

This will be further explained in **explore the environment** section


### Game rules

Breakout begins with eight rows of bricks, with each two rows a different color. The color order from the bottom up is **yellow**, **green**, **orange** and **red**.   

Using a single ball, the player must knock down as many bricks as possible by using the walls and/or the paddle below to ricochet the ball against the bricks and eliminate them. If the player’s paddle misses the ball’s rebound, he or she will lose a turn. The player has three turns to try to clear two screens of bricks. 

Color to points:
- Yellow bricks earn one point each 
- Green bricks earn three points 
- Orange bricks earn five points 
- The top-level red bricks score seven points each

The paddle shrinks to one-half its size after the ball has broken through the red row and hit the upper wall. Ball speed increases at specific intervals: after four hits, after twelve hits, and after making contact with the orange and red rows.


<img src="https://raw.githubusercontent.com/Kyushik/Unity_ML_Agent/master/Images/Breakout.png" width="400px" height="500px" />


*Rules description taken from [wikipedia](https://en.wikipedia.org/wiki/Breakout_%28video_game%29)*


### Checkout the example below

In [19]:
from IPython.display import HTML

HTML("""
<div align="middle">
<video width="20%" controls>
      <source src="./res/original.mp4" type="video/mp4">
</video>
</div>""")

## 1. Install necessary packages/libs

In [None]:
!pip install 'stable-baselines3[extra]'
!pip install tensorboard==1.15.0
!python -m atari_py.import_roms rars

**NOTE**:  ROMS are predownloaded and included in the repository.

You can also download them yourself from [Atari ROMS](http://www.atarimania.com/rom_collection_archive_atari_2600_roms.html)

## 2. Imports

In [1]:
import gym 
import os

from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env

## 3. Create & Explore the environment

In [2]:
def make_environment(env_name):
    return gym.make(env_name)
    
env = make_environment("Breakout-v0")

In [None]:
print(f"Action space type -> {env.action_space}")
print(f"Action space sample -> {env.action_space.sample()}")

In [None]:
env.unwrapped.get_action_meanings()

We can see that action can take 4 values, that correspond to the following actions:

- 0 -> No operation 
- 1 -> Fire
- 2 -> Move right 
- 3 -> Move left


In [4]:
print(f"Observation space type -> {env.observation_space}")
print(f"Observation space sample -> {env.observation_space.sample()}")

Observation space type -> Box([[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]], [[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 ...

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 25

In [None]:
print(f"Observation space shape -> {env.observation_space.sample().shape}")

#### More information on all possible [space types](https://gym.openai.com/docs/#spaces)

<br><br>
Breakout **does not** have a reward limit as, for example, CartPole does. 

This means it can be trained indefinitely.

## 4. Random dummy agent

In [None]:
EPISODES = 10

In [None]:
for ep in range(0, EPISODES):
    state = env.reset()
    is_done = False
    score = 0 
    
    while not is_done:
        
        env.render()
        
        # take a sample from action space (random action)
        action = env.action_space.sample()
        state, reward, is_done, additional_info = env.step(action)
        score += reward
    print(f'Step -> {ep+1} | Score -> {score}')
env.close()

## 5. Vectorise Environment

Vectorizing allows us to train our model faster because we are essentially training multiple environments in parallel (4 in our case)

In [3]:
env = make_atari_env('Breakout-v0', n_envs=4, seed=0)

In [4]:
env = VecFrameStack(env, n_stack=4)

<img src="./res/vectorized_envs.png" width="200px" height="250px" />

*Call to **env.render()** after vectorization*

## 6. Train Model

#### There are different kinds of RL algorithms that can be used for training, each with its own tradeoffs


<img src="https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg" width="800px" height="1000px" />

*Source: [spinning up OpenAI docs](https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html)*


### How do we know which RL algorithm to use?
Some algorithms work **only** on certain types of action spaces

<img src="./res/action_space_to_algo.png" width="400px" height="500px" />


<br><br>
From observations about the environment performed above, we can see that  
we have a **Discrete** action space, which means we can use **A2C** for the task at hand.


### Understanding training metrics

<div><img src="./res/training_metrics_a2c.png" width="400px" height="500px" /></div>

*Training metric example*

##### Explaining most important metrics:

- *ep_len_mean* -> how long does the episode last on average
- *ep_reward_mean* -> average reward per episode


In [7]:
log_path = os.path.join('breakout_training', 'logs')

#### Add training reward threshold callback

In [5]:
# Let's go with 20 points
REWARD_THRESHOLD = 20

from stable_baselines3.common.callbacks import (
    EvalCallback,
    StopTrainingOnRewardThreshold
)

best_model_save_path = os.path.join('breakout_training', 'trained_models')

stop_callback = StopTrainingOnRewardThreshold(reward_threshold=REWARD_THRESHOLD, 
                                             verbose=1)
evaluation_callback = EvalCallback(env,
                                  callback_on_new_best=stop_callback,
                                  eval_freq=6000,
                                  best_model_save_path=best_model_save_path,
                                  verbose=1)

In [8]:
# Use CnnPolicy since our observations are returned as images
model = A2C("CnnPolicy", env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env in a VecTransposeImage.


In [9]:
model.learn(total_timesteps=1000000, callback=evaluation_callback)

Logging to breakout_training/logs/A2C_3




------------------------------------
| rollout/              |          |
|    ep_len_mean        | 264      |
|    ep_rew_mean        | 1.26     |
| time/                 |          |
|    fps                | 128      |
|    iterations         | 100      |
|    time_elapsed       | 15       |
|    total_timesteps    | 2000     |
| train/                |          |
|    entropy_loss       | -1.39    |
|    explained_variance | 0.0141   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 0.0396   |
|    value_loss         | 0.107    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 276      |
|    ep_rew_mean        | 1.48     |
| time/                 |          |
|    fps                | 153      |
|    iterations         | 200      |
|    time_elapsed       | 26       |
|    total_timesteps    | 4000     |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 351      |
|    ep_rew_mean        | 3.01     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 1400     |
|    time_elapsed       | 160      |
|    total_timesteps    | 28000    |
| train/                |          |
|    entropy_loss       | -1.1     |
|    explained_variance | 0.742    |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | -0.093   |
|    value_loss         | 0.0945   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 361      |
|    ep_rew_mean        | 3.25     |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 1500     |
|    time_elapsed       | 171      |
|    total_timesteps    | 30000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 419      |
|    ep_rew_mean        | 4.65     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 2700     |
|    time_elapsed       | 309      |
|    total_timesteps    | 54000    |
| train/                |          |
|    entropy_loss       | -0.23    |
|    explained_variance | 0.635    |
|    learning_rate      | 0.0007   |
|    n_updates          | 2699     |
|    policy_loss        | 0.198    |
|    value_loss         | 0.331    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 430      |
|    ep_rew_mean        | 4.9      |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 2800     |
|    time_elapsed       | 319      |
|    total_timesteps    | 56000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 450      |
|    ep_rew_mean        | 5.4      |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 4000     |
|    time_elapsed       | 462      |
|    total_timesteps    | 80000    |
| train/                |          |
|    entropy_loss       | -0.185   |
|    explained_variance | 0.975    |
|    learning_rate      | 0.0007   |
|    n_updates          | 3999     |
|    policy_loss        | -0.00485 |
|    value_loss         | 0.0512   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 453      |
|    ep_rew_mean        | 5.53     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 4100     |
|    time_elapsed       | 475      |
|    total_timesteps    | 82000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 503      |
|    ep_rew_mean        | 6.29     |
| time/                 |          |
|    fps                | 166      |
|    iterations         | 5300     |
|    time_elapsed       | 635      |
|    total_timesteps    | 106000   |
| train/                |          |
|    entropy_loss       | -0.315   |
|    explained_variance | 0.848    |
|    learning_rate      | 0.0007   |
|    n_updates          | 5299     |
|    policy_loss        | 0.0257   |
|    value_loss         | 0.0764   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 505      |
|    ep_rew_mean        | 6.27     |
| time/                 |          |
|    fps                | 167      |
|    iterations         | 5400     |
|    time_elapsed       | 646      |
|    total_timesteps    | 108000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 489      |
|    ep_rew_mean        | 6.01     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 6600     |
|    time_elapsed       | 768      |
|    total_timesteps    | 132000   |
| train/                |          |
|    entropy_loss       | -0.323   |
|    explained_variance | 0.904    |
|    learning_rate      | 0.0007   |
|    n_updates          | 6599     |
|    policy_loss        | -0.0262  |
|    value_loss         | 0.0752   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 506      |
|    ep_rew_mean        | 6.29     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 6700     |
|    time_elapsed       | 778      |
|    total_timesteps    | 134000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 443      |
|    ep_rew_mean        | 4.96     |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 7900     |
|    time_elapsed       | 912      |
|    total_timesteps    | 158000   |
| train/                |          |
|    entropy_loss       | -0.344   |
|    explained_variance | 0.939    |
|    learning_rate      | 0.0007   |
|    n_updates          | 7899     |
|    policy_loss        | -0.157   |
|    value_loss         | 0.11     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 454      |
|    ep_rew_mean        | 5.19     |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 8000     |
|    time_elapsed       | 922      |
|    total_timesteps    | 160000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 598      |
|    ep_rew_mean        | 8.22     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 9200     |
|    time_elapsed       | 1074     |
|    total_timesteps    | 184000   |
| train/                |          |
|    entropy_loss       | -0.0859  |
|    explained_variance | 0.748    |
|    learning_rate      | 0.0007   |
|    n_updates          | 9199     |
|    policy_loss        | 0.029    |
|    value_loss         | 0.188    |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 592       |
|    ep_rew_mean        | 8.15      |
| time/                 |           |
|    fps                | 171       |
|    iterations         | 9300      |
|    time_elapsed       | 1085      |
|    total_timesteps    | 186000    |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 577      |
|    ep_rew_mean        | 7.71     |
| time/                 |          |
|    fps                | 166      |
|    iterations         | 10500    |
|    time_elapsed       | 1263     |
|    total_timesteps    | 210000   |
| train/                |          |
|    entropy_loss       | -0.262   |
|    explained_variance | 0.528    |
|    learning_rate      | 0.0007   |
|    n_updates          | 10499    |
|    policy_loss        | 0.0764   |
|    value_loss         | 0.149    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 580      |
|    ep_rew_mean        | 7.71     |
| time/                 |          |
|    fps                | 166      |
|    iterations         | 10600    |
|    time_elapsed       | 1274     |
|    total_timesteps    | 212000   |
| train/                |          |
|

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 587       |
|    ep_rew_mean        | 8.04      |
| time/                 |           |
|    fps                | 168       |
|    iterations         | 11800     |
|    time_elapsed       | 1398      |
|    total_timesteps    | 236000    |
| train/                |           |
|    entropy_loss       | -0.537    |
|    explained_variance | 0.126     |
|    learning_rate      | 0.0007    |
|    n_updates          | 11799     |
|    policy_loss        | -0.000134 |
|    value_loss         | 0.103     |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 577      |
|    ep_rew_mean        | 7.79     |
| time/                 |          |
|    fps                | 168      |
|    iterations         | 11900    |
|    time_elapsed       | 1409     |
|    total_timesteps    | 238000   |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 667      |
|    ep_rew_mean        | 9.81     |
| time/                 |          |
|    fps                | 168      |
|    iterations         | 13100    |
|    time_elapsed       | 1551     |
|    total_timesteps    | 262000   |
| train/                |          |
|    entropy_loss       | -0.18    |
|    explained_variance | 0.821    |
|    learning_rate      | 0.0007   |
|    n_updates          | 13099    |
|    policy_loss        | 0.112    |
|    value_loss         | 0.233    |
------------------------------------
Eval num_timesteps=264000, episode_reward=9.60 +/- 1.62
Episode length: 695.00 +/- 72.51
------------------------------------
| eval/                 |          |
|    mean_ep_length     | 695      |
|    mean_reward        | 9.6      |
| time/                 |          |
|    total_timesteps    | 264000   |
| train/                |          |
|    entropy_loss      

Eval num_timesteps=288000, episode_reward=11.20 +/- 4.45
Episode length: 714.80 +/- 119.03
-------------------------------------
| eval/                 |           |
|    mean_ep_length     | 715       |
|    mean_reward        | 11.2      |
| time/                 |           |
|    total_timesteps    | 288000    |
| train/                |           |
|    entropy_loss       | -0.426    |
|    explained_variance | 0.226     |
|    learning_rate      | 0.0007    |
|    n_updates          | 14399     |
|    policy_loss        | -0.000236 |
|    value_loss         | 0.144     |
-------------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 657      |
|    ep_rew_mean     | 9.67     |
| time/              |          |
|    fps             | 169      |
|    iterations      | 14400    |
|    time_elapsed    | 1699     |
|    total_timesteps | 288000   |
---------------------------------
------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 690      |
|    ep_rew_mean     | 10.1     |
| time/              |          |
|    fps             | 170      |
|    iterations      | 15600    |
|    time_elapsed    | 1831     |
|    total_timesteps | 312000   |
---------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 687      |
|    ep_rew_mean        | 10.2     |
| time/                 |          |
|    fps                | 170      |
|    iterations         | 15700    |
|    time_elapsed       | 1842     |
|    total_timesteps    | 314000   |
| train/                |          |
|    entropy_loss       | -0.273   |
|    explained_variance | 0.736    |
|    learning_rate      | 0.0007   |
|    n_updates          | 15699    |
|    policy_loss        | 0.0215   |
|    value_loss         | 0.111    |
------------------------------------
-------------------------------

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 641       |
|    ep_rew_mean        | 9.17      |
| time/                 |           |
|    fps                | 171       |
|    iterations         | 16900     |
|    time_elapsed       | 1974      |
|    total_timesteps    | 338000    |
| train/                |           |
|    entropy_loss       | -0.241    |
|    explained_variance | 0.946     |
|    learning_rate      | 0.0007    |
|    n_updates          | 16899     |
|    policy_loss        | -0.000524 |
|    value_loss         | 0.0441    |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 654      |
|    ep_rew_mean        | 9.44     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 17000    |
|    time_elapsed       | 1984     |
|    total_timesteps    | 340000   |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 722      |
|    ep_rew_mean        | 11.1     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 18200    |
|    time_elapsed       | 2126     |
|    total_timesteps    | 364000   |
| train/                |          |
|    entropy_loss       | -0.203   |
|    explained_variance | 0.544    |
|    learning_rate      | 0.0007   |
|    n_updates          | 18199    |
|    policy_loss        | 0.0638   |
|    value_loss         | 0.178    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 723      |
|    ep_rew_mean        | 11.1     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 18300    |
|    time_elapsed       | 2137     |
|    total_timesteps    | 366000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 707      |
|    ep_rew_mean        | 11.1     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 19500    |
|    time_elapsed       | 2264     |
|    total_timesteps    | 390000   |
| train/                |          |
|    entropy_loss       | -0.242   |
|    explained_variance | 0.0448   |
|    learning_rate      | 0.0007   |
|    n_updates          | 19499    |
|    policy_loss        | 0.0571   |
|    value_loss         | 0.338    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 708      |
|    ep_rew_mean        | 11.2     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 19600    |
|    time_elapsed       | 2275     |
|    total_timesteps    | 392000   |
| train/                |          |
|

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 733       |
|    ep_rew_mean        | 11.6      |
| time/                 |           |
|    fps                | 171       |
|    iterations         | 20800     |
|    time_elapsed       | 2430      |
|    total_timesteps    | 416000    |
| train/                |           |
|    entropy_loss       | -0.00502  |
|    explained_variance | 0.895     |
|    learning_rate      | 0.0007    |
|    n_updates          | 20799     |
|    policy_loss        | -0.000501 |
|    value_loss         | 0.142     |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 727      |
|    ep_rew_mean        | 11.3     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 20900    |
|    time_elapsed       | 2442     |
|    total_timesteps    | 418000   |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 731      |
|    ep_rew_mean        | 11.6     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 22100    |
|    time_elapsed       | 2575     |
|    total_timesteps    | 442000   |
| train/                |          |
|    entropy_loss       | -0.128   |
|    explained_variance | 0.78     |
|    learning_rate      | 0.0007   |
|    n_updates          | 22099    |
|    policy_loss        | -0.00692 |
|    value_loss         | 0.0347   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 739      |
|    ep_rew_mean        | 11.7     |
| time/                 |          |
|    fps                | 171      |
|    iterations         | 22200    |
|    time_elapsed       | 2585     |
|    total_timesteps    | 444000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 761      |
|    ep_rew_mean        | 12.2     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 23400    |
|    time_elapsed       | 2713     |
|    total_timesteps    | 468000   |
| train/                |          |
|    entropy_loss       | -0.0686  |
|    explained_variance | 0.925    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23399    |
|    policy_loss        | -0.00584 |
|    value_loss         | 0.244    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 759      |
|    ep_rew_mean        | 12.1     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 23500    |
|    time_elapsed       | 2723     |
|    total_timesteps    | 470000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 726      |
|    ep_rew_mean        | 11.4     |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 24700    |
|    time_elapsed       | 2852     |
|    total_timesteps    | 494000   |
| train/                |          |
|    entropy_loss       | -0.225   |
|    explained_variance | 0.891    |
|    learning_rate      | 0.0007   |
|    n_updates          | 24699    |
|    policy_loss        | 0.0143   |
|    value_loss         | 0.0309   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 721      |
|    ep_rew_mean        | 11.3     |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 24800    |
|    time_elapsed       | 2863     |
|    total_timesteps    | 496000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 811      |
|    ep_rew_mean        | 13.3     |
| time/                 |          |
|    fps                | 173      |
|    iterations         | 26000    |
|    time_elapsed       | 2990     |
|    total_timesteps    | 520000   |
| train/                |          |
|    entropy_loss       | -0.149   |
|    explained_variance | 0.841    |
|    learning_rate      | 0.0007   |
|    n_updates          | 25999    |
|    policy_loss        | 0.00327  |
|    value_loss         | 0.11     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 815       |
|    ep_rew_mean        | 13.6      |
| time/                 |           |
|    fps                | 173       |
|    iterations         | 26100     |
|    time_elapsed       | 3001      |
|    total_timesteps    | 522000    |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 755      |
|    ep_rew_mean        | 12       |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 27300    |
|    time_elapsed       | 3135     |
|    total_timesteps    | 546000   |
| train/                |          |
|    entropy_loss       | -0.122   |
|    explained_variance | 0.947    |
|    learning_rate      | 0.0007   |
|    n_updates          | 27299    |
|    policy_loss        | 0.0255   |
|    value_loss         | 0.101    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 747      |
|    ep_rew_mean        | 11.8     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 27400    |
|    time_elapsed       | 3147     |
|    total_timesteps    | 548000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 743      |
|    ep_rew_mean        | 12.2     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 28600    |
|    time_elapsed       | 3279     |
|    total_timesteps    | 572000   |
| train/                |          |
|    entropy_loss       | -0.00744 |
|    explained_variance | 0.552    |
|    learning_rate      | 0.0007   |
|    n_updates          | 28599    |
|    policy_loss        | 6.99e-05 |
|    value_loss         | 0.155    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 766      |
|    ep_rew_mean        | 12.8     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 28700    |
|    time_elapsed       | 3290     |
|    total_timesteps    | 574000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 781      |
|    ep_rew_mean        | 12.7     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 29900    |
|    time_elapsed       | 3417     |
|    total_timesteps    | 598000   |
| train/                |          |
|    entropy_loss       | -0.247   |
|    explained_variance | 0.678    |
|    learning_rate      | 0.0007   |
|    n_updates          | 29899    |
|    policy_loss        | 0.0129   |
|    value_loss         | 0.0747   |
------------------------------------
Eval num_timesteps=600000, episode_reward=13.80 +/- 5.95
Episode length: 783.60 +/- 180.26
------------------------------------
| eval/                 |          |
|    mean_ep_length     | 784      |
|    mean_reward        | 13.8     |
| time/                 |          |
|    total_timesteps    | 600000   |
| train/                |          |
|    entropy_loss    

Eval num_timesteps=624000, episode_reward=11.40 +/- 3.56
Episode length: 754.80 +/- 192.14
------------------------------------
| eval/                 |          |
|    mean_ep_length     | 755      |
|    mean_reward        | 11.4     |
| time/                 |          |
|    total_timesteps    | 624000   |
| train/                |          |
|    entropy_loss       | -0.134   |
|    explained_variance | 0.753    |
|    learning_rate      | 0.0007   |
|    n_updates          | 31199    |
|    policy_loss        | 0.00438  |
|    value_loss         | 0.107    |
------------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 829      |
|    ep_rew_mean     | 14.3     |
| time/              |          |
|    fps             | 175      |
|    iterations      | 31200    |
|    time_elapsed    | 3560     |
|    total_timesteps | 624000   |
---------------------------------
------------------------------------
| rollout/    

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 818      |
|    ep_rew_mean     | 13.7     |
| time/              |          |
|    fps             | 175      |
|    iterations      | 32400    |
|    time_elapsed    | 3685     |
|    total_timesteps | 648000   |
---------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 810      |
|    ep_rew_mean        | 13.5     |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 32500    |
|    time_elapsed       | 3695     |
|    total_timesteps    | 650000   |
| train/                |          |
|    entropy_loss       | -0.153   |
|    explained_variance | 0.64     |
|    learning_rate      | 0.0007   |
|    n_updates          | 32499    |
|    policy_loss        | -0.071   |
|    value_loss         | 0.295    |
------------------------------------
-------------------------------

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 816      |
|    ep_rew_mean        | 14.1     |
| time/                 |          |
|    fps                | 176      |
|    iterations         | 33700    |
|    time_elapsed       | 3823     |
|    total_timesteps    | 674000   |
| train/                |          |
|    entropy_loss       | -0.291   |
|    explained_variance | 0.89     |
|    learning_rate      | 0.0007   |
|    n_updates          | 33699    |
|    policy_loss        | 0.0403   |
|    value_loss         | 0.0855   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 822      |
|    ep_rew_mean        | 14.4     |
| time/                 |          |
|    fps                | 176      |
|    iterations         | 33800    |
|    time_elapsed       | 3833     |
|    total_timesteps    | 676000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 799      |
|    ep_rew_mean        | 14       |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 35000    |
|    time_elapsed       | 3994     |
|    total_timesteps    | 700000   |
| train/                |          |
|    entropy_loss       | -0.107   |
|    explained_variance | 0.84     |
|    learning_rate      | 0.0007   |
|    n_updates          | 34999    |
|    policy_loss        | 0.0121   |
|    value_loss         | 0.177    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 810      |
|    ep_rew_mean        | 14.5     |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 35100    |
|    time_elapsed       | 4005     |
|    total_timesteps    | 702000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 814      |
|    ep_rew_mean        | 14       |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 36300    |
|    time_elapsed       | 4163     |
|    total_timesteps    | 726000   |
| train/                |          |
|    entropy_loss       | -0.147   |
|    explained_variance | 0.833    |
|    learning_rate      | 0.0007   |
|    n_updates          | 36299    |
|    policy_loss        | -0.231   |
|    value_loss         | 0.131    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 830      |
|    ep_rew_mean        | 14.5     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 36400    |
|    time_elapsed       | 4173     |
|    total_timesteps    | 728000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 816      |
|    ep_rew_mean        | 14       |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 37600    |
|    time_elapsed       | 4307     |
|    total_timesteps    | 752000   |
| train/                |          |
|    entropy_loss       | -0.15    |
|    explained_variance | 0.806    |
|    learning_rate      | 0.0007   |
|    n_updates          | 37599    |
|    policy_loss        | -0.0256  |
|    value_loss         | 0.126    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 822      |
|    ep_rew_mean        | 14.1     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 37700    |
|    time_elapsed       | 4322     |
|    total_timesteps    | 754000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 851      |
|    ep_rew_mean        | 14.8     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 38900    |
|    time_elapsed       | 4457     |
|    total_timesteps    | 778000   |
| train/                |          |
|    entropy_loss       | -0.243   |
|    explained_variance | 0.824    |
|    learning_rate      | 0.0007   |
|    n_updates          | 38899    |
|    policy_loss        | -0.0161  |
|    value_loss         | 0.131    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 865      |
|    ep_rew_mean        | 15.3     |
| time/                 |          |
|    fps                | 174      |
|    iterations         | 39000    |
|    time_elapsed       | 4467     |
|    total_timesteps    | 780000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 852      |
|    ep_rew_mean        | 14.9     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 40200    |
|    time_elapsed       | 4650     |
|    total_timesteps    | 804000   |
| train/                |          |
|    entropy_loss       | -0.144   |
|    explained_variance | 0.919    |
|    learning_rate      | 0.0007   |
|    n_updates          | 40199    |
|    policy_loss        | -0.117   |
|    value_loss         | 0.0952   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 842      |
|    ep_rew_mean        | 14.6     |
| time/                 |          |
|    fps                | 172      |
|    iterations         | 40300    |
|    time_elapsed       | 4665     |
|    total_timesteps    | 806000   |
| train/                |          |
|

<stable_baselines3.a2c.a2c.A2C at 0x7fe5704ea450>

## 7. Save or Reload trained model

In [10]:
save_trained_models_path = os.path.join('breakout_training', 'trained_models', 'A2C_model')

In [11]:
model.save(save_trained_models_path)

### Reload

In [None]:
del model

In [14]:
env = make_atari_env('Breakout-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)

In [None]:
model = A2C.load(save_trained_models_path, env)

## 8. Evaluate model performance

In [18]:
evaluate_policy(model, env, n_eval_episodes=10, render=False)

(21.0, 7.224956747275377)

### Explore logs

In [12]:
!tensorboard --logdir={log_path}


NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.7.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


### Results (for 1M steps)

<img src="./res/breakout_1mil_eval.png" width="500px" height="700px" />
<img src="./res/breakout_1mil_rollout.png" width="500px" height="700px" />

Reward target of **20** was reached.  
Evaluate policy returns reward of **15-21** with ~7 std.dev.

## 9. Test env with trained model

In [None]:
observations = env.reset()
while True:
    action, states = model.predict(obs)
    observations, rewards, is_done, info = env.step(action)
    env.render()
env.close()

## 10. Interesting facts

Training a model for **1 million timesteps** took about **1,5 hours**.  

Laptop specs:
<img src="./res/laptop_specs.png" width="500px" height="700px" />