# Homework 3
## Dates and directions
* **Due 06 March 2020**
* **Description** This notebook introduces use of `ray` and `rllib` and explores `Pong-v0` (an atari game). The *goal* is to train policies for pong using PPO and Q-learning and explore how choices effect performance (both of agent and training). Other notes and snippets will likely be added as time goes on.

In [8]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
from gym import wrappers

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

# References
1. https://ray.readthedocs.io/en/latest/auto_examples/plot_pong_example.html
1. https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
1. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5
1. https://kyso.io/eoin/openai-gym-jupyter?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

# Additional configuration
1. Install ffmpeg `apt install ffmpeg` *(hoping to avoid this with using cluster resource)*

In [13]:
!pip install gputil

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25ldone
[?25h  Created wheel for gputil: filename=GPUtil-1.4.0-cp37-none-any.whl size=7414 sha256=6240ce9c7dace460717e86e6f970ae8a8a5d493d01b8819174ca860c1089fd95
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0


In [4]:
!apt-get -y install ffmpeg

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  dbus fontconfig fontconfig-config fonts-dejavu-core i965-va-driver
  intel-media-va-driver libaacs0 libaom0 libapparmor1 libasound2
  libasound2-data libass9 libasyncns0 libavc1394-0 libavcodec58 libavdevice58
  libavfilter7 libavformat58 libavresample4 libavutil56 libbdplus0 libbluray2
  libbs2b0 libcaca0 libcairo2 libcap2 libcdio-cdda2 libcdio-paranoia2
  libcdio18 libchromaprint1 libcodec2-0.8.1 libcroco3 libcrystalhd3 libdatrie1
  libdbus-1-3 libdc1394-22 libfftw3-double3 libflac8 libflite1 libfontconfig1
  libfreetype6 libfribidi0 libgdk-pixbuf2.0-0 libgdk-pixbuf2.0-bin
  libgdk-pixbuf2.0-common libgme0 libgraphite2-3 libgsm1 libharfbuzz0b
  libiec61883-0 libigdgmm5 libjack-jackd2-0 libjbig0 libjpeg62-turbo
  liblilv-0-0 libmp3lame0 libmpg123-0 libmysofa0 libnorm1 libogg0
  libopenal-data libopenal1 libopenjp2-7 libopenmpt0 libopus0

Get:43 http://deb.debian.org/debian buster/main amd64 libshine3 amd64 3.1.1-2 [23.6 kB]
Get:44 http://deb.debian.org/debian buster/main amd64 libsnappy1v5 amd64 1.1.7-1 [17.0 kB]
Get:45 http://deb.debian.org/debian buster/main amd64 libspeex1 amd64 1.2~rc1.2-1+b2 [54.8 kB]
Get:46 http://deb.debian.org/debian buster/main amd64 libsoxr0 amd64 0.1.2-3 [65.9 kB]
Get:47 http://deb.debian.org/debian buster/main amd64 libswresample3 amd64 7:4.1.4-1~deb10u1 [95.7 kB]
Get:48 http://deb.debian.org/debian buster/main amd64 libogg0 amd64 1.3.2-1+b1 [21.3 kB]
Get:49 http://deb.debian.org/debian buster/main amd64 libtheora0 amd64 1.1.1+dfsg.1-15 [169 kB]
Get:50 http://deb.debian.org/debian buster/main amd64 libtwolame0 amd64 0.3.13-4 [49.3 kB]
Get:51 http://deb.debian.org/debian buster/main amd64 libvorbis0a amd64 1.3.6-2 [93.5 kB]
Get:52 http://deb.debian.org/debian buster/main amd64 libvorbisenc2 amd64 1.3.6-2 [80.2 kB]
Get:53 http://deb.debian.org/debian buster/main amd64 libvpx5 amd64 1.7.0-3+de

Get:131 http://deb.debian.org/debian buster/main amd64 libaacs0 amd64 0.9.0-2 [52.4 kB]
Get:132 http://deb.debian.org/debian buster/main amd64 libbdplus0 amd64 0.1.2-3 [47.5 kB]
Get:133 http://deb.debian.org/debian buster/main amd64 libgdk-pixbuf2.0-bin amd64 2.38.1+dfsg-1 [24.1 kB]
Get:134 http://deb.debian.org/debian buster/main amd64 librsvg2-common amd64 2.44.10-2.1 [23.3 kB]
Get:135 http://deb.debian.org/debian buster/main amd64 libvdpau-va-gl1 amd64 0.4.2-1+b1 [71.3 kB]
Get:136 http://deb.debian.org/debian buster/main amd64 mesa-va-drivers amd64 18.3.6-2+deb10u1 [1972 kB]
Get:137 http://deb.debian.org/debian buster/main amd64 mesa-vdpau-drivers amd64 18.3.6-2+deb10u1 [2101 kB]
Get:138 http://deb.debian.org/debian buster/main amd64 va-driver-all amd64 2.4.0-1 [13.0 kB]
Get:139 http://deb.debian.org/debian buster/main amd64 vdpau-driver-all amd64 1.1.1-10 [20.8 kB]
Fetched 52.3 MB in 4s (14.3 MB/s)              
debconf: delaying package configuration, since apt-utils is not instal

Selecting previously unselected package libshine3:amd64.
Preparing to unpack .../042-libshine3_3.1.1-2_amd64.deb ...
Unpacking libshine3:amd64 (3.1.1-2) ...
Selecting previously unselected package libsnappy1v5:amd64.
Preparing to unpack .../043-libsnappy1v5_1.1.7-1_amd64.deb ...
Unpacking libsnappy1v5:amd64 (1.1.7-1) ...
Selecting previously unselected package libspeex1:amd64.
Preparing to unpack .../044-libspeex1_1.2~rc1.2-1+b2_amd64.deb ...
Unpacking libspeex1:amd64 (1.2~rc1.2-1+b2) ...
Selecting previously unselected package libsoxr0:amd64.
Preparing to unpack .../045-libsoxr0_0.1.2-3_amd64.deb ...
Unpacking libsoxr0:amd64 (0.1.2-3) ...
Selecting previously unselected package libswresample3:amd64.
Preparing to unpack .../046-libswresample3_7%3a4.1.4-1~deb10u1_amd64.deb ...
Unpacking libswresample3:amd64 (7:4.1.4-1~deb10u1) ...
Selecting previously unselected package libogg0:amd64.
Preparing to unpack .../047-libogg0_1.3.2-1+b1_amd64.deb ...
Unpacking libogg0:amd64 (1.3.2-1+b1) ...
S

Selecting previously unselected package libzmq5:amd64.
Preparing to unpack .../090-libzmq5_4.3.1-4+deb10u1_amd64.deb ...
Unpacking libzmq5:amd64 (4.3.1-4+deb10u1) ...
Selecting previously unselected package libavfilter7:amd64.
Preparing to unpack .../091-libavfilter7_7%3a4.1.4-1~deb10u1_amd64.deb ...
Unpacking libavfilter7:amd64 (7:4.1.4-1~deb10u1) ...
Selecting previously unselected package libslang2:amd64.
Preparing to unpack .../092-libslang2_2.3.2-2_amd64.deb ...
Unpacking libslang2:amd64 (2.3.2-2) ...
Selecting previously unselected package libcaca0:amd64.
Preparing to unpack .../093-libcaca0_0.99.beta19-2.1_amd64.deb ...
Unpacking libcaca0:amd64 (0.99.beta19-2.1) ...
Selecting previously unselected package libcdio18:amd64.
Preparing to unpack .../094-libcdio18_2.0.0-2_amd64.deb ...
Unpacking libcdio18:amd64 (2.0.0-2) ...
Selecting previously unselected package libcdio-cdda2:amd64.
Preparing to unpack .../095-libcdio-cdda2_10.2+0.94+2-4_amd64.deb ...
Unpacking libcdio-cdda2:amd64 

Unpacking va-driver-all:amd64 (2.4.0-1) ...
Selecting previously unselected package vdpau-driver-all:amd64.
Preparing to unpack .../138-vdpau-driver-all_1.1.1-10_amd64.deb ...
Unpacking vdpau-driver-all:amd64 (1.1.1-10) ...
Setting up libgme0:amd64 (0.6.2-1) ...
Setting up libssh-gcrypt-4:amd64 (0.8.7-1) ...
Setting up libgraphite2-3:amd64 (1.3.13-7) ...
Setting up libpixman-1-0:amd64 (0.36.0-1) ...
Setting up libigdgmm5:amd64 (18.4.1+ds1-1) ...
Setting up libraw1394-11:amd64 (2.1.2-1+b1) ...
Setting up libapparmor1:amd64 (2.13.2-10) ...
Setting up libsodium23:amd64 (1.0.17-1) ...
Setting up libmpg123-0:amd64 (1.25.10-2) ...
Setting up libxcb-xfixes0:amd64 (1.13.1-2) ...
Setting up libogg0:amd64 (1.3.2-1+b1) ...
Setting up libspeex1:amd64 (1.2~rc1.2-1+b2) ...
Setting up libshine3:amd64 (3.1.1-2) ...
Setting up libtwolame0:amd64 (0.3.13-4) ...
Setting up libdatrie1:amd64 (0.2.12-2) ...
Setting up libgsm1:amd64 (1.0.18-2) ...
Setting up libx264-155:amd64 (2:0.155.2917+git0a84d98-2) ...
S

# Potentially helpful snippets

In [5]:
env = gym.make('Pong-v0')
env = wrappers.Monitor(env, "./gym-results", force=True)
env.reset()

for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done: break
env.close()

In [6]:
import io
import base64
from IPython.display import HTML

video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))

# *Problem 1:* Training pong with PPO and DQN using RLlib and ray

## Resources
* ray: https://ray.readthedocs.io/en/latest/
* rllib: https://ray.readthedocs.io/en/latest/rllib.html
* ray.tune: https://ray.readthedocs.io/en/latest/tune.html
* tutorial notebooks on ray, rllib, and tune: https://github.com/ray-project/tutorial

### Generate a random baseline
1. Generate the average return of a random agent on pong by updating the parts marked with 
1. Record a video of random behavior 

### Train pong using PPO
Essentially we will go through https://github.com/ray-project/tutorial/blob/master/rllib_exercises/rllib_exercise02_ppo.ipynb but using the pong environment.

1. Complete exercises in that notebook.
1. Generate training artifacts for PPO (reward curve, example behaviors throughout training (i.e. a little video), etc).

## Train pong using DQN
Do the same with DQN.
1. Complete exercises in that notebook.
1. Generate training artifacts for DQN (reward curve, example behaviors throughout training (i.e. a little video), etc).

## Compare DQN and PPO performance
1. Comment on differences in intermediate and final performance.

In [10]:
ray.init(num_cpus=3, ignore_reinit_error=True, log_to_driver=False)

2020-03-06 16:03:17,562	ERROR worker.py:681 -- Calling ray.init() again after it has already been called.


In [14]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 1
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

agent = PPOTrainer(config, 'CartPole-v0')

2020-03-06 16:07:27,149	ERROR syncer.py:39 -- Log sync requires rsync to be installed.
  obj = yaml.load(type_)


# *Problem 2:* Complete following Q-learning code
Complete the following Q-learning code written pytorch:

https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

to train an agent for `pong-v0`. Currently is trains cartpole from video frames.

# *Problem 3:* Complete following policy gradient code
Compare the perform of ray ppo to simple policy gradient method implemented in a single file: https://ray.readthedocs.io/en/latest/auto_examples/plot_pong_example.html
1. Can you achieve the same performance? (as measured by reward and runtime)
2. Bonus: can you use the same trick with the decorator `@ray.remote` to accelerate the dqn code in **Problem 2**