# M2177.003100 Deep Learning <br>Assignment #5 Part 2: Playing Atari games using a A3C agent

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Hyemi Jang, November 2018

Now, you will make an agent to play Atari games. Your agent is based on A3C.<br>
In this notebook, you will implement your A3C codes to learn how to play Atari games according to the reference paper [1]. <br>
You need to follow the instructions to implement the given classes.

There is a bade code of A3C in TensorFlow https://github.com/openai/universe-starter-agent This iPython notebook is basically a copypasta of this repo.

That said, you are allowed to copy paste the codes from the original repo. HOWEVER, try to implement the model yourself first, and consider the original source code as a last resort. You will learn a lot while wrapping around your head during the implementation. And you will understand nuts and bolts of RNNs more clearly in a code level.

1. [Play](#Play) ( 50 points )

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **two parts of the assignment**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; Team_#)

### Some helpful references for assignment #5:
- [1] Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.
- [2] OpenAI GYM website [[link]](https://gym.openai.com/envs) and [[git]](https://github.com/openai/gym)

In [1]:
#import matplotlib.pyplot as plt
import argparse
import os
import sys; sys.argv=['']
#del sys
from six.moves import shlex_quote
import tensorflow as tf
import cv2 # uncomment to render game screens
import gym
import numpy as np
import os

print(tf.__version__)
print(gym.spec("Pong-v0").tags.get('atari', False))
print(gym.spec("PongDeterministic-v3").tags.get('flashgames', False))
import distutils.version
use_tf12_api = distutils.version.LooseVersion(tf.VERSION) >= distutils.version.LooseVersion('0.12.0')
print (use_tf12_api)

1.12.0
False
False
True


## 1. Make command for calling workers

In [2]:
def new_cmd(session, name, cmd, logdir, shell):
    if isinstance(cmd, (list, tuple)):
        cmd = " ".join(shlex_quote(str(v)) for v in cmd)
    return name, "tmux send-keys -t {}:{} {} Enter".format(session, name, shlex_quote(cmd))
   

def create_commands(session, num_workers, remotes, env_id, logdir, test=False, shell='bash',visualise=False):
    # for launching the TF workers and for launching tensorboard
    base_cmd = [
        'CUDA_VISIBLE_DEVICES=',
        sys.executable, 'worker.py',
        '--log-dir', logdir,
        '--env-id', env_id,
        '--num-workers', str(num_workers)]

    if visualise:
        base_cmd += ['--visualise']
    
    if test:
        base_cmd += ['--test']
        
    if remotes is None:
        remotes = ["1"] * num_workers
    else:
        remotes = remotes.split(',')
        assert len(remotes) == num_workers

    cmds_map = [new_cmd(session, "ps", base_cmd + ["--job-name", "ps"], logdir, shell)]
    for i in range(num_workers):
        cmds_map += [new_cmd(session,
            "w-%d" % i, base_cmd + ["--job-name", "worker", "--task", str(i), "--remotes", remotes[i]], logdir, shell)]

    cmds_map += [new_cmd(session, "tb", ["tensorboard", "--logdir", logdir, "--port", "12345"], logdir, shell)]
    
    cmds_map += [new_cmd(session, "htop", ["htop"], logdir, shell)]

    windows = [v[0] for v in cmds_map]

    notes = []
    cmds = [
        "mkdir -p {}".format(logdir),
        "echo {} {} > {}/cmd.sh".format(sys.executable, ' '.join([shlex_quote(arg) for arg in sys.argv if arg != '-n']), logdir),
    ]
    
    notes += ["Use `tmux attach -t {}` to watch process output".format(session)]
    notes += ["Use `tmux kill-session -t {}` to kill the job".format(session)]
    notes += ["Point your browser to http://localhost:12345 to see Tensorboard"]

    cmds += [
    "kill $( lsof -i:12345 -t ) > /dev/null 2>&1",  # kill any process using tensorboard's port
    "kill $( lsof -i:12222-{} -t ) > /dev/null 2>&1".format(num_workers+12222), # kill any processes using ps / worker ports
    "tmux kill-session -t {}".format(session),
    "tmux new-session -s {} -n {} -d {}".format(session, windows[0], shell)
    ]
    for w in windows[1:]:
        cmds += ["tmux new-window -t {} -n {} {}".format(session, w, shell)]
    cmds += ["sleep 1"]
    for window, cmd in cmds_map:
        cmds += [cmd]

    return cmds, notes


## <a name="train"></a>2. Train your agent

Now, we train an agent to play Pong-v0. Pong-v0 is generally the easiest game for learning.<br>
The maximum value of total reward which can be aquired from one episdoe is 21 (when your agent wins with 21:0).<br>
You have to implement the file a3c.py, envs.py, model.py, worker.py. <br>
Attach the training process to the report

In [3]:
parser = argparse.ArgumentParser(description="Run commands")
parser.add_argument('-w', '--num-workers', default=10, type=int,
                    help="Number of workers")
parser.add_argument('-r', '--remotes', default=None,
                    help='The address of pre-existing VNC servers and '
                         'rewarders to use (e.g. -r vnc://localhost:5900+15900,vnc://localhost:5901+15901).')
parser.add_argument('-e', '--env-id', type=str, default='Pong-v0',
                    help="Environment id")
parser.add_argument('-l', '--log-dir', type=str, default="./pong",
                    help="Log directory path")
parser.add_argument('--test',action= 'store_true')
# Add visualise tag
parser.add_argument('--visualise', action='store_true',
                    help="Visualise the gym environment by running env.render() between each timestep")

_StoreTrueAction(option_strings=['--visualise'], dest='visualise', nargs=0, const=True, default=False, type=None, choices=None, help='Visualise the gym environment by running env.render() between each timestep', metavar=None)

In [4]:
args = parser.parse_args()
print(args)
cmds, notes = create_commands("a3c", args.num_workers, args.remotes, args.env_id, args.log_dir, args.test, visualise=args.visualise)
#print(cmds)
#print(notes)
print("Executing the following commands:")
print("\n".join(cmds))
print("") 
os.environ["TMUX"] = ""
os.system("\n".join(cmds))
print('\n'.join(notes))

Namespace(env_id='Pong-v0', log_dir='./pong', num_workers=10, remotes=None, test=False, visualise=False)
Executing the following commands:
mkdir -p ./pong
echo /home/bivtt4/anaconda3/bin/python '' > ./pong/cmd.sh
kill $( lsof -i:12345 -t ) > /dev/null 2>&1
kill $( lsof -i:12222-12232 -t ) > /dev/null 2>&1
tmux kill-session -t a3c
tmux new-session -s a3c -n ps -d bash
tmux new-window -t a3c -n w-0 bash
tmux new-window -t a3c -n w-1 bash
tmux new-window -t a3c -n w-2 bash
tmux new-window -t a3c -n w-3 bash
tmux new-window -t a3c -n w-4 bash
tmux new-window -t a3c -n w-5 bash
tmux new-window -t a3c -n w-6 bash
tmux new-window -t a3c -n w-7 bash
tmux new-window -t a3c -n w-8 bash
tmux new-window -t a3c -n w-9 bash
tmux new-window -t a3c -n tb bash
tmux new-window -t a3c -n htop bash
sleep 1
tmux send-keys -t a3c:ps 'CUDA_VISIBLE_DEVICES= /home/bivtt4/anaconda3/bin/python worker.py --log-dir ./pong --env-id Pong-v0 --num-workers 10 --job-name ps' Enter
tmux send-keys -t a3c:w-0 'CUDA_VISIBL

## <a name="train"></a>3. Test your agent
Now, we test your agent and calculate an average reward of 20 episodes.
- -21 <= average reward < -10 : you can get 0 points
- -10 <= average reward < 0 : you can get 10 points
- 0 <= average reward < 10 : you can get 20 points
- 10 <= average reward < 17 : you can get 30 points
- 17 <= average reward <= 21 : you can get 50 points

In [None]:
args = parser.parse_args()
cmds, notes = create_commands("a3c", 1, args.remotes, args.env_id, args.log_dir, True, visualise=args.visualise)

print("Executing the following commands:")
print("\n".join(cmds))
print("")
os.environ["TMUX"] = ""
os.system("\n".join(cmds))
print('\n'.join(notes))