### Deep Reinforcement Learning to train an agent to play Atari Pong game using DQN
Reinforcement learning is one of the fascinating machine learning techniques which primarily focuses on playing games. There have been a lot of theoretical models proposed in the past, but due to recent advancements in computation resources and deep learning techniques, implementing reinforcement learning has become practical. Google's Deep Mind has demonstrated the potential of reinforcement learning by teaching an agent to play atari games and later they went on to defeat world's best Go players. 
	
This project explores a deep reinforcement learning technique to train an agent to play atari pong game from OpenAI Gym. OpenAI Gym is a toolkit to develop and compare reinforcement learning algorithms. The learning agent takes raw pixels from the atari emulator and predicts an action that is fed back into the emulator via OpenAI interface. The deep reinforcement learning network used in this project is Deep Q Network (DQN), it took over 10 million episodes to train an agent to perfectly win the game.

In [1]:
import os
import itertools
import time
import glob
import json
import pandas as pd
import run_dqn_atari
from bokeh.plotting import figure 
from bokeh.models import Legend
from bokeh.layouts import column
from bokeh.io import output_notebook, show
output_notebook()

In [2]:
USE_ARCHIVED_LOGS = True
RUN_MODEL = False
ARCHIVAL_ROOT = 'archived_logs'

LOGS_DIR = 'open_ai_logs'
MODELS_DIR = 'trained_models'
TRAINING_LOGS = 'train_logs.log'

TRAINED_MODEL_NAMES = ['random', 'model.ckpt-5260000', 'model.ckpt-8090000', 'model.ckpt-9500000']
MAX_EPISODE_COUNT = 100 # number of episodes to run for the stats

In [3]:
def parse_traning_logs(log_path):
    with open(log_path) as logfile :
        raw_log_data = logfile.readlines()

    log_data = {}
    set_count = 0
    set_label = None
    start_parsing = False
    for log in raw_log_data:
        
        if not start_parsing:
            if 'Timestep 60000' in log: 
                start_parsing = True
            else:
                continue
                
        log  = log.rstrip()
        if log.startswith('Timestep'):
            set_count = 0
            set_label = int(log.replace('Timestep ',''))
            log_data[set_label]={}
            continue

        if log.startswith('mean reward (100 episodes)'):
            log_data[set_label]['Mean reward (Past 100 episodes)'] = float(
                log.replace('mean reward (100 episodes) ','')
            )

        if log.startswith('best mean reward'):
            log_data[set_label]['Best Mean reward (Past 100 episodes)'] = float(
                log.replace('best mean reward ','')
            )

        if log.startswith('episodes'):
            log_data[set_label]['Episodes'] = int(log.replace('episodes ',''))

        if log.startswith('exploration'):
            log_data[set_label]['Exploration'] = float(log.replace('exploration ',''))

        if log.startswith('learning_rate'):
            log_data[set_label]['Learning Rate'] = float(log.replace('learning_rate ',''))
            
    log_data= pd.DataFrame(log_data)

    return log_data.transpose()

def plot_training_stats(log_data):
    plot_priority_order = [
        'Mean reward (Past 100 episodes)',
        'Best Mean reward (Past 100 episodes)',
        'Exploration',
        'Learning Rate',
        'Episodes'    
    ]
    
    plots = []
    for key in plot_priority_order:

        plots.append(figure(plot_width=900, plot_height=300, y_axis_label=key, x_axis_label='Timesteps'))
        plots[-1].line(log_data[key].index, log_data[key], line_width=2)

    main_row = column(*plots)

    show(main_row)

### Training

For the training I used AWS g2.2xlarge instance and AWS Ubuntu Deep Learning AMI (It took me a while to finish this report). While training the stdout is streamed into a file and the contents of this file are used to analyze the training in next section. Explaination regarding this plots is given in the report.

In [4]:
training_log_data = parse_traning_logs(TRAINING_LOGS)
plot_training_stats(training_log_data)

In [5]:
def mkdir(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
        
def _prepare_figure(stats, colors, lables, x_axis_label, y_axis_label):

    fig = figure(
        plot_width=900,
        plot_height=300,
        y_axis_label=y_axis_label,
        x_axis_label=x_axis_label #,
        #y_range=(-26,30)
    )
    
    fig_items = []
    
    fig_items.append((
        lables[0],
        [
            fig.line(
                stats.index,
                stats,
                color=colors[0],
                line_width=1
            )
        ]
    ))
    
    fig_items.append((
        lables[1],
        [
            fig.line(
                stats.index,
                pd.rolling_mean(stats, window=100, min_periods=1),
                color=colors[1],
                line_width=1
            )
        ]
    ))
    
    fig_legend = Legend(items=fig_items, orientation="horizontal")
    fig.add_layout(fig_legend)
    
    return fig
    
    
def plot_agent_performance(stats_path):
    
    if USE_ARCHIVED_LOGS:
        stats_path = os.path.join(ARCHIVAL_ROOT, stats_path)
    stats_path = glob.glob(os.path.join(stats_path,'*','*.stats.json'))[0]
    print stats_path
    stats = json.load(open(stats_path, 'r'))
    
    del stats['initial_reset_timestamp']
    del stats['episode_types']
    
    final_stats = pd.DataFrame(stats)
    
    colors = ['#44AA99', '#332288'] #, '#999933', '#CC6677', '#AA4499']
    lables = ['Reward', 'Rolling average of past 100 rewards']
    
    fig1 = _prepare_figure(
        final_stats['episode_rewards'],
        colors=colors, 
        lables=lables,
        x_axis_label='Episodes',
        y_axis_label='Reward'
    )
    
    lables = ['Episode lengths', 'Rolling average of past 100 episode lengths']
    
    fig2 = _prepare_figure(
        final_stats['episode_lengths'],
        colors=colors, 
        lables=lables,
        x_axis_label='Episodes',
        y_axis_label='Episode Lengths (units : steps)'
    )
    
    fig_group = column(fig1, fig2)
    show(fig_group)


### Performace of the trained agents.
In this section we run the agents for 500 episodes to analyze their performace. The agents we compare in this sections are 
- Random agent
- DQN trained for 5260000 timessteps
- DQN trained for 8090000 timessteps
- DQN trained for 9500000 timessteps

We compare reward, rolling average of past 100 episoded, episode length and rolling average of past 100 episode lengths for the agents with each other.


### Random agent


In [6]:
# run the rendom agent
log_path = os.path.join(LOGS_DIR, 'random')
mkdir(log_path)

if RUN_MODEL:
    # Get Atari games.
    benchmark = run_dqn_atari.gym.benchmark_spec('Atari40M')
    # selecting Atari pong environment
    task = benchmark.tasks[3]
    seed = 0 
    env = run_dqn_atari.get_env(task, seed, log_path)

    episode_count = 0
    env.reset()
    for t in itertools.count():
        if episode_count > MAX_EPISODE_COUNT:
                    break
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        if done:
            obs = env.reset()
            episode_count += 1
        #env.render()

    env.close()

plot_agent_performance(log_path)

archived_logs/open_ai_logs/random/gym/openaigym.episode_batch.0.47430.stats.json


	Series.rolling(min_periods=1,window=100,center=False).mean()


### DQN trained for 5260000 timessteps

In [7]:
model = 'model.ckpt-5260000'
log_path = os.path.join(LOGS_DIR, model)
mkdir(log_path)

if RUN_MODEL:
    run_dqn_atari.run_model(
            model_path=os.path.join(MODELS_DIR, model),
            log_path=log_path,
            max_episode_count=MAX_EPISODE_COUNT
        )

plot_agent_performance(log_path)

archived_logs/open_ai_logs/model.ckpt-5260000/gym/openaigym.episode_batch.0.46796.stats.json


	Series.rolling(min_periods=1,window=100,center=False).mean()


### DQN trained for 8090000 timessteps

In [8]:
model = 'model.ckpt-8090000'
log_path = os.path.join(LOGS_DIR, model)
mkdir(log_path)

if RUN_MODEL:
    run_dqn_atari.run_model(
            model_path=os.path.join(MODELS_DIR, model),
            log_path=log_path,
            max_episode_count=MAX_EPISODE_COUNT
        )

plot_agent_performance(log_path)

	Series.rolling(min_periods=1,window=100,center=False).mean()


archived_logs/open_ai_logs/model.ckpt-8090000/gym/openaigym.episode_batch.0.47468.stats.json


### DQN trained for 9500000 timessteps

In [9]:
model = 'model.ckpt-9500000'
log_path = os.path.join(LOGS_DIR, model)
mkdir(log_path)

if RUN_MODEL:
    run_dqn_atari.run_model(
            model_path=os.path.join(MODELS_DIR, model),
            log_path=log_path,
            max_episode_count=MAX_EPISODE_COUNT
        )

plot_agent_performance(log_path)

archived_logs/open_ai_logs/model.ckpt-9500000/gym/openaigym.episode_batch.0.47397.stats.json


	Series.rolling(min_periods=1,window=100,center=False).mean()


In [10]:
def get_color():
    for c in ['#332288', '#44AA99', '#999933', '#CC6677', '#AA4499']:
        yield c

def get_lable():
    for l in ['Random Agent',
              'Agent after 5.26M steps of training',
              'Agent after 8.09M steps of training',
              'Agent after 9.50M steps of training']:
        yield l

def plot_performance_stats(stats_path):
    
    if USE_ARCHIVED_LOGS:
        stats_path = os.path.join(ARCHIVAL_ROOT, stats_path)
        
    stats_path = glob.glob(os.path.join(stats_path,'*','*.stats.json'))[0]
    
    stats = json.load(open(stats_path, 'r'))
    del stats['initial_reset_timestamp']
    del stats['episode_types']

    stats = pd.DataFrame(stats)
    cur_lable = labels.next()
    cur_color = colors.next()
    
    rolling_mean_rewards = pd.rolling_mean(stats['episode_rewards'],window=100,min_periods=1)
    
    final_stats['agent'].append(cur_lable)
    final_stats['mean_reward'].append(round(stats['episode_rewards'].mean(), 2))
    final_stats['rolling_mean_reward'].append(round(rolling_mean_rewards.max(), 2))
    final_stats['mean_episode_len'].append(round(stats['episode_lengths'].mean(), 2))
      
    fig_1_legend_items.append((
        cur_lable,
        [fig_1.line(
            stats['episode_rewards'].index,
            stats['episode_rewards'],
            color=cur_color,
            line_width=1
        )]
    ))
    fig_2_legend_items.append((
        cur_lable,
        [fig_2.line(
            stats['episode_rewards'].index,
            rolling_mean_rewards,
            color=cur_color,
            line_width=1
        )]
    ))
    
    fig_3_legend_items.append((
        cur_lable,
        [fig_3.line(
            stats['episode_lengths'].index,
            pd.rolling_mean(stats['episode_lengths'],window=100,min_periods=1),
            color=cur_color,
            line_width=1
        )]
    ))


### Final stats

In [11]:
final_stats = {
    'agent': [],
    'mean_reward': [],
    'rolling_mean_reward': [],
    'mean_episode_len':[]
}
fig_1 = figure(
    plot_width=900,
    plot_height=300,
    y_axis_label='Reward', 
    x_axis_label='Episodes', 
    y_range=(-26,30)
)
fig_2 = figure(
    plot_width=900,
    plot_height=300,
    y_axis_label='Rolling Mean of recent 100 Rewards',
    x_axis_label='Episodes',
    y_range=(-26,30)
)
fig_3 = figure(
    plot_width=900,
    plot_height=300,
    y_axis_label='Rolling Mean of recent 100 Episode lenghts',
    x_axis_label='Episodes',
    y_range=(3000,21000)
)
fig_1_legend_items = []
fig_2_legend_items = []
fig_3_legend_items = []

colors = get_color()
labels = get_lable()

for model in TRAINED_MODEL_NAMES:
    plot_performance_stats(os.path.join(LOGS_DIR, model))

fig_1_legend = Legend(items=fig_1_legend_items, location=(0, 260), orientation="horizontal")
fig_2_legend = Legend(items=fig_2_legend_items, location=(0, 260), orientation="horizontal")
fig_3_legend = Legend(items=fig_3_legend_items, location=(0, 260), orientation="horizontal")

fig_1.add_layout(fig_1_legend)
fig_2.add_layout(fig_2_legend)
fig_3.add_layout(fig_3_legend)

main_fig = column(fig_1, fig_2, fig_3)
show(main_fig)

	Series.rolling(min_periods=1,window=100,center=False).mean()
	Series.rolling(min_periods=1,window=100,center=False).mean()


### Metrics

***Average reward :***
In Atari pong game at the end of every episode the agent can receive a reward between -21 to 21 based on its performance. The best way to measure the performance of the agent is to calculate the average of all the reward it received across the episodes. In the results we use 500 episodes to determine the average reward

***Best rolling 100-episode average reward :***
This is similar to the Average reward but instead of taking a mean of all the rewards we take rolling means \cite{10} with length of 100 and treat the maximum among those means as a measure.

***Average episode length for winning episodes :***
Another way to measure agents performance is by evaluating the average number of steps taken by the agent to win an episode. Better agent would win a game with in a minimum number of steps. Similarly a worser agent can lose a game with in a minimum number of steps.

In [12]:
from bokeh.models import ColumnDataSource
from bokeh.layouts import widgetbox
from bokeh.models.widgets import DataTable, TableColumn

source = ColumnDataSource(data=final_stats)

columns = [
    TableColumn(field='agent', title='Agent Name',width=300),
    TableColumn(field='mean_reward', title='Average Reward', width=150),
    TableColumn(field='rolling_mean_reward', title='Best Rolling Average of recent 100 Rewards',width=300),
    TableColumn(field='mean_episode_len', title='Average Episode Lenght', width=200)
]

data_table = DataTable(source=source,width=900, height=130, columns=columns)

show(widgetbox(data_table))