# Reinforcement Learning (Open AI: Lunar Lander v2)

- Member 1: Lim Hur (2112589)
- Member 2: Yee Hang (2112675)

# Defining Objectives

1. Develop and evaluate reinforcement learning algorithms to land an agent successfully on the lunar lander gym environment
2. Investigate applications of reinforcement learning algorithms

## !! About notebook!!
1. This notebook will evaluate our RL algorithms, and perform tuning on our final model


# Project Initialization Setup

In [None]:
!pip install Box2D
!pip install box2d
!pip install box2d-py
!pip install gym[all]
!pip install gym[Box_2D]2
!pip install gym
!pip install wandb tqdm tensorflow_addons

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Box2D
  Downloading Box2D-2.3.10-cp38-cp38-manylinux1_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Box2D
Successfully installed Box2D-2.3.10
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting box2d-py
  Downloading box2d-py-2.3.8.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.5/374.5 KB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run succ

[31mERROR: Invalid requirement: 'gym[Box_2D]2'[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.9-py2.py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow_addons
  Downloading tensorflow_addons-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setproctitle
  Downloading setproctitle-1.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collectin

> ### Installing necessary dependencies for OpenAI Gym

- `wandb`: Explained later
- `tqdm`: Progress bar

### Install X11 system and other dependencies

- Install X11 to render display and other dependencies to make sure we can run OpenAI environments in Google Colab.

In [None]:
!apt-get install -y xvfb x11-utils
!pip install pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libxxf86dga1
Suggested packages:
  mesa-utils
The following NEW packages will be installed:
  libxxf86dga1 x11-utils xvfb
0 upgraded, 3 newly installed, 0 to remove and 27 not upgraded.
Need to get 990 kB of archives.
After this operation, 3,058 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libxxf86dga1 amd64 2:1.1.5-0ubuntu1 [12.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 x11-utils amd64 7.7+5 [199 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 xvfb amd64 2:1.20.13-1ubuntu1~20.04.5 [780 kB]
Fetched 990 kB in 1s (1,029 kB/s)
Selecting previously unselected package libxxf86dga1:amd64.
(Reading database ... 129499 files and directories currently installed.)
Preparing to unpack .../libxxf86dga1_2%3a1.1.5-0ubuntu1_amd64.deb ...
Unpacking libxxf86dg

### Create virtual display in background

- Create a new virtual display in the background that the environment can connect for rendering.
- `echo` to ensure there is any running background display

In [None]:
import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False,
                                    size=(1024, 768) )
_display.start()
!echo $DISPLAY

In [None]:
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)
# %cd /content/drive/MyDrive

## Imports

In [None]:
from collections import *
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from matplotlib import pyplot as plt
from tqdm import tqdm
import seaborn as sns
import plotly.express as px
from copy import deepcopy
import os, time, math, datetime, warnings,glob,random,wandb,sys,functools, plotly
from IPython.display import display, HTML
from matplotlib import animation, rc


os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Ignore warnings
%matplotlib inline

In [None]:
plotly.offline.init_notebook_mode()

In [None]:
import absl.logging
absl.logging.set_verbosity(absl.logging.ERROR)
# ignore warning
import logging
logging.getLogger('tensorflow').disabled = True

# Ignore GPU when i'm not using colab because my GPU is not very good
if 'google.colab'  not in sys.modules:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Flatten, LeakyReLU, ReLU, Conv2D
from tensorflow.keras.models import load_model, Model, model_from_json

from tensorflow.keras.optimizers import Adam
from tensorflow_addons.layers import NoisyDense
from tensorflow.keras import backend as K


# OpenAI Gym 

It is a toolkit for building, evaluating and comparing RL algorithms. It is compatible with algorithms written in any frameworks like TensoFlow. It is simple and easy to comprehend. It makes no assumption about the structure of our agent and provides an interface to all RL tasks.

In [None]:
import gym
from gym import RewardWrapper, ObservationWrapper,Wrapper, logger
warnings.filterwarnings('ignore')

In [None]:
# random seed for reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)

## Create environment


In [None]:
env = gym.make("LunarLander-v2" )
env.reset(seed=seed)
NUM_ACTIONS = 4

## Helper functions for plotting

In [None]:
def reward_plot(df):
    fig = px.line(df, x='x', y=['Average Score', 'Solved Requirement'], markers=True, title='Score Analysis')
    fig.update_traces(patch={"line": {"width": 4, "dash": 'dash'}})
    fig.add_traces(go.Scatter(x= df['x'], y=df['Score'], mode='markers+lines', name='Score')).update_traces(patch={"line": {"width": 4}})
    fig.update_layout(legend_title="Legend")
    return fig

## Hyperparameter tuning
1. For tuning, we use wandb's sweep function. We use random search to search for the most optimal hyperparameters for our noisy net model, to see if we can further improve our average score.
2. For tuning we also set the training to stop, if the algorithm has trained for more than 50 episode, and obtain an average rewards of below -200, the training will be halted. This will help us save time as we know those hyperparameters are not useful.


Hyperparameters tuned: learning rate (min = 0.0001, max = 0.01); Update target network interval (5, 10, 20, 30, 50, 100, 130)
I wanted to investigate if updating the target network at fewer or more frequent intervals would result in a better performance. Similarly, tuning the learning rate determines how big of a leap we take in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.


In [None]:
from utils import NoisyNet
from utilsdqn import Agent as DQNAgent

In [None]:
sweep_config = {
    "method": "random",
    "metric" : {
        "name" : "Avg-Reward-100e",
        "goal" : "maximize",
        "target" : 300
        }
    }
parameters_tuning = {
    "lr" : {
        "min" : 0.0001,
        "max" : 0.01,
        "distribution" : "uniform"
    },
    "gamma" : {
        "value" : 0.99
    },
    "epsilon" : {
        "value" : 1.0
    },
    "update_target_net_interval" : {
        "values" : [5, 10, 20, 30, 50, 100, 130]
    },
    "episodes" : {
        "value" : 700
    }
}


# D3QN tuning

In [None]:
sweep_config["parameters"] = parameters_tuning
project_name = "D3QN-TUNING"
sweep_id = wandb.sweep(sweep_config, project=project_name, entity= 'lhurr')

Create sweep with ID: tvdet1ks
Sweep URL: https://wandb.ai/lhurr/D3QN-TUNING/sweeps/tvdet1ks


## Tuning
- Due to limited computational power and colab runtime, I am unable to complete the run, due to the long hours of runtime.
- Scrolling through, we see that our tuning has managed to find average scores that are of above 220, however it does not outperform our current noisy net. 
- As such, we have to move to evaluation

In [None]:
def train(config=None):
    with wandb.init(config=config):
        config = wandb.config
        env = gym.make("LunarLander-v2")
        model = NoisyNet(config.lr, config.gamma, 4, config.epsilon, 64, [8], config.update_target_net_interval)
        _ = model.train_model(env,config.episodes, graph=False)

In [None]:
wandb.agent(sweep_id, train, count=50, entity="lhurr", project=project_name)

[34m[1mwandb[0m: Agent Starting Run: 89pyswg0 with config:
[34m[1mwandb[0m: 	episodes: 700
[34m[1mwandb[0m: 	epsilon: 1
[34m[1mwandb[0m: 	gamma: 0.99
[34m[1mwandb[0m: 	lr: 0.0017500570707847662
[34m[1mwandb[0m: 	update_target_net_interval: 5
[34m[1mwandb[0m: Currently logged in as: [33mlhurr[0m. Use [1m`wandb login --relogin`[0m to force relogin


Episode 0/700, Score: -311.2242576409687 (1), AVG Score: -311.2242576409687
Episode 1/700, Score: -490.10918183345285 (1), AVG Score: -400.66671973721077
Episode 2/700, Score: -273.59185386561336 (1), AVG Score: -358.30843111334497
Episode 3/700, Score: 21.53222013341562 (1), AVG Score: -263.3482683016548
Episode 4/700, Score: -151.90340908764279 (1), AVG Score: -241.0592964588524
Episode 5/700, Score: -123.93490483198858 (1), AVG Score: -221.53856452104176
Episode 6/700, Score: -213.73469614541943 (1), AVG Score: -220.42372618166715
Episode 7/700, Score: -102.70367932254912 (1), AVG Score: -205.70872032427738
Episode 8/700, Score: -95.2705832343059 (1), AVG Score: -193.43781620316943
Episode 9/700, Score: -108.46850948106693 (1), AVG Score: -184.94088553095918
Episode 10/700, Score: -152.0058851215494 (1), AVG Score: -181.9467945846492
Episode 11/700, Score: -108.10219317165296 (1), AVG Score: -175.79307780023285
Episode 12/700, Score: -174.73481310150547 (1), AVG Score: -175.71167282

0,1
Avg-Reward-100e,▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▆▆▆▆▇▇▇▇▇▇▇▇███████▇▆▆▄▁
Episode,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
Reward,▅▅▅▅▅▆▅▆▅▆▆▅▆██▇▇▇███▅█▆████████▇██▅█▂▁▃

0,1
Avg-Reward-100e,-260.97585
Episode,699.0
Reward,-725.66899


[34m[1mwandb[0m: Agent Starting Run: zwvedbo3 with config:
[34m[1mwandb[0m: 	episodes: 700
[34m[1mwandb[0m: 	epsilon: 1
[34m[1mwandb[0m: 	gamma: 0.99
[34m[1mwandb[0m: 	lr: 0.007107885028230527
[34m[1mwandb[0m: 	update_target_net_interval: 50


Episode 0/700, Score: -556.2768432193695 (1), AVG Score: -556.2768432193695
Episode 1/700, Score: -609.7870606482279 (1), AVG Score: -583.0319519337987
Episode 2/700, Score: -548.9136973639638 (1), AVG Score: -571.6592004105204
Episode 3/700, Score: -154.46974017227154 (1), AVG Score: -467.3618353509582
Episode 4/700, Score: -126.67725326437083 (1), AVG Score: -399.2249189336407
Episode 5/700, Score: -109.96013524883524 (1), AVG Score: -351.0141216528398
Episode 6/700, Score: -198.9430079706303 (1), AVG Score: -329.28967684109557
Episode 7/700, Score: -205.6876069029792 (1), AVG Score: -313.83941809883106
Episode 8/700, Score: -187.96161267812073 (1), AVG Score: -299.85299527430766
Episode 9/700, Score: -42.516691698152115 (1), AVG Score: -274.1193649166921
Episode 10/700, Score: -130.67587199522023 (1), AVG Score: -261.0790473783765
Episode 11/700, Score: -54.92498393891387 (1), AVG Score: -243.89954209175463
Episode 12/700, Score: -221.99290482386812 (1), AVG Score: -242.214416148071

- It seems that we are unable to complete the tuning due to limited resources available (over 10 hours).
- We are also not able to find a suitable hyperparameter that is optimal. Now we will move on to evaluation.

In [1]:
!jupyter nbconvert --to html "6_Tuning".ipynb

[NbConvertApp] Converting notebook 6_Tuning.ipynb to html
  warn("Your element with mimetype(s) {mimetypes}"
[NbConvertApp] Writing 699373 bytes to 6_Tuning.html
