1. Creating Environments from OpenAI
2. Building a tf.keras DL model
3. Using keras-rl agents to train

from: Deep reinforcement Learning Tutorial for Python in 20 Minutes

AREA:

1. Action
2. Reward
3. Environment
4. Agent

## Build DR Agent

1. Install Dependencies
2. test random environment with OpenAI Gym
3. Create a Deep Learning Model with Keras
4. Build Agent with Keras-RL
5. Reloading Agent from Memory

**Question**

1. What is difference between DRL and RL?
2. What is difference between install and import?

1. Install Dependencies

In [4]:
#pip install --force-reinstall numpy==1.18.5

Collecting numpy==1.18.5
  Using cached https://files.pythonhosted.org/packages/3e/00/0266fefaafb839760d5b25b884375b2ab0f842ebe138ee6c1ef807af44bb/numpy-1.18.5-cp37-cp37m-macosx_10_9_x86_64.whl
Installing collected packages: numpy
  Found existing installation: numpy 1.18.5
    Uninstalling numpy-1.18.5:
      Successfully uninstalled numpy-1.18.5
Successfully installed numpy-1.18.5
Note: you may need to restart the kernel to use updated packages.


In [5]:
!pip install tensorflow==2.3.0
!pip install gym
!pip install keras
!pip install keras-rl2



2. test random environment with OpenAI Gym

In [6]:
#!pip install gym



In [7]:
#!pip show gym

Name: gym
Version: 0.21.0
Summary: Gym: A universal API for reinforcement learning environments.
Home-page: https://github.com/openai/gym
Author: OpenAI
Author-email: jkterry@umd.edu
License: UNKNOWN
Location: /Users/JeanJeongjinPark/opt/anaconda3/lib/python3.7/site-packages
Requires: importlib-metadata, numpy, cloudpickle
Required-by: 


In [8]:
import sys
sys.path.append('/Users/JeanJeongjinPark/opt/anaconda3/lib/python3.7/site-packages')

In [9]:
!pip install gym[all]

Collecting box2d-py==2.3.5; extra == "all" (from gym[all])
  Using cached https://files.pythonhosted.org/packages/dd/5a/ad8d3ef9c13d5afcc1e44a77f11792ee717f6727b3320bddbc607e935e2a/box2d-py-2.3.5.tar.gz
Collecting mujoco-py<2.0,>=1.50; extra == "all" (from gym[all])
  Using cached https://files.pythonhosted.org/packages/cf/8c/64e0630b3d450244feef0688d90eab2448631e40ba6bdbd90a70b84898e7/mujoco-py-1.50.1.68.tar.gz
Building wheels for collected packages: box2d-py, mujoco-py
  Building wheel for box2d-py (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /Users/JeanJeongjinPark/opt/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/8b/jkr7fnlj16n_fck4d8gkz4v00000gn/T/pip-install-vt7wo6b_/box2d-py/setup.py'"'"'; __file__='"'"'/private/var/folders/8b/jkr7fnlj16n_fck4d8gkz4v00000gn/T/pip-install-vt7wo6b_/box2d-py/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('

In [10]:
import gym

# -- set up the environment -- #

# make the cartpole environment
env = gym.make('CartPole-v0')
# looking at any states possible in our environment
states = env.observation_space.shape[0]
actions = env.action_space.n

In [11]:
# we have 2 actions available: going left or right
actions

2

In [12]:
# we have 4 states
states

4

In [14]:
import random

# -- build random environment -- #

episodes = 10
for episode in range(1, episodes+1):
  state = env.reset()
  done = False
  score = 0

  while not done:
    # render our environment, see our cut in action
    env.render()
    # taking random steps
    action = random.choice([0,1])
    # actually applying that action to our environment and getting the result
    n_state, reward, done, info = env.step(action)
    # accumulate our entire reward
    score += reward
  print('Episode: {} Score:{}'.format(episode, score))

Episode: 1 Score:23.0
Episode: 2 Score:20.0
Episode: 3 Score:64.0
Episode: 4 Score:19.0
Episode: 5 Score:26.0
Episode: 6 Score:15.0
Episode: 7 Score:13.0
Episode: 8 Score:14.0
Episode: 9 Score:40.0
Episode: 10 Score:16.0


3. Create a Deep Learning Model with Keras

In [25]:
import numpy as np
import tensorflow

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [26]:
def build_model(states, actions):
    model = tensorflow.keras.Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [27]:
model = build_model(states, actions)

In [28]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_4 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


4. Build Agent with Keras-RL

In [29]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [30]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

In [31]:
dqn = build_agent(model,actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
   11/10000 [..............................] - ETA: 5:01 - reward: 1.0000 



101 episodes - episode_reward: 98.703 [10.000, 200.000] - loss: 3.025 - mae: 18.948 - mean_q: 38.361

Interval 2 (10000 steps performed)
52 episodes - episode_reward: 192.269 [148.000, 200.000] - loss: 7.443 - mae: 39.124 - mean_q: 79.063

Interval 3 (20000 steps performed)
50 episodes - episode_reward: 196.720 [165.000, 200.000] - loss: 6.929 - mae: 43.031 - mean_q: 86.579

Interval 4 (30000 steps performed)
51 episodes - episode_reward: 198.843 [176.000, 200.000] - loss: 5.552 - mae: 41.496 - mean_q: 83.364

Interval 5 (40000 steps performed)
done, took 1849.124 seconds


<tensorflow.python.keras.callbacks.History at 0x7faeea05b850>

In [32]:
scores = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(scores.history['episode_reward']))

Testing for 100 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200
Episode 6: reward: 200.000, steps: 200
Episode 7: reward: 200.000, steps: 200
Episode 8: reward: 200.000, steps: 200
Episode 9: reward: 200.000, steps: 200
Episode 10: reward: 200.000, steps: 200
Episode 11: reward: 200.000, steps: 200
Episode 12: reward: 200.000, steps: 200
Episode 13: reward: 200.000, steps: 200
Episode 14: reward: 200.000, steps: 200
Episode 15: reward: 200.000, steps: 200
Episode 16: reward: 200.000, steps: 200
Episode 17: reward: 200.000, steps: 200
Episode 18: reward: 200.000, steps: 200
Episode 19: reward: 200.000, steps: 200
Episode 20: reward: 200.000, steps: 200
Episode 21: reward: 200.000, steps: 200
Episode 22: reward: 200.000, steps: 200
Episode 23: reward: 200.000, steps: 200
Episode 24: reward: 200.000, steps: 200
Episode 25: reward: 