[Deep Reinforcement Learning Tutorial for Python in 20 Minutes](https://www.youtube.com/watch?v=cO5g5qLrLSo&list=PLgNJO2hghbmjlE6cuKMws2ejC54BTAaWV&index=1)

# 0. Install Dependencies

In [6]:
# !pip install tensorflow
# !pip install gym
# !pip install keras
# !pip install keras-rl2

# 1. Test Random Environment with Open AI Gym

In [7]:
import gym
import random

In [8]:
env_name = 'CartPole-v1'

env = gym.make(env_name)

  ### Observation Space
    The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:
    | Num | Observation           | Min                  | Max                |
    |-----|-----------------------|----------------------|--------------------|
    | 0   | Cart Position         | -4.8                 | 4.8                |
    | 1   | Cart Velocity         | -Inf                 | Inf                |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°)  | ~ 0.418 rad (24°)  |
    | 3   | Pole Angular Velocity | -Inf                 | Inf                |

In [9]:
env.observation_space.shape

(4,)

   ### Action Space
    The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.
    | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |

In [10]:
env.action_space

Discrete(2)

In [11]:
env.action_space.n

2

In [12]:
# get the states and actions
states = env.observation_space.shape[0]
actions = env.action_space.n # for this one is left and right => 2 total actions

In [13]:
episodes = 15

for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action = random.choice([0,1]) # left or right
        n_state, reward, done, info = env.step(action)
        score += reward
        
    print(f'Episode: {episode}, Score: {score}')

Episode: 1, Score: 41.0
Episode: 2, Score: 13.0
Episode: 3, Score: 26.0
Episode: 4, Score: 12.0
Episode: 5, Score: 25.0
Episode: 6, Score: 27.0
Episode: 7, Score: 10.0
Episode: 8, Score: 45.0
Episode: 9, Score: 24.0
Episode: 10, Score: 26.0
Episode: 11, Score: 35.0
Episode: 12, Score: 36.0
Episode: 13, Score: 21.0
Episode: 14, Score: 18.0
Episode: 15, Score: 58.0


In [14]:
env.close()

We can see that our maximum score is 58. That's where deep learning comes in to maximize the value.

# 2. Create Deep Learning Model with Keras

In [18]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [19]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(units=24, activation='relu'))
    model.add(Dense(units=24, activation='relu'))
    model.add(Dense(units=actions, activation='linear'))
    return model

In [25]:
# if the following error occurs, delete the model and rebuild it.
# AttributeError: 'Sequential' object has no attribute '_compile_time_distribution_strategy'
del model

In [26]:
model = build_model(states, actions)

In [27]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_1 (Flatten)         (None, 4)                 0         
                                                                 
 dense_3 (Dense)             (None, 24)                120       
                                                                 
 dense_4 (Dense)             (None, 24)                600       
                                                                 
 dense_5 (Dense)             (None, 2)                 50        
                                                                 
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


# 3. Build Agent with Keras-RL

In [28]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [29]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, policy=policy, memory=memory,
                              nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

In [30]:
dqn = build_agent(model, actions)
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=10000, visualize=False, verbose=1)

Training for 10000 steps ...
Interval 1 (0 steps performed)
    1/10000 [..............................] - ETA: 7:48 - reward: 1.0000

  updates=self.state_updates,


done, took 87.413 seconds


<keras.callbacks.History at 0x84693375c8>

In [31]:
scores = dqn.test(env, nb_episodes=10, visualize=False)
print(np.mean(scores.history['episode_reward']))

Testing for 10 episodes ...
Episode 1: reward: 187.000, steps: 187
Episode 2: reward: 187.000, steps: 187
Episode 3: reward: 266.000, steps: 266
Episode 4: reward: 202.000, steps: 202
Episode 5: reward: 201.000, steps: 201
Episode 6: reward: 190.000, steps: 190
Episode 7: reward: 174.000, steps: 174
Episode 8: reward: 283.000, steps: 283
Episode 9: reward: 257.000, steps: 257
Episode 10: reward: 185.000, steps: 185
213.2


In [32]:
_ = dqn.test(env, nb_episodes=10, visualize=True)

Testing for 10 episodes ...
Episode 1: reward: 190.000, steps: 190
Episode 2: reward: 184.000, steps: 184
Episode 3: reward: 205.000, steps: 205
Episode 4: reward: 180.000, steps: 180
Episode 5: reward: 176.000, steps: 176
Episode 6: reward: 174.000, steps: 174
Episode 7: reward: 243.000, steps: 243
Episode 8: reward: 213.000, steps: 213
Episode 9: reward: 168.000, steps: 168
Episode 10: reward: 216.000, steps: 216


In [33]:
env.close()

# 4. Reloading Agent From Memory

In [34]:
dqn.save_weights('./Models/dqn_weights.h5f', overwrite=True)

In [35]:
# delete the current ones
del model
del dqn
del env

In [38]:
# recreate the env, model, dqn
env = gym.make(env_name)
states = env.observation_space.shape[0]
actions = env.action_space.n
model = build_model(states, actions)
dqn = build_agent(model, actions )
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [40]:
dqn.load_weights('./Models/dqn_weights.h5f')

In [41]:
_ = dqn.test(env=env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 185.000, steps: 185
Episode 2: reward: 206.000, steps: 206
Episode 3: reward: 186.000, steps: 186
Episode 4: reward: 179.000, steps: 179
Episode 5: reward: 189.000, steps: 189


<keras.callbacks.History at 0x846a92c148>

In [42]:
env.close()