# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.990025, score: 22
Scores: (min: 22, avg: 22, max: 22)

Run: 2, exploration: 0.8911090557802088, score: 22
Scores: (min: 22, avg: 22, max: 22)

Run: 3, exploration: 0.6696478204705644, score: 58
Scores: (min: 22, avg: 34, max: 58)

Run: 4, exploration: 0.6369088258938781, score: 11
Scores: (min: 11, avg: 28.25, max: 58)

Run: 5, exploration: 0.5790496471185967, score: 20
Scores: (min: 11, avg: 26.6, max: 58)

Run: 6, exploration: 0.5452463540625918, score: 13
Scores: (min: 11, avg: 24.333333333333332, max: 58)

Run: 7, exploration: 0.5134164023722473, score: 13
Scores: (min: 11, avg: 22.714285714285715, max: 58)

Run: 8, exploration: 0.47147873742168567, score: 18
Scores: (min: 11, avg: 22.125, max: 58)

Run: 9, exploration: 0.4439551321314536, score: 13
Scores: (min: 11, avg: 21.11111111111111, max: 58)

Run: 10, exploration: 0.40974000909221303, score: 17
Scores: (min: 11, avg: 20.7, max: 58)

Run: 11, exploration: 0.3819719776053028, score: 15
Scores: (min: 11,

Run: 89, exploration: 0.01, score: 166
Scores: (min: 8, avg: 83.65168539325843, max: 311)

Run: 90, exploration: 0.01, score: 167
Scores: (min: 8, avg: 84.57777777777778, max: 311)

Run: 91, exploration: 0.01, score: 113
Scores: (min: 8, avg: 84.89010989010988, max: 311)

Run: 92, exploration: 0.01, score: 244
Scores: (min: 8, avg: 86.6195652173913, max: 311)

Run: 93, exploration: 0.01, score: 147
Scores: (min: 8, avg: 87.26881720430107, max: 311)

Run: 94, exploration: 0.01, score: 144
Scores: (min: 8, avg: 87.87234042553192, max: 311)

Run: 95, exploration: 0.01, score: 165
Scores: (min: 8, avg: 88.6842105263158, max: 311)

Run: 96, exploration: 0.01, score: 151
Scores: (min: 8, avg: 89.33333333333333, max: 311)

Run: 97, exploration: 0.01, score: 132
Scores: (min: 8, avg: 89.77319587628865, max: 311)

Run: 98, exploration: 0.01, score: 154
Scores: (min: 8, avg: 90.42857142857143, max: 311)

Run: 99, exploration: 0.01, score: 155
Scores: (min: 8, avg: 91.08080808080808, max: 311)

R

Run: 189, exploration: 0.01, score: 111
Scores: (min: 35, avg: 138.4, max: 322)

Run: 190, exploration: 0.01, score: 126
Scores: (min: 35, avg: 137.99, max: 322)

Run: 191, exploration: 0.01, score: 149
Scores: (min: 35, avg: 138.35, max: 322)

Run: 192, exploration: 0.01, score: 119
Scores: (min: 35, avg: 137.1, max: 322)

Run: 193, exploration: 0.01, score: 194
Scores: (min: 35, avg: 137.57, max: 322)

Run: 194, exploration: 0.01, score: 120
Scores: (min: 35, avg: 137.33, max: 322)

Run: 195, exploration: 0.01, score: 212
Scores: (min: 35, avg: 137.8, max: 322)

Run: 196, exploration: 0.01, score: 158
Scores: (min: 35, avg: 137.87, max: 322)

Run: 197, exploration: 0.01, score: 12
Scores: (min: 12, avg: 136.67, max: 322)

Run: 198, exploration: 0.01, score: 164
Scores: (min: 12, avg: 136.77, max: 322)

Run: 199, exploration: 0.01, score: 91
Scores: (min: 12, avg: 136.13, max: 322)

Run: 200, exploration: 0.01, score: 167
Scores: (min: 12, avg: 136.42, max: 322)

Run: 201, exploration

Run: 290, exploration: 0.01, score: 391
Scores: (min: 12, avg: 182.58, max: 500)

Run: 291, exploration: 0.01, score: 211
Scores: (min: 12, avg: 183.2, max: 500)

Run: 292, exploration: 0.01, score: 202
Scores: (min: 12, avg: 184.03, max: 500)

Run: 293, exploration: 0.01, score: 197
Scores: (min: 12, avg: 184.06, max: 500)

Run: 294, exploration: 0.01, score: 115
Scores: (min: 12, avg: 184.01, max: 500)

Run: 295, exploration: 0.01, score: 90
Scores: (min: 12, avg: 182.79, max: 500)

Run: 296, exploration: 0.01, score: 166
Scores: (min: 12, avg: 182.87, max: 500)

Run: 297, exploration: 0.01, score: 159
Scores: (min: 12, avg: 184.34, max: 500)

Run: 298, exploration: 0.01, score: 114
Scores: (min: 12, avg: 183.84, max: 500)

Run: 299, exploration: 0.01, score: 341
Scores: (min: 12, avg: 186.34, max: 500)

Run: 300, exploration: 0.01, score: 433
Scores: (min: 12, avg: 189, max: 500)

Run: 301, exploration: 0.01, score: 302
Scores: (min: 12, avg: 190.48, max: 500)

Run: 302, exploration

NameError: name 'exit' is not defined

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
# Adjusted learning rate to 0.01 from 0.001
LEARNING_RATE = 0.01  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [4]:
cartpole()

Run: 1, exploration: 0.9046104802746175, score: 40
Scores: (min: 40, avg: 40, max: 40)

Run: 2, exploration: 0.7552531090661897, score: 37
Scores: (min: 37, avg: 38.5, max: 40)

Run: 3, exploration: 0.6730128848950395, score: 24
Scores: (min: 24, avg: 33.666666666666664, max: 40)

Run: 4, exploration: 0.6180388156137953, score: 18
Scores: (min: 18, avg: 29.75, max: 40)

Run: 5, exploration: 0.5819594443402982, score: 13
Scores: (min: 13, avg: 26.4, max: 40)

Run: 6, exploration: 0.5535075230322891, score: 11
Scores: (min: 11, avg: 23.833333333333332, max: 40)

Run: 7, exploration: 0.531750826943791, score: 9
Scores: (min: 9, avg: 21.714285714285715, max: 40)

Run: 8, exploration: 0.5032248303978422, score: 12
Scores: (min: 9, avg: 20.5, max: 40)

Run: 9, exploration: 0.46444185833082485, score: 17
Scores: (min: 9, avg: 20.11111111111111, max: 40)

Run: 10, exploration: 0.4417353564707963, score: 11
Scores: (min: 9, avg: 19.2, max: 40)

Run: 11, exploration: 0.41386834584198684, score: 

Run: 84, exploration: 0.01, score: 41
Scores: (min: 8, avg: 14.214285714285714, max: 77)

Run: 85, exploration: 0.01, score: 34
Scores: (min: 8, avg: 14.447058823529412, max: 77)

Run: 86, exploration: 0.01, score: 12
Scores: (min: 8, avg: 14.418604651162791, max: 77)

Run: 87, exploration: 0.01, score: 28
Scores: (min: 8, avg: 14.574712643678161, max: 77)

Run: 88, exploration: 0.01, score: 15
Scores: (min: 8, avg: 14.579545454545455, max: 77)

Run: 89, exploration: 0.01, score: 39
Scores: (min: 8, avg: 14.853932584269662, max: 77)

Run: 90, exploration: 0.01, score: 21
Scores: (min: 8, avg: 14.922222222222222, max: 77)

Run: 91, exploration: 0.01, score: 19
Scores: (min: 8, avg: 14.967032967032967, max: 77)

Run: 92, exploration: 0.01, score: 26
Scores: (min: 8, avg: 15.08695652173913, max: 77)

Run: 93, exploration: 0.01, score: 54
Scores: (min: 8, avg: 15.505376344086022, max: 77)

Run: 94, exploration: 0.01, score: 12
Scores: (min: 8, avg: 15.46808510638298, max: 77)

Run: 95, exp

Run: 187, exploration: 0.01, score: 11
Scores: (min: 8, avg: 21.22, max: 87)

Run: 188, exploration: 0.01, score: 20
Scores: (min: 8, avg: 21.27, max: 87)

Run: 189, exploration: 0.01, score: 36
Scores: (min: 8, avg: 21.24, max: 87)

Run: 190, exploration: 0.01, score: 13
Scores: (min: 8, avg: 21.16, max: 87)

Run: 191, exploration: 0.01, score: 23
Scores: (min: 8, avg: 21.2, max: 87)

Run: 192, exploration: 0.01, score: 20
Scores: (min: 8, avg: 21.14, max: 87)

Run: 193, exploration: 0.01, score: 9
Scores: (min: 8, avg: 20.69, max: 87)

Run: 194, exploration: 0.01, score: 49
Scores: (min: 8, avg: 21.06, max: 87)

Run: 195, exploration: 0.01, score: 20
Scores: (min: 8, avg: 20.8, max: 87)

Run: 196, exploration: 0.01, score: 29
Scores: (min: 8, avg: 20.98, max: 87)

Run: 197, exploration: 0.01, score: 34
Scores: (min: 8, avg: 20.88, max: 87)

Run: 198, exploration: 0.01, score: 16
Scores: (min: 8, avg: 20.84, max: 87)

Run: 199, exploration: 0.01, score: 11
Scores: (min: 8, avg: 20.7, 

Run: 293, exploration: 0.01, score: 15
Scores: (min: 9, avg: 24.73, max: 68)

Run: 294, exploration: 0.01, score: 13
Scores: (min: 9, avg: 24.37, max: 68)

Run: 295, exploration: 0.01, score: 27
Scores: (min: 9, avg: 24.44, max: 68)

Run: 296, exploration: 0.01, score: 11
Scores: (min: 9, avg: 24.26, max: 68)

Run: 297, exploration: 0.01, score: 10
Scores: (min: 9, avg: 24.02, max: 68)

Run: 298, exploration: 0.01, score: 26
Scores: (min: 9, avg: 24.12, max: 68)

Run: 299, exploration: 0.01, score: 16
Scores: (min: 9, avg: 24.17, max: 68)

Run: 300, exploration: 0.01, score: 37
Scores: (min: 9, avg: 24.36, max: 68)

Run: 301, exploration: 0.01, score: 15
Scores: (min: 9, avg: 24.07, max: 68)

Run: 302, exploration: 0.01, score: 15
Scores: (min: 9, avg: 24.12, max: 68)

Run: 303, exploration: 0.01, score: 36
Scores: (min: 9, avg: 24.02, max: 68)

Run: 304, exploration: 0.01, score: 22
Scores: (min: 9, avg: 24.14, max: 68)

Run: 305, exploration: 0.01, score: 31
Scores: (min: 9, avg: 24.

Run: 399, exploration: 0.01, score: 10
Scores: (min: 8, avg: 25.54, max: 69)

Run: 400, exploration: 0.01, score: 9
Scores: (min: 8, avg: 25.26, max: 69)

Run: 401, exploration: 0.01, score: 9
Scores: (min: 8, avg: 25.2, max: 69)

Run: 402, exploration: 0.01, score: 9
Scores: (min: 8, avg: 25.14, max: 69)

Run: 403, exploration: 0.01, score: 10
Scores: (min: 8, avg: 24.88, max: 69)

Run: 404, exploration: 0.01, score: 8
Scores: (min: 8, avg: 24.74, max: 69)

Run: 405, exploration: 0.01, score: 9
Scores: (min: 8, avg: 24.52, max: 69)

Run: 406, exploration: 0.01, score: 9
Scores: (min: 8, avg: 24.39, max: 69)

Run: 407, exploration: 0.01, score: 10
Scores: (min: 8, avg: 24.07, max: 69)

Run: 408, exploration: 0.01, score: 10
Scores: (min: 8, avg: 23.96, max: 69)

Run: 409, exploration: 0.01, score: 10
Scores: (min: 8, avg: 23.85, max: 69)

Run: 410, exploration: 0.01, score: 14
Scores: (min: 8, avg: 23.78, max: 69)

Run: 411, exploration: 0.01, score: 75
Scores: (min: 8, avg: 24.22, max

KeyboardInterrupt: 

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  

# Adjusted Gamma to 0.5 from 0.95
GAMMA = 0.5  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [5]:
cartpole()

Run: 1, exploration: 1.0, score: 10
Scores: (min: 10, avg: 10, max: 10)

Run: 2, exploration: 1.0, score: 10
Scores: (min: 10, avg: 10, max: 10)

Run: 3, exploration: 0.9046104802746175, score: 21
Scores: (min: 10, avg: 13.666666666666666, max: 21)

Run: 4, exploration: 0.8265651079747222, score: 19
Scores: (min: 10, avg: 15, max: 21)

Run: 5, exploration: 0.7402609576967045, score: 23
Scores: (min: 10, avg: 16.6, max: 23)

Run: 6, exploration: 0.697046600835495, score: 13
Scores: (min: 10, avg: 16, max: 23)

Run: 7, exploration: 0.6498078359349755, score: 15
Scores: (min: 10, avg: 15.857142857142858, max: 23)

Run: 8, exploration: 0.6149486215357263, score: 12
Scores: (min: 10, avg: 15.375, max: 23)

Run: 9, exploration: 0.5848838636585911, score: 11
Scores: (min: 10, avg: 14.88888888888889, max: 23)

Run: 10, exploration: 0.5647174463480732, score: 8
Scores: (min: 8, avg: 14.2, max: 23)

Run: 11, exploration: 0.5264466124450268, score: 15
Scores: (min: 8, avg: 14.272727272727273, max

Run: 90, exploration: 0.01, score: 101
Scores: (min: 8, avg: 81.68888888888888, max: 424)

Run: 91, exploration: 0.01, score: 47
Scores: (min: 8, avg: 81.3076923076923, max: 424)

Run: 92, exploration: 0.01, score: 65
Scores: (min: 8, avg: 81.1304347826087, max: 424)

Run: 93, exploration: 0.01, score: 164
Scores: (min: 8, avg: 82.02150537634408, max: 424)

Run: 94, exploration: 0.01, score: 56
Scores: (min: 8, avg: 81.74468085106383, max: 424)

Run: 95, exploration: 0.01, score: 39
Scores: (min: 8, avg: 81.29473684210527, max: 424)

Run: 96, exploration: 0.01, score: 73
Scores: (min: 8, avg: 81.20833333333333, max: 424)

Run: 97, exploration: 0.01, score: 28
Scores: (min: 8, avg: 80.65979381443299, max: 424)

Run: 98, exploration: 0.01, score: 99
Scores: (min: 8, avg: 80.84693877551021, max: 424)

Run: 99, exploration: 0.01, score: 123
Scores: (min: 8, avg: 81.27272727272727, max: 424)

Run: 100, exploration: 0.01, score: 133
Scores: (min: 8, avg: 81.79, max: 424)

Run: 101, explorati

Run: 193, exploration: 0.01, score: 10
Scores: (min: 8, avg: 47.78, max: 266)

Run: 194, exploration: 0.01, score: 9
Scores: (min: 8, avg: 47.31, max: 266)

Run: 195, exploration: 0.01, score: 8
Scores: (min: 8, avg: 47, max: 266)

Run: 196, exploration: 0.01, score: 9
Scores: (min: 8, avg: 46.36, max: 266)

Run: 197, exploration: 0.01, score: 9
Scores: (min: 8, avg: 46.17, max: 266)

Run: 198, exploration: 0.01, score: 10
Scores: (min: 8, avg: 45.28, max: 266)

Run: 199, exploration: 0.01, score: 8
Scores: (min: 8, avg: 44.13, max: 266)

Run: 200, exploration: 0.01, score: 9
Scores: (min: 8, avg: 42.89, max: 266)

Run: 201, exploration: 0.01, score: 9
Scores: (min: 8, avg: 42.57, max: 266)

Run: 202, exploration: 0.01, score: 10
Scores: (min: 8, avg: 40.96, max: 266)

Run: 203, exploration: 0.01, score: 11
Scores: (min: 8, avg: 40.76, max: 266)

Run: 204, exploration: 0.01, score: 69
Scores: (min: 8, avg: 40.9, max: 266)

Run: 205, exploration: 0.01, score: 9
Scores: (min: 8, avg: 40.

Run: 299, exploration: 0.01, score: 22
Scores: (min: 8, avg: 11.6, max: 69)

Run: 300, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.64, max: 69)

Run: 301, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.65, max: 69)

Run: 302, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.64, max: 69)

Run: 303, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.62, max: 69)

Run: 304, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.01, max: 30)

Run: 305, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.05, max: 30)

Run: 306, exploration: 0.01, score: 15
Scores: (min: 8, avg: 11.11, max: 30)

Run: 307, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.11, max: 30)

Run: 308, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.11, max: 30)

Run: 309, exploration: 0.01, score: 21
Scores: (min: 8, avg: 11.23, max: 30)

Run: 310, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.23, max: 30)

Run: 311, exploration: 0.01, score: 14
Scores: (min: 8, avg: 11.26, m

Run: 405, exploration: 0.01, score: 18
Scores: (min: 8, avg: 11.33, max: 30)

Run: 406, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.28, max: 30)

Run: 407, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.28, max: 30)

Run: 408, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.26, max: 30)

Run: 409, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.14, max: 30)

Run: 410, exploration: 0.01, score: 27
Scores: (min: 8, avg: 11.32, max: 30)

Run: 411, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.28, max: 30)

Run: 412, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.27, max: 30)

Run: 413, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.25, max: 30)

Run: 414, exploration: 0.01, score: 19
Scores: (min: 8, avg: 11.35, max: 30)

Run: 415, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.33, max: 30)

Run: 416, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.34, max: 30)

Run: 417, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.32, m

Run: 511, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.82, max: 41)

Run: 512, exploration: 0.01, score: 26
Scores: (min: 8, avg: 12, max: 41)

Run: 513, exploration: 0.01, score: 11
Scores: (min: 8, avg: 12.01, max: 41)

Run: 514, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.92, max: 41)

Run: 515, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.93, max: 41)

Run: 516, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.91, max: 41)

Run: 517, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.92, max: 41)

Run: 518, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.87, max: 41)

Run: 519, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.76, max: 41)

Run: 520, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.75, max: 41)

Run: 521, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.73, max: 41)

Run: 522, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.72, max: 41)

Run: 523, exploration: 0.01, score: 12
Scores: (min: 8, avg: 11.76, max:

Run: 617, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.97, max: 51)

Run: 618, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.97, max: 51)

Run: 619, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.97, max: 51)

Run: 620, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.97, max: 51)

Run: 621, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.97, max: 51)

Run: 622, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.97, max: 51)

Run: 623, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.94, max: 51)

Run: 624, exploration: 0.01, score: 12
Scores: (min: 8, avg: 11.96, max: 51)

Run: 625, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.95, max: 51)

Run: 626, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.78, max: 51)

Run: 627, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.79, max: 51)

Run: 628, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.78, max: 51)

Run: 629, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.68, max:

Run: 723, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.45, max: 29)

Run: 724, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.41, max: 29)

Run: 725, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.43, max: 29)

Run: 726, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.45, max: 29)

Run: 727, exploration: 0.01, score: 24
Scores: (min: 8, avg: 11.59, max: 29)

Run: 728, exploration: 0.01, score: 15
Scores: (min: 8, avg: 11.65, max: 29)

Run: 729, exploration: 0.01, score: 17
Scores: (min: 8, avg: 11.72, max: 29)

Run: 730, exploration: 0.01, score: 15
Scores: (min: 8, avg: 11.78, max: 29)

Run: 731, exploration: 0.01, score: 14
Scores: (min: 8, avg: 11.72, max: 29)

Run: 732, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.71, max: 29)

Run: 733, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.7, max: 29)

Run: 734, exploration: 0.01, score: 15
Scores: (min: 8, avg: 11.72, max: 29)

Run: 735, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.59, m

Run: 829, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.35, max: 29)

Run: 830, exploration: 0.01, score: 16
Scores: (min: 8, avg: 11.36, max: 29)

Run: 831, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.31, max: 29)

Run: 832, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.31, max: 29)

Run: 833, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.33, max: 29)

Run: 834, exploration: 0.01, score: 15
Scores: (min: 8, avg: 11.33, max: 29)

Run: 835, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.32, max: 29)

Run: 836, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.13, max: 29)

Run: 837, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.14, max: 29)

Run: 838, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.15, max: 29)

Run: 839, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.03, max: 29)

Run: 840, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.98, max: 29)

Run: 841, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.97, max:

Run: 935, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.46, max: 30)

Run: 936, exploration: 0.01, score: 12
Scores: (min: 8, avg: 11.5, max: 30)

Run: 937, exploration: 0.01, score: 17
Scores: (min: 8, avg: 11.57, max: 30)

Run: 938, exploration: 0.01, score: 34
Scores: (min: 8, avg: 11.82, max: 34)

Run: 939, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.8, max: 34)

Run: 940, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.79, max: 34)

Run: 941, exploration: 0.01, score: 14
Scores: (min: 8, avg: 11.84, max: 34)

Run: 942, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.84, max: 34)

Run: 943, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.83, max: 34)

Run: 944, exploration: 0.01, score: 12
Scores: (min: 8, avg: 11.84, max: 34)

Run: 945, exploration: 0.01, score: 33
Scores: (min: 8, avg: 12.05, max: 34)

Run: 946, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.05, max: 34)

Run: 947, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.95, max:

Run: 1041, exploration: 0.01, score: 16
Scores: (min: 8, avg: 11.99, max: 42)

Run: 1042, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.97, max: 42)

Run: 1043, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.97, max: 42)

Run: 1044, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.94, max: 42)

Run: 1045, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.71, max: 42)

Run: 1046, exploration: 0.01, score: 21
Scores: (min: 8, avg: 11.83, max: 42)

Run: 1047, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.84, max: 42)

Run: 1048, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.83, max: 42)

Run: 1049, exploration: 0.01, score: 21
Scores: (min: 8, avg: 11.91, max: 42)

Run: 1050, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.92, max: 42)

Run: 1051, exploration: 0.01, score: 14
Scores: (min: 8, avg: 11.97, max: 42)

Run: 1052, exploration: 0.01, score: 13
Scores: (min: 8, avg: 12.01, max: 42)

Run: 1053, exploration: 0.01, score: 15
Scores: (min: 8,

Run: 1146, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.57, max: 28)

Run: 1147, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.57, max: 28)

Run: 1148, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.56, max: 28)

Run: 1149, exploration: 0.01, score: 16
Scores: (min: 8, avg: 11.51, max: 28)

Run: 1150, exploration: 0.01, score: 22
Scores: (min: 8, avg: 11.62, max: 28)

Run: 1151, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.58, max: 28)

Run: 1152, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.55, max: 28)

Run: 1153, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.5, max: 28)

Run: 1154, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.53, max: 28)

Run: 1155, exploration: 0.01, score: 23
Scores: (min: 8, avg: 11.65, max: 28)

Run: 1156, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.66, max: 28)

Run: 1157, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.65, max: 28)

Run: 1158, exploration: 0.01, score: 10
Scores: (min: 8

Run: 1251, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.43, max: 40)

Run: 1252, exploration: 0.01, score: 8
Scores: (min: 8, avg: 12.41, max: 40)

Run: 1253, exploration: 0.01, score: 8
Scores: (min: 8, avg: 12.39, max: 40)

Run: 1254, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.35, max: 40)

Run: 1255, exploration: 0.01, score: 24
Scores: (min: 8, avg: 12.36, max: 40)

Run: 1256, exploration: 0.01, score: 24
Scores: (min: 8, avg: 12.5, max: 40)

Run: 1257, exploration: 0.01, score: 38
Scores: (min: 8, avg: 12.79, max: 40)

Run: 1258, exploration: 0.01, score: 12
Scores: (min: 8, avg: 12.81, max: 40)

Run: 1259, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.78, max: 40)

Run: 1260, exploration: 0.01, score: 22
Scores: (min: 8, avg: 12.9, max: 40)

Run: 1261, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.9, max: 40)

Run: 1262, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.89, max: 40)

Run: 1263, exploration: 0.01, score: 9
Scores: (min: 8, avg:

Run: 1356, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.42, max: 51)

Run: 1357, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.14, max: 51)

Run: 1358, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.12, max: 51)

Run: 1359, exploration: 0.01, score: 40
Scores: (min: 8, avg: 12.42, max: 51)

Run: 1360, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.3, max: 51)

Run: 1361, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.31, max: 51)

Run: 1362, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.32, max: 51)

Run: 1363, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.33, max: 51)

Run: 1364, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.34, max: 51)

Run: 1365, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.33, max: 51)

Run: 1366, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.26, max: 51)

Run: 1367, exploration: 0.01, score: 9
Scores: (min: 8, avg: 12.25, max: 51)

Run: 1368, exploration: 0.01, score: 9
Scores: (min: 8,

Run: 1461, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.11, max: 35)

Run: 1462, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.11, max: 35)

Run: 1463, exploration: 0.01, score: 9
Scores: (min: 8, avg: 11.1, max: 35)

Run: 1464, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.1, max: 35)

Run: 1465, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.13, max: 35)

Run: 1466, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.14, max: 35)

Run: 1467, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.15, max: 35)

Run: 1468, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.17, max: 35)

Run: 1469, exploration: 0.01, score: 8
Scores: (min: 8, avg: 11.13, max: 35)

Run: 1470, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.02, max: 35)

Run: 1471, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.87, max: 35)

Run: 1472, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.87, max: 35)

Run: 1473, exploration: 0.01, score: 10
Scores: (min: 8, a

KeyboardInterrupt: 

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
# Adjusted exploration decay to .605 from .995
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.605  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [6]:
cartpole()

Run: 1, exploration: 0.9752487531218751, score: 25
Scores: (min: 25, avg: 25, max: 25)

Run: 2, exploration: 0.9275689688183278, score: 11
Scores: (min: 11, avg: 18, max: 25)

Run: 3, exploration: 0.8142285204175609, score: 27
Scores: (min: 11, avg: 21, max: 27)

Run: 4, exploration: 0.736559652908221, score: 21
Scores: (min: 11, avg: 21, max: 27)

Run: 5, exploration: 0.6935613678313175, score: 13
Scores: (min: 11, avg: 19.4, max: 27)

Run: 6, exploration: 0.6629680834613705, score: 10
Scores: (min: 10, avg: 17.833333333333332, max: 27)

Run: 7, exploration: 0.6088145090359074, score: 18
Scores: (min: 10, avg: 17.857142857142858, max: 27)

Run: 8, exploration: 0.5590843898207511, score: 18
Scores: (min: 10, avg: 17.875, max: 27)

Run: 9, exploration: 0.5238143793828016, score: 14
Scores: (min: 10, avg: 17.444444444444443, max: 27)

Run: 10, exploration: 0.4982051627146237, score: 11
Scores: (min: 10, avg: 16.8, max: 27)

Run: 11, exploration: 0.46677573701590436, score: 14
Scores: (mi

Run: 86, exploration: 0.01, score: 66
Scores: (min: 8, avg: 26.930232558139537, max: 109)

Run: 87, exploration: 0.01, score: 108
Scores: (min: 8, avg: 27.862068965517242, max: 109)

Run: 88, exploration: 0.01, score: 79
Scores: (min: 8, avg: 28.443181818181817, max: 109)

Run: 89, exploration: 0.01, score: 73
Scores: (min: 8, avg: 28.9438202247191, max: 109)

Run: 90, exploration: 0.01, score: 75
Scores: (min: 8, avg: 29.455555555555556, max: 109)

Run: 91, exploration: 0.01, score: 15
Scores: (min: 8, avg: 29.296703296703296, max: 109)

Run: 92, exploration: 0.01, score: 32
Scores: (min: 8, avg: 29.32608695652174, max: 109)

Run: 93, exploration: 0.01, score: 27
Scores: (min: 8, avg: 29.301075268817204, max: 109)

Run: 94, exploration: 0.01, score: 20
Scores: (min: 8, avg: 29.20212765957447, max: 109)

Run: 95, exploration: 0.01, score: 126
Scores: (min: 8, avg: 30.221052631578946, max: 126)

Run: 96, exploration: 0.01, score: 90
Scores: (min: 8, avg: 30.84375, max: 126)

Run: 97, ex

Run: 189, exploration: 0.01, score: 9
Scores: (min: 8, avg: 15.01, max: 126)

Run: 190, exploration: 0.01, score: 9
Scores: (min: 8, avg: 14.35, max: 126)

Run: 191, exploration: 0.01, score: 11
Scores: (min: 8, avg: 14.31, max: 126)

Run: 192, exploration: 0.01, score: 10
Scores: (min: 8, avg: 14.09, max: 126)

Run: 193, exploration: 0.01, score: 11
Scores: (min: 8, avg: 13.93, max: 126)

Run: 194, exploration: 0.01, score: 9
Scores: (min: 8, avg: 13.82, max: 126)

Run: 195, exploration: 0.01, score: 10
Scores: (min: 8, avg: 12.66, max: 99)

Run: 196, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.86, max: 99)

Run: 197, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.96, max: 23)

Run: 198, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.96, max: 23)

Run: 199, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.97, max: 23)

Run: 200, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.98, max: 23)

Run: 201, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.

Run: 296, exploration: 0.01, score: 8
Scores: (min: 8, avg: 10.08, max: 34)

Run: 297, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.08, max: 34)

Run: 298, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.08, max: 34)

Run: 299, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.07, max: 34)

Run: 300, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.07, max: 34)

Run: 301, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.08, max: 34)

Run: 302, exploration: 0.01, score: 17
Scores: (min: 8, avg: 10.16, max: 34)

Run: 303, exploration: 0.01, score: 12
Scores: (min: 8, avg: 10.19, max: 34)

Run: 304, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.18, max: 34)

Run: 305, exploration: 0.01, score: 11
Scores: (min: 8, avg: 10.19, max: 34)

Run: 306, exploration: 0.01, score: 13
Scores: (min: 8, avg: 10.21, max: 34)

Run: 307, exploration: 0.01, score: 13
Scores: (min: 8, avg: 10.25, max: 34)

Run: 308, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.22, ma

KeyboardInterrupt: 

The Cartpole game is a simple game where you control a block that is holding a stick vertically. The goal of this game is to keep the stick upright by adjusting the base block from left to right. We can apply our reinforcement learning algorithm to this game and we see how the results played out above. Using the algorithm, the agent learns from experience how to balance the pole better and better with each iteration. The goal for the agent is to keep the stick balanced as long as possible. The different states are either move left, move right, or stationary. The algorithm used is a Q-learning reinforcement algorithm that learns from each iteration to achieve the highest score. This is called experience replay, where it takes the past iterations to achieve the best overall result. We have to use whats called a discount factor in this  algorithm as well to keep it from getting stuck in a continuous loop of the same iterations. This discount factor tell the algorithm how much it should depend on past information, and how much guessing it should do as well. But, this means that results will be both better and worse for each iteration. Since each iteration is saved and remembered, the longer the algorithm runs, the more memory it takes. This is why using a neural network is so effecient. It helps with organizing memory to only remember the most effecient runs, to maximize production results. 

As you can see from the first run, it took 303 runs to finish processing, and solved it in 203. This is relatively fast and it achieved the highest result quickly. In the next block, I adjusted the learning rate from 0.001 to 0.01. This made the algorithm work much slower and I eventually had to manually exit the program. The next change I made was to the Gamma value, which I changed from 0.95, to 0.5. This time the algorithm worked much faster for each iteration, but the result did not vary very much. Eventually, I manually stopped the program after about 1500 iterations because the results had not changed very much at all. For the last change, I modified the value of exploration decay from .995 to .605. Again, the algorithm worked much faster for each iteration, but the values we changing at a better pace. The results did have a bigger variation however. For example, one iteration has a score of 17, and the very next one had a score of 155. After about 200 iteration the algorithm seemed to get stuck and only produce scores around 10, so I stopped it. 

Sources:
Surma, G. (2021, October 13). Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288