### Lawrence Artl
#### CS370 22EW1
#### Assignment 5 - Cartpol Problem with DQN
#### September 24, 2022

# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

### Standard Run
(Below)
Solved in 134 runs, 234 total runs.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95                # calculat the future discounted reward
LEARNING_RATE = 0.001       # how much the nn learns each iteraction
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20             # how much memory is used to learn
  
EXPLORATION_MAX = 1.0       # rate at which agent randomly decides actions
                            # as opposed to predictions
EXPLORATION_MIN = 0.1      # explore at least this amount
EXPLORATION_DECAY = 0.995   # decrease number of explorations over time as
                            # agent ability improves
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 16
Scores: (min: 16, avg: 16, max: 16)

Run: 2, exploration: 0.9275689688183278, score: 19
Scores: (min: 16, avg: 17.5, max: 19)

Run: 3, exploration: 0.7822236754458713, score: 35
Scores: (min: 16, avg: 23.333333333333332, max: 35)

Run: 4, exploration: 0.7076077347272662, score: 21
Scores: (min: 16, avg: 22.75, max: 35)

Run: 5, exploration: 0.6629680834613705, score: 14
Scores: (min: 14, avg: 21, max: 35)

Run: 6, exploration: 0.6337242817644086, score: 10
Scores: (min: 10, avg: 19.166666666666668, max: 35)

Run: 7, exploration: 0.5535075230322891, score: 28
Scores: (min: 10, avg: 20.428571428571427, max: 35)

Run: 8, exploration: 0.5159963842937159, score: 15
Scores: (min: 10, avg: 19.75, max: 35)

Run: 9, exploration: 0.49571413690105054, score: 9
Scores: (min: 9, avg: 18.555555555555557, max: 35)

Run: 10, exploration: 0.46912134373457726, score: 12
Scores: (min: 9, avg: 17.9, max: 35)

Run: 11, exploration: 0.43952667968844233, score: 14
Scores: 

Run: 91, exploration: 0.1, score: 13
Scores: (min: 8, avg: 11.351648351648352, max: 35)

Run: 92, exploration: 0.1, score: 12
Scores: (min: 8, avg: 11.358695652173912, max: 35)

Run: 93, exploration: 0.1, score: 12
Scores: (min: 8, avg: 11.365591397849462, max: 35)

Run: 94, exploration: 0.1, score: 10
Scores: (min: 8, avg: 11.351063829787234, max: 35)

Run: 95, exploration: 0.1, score: 13
Scores: (min: 8, avg: 11.368421052631579, max: 35)

Run: 96, exploration: 0.1, score: 10
Scores: (min: 8, avg: 11.354166666666666, max: 35)

Run: 97, exploration: 0.1, score: 9
Scores: (min: 8, avg: 11.329896907216495, max: 35)

Run: 98, exploration: 0.1, score: 10
Scores: (min: 8, avg: 11.316326530612244, max: 35)

Run: 99, exploration: 0.1, score: 14
Scores: (min: 8, avg: 11.343434343434344, max: 35)

Run: 100, exploration: 0.1, score: 17
Scores: (min: 8, avg: 11.4, max: 35)

Run: 101, exploration: 0.1, score: 22
Scores: (min: 8, avg: 11.46, max: 35)

Run: 102, exploration: 0.1, score: 16
Scores: (

Run: 195, exploration: 0.1, score: 186
Scores: (min: 9, avg: 117.53, max: 414)

Run: 196, exploration: 0.1, score: 224
Scores: (min: 9, avg: 119.67, max: 414)

Run: 197, exploration: 0.1, score: 221
Scores: (min: 10, avg: 121.79, max: 414)

Run: 198, exploration: 0.1, score: 243
Scores: (min: 10, avg: 124.12, max: 414)

Run: 199, exploration: 0.1, score: 248
Scores: (min: 10, avg: 126.46, max: 414)

Run: 200, exploration: 0.1, score: 163
Scores: (min: 10, avg: 127.92, max: 414)

Run: 201, exploration: 0.1, score: 285
Scores: (min: 10, avg: 130.55, max: 414)

Run: 202, exploration: 0.1, score: 237
Scores: (min: 10, avg: 132.76, max: 414)

Run: 203, exploration: 0.1, score: 231
Scores: (min: 10, avg: 134.83, max: 414)

Run: 204, exploration: 0.1, score: 260
Scores: (min: 10, avg: 137.22, max: 414)

Run: 205, exploration: 0.1, score: 195
Scores: (min: 10, avg: 139.01, max: 414)

Run: 206, exploration: 0.1, score: 308
Scores: (min: 10, avg: 141.96, max: 414)

Run: 207, exploration: 0.1, sc

NameError: name 'exit' is not defined

### Modified Run 1
In this run we changed the `LEARNING RATE = 0.001` to `0.01`and saw a significant increase in runs. A keyboard interupt was used to keep the file short, but in some tests the Run count reached over 700! In reaching for an average score of **195**, one can see how this block-run would easily reach 1000 or more Runs.  

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95                # calculat the future discounted reward
LEARNING_RATE = 0.01        # how much the nn learns each iteraction
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20             # how much memory is used to learn
  
EXPLORATION_MAX = 1.0       # rate at which agent randomly decides actions
                            # as opposed to predictions
EXPLORATION_MIN = 0.1      # explore at least this amount
EXPLORATION_DECAY = 0.995   # decrease number of explorations over time as
                            # agent ability improves
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [5]:
cartpole()

Run: 1, exploration: 1.0, score: 12
Scores: (min: 12, avg: 12, max: 12)

Run: 2, exploration: 0.9322301194154049, score: 22
Scores: (min: 12, avg: 17, max: 22)

Run: 3, exploration: 0.8690529955452602, score: 15
Scores: (min: 12, avg: 16.333333333333332, max: 22)

Run: 4, exploration: 0.8224322824348486, score: 12
Scores: (min: 12, avg: 15.25, max: 22)

Run: 5, exploration: 0.6900935609921609, score: 36
Scores: (min: 12, avg: 19.4, max: 36)

Run: 6, exploration: 0.6337242817644086, score: 18
Scores: (min: 12, avg: 19.166666666666668, max: 36)

Run: 7, exploration: 0.5848838636585911, score: 17
Scores: (min: 12, avg: 18.857142857142858, max: 36)

Run: 8, exploration: 0.5618938591163328, score: 9
Scores: (min: 9, avg: 17.625, max: 36)

Run: 9, exploration: 0.531750826943791, score: 12
Scores: (min: 9, avg: 17, max: 36)

Run: 10, exploration: 0.5082950737585841, score: 10
Scores: (min: 9, avg: 16.3, max: 36)

Run: 11, exploration: 0.47862223409330756, score: 13
Scores: (min: 9, avg: 16, m

Run: 91, exploration: 0.1, score: 11
Scores: (min: 8, avg: 20.395604395604394, max: 97)

Run: 92, exploration: 0.1, score: 11
Scores: (min: 8, avg: 20.293478260869566, max: 97)

Run: 93, exploration: 0.1, score: 10
Scores: (min: 8, avg: 20.182795698924732, max: 97)

Run: 94, exploration: 0.1, score: 14
Scores: (min: 8, avg: 20.117021276595743, max: 97)

Run: 95, exploration: 0.1, score: 40
Scores: (min: 8, avg: 20.326315789473686, max: 97)

Run: 96, exploration: 0.1, score: 29
Scores: (min: 8, avg: 20.416666666666668, max: 97)

Run: 97, exploration: 0.1, score: 27
Scores: (min: 8, avg: 20.484536082474225, max: 97)

Run: 98, exploration: 0.1, score: 66
Scores: (min: 8, avg: 20.948979591836736, max: 97)

Run: 99, exploration: 0.1, score: 32
Scores: (min: 8, avg: 21.060606060606062, max: 97)

Run: 100, exploration: 0.1, score: 25
Scores: (min: 8, avg: 21.1, max: 97)

Run: 101, exploration: 0.1, score: 41
Scores: (min: 8, avg: 21.39, max: 97)

Run: 102, exploration: 0.1, score: 29
Scores: 

Run: 196, exploration: 0.1, score: 14
Scores: (min: 8, avg: 39.47, max: 325)

Run: 197, exploration: 0.1, score: 74
Scores: (min: 8, avg: 39.94, max: 325)

Run: 198, exploration: 0.1, score: 12
Scores: (min: 8, avg: 39.4, max: 325)

Run: 199, exploration: 0.1, score: 9
Scores: (min: 8, avg: 39.17, max: 325)

Run: 200, exploration: 0.1, score: 10
Scores: (min: 8, avg: 39.02, max: 325)

Run: 201, exploration: 0.1, score: 10
Scores: (min: 8, avg: 38.71, max: 325)

Run: 202, exploration: 0.1, score: 11
Scores: (min: 8, avg: 38.53, max: 325)

Run: 203, exploration: 0.1, score: 9
Scores: (min: 8, avg: 38.28, max: 325)

Run: 204, exploration: 0.1, score: 9
Scores: (min: 8, avg: 38.28, max: 325)

Run: 205, exploration: 0.1, score: 9
Scores: (min: 8, avg: 38.27, max: 325)

Run: 206, exploration: 0.1, score: 9
Scores: (min: 8, avg: 38.27, max: 325)

Run: 207, exploration: 0.1, score: 9
Scores: (min: 8, avg: 38.27, max: 325)

Run: 208, exploration: 0.1, score: 10
Scores: (min: 8, avg: 38.28, max:

Run: 302, exploration: 0.1, score: 10
Scores: (min: 8, avg: 33.93, max: 304)

Run: 303, exploration: 0.1, score: 10
Scores: (min: 8, avg: 33.94, max: 304)

Run: 304, exploration: 0.1, score: 10
Scores: (min: 8, avg: 33.95, max: 304)

Run: 305, exploration: 0.1, score: 8
Scores: (min: 8, avg: 33.94, max: 304)

Run: 306, exploration: 0.1, score: 10
Scores: (min: 8, avg: 33.95, max: 304)

Run: 307, exploration: 0.1, score: 9
Scores: (min: 8, avg: 33.95, max: 304)

Run: 308, exploration: 0.1, score: 10
Scores: (min: 8, avg: 33.95, max: 304)

Run: 309, exploration: 0.1, score: 11
Scores: (min: 8, avg: 33.97, max: 304)

Run: 310, exploration: 0.1, score: 9
Scores: (min: 8, avg: 33.95, max: 304)

Run: 311, exploration: 0.1, score: 9
Scores: (min: 8, avg: 33.94, max: 304)

Run: 312, exploration: 0.1, score: 9
Scores: (min: 8, avg: 33.93, max: 304)

Run: 313, exploration: 0.1, score: 12
Scores: (min: 8, avg: 33.96, max: 304)

Run: 314, exploration: 0.1, score: 20
Scores: (min: 8, avg: 34.08, ma

KeyboardInterrupt: 

### Modified Run 2
For this run we increased the `EXPLORATION_MIN = 0.1` to `0.9`. Once again the block-run was heading towards a very large total Run number (even more than the previous modification of `LEARNING_RATE`); the block was stopped at Run **302** with an average score of just **26.71**. At this rate we would be well into the 1000's of runs before reaching the goal of **195**. 

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95                # calculat the future discounted reward
LEARNING_RATE = 0.001        # how much the nn learns each iteraction
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20             # how much memory is used to learn
  
EXPLORATION_MAX = 1.0       # rate at which agent randomly decides actions
                            # as opposed to predictions
EXPLORATION_MIN = 0.9       # explore at least this amount
EXPLORATION_DECAY = 0.995   # decrease number of explorations over time as
                            # agent ability improves
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [7]:
cartpole()

Run: 1, exploration: 0.995, score: 21
Scores: (min: 21, avg: 21, max: 21)

Run: 2, exploration: 0.9, score: 25
Scores: (min: 21, avg: 23, max: 25)

Run: 3, exploration: 0.9, score: 19
Scores: (min: 19, avg: 21.666666666666668, max: 25)

Run: 4, exploration: 0.9, score: 31
Scores: (min: 19, avg: 24, max: 31)

Run: 5, exploration: 0.9, score: 29
Scores: (min: 19, avg: 25, max: 31)

Run: 6, exploration: 0.9, score: 18
Scores: (min: 18, avg: 23.833333333333332, max: 31)

Run: 7, exploration: 0.9, score: 10
Scores: (min: 10, avg: 21.857142857142858, max: 31)

Run: 8, exploration: 0.9, score: 29
Scores: (min: 10, avg: 22.75, max: 31)

Run: 9, exploration: 0.9, score: 11
Scores: (min: 10, avg: 21.444444444444443, max: 31)

Run: 10, exploration: 0.9, score: 10
Scores: (min: 10, avg: 20.3, max: 31)

Run: 11, exploration: 0.9, score: 11
Scores: (min: 10, avg: 19.454545454545453, max: 31)

Run: 12, exploration: 0.9, score: 10
Scores: (min: 10, avg: 18.666666666666668, max: 31)

Run: 13, explorati

Run: 94, exploration: 0.9, score: 12
Scores: (min: 10, avg: 26.79787234042553, max: 115)

Run: 95, exploration: 0.9, score: 14
Scores: (min: 10, avg: 26.66315789473684, max: 115)

Run: 96, exploration: 0.9, score: 17
Scores: (min: 10, avg: 26.5625, max: 115)

Run: 97, exploration: 0.9, score: 85
Scores: (min: 10, avg: 27.164948453608247, max: 115)

Run: 98, exploration: 0.9, score: 26
Scores: (min: 10, avg: 27.153061224489797, max: 115)

Run: 99, exploration: 0.9, score: 18
Scores: (min: 10, avg: 27.060606060606062, max: 115)

Run: 100, exploration: 0.9, score: 26
Scores: (min: 10, avg: 27.05, max: 115)

Run: 101, exploration: 0.9, score: 13
Scores: (min: 10, avg: 26.97, max: 115)

Run: 102, exploration: 0.9, score: 28
Scores: (min: 10, avg: 27, max: 115)

Run: 103, exploration: 0.9, score: 12
Scores: (min: 10, avg: 26.93, max: 115)

Run: 104, exploration: 0.9, score: 59
Scores: (min: 10, avg: 27.21, max: 115)

Run: 105, exploration: 0.9, score: 20
Scores: (min: 10, avg: 27.12, max: 11

Run: 199, exploration: 0.9, score: 18
Scores: (min: 9, avg: 26.79, max: 172)

Run: 200, exploration: 0.9, score: 21
Scores: (min: 9, avg: 26.74, max: 172)

Run: 201, exploration: 0.9, score: 37
Scores: (min: 9, avg: 26.98, max: 172)

Run: 202, exploration: 0.9, score: 40
Scores: (min: 9, avg: 27.1, max: 172)

Run: 203, exploration: 0.9, score: 10
Scores: (min: 9, avg: 27.08, max: 172)

Run: 204, exploration: 0.9, score: 17
Scores: (min: 9, avg: 26.66, max: 172)

Run: 205, exploration: 0.9, score: 26
Scores: (min: 9, avg: 26.72, max: 172)

Run: 206, exploration: 0.9, score: 18
Scores: (min: 9, avg: 26.78, max: 172)

Run: 207, exploration: 0.9, score: 12
Scores: (min: 9, avg: 26.55, max: 172)

Run: 208, exploration: 0.9, score: 49
Scores: (min: 9, avg: 26.93, max: 172)

Run: 209, exploration: 0.9, score: 23
Scores: (min: 9, avg: 26.99, max: 172)

Run: 210, exploration: 0.9, score: 14
Scores: (min: 9, avg: 26.84, max: 172)

Run: 211, exploration: 0.9, score: 34
Scores: (min: 9, avg: 26.39

KeyboardInterrupt: 

### Modified Run 3
For this run, we have modified the `GAMMA = 0.95` to `0.65`; the Gamma variable determines the future discounted reward for the agent. Higher values tend to push the agent towards more consideration of the total sum of the future reward when determining what actions to take during the current state. A lower value increases the agent's myopia, and a value of zero sees the agent only considering the rewards gained in the present environmental state, disregarding all future rewards from future states. 

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.65                 # calculat the future discounted reward
LEARNING_RATE = 0.001        # how much the nn learns each iteraction
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20             # how much memory is used to learn
  
EXPLORATION_MAX = 1.0       # rate at which agent randomly decides actions
                            # as opposed to predictions
EXPLORATION_MIN = 0.01       # explore at least this amount
EXPLORATION_DECAY = 0.995   # decrease number of explorations over time as
                            # agent ability improves
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [None]:
cartpole()

### Analysis of the cartpole problem and code
The agent of the "cartpole" problem is tasked with balancing a pole on it's center. It does this by moving a cart left or right (the only possible actions to take) and attempting to prevent the pole from moving either direction by more than 15 degrees. The cart is expected to stay within 2.4 units of center. If either threshold is broken, the agent has failed. Each time the agent takes an action it is given feedback from the environment in form of a reward (increase in score) and a change to the next state of the environment.

To solve the cartpole problem we have employed the use of Deep Q-Learning. This learning style differs from standard Q-Learning in that a DQN uses a **Neural Network**. Where Q-Learning updates it's Q-values in the Q-table manually, the DQN uses the NN to approximate the values. With standard Q-learning the Q-value for all possible state-action pairs is manually placed in the Q-table, but with larger amounts of state-actions pairs (think thousands or even millions) this becomes infeasible. Instead, the NN is used to approximate the Q-values in the Q-table, and thus a DQN is created. The architecture of the NN is such that the states of the environment act as input while pairs of actions and Q-values are the outputs. The best possible action for the given state is represented by the output with the highest Q-value. 

In order to update the weights in the NN after each run, we use the `LEARNING_RATE` variable. Changing the LR will affect the rate of change in the weights within the model. A low learning rate typically results in more reliable training but slower optimization. Higher learning rates beget higher losses during training (see 'Modified Run 1' above). Since the goal is ultimately a decrease in loss over time, lower learning rates are preferred. 

To increase the long-term learning of the agent in the cartpole problem, Experience Replay is used within the DQN. To do this, a random sample of “experiences” is taken from the batch during each run (regulated by the `BATCH_SIZE` variable). These experiences are used to train the agent in small batches, which prevents the agent from having to train from scratch each time. This process also reduces correlation between subsequent actions, and ensures the generated Q-values are of highest quality. A discount factor (`GAMMA`) helps the agent perform better in the long-term. This is done by giving the agent a sort of “future-thinking” when considering it's actions (Wang, 2021). What this really translates to is giving the agent concern for distant future rewards as opposed to short-term rewards. An agent will consider the sum total of all future rewards forthcoming when evaluating its actions, if the gamma is equal to 1 (or very close to it); this keeps the agent “working” to always improve its own total score. A gamma of 0 will make an agent that is only concerned with actions that produce an immediate reward.


### References
* Beysolow, T. (2019). Chapter 3. In Applied Reinforcement Learning with python: With Openai Gym, tensorflow, and keras. essay, Apress.

* Gulli, A., & Pal, S. (2017). Chapter 8: AI Game Playing. In Deep learning with keras: Implement neural networks with Keras on Theano and tensorflow (pp. 271–274). essay, Packt.

* Gupta, A. (2022, May 13). Deep Q-learning. GeeksforGeeks. Retrieved September 23, 2022, from https://www.geeksforgeeks.org/deep-q-learning/

* Surma, G. (2019, November 10). Cartpole - introduction to reinforcement learning (DQN - deep Q-learning). Medium. Retrieved September 23, 2022, from https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

* Surmenok, P. (2021, April 19). Estimating an optimal learning rate for a deep neural network. Medium. Retrieved September 23, 2022, from https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

* Wang, M. (2021, October 3). Deep Q-learning tutorial: Mindqn. Medium. Retrieved September 23, 2022, from https://towardsdatascience.com/deep-q-learning-tutorial-mindqn-2a4c855abffc